Commits · Dovakiins/qwerrwe

Efficiently get the length of the tokenized docs (#1063)

81d3845
unverified

ricdomolm

winglian commited on Jan 8

streaming multipack for pretraining dataset (#959)

553c80f
unverified

jinwonkim93 jinwonkim93@github.com

winglian commited on Jan 6

fix: revert local dir dataset load (#878)

575a082
unverified

Nanobit commited on Nov 18, 2023

don't train if eval split is too small (#873)

797f3dd
unverified

winglian commited on Nov 16, 2023

Feat: Add dataset loading from S3, GCS (#765)

3cc67d2
unverified

Nanobit commited on Nov 16, 2023

Update data.py for signature generation (#851)

48630f5
unverified

MilesQLi

winglian commited on Nov 15, 2023

cleanup the old multipack dataloader (#841)

1a6309c
unverified

winglian commited on Nov 12, 2023

multipack w batch sampler (#795)

641e6f7
unverified

winglian commited on Nov 8, 2023

update table for rwkv4 support, fix process count for dataset (#822)

cdc71f7
unverified

winglian commited on Nov 5, 2023

Create preprocess CLI (#785)

e50ab07
unverified

casperhansen commited on Oct 26, 2023

catch ConnectionError when checking dataset from HuggingFace (#743)

992d57f
unverified

Napuh commited on Oct 19, 2023

improve handling of the prepared ds path and other cfg defaults (#701)

1c412c7
unverified

winglian commited on Oct 13, 2023

Fix: Future deprecation warning with use_auth_token (#680)

69fac9a
unverified

Nanobit commited on Oct 5, 2023

prepared dataset caching, other misc fixes (#665)

e50a64e
unverified

winglian commited on Oct 3, 2023

add support for defined train split (#654)

409ca0f
unverified

winglian commited on Sep 29, 2023

Fix bug in dataset loading (#284)

8fe0e63
unverified

ethanhs commited on Sep 27, 2023

use fastchat conversations template (#578)

e7d3e2d
unverified

winglian commited on Sep 27, 2023

attention_mask not needed for training (#642)

e8cbf50
unverified

winglian commited on Sep 27, 2023

Feat(data): Allow loading local csv and text (#594)

00dce35
unverified

Nanobit commited on Sep 17, 2023

support custom field for completion from yml (#580)

f7a2263
unverified

winglian commited on Sep 15, 2023

remove columns after tokenizing for pretraining (#571)

1157950
unverified

winglian commited on Sep 14, 2023

Fix pretraining with iterable/streaming Dataset (#556)

2f586d1
unverified

Jan Philipp Harries Jan Philipp Harries commited on Sep 13, 2023

workaround for md5 variations (#533)

0b4cf5b
unverified

winglian commited on Sep 8, 2023

support for datasets with multiple names (#480)

5ac3392
unverified

winglian commited on Aug 29, 2023

improve llama pad token handling (#475)

cb9797e
unverified

winglian commited on Aug 24, 2023

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)

d2e7f27
unverified

winglian commited on Aug 20, 2023

add utils.data.prepare_dataset

2e22404

tmm1 commited on Aug 15, 2023

use context manager to run things on rank0 before others (#397)

fc2d6be
unverified

winglian commited on Aug 15, 2023

Attention mask and position id fixes for packing (#285)

2bb0b78
unverified

winglian commited on Aug 12, 2023

experimental llama 2 chat support (#296)

3392270
unverified

Jan Philipp Harries Jan Philipp Harries commited on Aug 6, 2023

optimize the iteration when tokenizeing large datasets (#332)

fe28543
unverified

winglian commited on Aug 4, 2023

Merge pull request #276 from theobjectivedad/logging_enhancement

6f16c45
unverified

winglian commited on Jul 16, 2023

Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var

b1f4f7a

theobjectivedad commited on Jul 15, 2023

Add ability to pass 'name' argument to load_dataset

88089e8

chargoddard commited on Jul 14, 2023

Adding logging enhancement

553a86b

theobjectivedad commited on Jul 14, 2023

Support loading data files from a local directory

9bdd30c

utensil commited on Jun 21, 2023

Merge branch 'main' into flash-optimum

fd2c981
unverified

winglian commited on Jun 12, 2023

add new sharegpt, refactor prompt so it can be customized later, add exception if no data is processed

aac4b76

winglian commited on Jun 11, 2023

address PR feedback

0c6f928

winglian commited on Jun 10, 2023

add streaming dataset support for pretraining datasets

eea2731

winglian commited on Jun 10, 2023

more gpt-neox long ctx fixes

ab5cd28

winglian commited on Jun 1, 2023

more tweaks to do pre-training with bettertransformers

1210dc8

winglian commited on Jun 1, 2023

experimental expansion of ctx len

488a67d

winglian commited on May 31, 2023

Set to use cfg.seed or 42 for backward compat

2cfe9e9

Nanobit commited on Jun 8, 2023