Commit History

Efficiently get the length of the tokenized docs (#1063)
81d3845
unverified

ricdomolm winglian commited on

streaming multipack for pretraining dataset (#959)
553c80f
unverified

jinwonkim93 jinwonkim93@github.com winglian commited on

fix: revert local dir dataset load (#878)
575a082
unverified

Nanobit commited on

don't train if eval split is too small (#873)
797f3dd
unverified

winglian commited on

Feat: Add dataset loading from S3, GCS (#765)
3cc67d2
unverified

Nanobit commited on

Update data.py for signature generation (#851)
48630f5
unverified

MilesQLi winglian commited on

cleanup the old multipack dataloader (#841)
1a6309c
unverified

winglian commited on

multipack w batch sampler (#795)
641e6f7
unverified

winglian commited on

update table for rwkv4 support, fix process count for dataset (#822)
cdc71f7
unverified

winglian commited on

Create preprocess CLI (#785)
e50ab07
unverified

casperhansen commited on

catch ConnectionError when checking dataset from HuggingFace (#743)
992d57f
unverified

Napuh commited on

improve handling of the prepared ds path and other cfg defaults (#701)
1c412c7
unverified

winglian commited on

Fix: Future deprecation warning with use_auth_token (#680)
69fac9a
unverified

Nanobit commited on

prepared dataset caching, other misc fixes (#665)
e50a64e
unverified

winglian commited on

add support for defined train split (#654)
409ca0f
unverified

winglian commited on

Fix bug in dataset loading (#284)
8fe0e63
unverified

ethanhs commited on

use fastchat conversations template (#578)
e7d3e2d
unverified

winglian commited on

attention_mask not needed for training (#642)
e8cbf50
unverified

winglian commited on

Feat(data): Allow loading local csv and text (#594)
00dce35
unverified

Nanobit commited on

support custom field for completion from yml (#580)
f7a2263
unverified

winglian commited on

remove columns after tokenizing for pretraining (#571)
1157950
unverified

winglian commited on

Fix pretraining with iterable/streaming Dataset (#556)
2f586d1
unverified

Jan Philipp Harries Jan Philipp Harries commited on

workaround for md5 variations (#533)
0b4cf5b
unverified

winglian commited on

support for datasets with multiple names (#480)
5ac3392
unverified

winglian commited on

improve llama pad token handling (#475)
cb9797e
unverified

winglian commited on

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)
d2e7f27
unverified

winglian commited on

add utils.data.prepare_dataset
2e22404

tmm1 commited on

use context manager to run things on rank0 before others (#397)
fc2d6be
unverified

winglian commited on

Attention mask and position id fixes for packing (#285)
2bb0b78
unverified

winglian commited on

experimental llama 2 chat support (#296)
3392270
unverified

Jan Philipp Harries Jan Philipp Harries commited on

optimize the iteration when tokenizeing large datasets (#332)
fe28543
unverified

winglian commited on

Merge pull request #276 from theobjectivedad/logging_enhancement
6f16c45
unverified

winglian commited on

Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var
b1f4f7a

theobjectivedad commited on

Add ability to pass 'name' argument to load_dataset
88089e8

chargoddard commited on

Support loading data files from a local directory
9bdd30c

utensil commited on

Merge branch 'main' into flash-optimum
fd2c981
unverified

winglian commited on

add new sharegpt, refactor prompt so it can be customized later, add exception if no data is processed
aac4b76

winglian commited on

address PR feedback
0c6f928

winglian commited on

add streaming dataset support for pretraining datasets
eea2731

winglian commited on

more gpt-neox long ctx fixes
ab5cd28

winglian commited on

more tweaks to do pre-training with bettertransformers
1210dc8

winglian commited on

experimental expansion of ctx len
488a67d

winglian commited on

Set to use cfg.seed or 42 for backward compat
2cfe9e9

Nanobit commited on

fix batch size calculation
5a631b3

winglian commited on

Fix security issue or ignore false positives
a1f9850

Nanobit commited on

Apply isort then black
37293dc

Nanobit commited on

Fix mypy typing
e9650d3

Nanobit commited on

Black formatting
b832a0a

Nanobit commited on

Refactor
4c0eddb

Nanobit commited on