hamel winglian commited on
Commit
712fd27
β€’
1 Parent(s): ef24342

Add docs (#947)

Browse files

* move section

* update README

* update README

* update README

* update README

* update README

* Update README.md

Co-authored-by: Wing Lian <wing.lian@gmail.com>

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>

Files changed (1) hide show
  1. README.md +39 -0
README.md CHANGED
@@ -36,7 +36,9 @@ Features:
36
  - [Train](#train)
37
  - [Inference](#inference)
38
  - [Merge LORA to Base](#merge-lora-to-base)
 
39
  - [Common Errors](#common-errors-)
 
40
  - [Need Help?](#need-help-)
41
  - [Badge](#badge-)
42
  - [Community Showcase](#community-showcase)
@@ -251,6 +253,13 @@ Have dataset(s) in one of the following format (JSONL recommended):
251
  ```json
252
  {"conversations": [{"from": "...", "value": "..."}]}
253
  ```
 
 
 
 
 
 
 
254
  - `completion`: raw corpus
255
  ```json
256
  {"text": "..."}
@@ -970,6 +979,22 @@ wandb_name:
970
  wandb_log_model:
971
  ```
972
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
973
  ### Inference
974
 
975
  Pass the appropriate flag to the train command:
@@ -1048,6 +1073,20 @@ It's safe to ignore it.
1048
 
1049
  See the [NCCL](docs/nccl.md) guide.
1050
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1051
  ## Need help? πŸ™‹β™‚οΈ
1052
 
1053
  Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
 
36
  - [Train](#train)
37
  - [Inference](#inference)
38
  - [Merge LORA to Base](#merge-lora-to-base)
39
+ - [Special Tokens](#special-tokens)
40
  - [Common Errors](#common-errors-)
41
+ - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
42
  - [Need Help?](#need-help-)
43
  - [Badge](#badge-)
44
  - [Community Showcase](#community-showcase)
 
253
  ```json
254
  {"conversations": [{"from": "...", "value": "..."}]}
255
  ```
256
+ - `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
257
+ ```yml
258
+ datasets:
259
+ - path: <your-path>
260
+ type: sharegpt
261
+ conversation: llama-2
262
+ ```
263
  - `completion`: raw corpus
264
  ```json
265
  {"text": "..."}
 
979
  wandb_log_model:
980
  ```
981
 
982
+ ##### Special Tokens
983
+
984
+ It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocubulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
985
+
986
+ ```yml
987
+ special_tokens:
988
+ bos_token: "<s>"
989
+ eos_token: "</s>"
990
+ unk_token: "<unk>"
991
+ tokens: # these are delimiters
992
+ - "<|im_start|>"
993
+ - "<|im_end|>"
994
+ ```
995
+
996
+ When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
997
+
998
  ### Inference
999
 
1000
  Pass the appropriate flag to the train command:
 
1073
 
1074
  See the [NCCL](docs/nccl.md) guide.
1075
 
1076
+
1077
+ ### Tokenization Mismatch b/w Inference & Training
1078
+
1079
+ For many formats, Axolotl constructs prompts by concatenating token ids _after_ tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.
1080
+
1081
+ If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:
1082
+
1083
+ 1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
1084
+ 2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
1085
+ 3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same adjust your inference server accordingly.
1086
+ 4. As an additional troubleshooting step, you can look look at the token ids between 1 and 2 to make sure they are identical.
1087
+
1088
+ Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
1089
+
1090
  ## Need help? πŸ™‹β™‚οΈ
1091
 
1092
  Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you