Nanobit commited on
Commit
04a42b6
1 Parent(s): 919f4ca

feat(docs): improve user customized prompts (#443)

Browse files

* feat(docs): improve user customized prompts

* feat(doc): add custom pretokenized instructions

* chore: clean old data folder

* chore: add new line

README.md CHANGED
@@ -16,6 +16,7 @@ Axolotl is a tool designed to streamline the fine-tuning of various AI models, o
16
  - [LambdaLabs Installation](#lambdalabs)
17
  - [Dataset](#dataset)
18
  - [How to Add Custom Prompts](#how-to-add-custom-prompts)
 
19
  - [Config](#config)
20
  - [Train](#train)
21
  - [Inference](#inference)
@@ -99,7 +100,7 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
99
  ```
100
 
101
  - Conda/Pip venv
102
- 1. Install python **3.9**
103
 
104
  2. Install pytorch stable https://pytorch.org/get-started/locally/
105
 
@@ -273,11 +274,29 @@ Have dataset(s) in one of the following format (JSONL recommended):
273
 
274
  #### How to add custom prompts
275
 
276
- 1. Add your method to a file in [prompt_strategies](src/axolotl/prompt_strategies). Please see other files as example.
277
- 2. Use your custom file name as the dataset type `<prompt_strategies_file>.load_<load_fn>`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278
 
279
- Optionally, download some datasets, see [data/README.md](data/README.md)
280
 
 
 
281
 
282
 
283
  ### Config
@@ -307,9 +326,9 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
307
 
308
  # local
309
  datasets:
310
- - path: json
311
- data_files: data.jsonl # or json
312
- type: alpaca # format from earlier
313
  ```
314
 
315
  - loading
@@ -395,6 +414,24 @@ datasets:
395
  shards: # number of shards to split data into
396
  name: # name of dataset configuration to load
397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
398
  # axolotl attempts to save the dataset as an arrow after packing the data together so
399
  # subsequent training attempts load faster, relative path
400
  dataset_prepared_path: data/last_run_prepared
@@ -667,7 +704,9 @@ Please reduce any below
667
  - `gradient_accumulation_steps`
668
  - `sequence_len`
669
 
670
- > `failed (exitcode: -9)` usually means your system has run out of system memory.
 
 
671
  Similarly, you should consider reducing the same settings as when you run out of VRAM.
672
  Additionally, look into upgrading your system RAM which should be simpler than GPU upgrades.
673
 
 
16
  - [LambdaLabs Installation](#lambdalabs)
17
  - [Dataset](#dataset)
18
  - [How to Add Custom Prompts](#how-to-add-custom-prompts)
19
+ - [How to Use Custom Pretokenized Dataset](#how-to-use-your-custom-pretokenized-dataset)
20
  - [Config](#config)
21
  - [Train](#train)
22
  - [Inference](#inference)
 
100
  ```
101
 
102
  - Conda/Pip venv
103
+ 1. Install python >=**3.9**
104
 
105
  2. Install pytorch stable https://pytorch.org/get-started/locally/
106
 
 
274
 
275
  #### How to add custom prompts
276
 
277
+ Using yaml. Example:
278
+ ```yaml
279
+ datasets:
280
+ - path: repo
281
+ type:
282
+ system_prompt: ""
283
+ no_input_format: |-
284
+ User: {instruction}<|end_of_turn|>
285
+ Assistant:
286
+ format: |-
287
+ User: {instruction}
288
+ {input}<|end_of_turn|>
289
+ Assistant:
290
+ ```
291
+
292
+ Using file:
293
+ 1. Add your method to a file in [prompt_strategies](src/axolotl/prompt_strategies). Please see other files as example.
294
+ 2. Use your custom file name as the dataset type `<prompt_strategies_file>.load_<load_fn>`.
295
 
296
+ #### How to use your custom pretokenized dataset
297
 
298
+ - Do not pass a `type:`
299
+ - Dataset must contain `input_ids`, `attention_mask`, `labels` in columns
300
 
301
 
302
  ### Config
 
326
 
327
  # local
328
  datasets:
329
+ - path: data.jsonl # or json
330
+ ds_type: json # see other options below
331
+ type: alpaca
332
  ```
333
 
334
  - loading
 
414
  shards: # number of shards to split data into
415
  name: # name of dataset configuration to load
416
 
417
+ # custom user prompt
418
+ - path: repo
419
+ type:
420
+ # the below are defaults. only set what's needed.
421
+ system_prompt: ""
422
+ field_system: system
423
+ field_instruction: instruction
424
+ field_output: input
425
+
426
+ # customizable to be single line or multi-line
427
+ system_format: "{system}"
428
+ # 'format' can include {input}
429
+ format: |-
430
+ User: {instruction} {input}
431
+ Assistant:
432
+ # 'no_input_format' cannot include {input}
433
+ no_input_format: "{instruction} "
434
+
435
  # axolotl attempts to save the dataset as an arrow after packing the data together so
436
  # subsequent training attempts load faster, relative path
437
  dataset_prepared_path: data/last_run_prepared
 
704
  - `gradient_accumulation_steps`
705
  - `sequence_len`
706
 
707
+ > `failed (exitcode: -9)`
708
+
709
+ Usually means your system has run out of system memory.
710
  Similarly, you should consider reducing the same settings as when you run out of VRAM.
711
  Additionally, look into upgrading your system RAM which should be simpler than GPU upgrades.
712
 
data/README.md DELETED
@@ -1,24 +0,0 @@
1
-
2
- ## Download some datasets
3
- ```shell
4
- curl https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_gpt4.json -o data/raw/alpaca_data_gpt4.json
5
- curl https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -L -o data/raw/vicuna_cleaned.json
6
- curl https://github.com/teknium1/GPTeacher/blob/main/Instruct/gpt4-instruct-similarity-0.6-dataset.json?raw=true -L -o data/raw/gpt4-instruct-similarity-0.6-dataset.json
7
- curl https://github.com/teknium1/GPTeacher/blob/main/Roleplay/roleplay-similarity_0.6-instruct-dataset.json?raw=true -L -o data/raw/roleplay-similarity_0.6-instruct-dataset.json
8
- ```
9
-
10
- ## Convert the JSON data files to JSONL.
11
-
12
- ```shell
13
- python3 ./scripts/alpaca_json_to_jsonl.py --file data/alpaca_data_gpt4.json --output data/alpaca_data_gpt4.jsonl
14
- python3 ./scripts/alpaca_json_to_jsonl.py --file data/raw/vicuna_cleaned.json --output data/vicuna_cleaned.jsonl
15
- python3 ./scripts/alpaca_json_to_jsonl.py --file data/raw/roleplay-similarity_0.6-instruct-dataset.json --output data/roleplay-similarity_0.6-instruct-dataset.jsonl
16
- python3 ./scripts/alpaca_json_to_jsonl.py --file data/raw/gpt4-instruct-similarity-0.6-dataset.json --output data/gpt4-instruct-similarity-0.6-dataset.jsonl
17
- ```
18
- ---
19
-
20
- Using JSONL makes it easier to subset the data if you want a smaller training set, i.e get 2000 random examples.
21
-
22
- ```shell
23
- shuf -n2000 data/vicuna_cleaned.jsonl > data/vicuna_cleaned.subset0.jsonl
24
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
data/raw/.gitignore DELETED
@@ -1 +0,0 @@
1
- **
 
 
scripts/alpaca_json_to_jsonl.py DELETED
@@ -1,52 +0,0 @@
1
- """Module to convert json file to jsonl"""
2
-
3
- import os
4
- import sys
5
- from pathlib import Path
6
- from typing import Optional, Union
7
-
8
- import fire
9
-
10
- from axolotl.convert import (
11
- FileReader,
12
- FileWriter,
13
- JsonlSerializer,
14
- JsonParser,
15
- JsonToJsonlConverter,
16
- StdoutWriter,
17
- )
18
- from axolotl.logging_config import configure_logging
19
-
20
- configure_logging()
21
-
22
- # add src to the pythonpath so we don't need to pip install this
23
- project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
24
- src_dir = os.path.join(project_root, "src")
25
- sys.path.insert(0, src_dir)
26
-
27
-
28
- def main(
29
- file: Path,
30
- output: Optional[Path] = None,
31
- to_stdout: Optional[bool] = False,
32
- ):
33
- """
34
- Convert a json file to jsonl
35
- """
36
-
37
- file_reader = FileReader()
38
- writer: Union[StdoutWriter, FileWriter]
39
- if to_stdout or output is None:
40
- writer = StdoutWriter()
41
- else:
42
- writer = FileWriter(output)
43
- json_parser = JsonParser()
44
- jsonl_serializer = JsonlSerializer()
45
-
46
- converter = JsonToJsonlConverter(file_reader, writer, json_parser, jsonl_serializer)
47
-
48
- converter.convert(file, output)
49
-
50
-
51
- if __name__ == "__main__":
52
- fire.Fire(main)