Axolotl prompt format (sharegpt, chatml) could differ from yours

#13
by timlim123 - opened

Hi @teknium ,

I believe you trained using axolotl with this dataset config:

datasets:
  - path: /data/chat_data/full_dataset_chat.jsonl
    type: sharegpt
    conversation: chatml
dataset_prepared_path: last_run_prepared

Did you realised that Axolotl actually adds an extra linebreak (somehow) and it becomes <|im_end|>\n\n ? or did you create your own custom dataset and dataloader? Hope to see your release of the configuration file and dataset format.

I found out by debugging step-by-step to run through the repo, the last few label tokens will be always be [....., 28766, 321, 28730, 416, 28766, 28767, 13, 13, 2] which when decoded is <|im_end|>\n\n</s>.

Issue could be here (extra \n in sep): https://github.com/OpenAccess-AI-Collective/axolotl/blob/a48dbf6561cc74c275a48070f397334a2c367dd5/src/axolotl/prompt_strategies/sharegpt.py#L16

Hi @teknium ,

I believe you trained using axolotl with this dataset config:

datasets:
  - path: /data/chat_data/full_dataset_chat.jsonl
    type: sharegpt
    conversation: chatml
dataset_prepared_path: last_run_prepared

Did you realised that Axolotl actually adds an extra linebreak (somehow) and it becomes <|im_end|>\n\n ? or did you create your own custom dataset and dataloader? Hope to see your release of the configuration file and dataset format.

I found out by debugging step-by-step to run through the repo, the last few label tokens will be always be [....., 28766, 321, 28730, 416, 28766, 28767, 13, 13, 2] which when decoded is <|im_end|>\n\n</s>.

Issue could be here (extra \n in sep): https://github.com/OpenAccess-AI-Collective/axolotl/blob/a48dbf6561cc74c275a48070f397334a2c367dd5/src/axolotl/prompt_strategies/sharegpt.py#L16

I believe they changed things for chatml format after this was trained

Sign up or log in to comment