Idea: adding support for chatml & abbreviations

#5
by ngxson HF staff - opened

Hi, thank you for providing us a very high quality model. It's very fun and useful to play with this AI assistant.

I would like to suggest 2 changes that I think will greatly improve the convenient when using the model.

Use chatml template

Currently, in tokenizer_config.json, we're using llama 2 format (with [INST], [/INST],...). Most Mistral-based models already moved to chatml (<|im_start|>, <|im_end|>) because it's easier to work with.

To migrate to chatml, all we need to do is to replace tokens [INST], [/INST] to <|im_start|>, <|im_end|>, update template, then run finetune so that the model can remember the new format. In my case, I uses bkai-foundation-models/vi-self-chat-sharegpt-format dataset and finetune using qlora. Train during 1 hr with 1 x Nvidia V100. Works fine. But would be nice if you can update the upstream model.

You can take the tokenizer_*, added_tokens, special_tokens_map from my repo here: https://huggingface.co/ngxson/Vistral-7B-ChatML

Abbreviations

In vietnamese, very often, we use abbreviations like t, c, e, a, ko, đc,...

As I imagine, the model is trained using "textbook" dataset (high quality). This make the chance for the model to see abbreviations become lower, thus make it not paying much attention to these word (for example, the negative word "ko" sometimes be ignored).

I tried finetuning using my own dataset (only 40 examples). Half of the samples contains these abbreviations (only on user message, not in assistant reponse), and the end result was great. The repo is here: https://huggingface.co/ngxson/vistral-meow

This is the nodejs function that I used to generate these abbreviations. Feel free to adapt to python if you want to use it:

const sample = items => items[Math.floor(Math.random()*items.length)];
const TRICKS = {
  'tớ': ['t'],
  'em': ['e'],
  'cậu': ['c'],
  'anh': ['anh'],
  'cũng': ['cx'],
  'được': ['đc'],
  'không': ['ko'],
  'gì': ['j'],
  'với': ['vs'],
  'rồi': ['r', 'rùi'],
  'hôm nay': ['hnay'],
  'hôm trước': ['htrc'],
  'hôm qua': ['hqua'],
  'nói chuyện': ['nchn'],
  'chia tay': ['ctay'],
  'người yêu': ['ny'],
  'người yêu cũ': ['nyc'],
};
for (const trick of Object.entries(TRICKS)) {
  const upperCaseFirstChar = (t) => t.split('').map((c, i) => i === 0 ? c.toUpperCase() : c).join('');
  const newTrick = upperCaseFirstChar(trick[0]);
  TRICKS[newTrick] = trick[1].map(upperCaseFirstChar);
}
const applyTricks = (text) => {
  const shouldApply = Math.random() < 0.4;
  if (!shouldApply) return text;
  let newText = text;
  for (const trick of Object.entries(TRICKS)) {
    const from = trick[0];
    const to = sample(trick[1]);
    newText = newText.replaceAll(from, to);
  }
  return newText;
};

Further idea: make it more resilient to typo errors

I know it's quite fictive, but vietnamese sentences are long and easy to have typo errors. Using the same technique as abbreviations, we maybe able to introduce user's typo errors to the dataset. For example, becomes cos (mistype an "s"),...

ngxson changed discussion status to closed
ngxson changed discussion status to open
Vietnamese Mistral org

@ngxson
Thank you for your feedback! We appreciate your suggestions and will consider them for future improvements to the model.

chiennv changed discussion status to closed

Sign up or log in to comment