No tokenizer.model???

#1
by wyxwangmed - opened

It should be here: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-merge-v7/blob/main/tokenizer.json

What backend are you using? Do you mean that some require the .model file instead?

Oobabooga needs it, at least under Exllama 2.
I just used the tokenizer.model of the Dare Merge v5.

Interesting. I believe tokenizer.model is a legacy format, not sure why ooba wants it.

I converted it with llama.cpp and uploaded it, lemme know if it works.

Also is there any reason you are not using an exl2 in ooba? Do y'all need an 8bpw upload?

i think the tokenizer.model has some issue

exllamav2 0.0.11 convert.py fails with

Traceback (most recent call last):
  File "/exllamav2/convert.py", line 69, in <module>
    tokenizer = ExLlamaV2Tokenizer(config)
  File "/exllamav2/exllamav2/tokenizer.py", line 65, in __init__
    if os.path.exists(path_spm) and not force_json: self.tokenizer = ExLlamaV2TokenizerSPM(path_spm)
  File "/exllamav2/exllamav2/tokenizers/spm.py", line 9, in __init__
    self.spm = SentencePieceProcessor(model_file = tokenizer_model)
  File "/venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 447, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

quanting without the model file seems to work

Yeah I just got the same error actually.

Does anyone know how to even make an old tokenizer.model file?

Technically another tokenizer could be used but some tokens from the union tokenizer merge may be missing.

Sign up or log in to comment