Fill-Mask
Transformers
PyTorch
Safetensors
English
nomic_bert
custom_code

Wrong Tokenizer?

#4
by JJinho - opened

In the model card it states that 'bert-base-uncased' tokenizer is used, but in the paper it is clear that modifications were performed (e.g. multiples of 64 vocab) and in the config.json file the vocabularies are not the same (30528 for nomic-bert-2048 vs 30522 for bert-base-uncased).

Is this intentional? If not could the tokenizer also be included? Thank you.

Nomic AI org

it is just the bert base tokenizer, we just add extra tokens to increase training throughput. those extra tokens aren't ever used. either way, i pushed the bert tokenizer to this repo

zpn changed discussion status to closed

Thank you for clarifying that!

Sign up or log in to comment