Wrong Tokenizer?

by JJinho - opened Feb 28

Feb 28

In the model card it states that 'bert-base-uncased' tokenizer is used, but in the paper it is clear that modifications were performed (e.g. multiples of 64 vocab) and in the config.json file the vocabularies are not the same (30528 for nomic-bert-2048 vs 30522 for bert-base-uncased).

Is this intentional? If not could the tokenizer also be included? Thank you.

zpn

Nomic AI org Feb 29

it is just the bert base tokenizer, we just add extra tokens to increase training throughput. those extra tokens aren't ever used. either way, i pushed the bert tokenizer to this repo

zpn changed discussion status to closed Feb 29

JJinho

Feb 29

Thank you for clarifying that!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment