Wrong Tokenizer?
#4
by
JJinho
- opened
In the model card it states that 'bert-base-uncased' tokenizer is used, but in the paper it is clear that modifications were performed (e.g. multiples of 64 vocab) and in the config.json file the vocabularies are not the same (30528 for nomic-bert-2048 vs 30522 for bert-base-uncased).
Is this intentional? If not could the tokenizer also be included? Thank you.
it is just the bert base tokenizer, we just add extra tokens to increase training throughput. those extra tokens aren't ever used. either way, i pushed the bert tokenizer to this repo
zpn
changed discussion status to
closed
Thank you for clarifying that!