Setting model_max_length in tokenizer appears to have no effect

#14
by fCola - opened

Hi and thanks for the great model!
It seems that the settings for using long context embeddings (in transformers) are not working. Doing this:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',
model_max_length=8192)
print(tokenizer.model_max_length)
outputs 512. I also tested that embedding two texts that are equal up to the 512-th token produces the same embedding. However, it works fine with the sentence transformer example. Am I doing something wrong? Thanks!

Nomic AI org

hm does it work if you do the following?

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.model_max_length = 8192

Yes, setting it this way makes it work in Transformers too! Should I close this?

Nomic AI org

yes thanks for bearing with us!

zpn changed discussion status to closed

Sign up or log in to comment