Setting model_max_length in tokenizer appears to have no effect

#14

by fCola - opened Mar 4

Mar 4

Hi and thanks for the great model!
It seems that the settings for using long context embeddings (in transformers) are not working. Doing this:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',
model_max_length=8192)
print(tokenizer.model_max_length)
outputs 512. I also tested that embedding two texts that are equal up to the 512-th token produces the same embedding. However, it works fine with the sentence transformer example. Am I doing something wrong? Thanks!

zpn

Nomic AI org Mar 4

hm does it work if you do the following?

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.model_max_length = 8192

fCola

Mar 5

Yes, setting it this way makes it work in Transformers too! Should I close this?

zpn

Nomic AI org Mar 5

yes thanks for bearing with us!

zpn changed discussion status to closed Mar 5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment