Finetuned model generates sequences far different from sequences in the finetune training set

#11
by atqamar - opened

We finetuned a ZymCTRL model using EC 4.2.1.1 (Carbonic Anhydrase) as the context label, and 131 carbonic anhydrase sequences, which are all highly similar and roughly ~190 residues long.

However, when we generate sequences with the finetuned model using EC 4.2.1.1 as the context label, the resulting sequences differ significantly from the sequences in the training set. The generated sequences exhibit an average Levenshtein distance of ~62 from the sequences in the training set.

What adjustments can we make to obtain generated sequences more similar to those used in the fine-tuning step?

AI for protein design org

hi atqamar,

sorry for the late response. I am surprised this is the case, how long did you train for? are the training curves looking good?

Sign up or log in to comment