Feature extraction suitability?

#52
by ivoras - opened

Does it make sense to use gemma-2b for feature extraction / generation of embeddings for vector similarity search?

I'm generating vectors with:

def dataset():
  for x in data:
    yield x

p  = pipeline('feature-extraction', framework='pt', model='google/gemma-2b', device='cuda', access_token=os.environ['HF_TOKEN'])
for i, vec in enumerate(p(dataset())):
  save_vec(i, data[i], vec)

But after vectors are generated, trying to find vectors nearby (using L2 distance) to the query vector yields gibberish. This exact code works with other models, including bert-based and phi2.

Google org

Hi @ivoras , Gemma-2B is a large language model pre-trained by Google. It's not specifically designed for feature extraction or vector similarity search. While it can generate vectors, these vectors may not be optimal for such tasks. For vector similarity search, it's recommended to use models explicitly trained for this purpose. Models like BERT and Phi2 have been designed and optimized for generating vectors that are suitable for similarity search tasks. Thank you.

Sign up or log in to comment