Retrieval 效果一般,仅和bm25持平

#26
by Stefan8 - opened

在做对CSL(纯中文corpus)的first stage retrieval 测试时,效果不是很好欸,仅仅和bm25差不多。不知道是不是使用方法有什么问题,能不能简单指导一下?

Alibaba-NLP org

方便贴两条数据参考一下吗

数据基本都是sentific paper的abstract, https://huggingface.co/datasets/neuclir/csl
corpus:{"id": "csl-387565", "contents": "谈若敖氏\n"'若敖之鬼馁而',也是一件人生的大哀",这句话是说阿Q需要一个女人的念头不仅合符礼制:所谓"不孝有三,无后为大";而且也是十分现实的事情,即自己饿死了尚不要紧,连祖先也没有人供奉了则是一件大事.其实这已经说得非常冠冕堂皇,阿Q私心至多不过这么想:若没有一个女人,"断子绝孙便没有人供一碗饭".自然这就同"若敖之鬼馁而"的境遇一样了.当然无论阿Q是否知书识礼,此话用在阿Q身上就成了歪理,不过歪理对于阿Q就是真理."}
{"id": "csl-093481", "contents": ""(X)整个一(个)Y"格式试析\n"(X)整个一(个)Y"是口语中的一种常见格式.本文讨论该格式的表达功能和句法语义特点,具体分析格式中的Y由哪些成分充当以及X和Y之间的语义关系.文章最后分析"(X)整个一(个)Y"格式的篇章功能和格式中"整个一(个)"的语法化倾向."}
{"id": "csl-170933", "contents": "0-14,迪拜,阿拉伯联合酋长国\n"0-14"大厦位于迪拜商务湾的核心区.大厦的混凝土外壳在作为建筑物支撑结构的同时,还创造出对光线、空气和视线等均通透的多孔织物状立面.21层定制设计的办公空间中没有常规意义上的墙和柱子的阻碍.在地面层,高档专卖店一直贯通到商业湾的海滨演艺广场,将高档消费文化和大众娱乐生活结合起来.地下4层停车场提供了逾400个泊车位."}
Topics:200 相分离颗粒影响神经退行性疾病 由 RNA 和蛋白质组成的相分离颗粒如何导致神经退行性疾病?
201 小企业经济增长趋势 小企业经济增长趋势如何?
202 分类点云深度学习 我正在寻找利用深度学习方法对点云进行分类的研究。
203 社交网络关系对工作流动性的影响 较弱的社交网络联系是否比较强的联系更有利于工作流动?
204 混沌中的极端事件预测 我正在寻找描述混沌系统中极端事件预测的摘要。
205 中国开放移民 中国开放政策对移民有何影响?

可以指点一下怎么reproduce你们beir dataset的结果嘛?有什么config需要注意吗?我是Jimmy Lin手下的研究助理,如果retrieval结果好的话想尝试把模型integrate到pyserini module里

Alibaba-NLP org

怀疑是用法的问题,可以贴一下编码用的代码么

没问题,如果需要更多代码,请联系我email: j28min@uwaterloo

import argparse
import torch
from transformers import AutoTokenizer, AutoModel
import json
import os
from tqdm import tqdm

# Set environment variables to optimize CUDA memory management
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

# Parse command-line arguments
parser = argparse.ArgumentParser(description="Generate embeddings with configurable batch size and precision.")
parser.add_argument('--batchsize', type=int, default=2, help='Batch size for processing')
parser.add_argument('--half_precision', type=int, choices=[0, 1], default=1, help='Use half precision (1) or full precision (0)')
parser.add_argument('--input_file', type=str, required=True, help='Input JSONL file with id and contents')
parser.add_argument('--output_file', type=str, default='embeddings.jsonl', help='Output file to save embeddings')
args = parser.parse_args()

# Clear CUDA cache before loading the model
torch.cuda.empty_cache()

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Determine precision
precision = torch.float16 if args.half_precision == 1 else torch.float32

# Load the tokenizer and model with the specified precision
tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-7B-instruct")
model = AutoModel.from_pretrained("Alibaba-NLP/gte-Qwen2-7B-instruct", torch_dtype=precision).to(device)

# Function to generate embeddings for a batch of texts
def generate_embeddings(texts):
    inputs = tokenizer(texts, return_tensors='pt', truncation=True, padding=True, max_length=128).to(device)
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state.mean(dim=1)
    return embeddings.cpu().numpy()  # Convert to numpy array for saving to file

# Process dataset and generate embeddings with a configurable batch size
def process_dataset(input_file, batch_size, output_file):
    with open(output_file, 'w') as outfile:
        batch_texts = []
        batch_doc_ids = []

        with open(input_file, 'r') as infile:
            for line in tqdm(infile):
                doc = json.loads(line)
                batch_texts.append(doc['contents'])
                batch_doc_ids.append(doc['id'])

                # When the batch is full, process it
                if len(batch_texts) == batch_size:
                    embeddings = generate_embeddings(batch_texts)
                    for doc_id, embedding in zip(batch_doc_ids, embeddings):
                        item = {
                            "id": doc_id,
                            "contents": batch_texts[batch_doc_ids.index(doc_id)],
                            "vector": embedding.tolist()
                        }
                        json.dump(item, outfile, ensure_ascii=False)
                        outfile.write('\n')

                    # Clear batch lists
                    batch_texts = []
                    batch_doc_ids = []

                    # Clear CUDA cache periodically to manage memory
                    torch.cuda.empty_cache()

            # Process any remaining documents in the last batch
            if batch_texts:
                embeddings = generate_embeddings(batch_texts)
                for doc_id, embedding in zip(batch_doc_ids, embeddings):
                    item = {
                        "id": doc_id,
                        "contents": batch_texts[batch_doc_ids.index(doc_id)],
                        "vector": embedding.tolist()
                    }
                    json.dump(item, outfile, ensure_ascii=False)
                    outfile.write('\n')

    print(f"Embeddings saved to {output_file}")

if __name__ == '__main__':
    # Run the dataset processing with the specified batch size
    process_dataset(args.input_file, args.batchsize, args.output_file)

搜索是用的pyserini的faiss-search

Alibaba-NLP org

两个问题:1. gte-Qwen2-7B-instruc是使用last token位置作embedding的,上述代码用的是mean pooling 2. 对于query侧的编码,建议参考README中的格式添加instruct。建议参考README中的代码修改一下encode方式:

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'


# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Sign up or log in to comment