当前位置：首页 > news >正文

MTEB评估基准使用指北

news 2025/7/10 19:47:22

文章目录

- 介绍
- 评估数据

介绍

文本嵌入通常是在单一任务的少量数据集上进行评估，这些数据集未涵盖其可能应用于其他任务的情况，不清楚在语义文本相似性（semantic textual similarity, STS）等任务上的最先进嵌入是否同样适用于聚类或重排序等其他任务。这使得该领域的进展难以跟踪，因为不断有各种模型被提出，而没有进行适当的评估。
为了解决这个问题，Hugging Face团队推出了大规模文本嵌入基准（Massive Text Embedding Benchmark, MTEB）。MTEB涵盖了8个嵌入任务，共58个数据集和112种语言，是目前迄今为止最全面的文本嵌入基准。
MTEB源码：https://github.com/embeddings-benchmark/mteb
MTEB论文：https://arxiv.org/abs/2210.07316
MTEB排行榜：https://huggingface.co/spaces/mteb/leaderboard

评估数据

由于众所周知的原因，Hugging Face官网访问无法直接，所以这篇文章提供了一个比较友好的代理方案来下载数据集。

由于mteb在1.12.4的版本中使用了ISO编码，导致task_langs参数不太好使了，这里暂时使用1.1.1版本。
pip install mteb==1.1.1
pip install C_MTEB

# -*- coding: utf-8 -*-
# Author  : liyanpeng
# Email   : yanpeng.li@cumt.edu.cn
# Datetime: 2024/5/28 18:23
# Filename: download_data.py
from mteb import MTEBimport os
import subprocessos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
data_path = '/root/data3/liyanpeng/hf_data'def show_dataset():evaluation = MTEB(task_langs=["zh", "zh-CN"])dataset_list = []for task in evaluation.tasks:if task.description.get('name') not in dataset_list:dataset_list.append(task.description.get('name'))desc = 'name: {}\t\thf_name: {}\t\ttype: {}\t\tcategory: {}'.format(task.description.get('name'), task.description.get('hf_hub_name'),task.description.get('type'), task.description.get('category'),)print(desc)print(len(dataset_list))def download_dataset():evaluation = MTEB(task_langs=["zh", "zh-CN"])err_list = []for task in evaluation.tasks:# task.load_data()# https://huggingface.co/datasets/task_name = task.description.get('hf_hub_name')print(task_name)cmd = ['huggingface-cli', 'download', '--repo-type', 'dataset', '--resume-download','--local-dir-use-symlinks', 'False', task_name, '--local-dir', os.path.join(data_path, task_name)]try:result = subprocess.run(cmd, check=True)except subprocess.CalledProcessError as e:err_list.append(task_name)print("{} is error".format(task_name))if err_list:print('download failed: \n', '\n'.join(err_list))else:print('download success.')if __name__ == '__main__':download_dataset()show_dataset()