当前位置：首页 > news >正文

LLMs之Embedding：Qwen3 Embedding的简介、安装和使用方法、案例应用之详细攻略

news 2025/9/8 10:37:14

Qwen3 Embedding的简介

1、特点

2、模型列表

3、评测结果

MTEB (Multilingual)

MTEB (Eng v2)

C-MTEB (MTEB Chinese)

Reranker

Qwen3 Embedding的使用方法

1、安装

2、使用方法

2.1、Text Embedding嵌入模型的使用方法

Transformers 使用方法

vLLM 使用方法

Sentence Transformers 使用方法

2.2、Text Reranking重排序模型的使用方法

Transformers 使用方法

vLLM 使用方法

案例应用

Qwen3 Embedding的简介

2025年6月，Qwen3 Embedding 模型系列是 Qwen 家族最新的专有模型，专门为文本嵌入和排序任务而设计。它构建于 Qwen3 系列的密集基础模型之上，提供各种尺寸（0.6B、4B 和 8B）的文本嵌入和重排序模型。该系列继承了其基础模型的卓越多语言能力、长文本理解和推理能力。Qwen3 Embedding 系列代表了在多项文本嵌入和排序任务（包括文本检索、代码检索、文本分类、文本聚类和双文本挖掘）方面的重大进步。

GitHub地址：https://github.com/QwenLM/Qwen3-Embedding

1、特点

>> 卓越的通用性：该嵌入模型在各种下游应用评估中均实现了最先进的性能。截至 2025 年 6 月 5 日，8B 尺寸的嵌入模型在 MTEB 多语言排行榜中排名第一（得分 70.58），而重排序模型在各种文本检索场景中表现出色。

>> 全面的灵活性：Qwen3 Embedding 系列为嵌入和重排序模型提供全方位的尺寸（从 0.6B 到 8B），以满足优先考虑效率和有效性的各种用例。开发人员可以无缝地组合这两个模块。此外，嵌入模型允许跨所有维度灵活定义向量，并且嵌入和重排序模型都支持用户定义的指令，以增强特定任务、语言或场景的性能。

>> 多语言能力：借助 Qwen3 模型的多语言能力，Qwen3 Embedding 系列提供对 100 多种语言的支持。这包括各种编程语言，并提供强大的多语言、跨语言和代码检索能力。

>> MRL 支持：MRL（Matryoshka Representation Learning）支持表示嵌入模型是否支持最终嵌入的自定义维度。

>> 指令感知：指令感知说明嵌入或重排序模型是否支持根据不同的任务自定义输入指令。

2、模型列表

Model Type	Models	Size	Layers	Sequence Length	Embedding Dimension	MRL Support	Instruction Aware
Text Embedding	Qwen3-Embedding-0.6B	0.6B	28	32K	1024	Yes	Yes
Text Embedding	Qwen3-Embedding-4B	4B	36	32K	2560	Yes	Yes
Text Embedding	Qwen3-Embedding-8B	8B	36	32K	4096	Yes	Yes
Text Reranking	Qwen3-Reranker-0.6B	0.6B	28	32K	-	-	Yes
Text Reranking	Qwen3-Reranker-4B	4B	36	32K	-	-	Yes
Text Reranking	Qwen3-Reranker-8B	8B	36	32K	-	-	Yes

3、评测结果

Qwen3 Embedding 系列模型在 MTEB 多语言排行榜上取得了优异的成绩。具体结果请参考文档中的表格。

MTEB (Multilingual)

Model	Size	Mean (Task)	Mean (Type)	Bitxt Mining	Class.	Clust.	Inst. Retri.	Multi. Class.	Pair. Class.	Rerank	Retri.	STS
NV-Embed-v2	7B	56.29	49.58	57.84	57.29	40.80	1.04	18.63	78.94	63.82	56.72	71.10
GritLM-7B	7B	60.92	53.74	70.53	61.83	49.75	3.45	22.77	79.94	63.78	58.31	73.33
BGE-M3	0.6B	59.56	52.18	79.11	60.35	40.88	-3.11	20.1	80.76	62.79	54.60	74.12
multilingual-e5-large-instruct	0.6B	63.22	55.08	80.13	64.94	50.75	-0.40	22.91	80.86	62.61	57.12	76.81
gte-Qwen2-1.5B-instruct	1.5B	59.45	52.69	62.51	58.32	52.05	0.74	24.02	81.58	62.58	60.78	71.61
gte-Qwen2-7b-Instruct	7B	62.51	55.93	73.92	61.55	52.77	4.94	25.48	85.13	65.55	60.08	73.98
text-embedding-3-large	-	58.93	51.41	62.17	60.27	46.89	-2.68	22.03	79.17	63.89	59.27	71.68
Cohere-embed-multilingual-v3.0	-	61.12	53.23	70.50	62.95	46.89	-1.89	22.74	79.88	64.07	59.16	74.80
gemini-embedding-exp-03-07	-	68.37	59.59	79.28	71.82	54.59	5.18	29.16	83.63	65.58	67.71	79.40
Qwen3-Embedding-0.6B	0.6B	64.33	56.00	72.22	66.83	52.33	5.09	24.59	80.83	61.41	64.64	76.17
Qwen3-Embedding-4B	4B	69.45	60.86	79.36	72.33	57.15	11.56	26.77	85.05	65.08	69.60	80.86
Qwen3-Embedding-8B	8B	70.58	61.69	80.89	74.00	57.65	10.06	28.66	86.40	65.63	70.88	81.08

MTEB (Eng v2)

MTEB English / Models	Param.	Mean(Task)	Mean(Type)	Class.	Clust.	Pair Class.	Rerank.	Retri.	STS	Summ.
multilingual-e5-large-instruct	0.6B	65.53	61.21	75.54	49.89	86.24	48.74	53.47	84.72	29.89
NV-Embed-v2	7.8B	69.81	65.00	87.19	47.66	88.69	49.61	62.84	83.82	35.21
GritLM-7B	7.2B	67.07	63.22	81.25	50.82	87.29	49.59	54.95	83.03	35.65
gte-Qwen2-1.5B-instruct	1.5B	67.20	63.26	85.84	53.54	87.52	49.25	50.25	82.51	33.94
stella_en_1.5B_v5	1.5B	69.43	65.32	89.38	57.06	88.02	50.19	52.42	83.27	36.91
gte-Qwen2-7B-instruct	7.6B	70.72	65.77	88.52	58.97	85.9	50.47	58.09	82.69	35.74
gemini-embedding-exp-03-07	-	73.3	67.67	90.05	59.39	87.7	48.59	64.35	85.29	38.28
Qwen3-Embedding-0.6B	0.6B	70.70	64.88	85.76	54.05	84.37	48.18	61.83	86.57	33.43
Qwen3-Embedding-4B	4B	74.60	68.10	89.84	57.51	87.01	50.76	68.46	88.72	34.39
Qwen3-Embedding-8B	8B	75.22	68.71	90.43	58.57	87.52	51.56	69.44	88.58	34.83

C-MTEB (MTEB Chinese)

C-MTEB	Param.	Mean(Task)	Mean(Type)	Class.	Clust.	Pair Class.	Rerank.	Retr.	STS
multilingual-e5-large-instruct	0.6B	58.08	58.24	69.80	48.23	64.52	57.45	63.65	45.81
bge-multilingual-gemma2	9B	67.64	68.52	75.31	59.30	86.67	68.28	73.73	55.19
gte-Qwen2-1.5B-instruct	1.5B	67.12	67.79	72.53	54.61	79.5	68.21	71.86	60.05
gte-Qwen2-7B-instruct	7.6B	71.62	72.19	75.77	66.06	81.16	69.24	75.70	65.20
ritrieve_zh_v1	0.3B	72.71	73.85	76.88	66.5	85.98	72.86	76.97	63.92
Qwen3-Embedding-0.6B	0.6B	66.33	67.45	71.40	68.74	76.42	62.58	71.03	54.52
Qwen3-Embedding-4B	4B	72.27	73.51	75.46	77.89	83.34	66.05	77.03	61.26
Qwen3-Embedding-8B	8B	73.84	75.00	76.97	80.08	84.23	66.99	78.21	63.53

Reranker

Model	Param	MTEB-R	CMTEB-R	MMTEB-R	MLDR	MTEB-Code	FollowIR
Qwen3-Embedding-0.6B	0.6B	61.82	71.02	64.64	50.26	75.41	5.09
Jina-multilingual-reranker-v2-base	0.3B	58.22	63.37	63.73	39.66	58.98	-0.68
gte-multilingual-reranker-base	0.3B	59.51	74.08	59.44	66.33	54.18	-1.64
BGE-reranker-v2-m3	0.6B	57.03	72.16	58.36	59.51	41.38	-0.01
Qwen3-Reranker-0.6B	0.6B	65.80	71.31	66.36	67.28	73.42	5.41
Qwen3-Reranker-4B	4B	69.76	75.94	72.74	69.97	81.20	14.84
Qwen3-Reranker-8B	8B	69.02	77.45	72.94	70.19	81.22	8.05

Qwen3 Embedding的使用方法

1、安装

下载地址：https://huggingface.co/collections/Qwen/qwen3-embedding-6841b2055b99c44d9a4c371f

2、使用方法

以下是使用 transformers 库加载和使用 Qwen3-Embedding 模型的示例代码：

2.1、Text Embedding嵌入模型的使用方法

Transformers 使用方法

# Requires transformers>=4.51.0

# Requires transformers>=4.51.0import torch
import torch.nn.functional as Ffrom torch import Tensor
from transformers import AutoTokenizer, AutoModeldef last_token_pool(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])if left_padding:return last_hidden_states[:, -1]else:sequence_lengths = attention_mask.sum(dim=1) - 1batch_size = last_hidden_states.shape[0]return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]def get_detailed_instruct(task_description: str, query: str) -> str:return f'Instruct: {task_description}\nQuery:{query}'# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'queries = [get_detailed_instruct(task, 'What is the capital of China?'),get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = ["The capital of China is Beijing.","Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documentstokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-0.6B', padding_side='left')
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B')# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B', attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()max_length = 8192# Tokenize the input texts
batch_dict = tokenizer(input_texts,padding=True,truncation=True,max_length=max_length,return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7645568251609802, 0.14142508804798126], [0.13549736142158508, 0.5999549627304077]]

vLLM 使用方法

# Requires vllm>=0.8.5

# Requires vllm>=0.8.5
import torch
import vllm
from vllm import LLMdef get_detailed_instruct(task_description: str, query: str) -> str:return f'Instruct: {task_description}\nQuery:{query}'# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'queries = [get_detailed_instruct(task, 'What is the capital of China?'),get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = ["The capital of China is Beijing.","Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documentsmodel = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]

Sentence Transformers 使用方法

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0from sentence_transformers import SentenceTransformer# Load the model
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "Qwen/Qwen3-Embedding-0.6B",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )# The queries and documents to embed
queries = ["What is the capital of China?","Explain gravity",
]
documents = ["The capital of China is Beijing.","Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.7646, 0.1414], [0.1355, 0.6000]])

2.2、Text Reranking重排序模型的使用方法

Transformers 使用方法

# Requires transformers>=4.51.0

# Requires transformers>=4.51.0
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLMdef format_instruction(instruction, query, doc):if instruction is None:instruction = 'Given a web search query, retrieve relevant passages that answer the query'output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)return outputdef process_inputs(pairs):inputs = tokenizer(pairs, padding=False, truncation='longest_first',return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens))for i, ele in enumerate(inputs['input_ids']):inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokensinputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)for key in inputs:inputs[key] = inputs[key].to(model.device)return inputs@torch.no_grad()
def compute_logits(inputs, **kwargs):batch_scores = model(**inputs).logits[:, -1, :]true_vector = batch_scores[:, token_true_id]false_vector = batch_scores[:, token_false_id]batch_scores = torch.stack([false_vector, true_vector], dim=1)batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)scores = batch_scores[:, 1].exp().tolist()return scorestokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B", torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)task = 'Given a web search query, retrieve relevant passages that answer the query'queries = ["What is the capital of China?","Explain gravity",
]documents = ["The capital of China is Beijing.","Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)print("scores: ", scores)

vLLM 使用方法

# Requires vllm>=0.8.5
import logging
from typing import Dict, Optional, Listimport json
import loggingimport torchfrom transformers import AutoTokenizer, is_torch_npu_available
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel
import gc
import math
from vllm.inputs.data import TokensPromptdef format_instruction(instruction, query, doc):text = [{"role": "system", "content": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."},{"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}]return textdef process_inputs(pairs, instruction, max_length, suffix_tokens):messages = [format_instruction(instruction, query, doc) for query, doc in pairs]messages =  tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False, enable_thinking=False)messages = [ele[:max_length] + suffix_tokens for ele in messages]messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]return messagesdef compute_logits(model, messages, sampling_params, true_token, false_token):outputs = model.generate(messages, sampling_params, use_tqdm=False)scores = []for i in range(len(outputs)):final_logits = outputs[i].outputs[0].logprobs[-1]token_count = len(outputs[i].outputs[0].token_ids)if true_token not in final_logits:true_logit = -10else:true_logit = final_logits[true_token].logprobif false_token not in final_logits:false_logit = -10else:false_logit = final_logits[false_token].logprobtrue_score = math.exp(true_logit)false_score = math.exp(false_logit)score = true_score / (true_score + false_score)scores.append(score)return scoresnumber_of_gpu = torch.cuda.device_count()
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Reranking-4B')
model = LLM(model='Qwen/Qwen3-Reranking-0.6B', tensor_parallel_size=number_of_gpu, max_model_len=10000, enable_prefix_caching=True, gpu_memory_utilization=0.8)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
max_length=8192
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
sampling_params = SamplingParams(temperature=0, max_tokens=1,logprobs=20, allowed_token_ids=[true_token, false_token],
)task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?","Explain gravity",
]
documents = ["The capital of China is Beijing.","Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]pairs = list(zip(queries, documents))
inputs = process_inputs(pairs, task, max_length-len(suffix_tokens), suffix_tokens)
scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
print('scores', scores)destroy_model_parallel()