当前位置：首页 > news >正文

OpenSearch 向量搜索与Qwen3-Embedding 集成示例

news 2025/7/7 6:35:38

本项目演示了如何将 OpenSearch 的 k-NN (k-Nearest Neighbors) 向量搜索功能与 OpenAI 的高级文本嵌入模型（如 Qwen3-Embedding）相结合，以实现强大的语义搜索。

核心概念

文本嵌入 (Text Embedding): 将文本（单词、句子、段落）转换为一个高维的数字向量。语义上相似的文本在向量空间中的距离会更近。
Qwen3-Embedding: 我们调用 Qwen3-Embedding 来为我们的文本生成这些高质量的向量。
k-NN 向量搜索: OpenSearch 接收一个查询向量，并利用专门的 k-NN 算法在索引中快速找到与该查询向量最“邻近”的 N 个文档向量，从而实现语义搜索。

第 1 步：环境准备

在运行脚本之前，请确保完成以下设置。

1.1. 启动 OpenSearch

请确保您已经通过 docker-compose.yml 文件启动了 OpenSearch 和 OpenSearch Dashboards 服务。

version: '3.8'
services:opensearch-node1:image: opensearchproject/opensearch:2.19.1container_name: opensearch-node1environment:- cluster.name=opensearch-cluster- node.name=opensearch-node1- discovery.type=single-node- bootstrap.memory_lock=true # along with the memlock settings below.- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM- "DISABLE_SECURITY_PLUGIN=true" # Disables security plugin for easier local developmentulimits:memlock:soft: -1hard: -1nofile:soft: 65536 # Maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems.hard: 65536volumes:- opensearch-data:/usr/share/opensearch/dataports:- 9200:9200- 9600:9600 # required for Performance Analyzernetworks:- opensearch-netopensearch-dashboards:image: opensearchproject/opensearch-dashboards:2.19.1container_name: opensearch-dashboardsports:- 5601:5601expose:- "5601"environment:OPENSEARCH_HOSTS: '["http://opensearch-node1:9200"]'DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true" # Disables security plugin for easier local developmentnetworks:- opensearch-netdepends_on:- opensearch-node1volumes:opensearch-data:networks:opensearch-net:

docker-compose up -d

在这里插入图片描述

1.2. 安装 Python 依赖库

此脚本需要 opensearch-py、openai 和 python-dotenv 库。通过 pip 安装它们：

uv pip install opensearch-py openai python-dotenv

1.3. 设置 OpenAI API 密钥

这是一个关键步骤！

在项目根目录下创建一个名为 .env 的新文件。
打开 .env 文件，并按以下格式添加您的 OpenAI API 密钥：
```
QWEN_API_KEY= `sk-YourActualOpenAIKeyHere`
QWEN_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_MODEL_NAME=qwen-turbo
QWEN_EMBEDDING_MODEL_NAME=text-embedding-v4
```
重要提示: 请将 sk-YourActualOpenAIKeyHere 替换为您自己的、真实的 API 密钥。脚本会从此文件中自动加载密钥，避免了将其硬编码在代码中的安全风险。

第 2 步：Python 脚本 (`opensearch_openai_vector_search.py`)

以下是完整的 Python 脚本。它负责：

从 .env 文件加载 API 密钥。
创建一个适用于 OpenAI 向量维度（1536）的 OpenSearch 索引。
定义一个函数，用于调用 OpenAI API 将文本转换为向量。
将示例文本转换为向量并存入 OpenSearch。
执行一个向量搜索，找到与查询最相关的文档。

# -*- coding: utf-8 -*-"""
This script demonstrates how to use OpenAI embeddings for vector search in OpenSearch.It requires the following libraries:
- opensearch-py
- openai
- python-dotenvYou can install them using: pip install opensearch-py openai python-dotenvSetup:
1. Make sure your OpenSearch instance is running (e.g., via docker-compose).
2. Create a file named .env in the same directory as this script.
3. Add your OpenAI API key to the .env file like this:OPENAI_API_KEY="sk-YourActualOpenAIKeyHere"
"""import os
import time
from dotenv import load_dotenv
from openai import OpenAI
from opensearchpy import OpenSearch# --- 1. Configuration ---# Load environment variables from .env file
load_dotenv()QWEN_API_KEY = os.getenv("QWEN_API_KEY")
QWEN_BASE_URL = os.getenv("QWEN_BASE_URL")
QWEN_EMBEDDING_MODEL_NAME = os.getenv("QWEN_EMBEDDING_MODEL_NAME")client_openai = OpenAI(base_url=QWEN_BASE_URL, api_key=QWEN_API_KEY)
# Connect to OpenSearch
client_opensearch = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}],http_auth=None,  # No authenticationuse_ssl=False,verify_certs=False,ssl_assert_hostname=False,ssl_show_warn=False,
)# Dimension of vectors produced by text-embedding-3-small
VECTOR_DIMENSION = 1024INDEX_NAME = "my-openai-vector-index"# --- 2. OpenAI Embedding Function ---def get_openai_embedding(text):"""Generates a vector embedding for the given text using OpenAI's API."""# OpenAI recommends replacing newlines with spaces for better performancetext = text.replace("\n", " ")response = client_openai.embeddings.create(input=[text], model=QWEN_EMBEDDING_MODEL_NAME)return response.data[0].embedding# --- 3. Index Setup ---def create_index_with_vector_mapping():"""Creates an OpenSearch index with a mapping for k-NN vector search using OpenAI dimensions."""if client_opensearch.indices.exists(index=INDEX_NAME):print(f"Index '{INDEX_NAME}' already exists. Deleting it.")client_opensearch.indices.delete(index=INDEX_NAME)settings = {"settings": {"index": {"knn": True,"knn.algo_param.ef_search": 100}},"mappings": {"properties": {"text": {"type": "text"},"text_vector": {"type": "knn_vector","dimension": VECTOR_DIMENSION, # Crucial: Must match the model's output dimension"method": {"name": "hnsw","space_type": "l2","engine": "nmslib","parameters": {"ef_construction": 128,"m": 24}}}}}}client_opensearch.indices.create(index=INDEX_NAME, body=settings)print(f"Index '{INDEX_NAME}' created successfully with dimension {VECTOR_DIMENSION}.")# --- 4. Indexing Documents ---def index_documents():"""Generates vector embeddings for sample documents using OpenAI and indexes them."""documents = [{"text": "The sky is blue and the sun is bright."},{"text": "I enjoy walking in the park on a sunny day."},{"text": "Artificial intelligence is transforming many industries."},{"text": "The new AI model shows impressive capabilities in natural language understanding."},{"text": "My favorite food is pizza, especially with pepperoni."},{"text": "I'm planning a trip to Italy to enjoy the local cuisine."}]for i, doc in enumerate(documents):print(f"Generating embedding for document {i+1}...")vector = get_openai_embedding(doc["text"])doc_body = {"text": doc["text"],"text_vector": vector # The embedding is already a list}client_opensearch.index(index=INDEX_NAME, body=doc_body, id=i+1, refresh=True)print(f"Indexed document {i+1}")time.sleep(2)# --- 5. Vector Search ---def search_with_vector(query_text, k=3):"""Performs a k-NN search for the most similar documents using an OpenAI embedding."""print(f"\n--- Performing k-NN search for: '{query_text}' ---")query_vector = get_openai_embedding(query_text)search_query = {"size": k,"query": {"knn": {"text_vector": {"vector": query_vector,"k": k}}}}response = client_opensearch.search(index=INDEX_NAME, body=search_query)print("Search Results:")for hit in response["hits"]["hits"]:print(f"  - Score: {hit['_score']:.4f}, Text: {hit['_source']['text']}")# --- 6. Main Execution ---
if __name__ == "__main__":create_index_with_vector_mapping()index_documents()# Perform a simple vector searchsearch_with_vector("intelligent machines")# Perform another vector searchsearch_with_vector("sunny weather activities")# Clean up the index (optional)# client_opensearch.indices.delete(index=INDEX_NAME)# print(f"\nIndex '{INDEX_NAME}' deleted.")

第 3 步：运行脚本

完成上述所有准备工作后，在您的终端中运行以下命令：

uv run opensearch_openai_vector_search.py

输出

在这里插入图片描述

opensearch dashboard可视化

代码链接： https://github.com/zhouruiliangxian/Awesome-demo/blob/main/Database/opensearch_test/opensearch_openai_vector_search.py

查看全文

http://www.lryc.cn/news/581477.html

@Data、@AllArgsConstructor、@NoArgsConstructor不生效。lombok不起作用怎么解决？

Web前端开发-Vue

多人协同开发时Git使用命令

锁和事务的关系

深入探索开源爬虫MediaCrawler，从入门到掌握多平台数据收集

HarmonyOS学习6 --- 数据存储

9. 【Vue实战--孢子记账--Web 版开发】-- 账户账本管理（二）

MySQL CDC与Kafka整合指南：构建实时数据管道的完整方案

1.线性神经网络--线性回归

华为云银河麒麟 vscode远程连接

前端开发问题：SyntaxError: “undefined“ is not valid JSON

Flutter 每日翻译之 Widget

Vue+Openlayers加载OSM、加载天地图

java学习——guava并发编程练习

【Guava】1.0.设计虚拟机的方向

第一个Flink 程序：词频统计 WordCount（流处理）

LeetCode--41.缺失的第一个正数

《Redis》缓存与分布式锁

AGV选型指南：AGV智能搬运车智能问答系统助力从技术参数到供应商选择的完整方案

Flutter 项目开启 UI 层级虚线（UI Guides）

深度学习篇---简单果实分类网络

JAVA 项目找不到符号

零依赖Web数据管理系统：midb轻松管理

Node.js EventEmitter 深入解析

数据挖掘：从理论到实践的深度探索

C++学习之STL学习：list的模拟实现

DTW模版匹配：弹性对齐的时间序列相似度度量算法

处理GET请求：在Web开发中如何处理GET请求

【C语言指南】深入剖析 C 语言递归函数

爬虫-浏览器工具简介