当前位置：首页 > news >正文

视频分析应用的搭建

news 2025/7/13 5:45:26

在视频监控和分析领域，经常需要自动对海量的视频进行分析，例如自动找出某些事件发生的时间。基于多模态大模型，我们可以轻松的进行视频的分析。

这里我是采用Qwen 2.5 VL多模态大模型结合Qwen 3大语言模型，来对视频进行自动分析。在企业内网部署了大模型，通过API接口提供服务。由于服务器的安全设计，不适宜直接把视频存放到服务器本地进行分析，因此需要先把视频进行关键帧提取，把关键帧的图像通过调用API接口来进行分析，最后再把所有的关键帧的分析数据进行归纳总结。这个处理过程其实也是和Qwen VL模型的思路是类似的，如果直接提供视频文件给模型，也是需要进行关键帧抽取来进行分析。例如Qwen VL默认是每0.5秒提取1个关键帧来进行分析。

关键帧的抽取

采用Dicord这个库来进行视频的图像帧按照每秒1帧进行抽取，然后通过opencv来对帧之间的变动进行检测，只有变动大于某个阈值，才判断为关键帧。把关键帧保存为JPG文件，在文件名里面加上时间点的信息。代码如下：

import cv2
import decordparallel_calls = 4  #Decord开启多少个线程来进行并发处理#检测前后两帧是否有显著差异
def motion_detect(prev, current) -> bool:diff = cv2.absdiff(prev, current)_, thresh = cv2.threshold(diff, 25, 255, cv2.THRESH_BINARY)contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)motion_detected = Falsefor cnt in contours:if cv2.contourArea(cnt) > 30000:  # 检测到显著变化motion_detected = Truebreakreturn motion_detectedvr = decord.VideoReader(folder_path + file_name, num_threads=parallel_calls)
frame = vr[0]
orig_height, orig_width = frame.shape[:2]
fps = vr.get_avg_fps()
total_frames = len(vr)
total_duration = total_frames / fpsratio = orig_height / orig_width
if orig_width > 448:width = 448height = int(width * ratio)
else:width = orig_widthheight = orig_heightstart_frame = 0
files_list = []for i in range(0, total_frames, int(fps)):progressbar1.progress(i/total_frames)frame = cv2.cvtColor(vr[i].asnumpy(), cv2.COLOR_RGB2BGR)if i==0:prev_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)flag = motion_detect(prev_gray, gray)    #判断当前帧与前一帧是否有大的变化if flag:prev_gray = gray.copy()time_idx = int(i/fps)filename = f"{folder_path}{pic_path}test_{str(time_idx)}.jpg"resized_img = cv2.resize(frame, (width, height), interpolation=cv2.INTER_LINEAR)cv2.imwrite(filename, resized_img, [cv2.IMWRITE_JPEG_QUALITY, 95])files_list.append(filename)

关键帧的内容分析

把视频关键帧的内容提取后，就可以通过Qwen VL多模态大模型进行内容分析了。以下是代码实现，为了提高效率，这里采用了multiprocessing多进程来进行处理。

import multiprocessing
import base64
import requestsmax_retries = 3
timeout = 30def call_api(url, filename, result_queue):global auth, model_vl, temperature, top_pwith open(filename, 'rb') as f:img_bytes = f.read()second = re.findall(pattern, filename)[0]    #从文件名获取这是第几秒的画面base64_str = base64.b64encode(img_bytes).decode("utf-8")result = f"data:image/jpeg;base64,{base64_str}"header = {"Authorization": auth}data = {"model": model_vl,"messages": [{"role": "user","content": [{"type": "text","text": "请描述画面的内容。"},{"type": "image_url","image_url": {"url": result}}]}],"temperature": temperature,"top_p": top_p}for _ in range(max_retries):status_code = 0try:response = requests.post(url, json=data, timeout=30, headers=header)status_code = response.status_coderesult_queue.put((status_code, second, response.json()["choices"][0]["message"]['content']))response.close()breakexcept Exception as e:passif status_code==0:result_queue.put((0, 0, "API调用失败"))responses = []
groups = [files_list[i:i+parallel_calls] for i in range(0, len(files_list), parallel_calls)]
for idx, group in enumerate(groups):if idx > 0:progressbar2.progress(idx/len(groups))result_queue = multiprocessing.Queue()processes = []for filename in group:p = multiprocessing.Process(target=call_api, args=(multimodal_api_url, filename, result_queue))processes.append(p)p.start()for p in processes:p.join()while not result_queue.empty():status_code, second, result = result_queue.get()placeholder3.text(f"第{second}秒,内容:{result}")if status_code == 200:responses.append({"id": int(second), "内容": result})
sorted_responses = sorted(responses, key=lambda x: x["id"])
responses = []
for item in sorted_responses:responses.append(json.dumps({"时间": f"第{item['id']}秒", "内容": item['内容']}, ensure_ascii=False))

视频内容分析

对所有关键帧的内容进行分析后，我们可以整合所有的关键帧内容来归纳整个视频的内容。这里我采用Qwen 3来进行整个视频内容的归纳分析，因为在文字处理上面Qwen 3的性能更加好。由于Qwen 3的输入长度有限制，如果我们有太多的关键帧内容，需要进行分片来分析，最后再汇总。例如这里我设置了按每500张关键帧来进行分片（大约对应10分钟的视频长度）。以下是代码：

slices_num = 500  #按多少个关键帧进行切分def video_analysis(chat_api_url, message):global auth, model_llm, temperature, top_pheader = {"Authorization": auth}data = {"model": model_llm,"messages": [{"role": "user","content": message}],"temperature": temperature,"top_p": top_p}response = requests.post(chat_api_url, json=data, headers=header)reply = response.json()["choices"][0]["message"]["content"]return response.status_code, replyvideo_analysis_slice = []for i in range(0, len(responses), slices_num):responses_slice = responses[i:i+slices_num]message = f"以下内容是视频里面每个时间点的内容描述，其中时间表示是第几秒发生的画面，内容表示该画面的内容。请根据这些内容对视频进行归纳，回答问题:{prompt}\n. 视频的内容是###" + "\n".join(responses) + "###"status_code, reply = video_analysis(chat_api_url, message)video_analysis_slice.append(json.dumps({"片段id": int(i/slices_num), "片段内容": reply}, ensure_ascii=False))if len(video_analysis_slice)>1:message = f"以下内容是视频里面多个片段的内容描述，其中片段id表示是第几个片段，片段内容表示该视频片段的内容。请根据这些片段内容来对这个完整的视频内容进行归纳，并回答问题:{prompt}\n. 视频片段的内容是###{"\n".join(video_analysis_slice)}###"status_code, reply = video_analysis(chat_api_url, message)print(f"视频的内容是：\n{reply}")