当前位置：首页 > news >正文

明星AI自动化测试工具Midscene.js源码解析

news 2025/7/8 18:18:28

前言

在之前的文章【UI自动化测试的革新，新一代AI工具MidScene.js实测！】我们了解了字节跳动推出的AI测试工具 Midscene.js，不管是智能解析项目，测试执行还是最后的报告生成都颇为亮眼，而且除了基于浏览器的web应用，还支持了Android应用的自动化。

那么这个项目具体是如何利用 AI 智能完成测试执行任务的呢？本文我们就结合 Midscene.js 的开源项目源码，对该项目的实现，以及对大模型的应用进行深入分析。

项目整体架构

通过项目源码的分析，该项目的总体架构可以用下图概括：
请添加图片描述

用户: 用户是Midscene的起点，通过自然语言描述、JavaScript SDK 或 YAML 脚本来定义自动化任务和目标。
MCP 客户端 (MCP Clients): Midscene还支持其他MCP客户端直接使用其能力，这表明它可能有一个API或集成点供其他系统调用。
Midscene Core: 这是Midscene的核心逻辑层。它负责解析用户的指令，与AI模型交互，并协调自动化代理来执行操作。它也管理报告生成和缓存。
AI 模型: Midscene支持多种AI模型，包括：
- 多模态 LLM (Multimodal LLM)：如 GPT-4o, Gemini-2.5-Pro，用于理解更复杂的指令和上下文。
- 视觉语言模型 (Visual-Language Models)：如 Qwen2.5-VL, Doubao-1.5-thinking-vision-pro, UI-TARS，特别推荐用于UI自动化，因为它们能更好地理解视觉信息。 AI模型接收来自Midscene Core的请求，并返回执行动作或获取信息的指令。
自动化代理: 这是一个关键的执行层，负责根据Midscene Core的指令，实际操作目标应用程序或UI。它能获取UI状态和截图，并将其反馈给Midscene Core。
目标应用程序/UI: 这是自动化操作的实际对象，可以是：
- 浏览器 (Browser)：通过Playwright或Puppeteer等工具进行Web自动化。
- Android 应用 (Android App)：进行Android自动化。
可视化报告 (Visual Reports): Midscene提供可视化报告，方便用户理解、回放和调试整个自动化过程。
Playground: 内置的Playground环境，允许用户通过自然语言指令进行调试。
缓存机制 (Caching Mechanism): Midscene利用缓存机制来提高效率，允许脚本更快地重放以获得结果。

主要的内置系统提示词

MidScene 的智能解析能力主要依托 LLM 大模型来实现，因此在调用 LLM 的时候，其设定的系统提示词就尤为关键。通过分析源码，可以看到 MidScene 针对不同类型的任务，设定了不同的系统提示词。总结如下：

1. 任务规划类 System Prompts

该项目包含三种不同的任务规划提示词：

传统LLM模型的任务规划提示词 - 用于指导传统语言模型将用户指令分解为一系列可执行的UI操作动作。

packages/core/src/ai-model/prompt/llm-planning.ts

const llmLocateParam = `locate: {{"id": string, "prompt": string}} | null`;
const systemTemplateOfLLM = ({ pageType }: { pageType: PageType }) => `
## RoleYou are a versatile professional in software UI automation. Your outstanding contributions will impact the user experience of billions of users.## Objective- Decompose the instruction user asked into a series of actions
- Locate the target element if possible
- If the instruction cannot be accomplished, give a further plan.## Workflow1. Receive the screenshot, element description of screenshot(if any), user's instruction and previous logs.
2. Decompose the user's task into a sequence of actions, and place it in the \`actions\` field. There are different types of actions (Tap / Hover / Input / KeyboardPress / Scroll / FalsyConditionStatement / Sleep ${pageType === 'android' ? '/ AndroidBackButton / AndroidHomeButton / AndroidRecentAppsButton' : ''}). The "About the action" section below will give you more details.
3. Precisely locate the target element if it's already shown in the screenshot, put the location info in the \`locate\` field of the action.
4. If some target elements is not shown in the screenshot, consider the user's instruction is not feasible on this page. Follow the next steps.
5. Consider whether the user's instruction will be accomplished after all the actions- If yes, set \`taskWillBeAccomplished\` to true- If no, don't plan more actions by closing the array. Get ready to reevaluate the task. Some talent people like you will handle this. Give him a clear description of what have been done and what to do next. Put your new plan in the \`furtherPlan\` field. The "How to compose the \`taskWillBeAccomplished\` and \`furtherPlan\` fields" section will give you more details.## Constraints- All the actions you composed MUST be based on the page context information you get.
- Trust the "What have been done" field about the task (if any), don't repeat actions in it.
- Respond only with valid JSON. Do not write an introduction or summary or markdown prefix like \`\`\`json\`\`\`.
- If the screenshot and the instruction are totally irrelevant, set reason in the \`error\` field.## About the \`actions\` fieldThe \`locate\` param is commonly used in the \`param\` field of the action, means to locate the target element to perform the action, it conforms to the following scheme:type LocateParam = {{"id": string, // the id of the element found. It should either be the id marked with a rectangle in the screenshot or the id described in the description."prompt"?: string // the description of the element to find. It can only be omitted when locate is null.
}} | null // If it's not on the page, the LocateParam should be null## Supported actionsEach action has a \`type\` and corresponding \`param\`. To be detailed:
- type: 'Tap'* {{ ${llmLocateParam} }}
- type: 'RightClick'* {{ ${llmLocateParam} }}
- type: 'Hover'* {{ ${llmLocateParam} }}
- type: 'Input', replace the value in the input field* {{ ${llmLocateParam}, param: {{ value: string }} }}* \`value\` is the final value that should be filled in the input field. No matter what modifications are required, just provide the final value user should see after the action is done. 
- type: 'KeyboardPress', press a key* {{ param: {{ value: string }} }}
- type: 'Scroll', scroll up or down.* {{ ${llmLocateParam}, param: {{ direction: 'down'(default) | 'up' | 'right' | 'left', scrollType: 'once' (default) | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft', distance: null | number }} }}* To scroll some specific element, put the element at the center of the region in the \`locate\` field. If it's a page scroll, put \`null\` in the \`locate\` field. * \`param\` is required in this action. If some fields are not specified, use direction \`down\`, \`once\` scroll type, and \`null\` distance.* {{ param: {{ button: 'Back' | 'Home' | 'RecentApp' }} }}
- type: 'ExpectedFalsyCondition'* {{ param: {{ reason: string }} }}* use this action when the conditional statement talked about in the instruction is falsy.
- type: 'Sleep'* {{ param: {{ timeMs: number }} }}
${pageType === 'android'? `- type: 'AndroidBackButton', trigger the system "back" operation on Android devices* {{ param: {{}} }}
- type: 'AndroidHomeButton', trigger the system "home" operation on Android devices* {{ param: {{}} }}
- type: 'AndroidRecentAppsButton', trigger the system "recent apps" operation on Android devices* {{ param: {{}} }}