当前位置: 首页 > news >正文

“自动化失败归因”测试集-WhoWhen

在MAS(Multi-Agent系统)中,Agent之间自主协作、信息链条长,异常常见而且诊断困难。

Who&When就是测者MAS系统异常诊断的benchmark。Who&When的prompt问题来源于GAIA and AssistantBench,包含了127个LLMMulti-Agent系统中收集的异常日志,并经过系统和人工处理。Who&When样本配有如下所示的细粒度标注:

“谁”(Who):哪个Agent是“罪魁祸首”。

“何时”(When):决定性错误发生在交互的哪一步。

“为何”(Why):对失败原因的自然语言解释。

Who&When样本细节:

"question": "In the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend."

ground_truth: "Maktay mato apple",

"mistake_agent": "Orchestrator"

"mistake_step": "1"

"mistake_reason": "The Orchestrator made an error when performing the translation.",

异常说明:

agent内容见“agents日志内容附录"

第0个agent的回答没有问题,问题出在第1个agent Orchestrator的解答

Tizin采用被动的方式,like表达为is pleasing to的被动句型,所以

i like apple 先转化为被动句型 apple is pleasing to me.  apple这时为subject,me为object

在转化为Tizin,应该为"Maktay mato apple"

第一个agent虽然提到了Tizin的别动式句型习惯,但实际没按Tizin的表达方式做句型转换。

\n\nPlan:
            \n1. Construct the sentence by placing the verb \"Maktay\" first.\n2. Follow the verb with the direct object \"Zapple.\"\n3. End the sentence with the subject \"Mato.\"

            \n\nTherefore, the resulting Tizin sentence should be:
            \n\"Maktay Zapple Mato\"

            \n\nWe achieve this entirely through the given facts and logical deduction. There is no need for any additional team members for this simple translation task.

agents日志内容附录

"history": [
        {
            "content": "In the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend. 

            \n\nThe word that indicates oneself is \"Pa\" is the nominative form, 
            \"Mato\" is the accusative form, and \"Sing\" is the genitive form. 

            \n\nThe root verb that indicates an intense like for something is \"Maktay\". When it is used in the present, it is used in it's root form, when it is used in the preterit past, it is \"Tay\", and when it is used in the imperfect past, it is \"Aktay\". It is used differently than in English, and is better translated as \"is pleasing to\", meaning that the thing doing the liking is actually the object of the sentence rather than the subject.\n\nThe word for apples is borrowed from English in Tizin, and so it is \"Apple\" is the nominative form, \"Zapple\" is the accusative form, and \"Izapple\" is the genitive form. 

            \n\nPlease translate \"I like apples\" to Tizin.\n",
            "role": "human"
        },
        {
            "content": "Initial plan:

            \n\nWe are working to address the following user request:

            \n\nIn the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend. 

            \n\nThe word that indicates oneself is \"Pa\" is the nominative form, \"Mato\" is the accusative form, and \"Sing\" is the genitive form. 

            \n\nThe root verb that indicates an intense like for something is \"Maktay\". When it is used in the present, it is used in it's root form, when it is used in the preterit past, it is \"Tay\", and when it is used in the imperfect past, it is \"Aktay\". It is used differently than in English, and is better translated as \"is pleasing to\", meaning that the thing doing the liking is actually the object of the sentence rather than the subject.

            \n\nThe word for apples is borrowed from English in Tizin, and so it is \"Apple\" is the nominative form, \"Zapple\" is the accusative form, and \"Izapple\" is the genitive form. 

            \n\nPlease translate \"I like apples\" to Tizin.


            \n\n\nTo answer this request we have assembled the following team:
            \n\nAssistant: A helpful and general-purpose AI assistant that has strong language skills, Python skills, and Linux command line skills.

            \nComputerTerminal: A computer terminal that performs no other action than running Python scripts (provided to it quoted in ```python code blocks), or sh shell scripts (provided to it quoted in ```sh code blocks)
            \nFileSurfer: An agent that can handle local files.
            \nWebSurfer: A helpful assistant with access to a web browser. Ask them to perform web searches, open pages, and interact with content (e.g., clicking links, scrolling the viewport, etc., filling in form fields, etc.) It can also summarize the entire page, or answer questions based on the content of the page. It can also be asked to sleep and wait for pages to load, in cases where the pages seem to be taking a while to load.


            \n\n\n\nHere is an initial fact sheet to consider:

            \n\n1. GIVEN OR VERIFIED FACTS\n- In Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence.
            \n- The word that indicates oneself in the nominative form is \"Pa,\" in the accusative form is \"Mato,\" and in the genitive form is \"Sing.\"
            \n- The root verb that indicates an intense like for something is \"Maktay.\" In the present tense, it is \"Maktay,\" in the preterit past, it is \"Tay,\" and in the imperfect past, it is \"Aktay.\"\n- The verb \"Maktay\" is used differently than in English and is better translated as \"is pleasing to,\" meaning that the thing doing the liking is actually the object of the sentence.
            \n- The word for apples in the nominative form is \"Apple,\" in the accusative form is \"Zapple,\" and in the genitive form is \"Izapple.\"

            \n\n2. FACTS TO LOOK UP
            \n- None specified. All necessary information appears to be provided in the request.

            \n\n3. FACTS TO DERIVE\n- The correct arrangement of the sentence \"I like apples\" in Tizin given the verb-object-subject structure and the specific verb usage.

            \n\n4. EDUCATED GUESSES\n- None needed, as all information needed for the translation is provided.


            \n\n\nHere is the plan to follow as best as possible:

            \n\n- We have all the required information to translate \"I like apples\" to Tizin.
            \n- The sentence structure in Tizin is Verb-Object-Subject.\
            n- The verb in the present tense is \"Maktay.\"
            \n- \"I\" as the subject in the nominative form is \"Pa.\"
            \n- \"Apples\" as the direct object in the accusative form is \"Zapple.\"
            
            \n\nPlan:
            \n1. Construct the sentence by placing the verb \"Maktay\" first.\n2. Follow the verb with the direct object \"Zapple.\"\n3. End the sentence with the subject \"Mato.\"

            \n\nTherefore, the resulting Tizin sentence should be:
            \n\"Maktay Zapple Mato\"

            \n\nWe achieve this entirely through the given facts and logical deduction. There is no need for any additional team members for this simple translation task.

            \n\n/usr/local/lib/python3.11/site-packages/autogen_magentic_one/agents/orchestrator.py:216: UserWarning: Resolved model mismatch: gpt-4o-2024-08-06 != gpt-4o-2024-05-13. Model mapping may be incorrect.
            \n  ledger_response = await self._model_client.create(",
            "role": "Orchestrator (thought)"
        },
        {
            "content": "Updated Ledger:
            \n{
            \n  \"is_request_satisfied\": {
            \n    \"reason\": \"The translation 'Maktay Zapple Mato' correctly forms the sentence 'I like apples' in Tizin following the syntax and word forms provided.\",
            \n    \"answer\": true
            \n  },
            \n  \"is_in_loop\": {
            \n    \"reason\": \"The request was satisfied in the first response, and there has been no repetition of the same requests or responses.\",
            \n    \"answer\": false\n  },
            \n  \"is_progress_being_made\": {
            \n    \"reason\": \"The translation task was completed accurately in the initial response.\",
            \n    \"answer\": true\n  },
            \n  \"next_speaker\": {
            \n    \"reason\": \"There is no additional task or follow-up needed as the request has been fully satisfied.\",
            \n    \"answer\": \"Assistant\"\n  },
            \n  \"instruction_or_question\": {
            \n    \"reason\": \"The translation of the sentence 'I like apples' to Tizin has been completed.\",
            \n    \"answer\": \"The translation of 'I like apples' to Tizin is 'Maktay Zapple Mato.'\"
            \n  }
            \n}",
            "role": "Orchestrator (thought)"
        },
        {
            "content": "Request satisfied.",
            "role": "Orchestrator (thought)"
        },
        {
            "content": "No agent selected.\n  response = await client.create(messages)\nFINAL ANSWER: Maktay Zapple Mato\nSCENARIO.PY COMPLETE !#!#\nRUN.SH COMPLETE !#!#",
            "role": "Orchestrator (termination condition)"
        }
    ]

---

Who and When.

https://huggingface.co/datasets/Kevin355/Who_and_When

Agents_Failure_Attribution.

https://github.com/mingyin1/Agents_Failure_Attribution

通过金融大模型实现数据智能分析和归因洞察. 

https://blog.51cto.com/dipeak/12468170

解密Prompt系列23.大模型幻觉分类&归因&检测&缓解方案脑图全梳理. 

https://cloud.tencent.com/developer/article/2378383

http://www.lryc.cn/news/574101.html

相关文章:

  • 在大数据求职面试中如何回答分布式协调与数据挖掘问题
  • 云原生 CAD 让制造业设计协同更便捷
  • 3D模式格式转换工具HOOPS Exchange如何将3D PDF转换为STEP格式?
  • 【实时Linux实战系列】使用定时器实现定时任务
  • 计算机网络:(六)超详细讲解数据链路层 (附带图谱表格更好对比理解)
  • docker镜像中集成act工具
  • 刀客doc:阿里巴巴集团品牌部划归集团公关管理
  • Java基础(三):逻辑运算符详解
  • P3258 [JLOI2014] 松鼠的新家
  • (LeetCode 面试经典 150 题) 27.移除元素
  • PR出书启动
  • ✨通义万相2.1深度解析:AI视频生成引擎FLF2V-14B全流程指南(命令行参数+模型架构+数据流)
  • VTK.js
  • 容声冰箱如何让荔枝在世俱杯赛场外再“长7天”
  • Elasticsearch API访问权限控制:禁用外部端点访问
  • 在Ubuntu上设置Selenium自动化测试环境:Chrome与Firefox的详细指南
  • 海拔案例分享-门店业绩管理小程序
  • 小程序 顶部栏标题栏 下拉滚动 渐显白色背景
  • Python Django全功能框架开发秘籍
  • 多模态大语言模型arxiv论文略读(133)
  • 【nvidia-H100-ib排障实战2】:服务器 InfiniBand 网络性能问题深度分析
  • 学习Linux进程冻结技术
  • 科技资讯杂志科技资讯杂志社科技资讯编辑部2025年第9期目录
  • 微算法科技(NASDAQ:MLGO)研发可信共识算法TCA,解决区块链微服务中的数据一致性与安全挑战
  • 笔试强训:Day8
  • Qt for OpenHarmony 编译鸿蒙调用的动态库
  • MCU双分区方案,如何优雅地获知当前运行分区?
  • 开发上门按摩APP应具备哪些安全保障功能?
  • 实时预警!机场机坪井室无线智能液位监测系统助力安全降本
  • 【设计模式精讲 Day 12】代理模式(Proxy Pattern)