当前位置：首页 > news >正文

模型训练的常用方法及llama-factory支持的数据训练格式

news 2025/7/14 0:31:18

一、模型训练的常用方法：

通常模型训练包含如下五种技术方向：指令监督微调、预训练、偏好训练、KTO和多模态。

接下来开始详细解析各技术方向：

1. 指令监督微调 (Supervised Fine-Tuning, SFT)

1.1概念：

在预训练模型基础上，使用指令-响应对数据集进行有监督微调，使模型适应特定任务格式。

1.2核心方法：

完全微调：更新所有参数，性能最优但资源消耗大（需多个高端GPU微调8B模型）

参数高效微调：LoRA（添加低秩适配器，训练参数<1%）和QLoRA（量化感知，内存再降33%）

1.3优点：

高效任务适应：千级样本即可显著提升指令遵循能力（如AlpacaEval评估中43%胜率）

低成本定制：LoRA/QLoRA大幅降低计算需求，可在单GPU完成微调（如Unsloth库提速2倍）

可控性强：通过领域数据注入实现专业化（医疗/法律等）

1.4缺点：

灾难性遗忘：完全微调可能覆盖预训练获得的世界知识依赖基础能力：难以学习全新知识（如未知语言），易产生幻觉

泛化局限：窄域微调（如仅诗歌生成）会削弱其他能力

1.5典型应用：

领域专家模型：使用医疗/金融指令集微调临床诊断或财报分析模型

风格迁移：注入特定写作风格（如学术严谨 vs. 口语化）

2.预训练 (Pre-training)

2.1概念：

在大规模无标注语料上通过自监督目标（如掩码语言建模、Next Token Prediction）训练模型掌握语言规律。

2.2主流架构：

编码器式（如BERT）：双向上下文表征，适合理解任务

解码器式（如GPT）：自回归生成，适合文本生成

2.3优点：

通用表征能力：学习语言结构和语义，支持下游任务泛化（文本分类/翻译等）

数据利用高效：无需标注数据，千亿级token训练释放模型潜力

技术生态成熟：丰富工具链支持（Hugging Face/TRL等）

2.4缺点：

上下文理解弱：长文本依赖处理不足（如需复杂推理的摘要）

多语言支持有限：非英语语料质量参差，影响跨语言性能

训练成本极高：千亿参数模型需千卡集群训练数月

2.5演进方向

多任务预训练：联合学习MLM、文本蕴含等提升泛化性

高效架构：Mixture-of-Experts（MoE）降低计算开销

3. 偏好训练 (Preference Training) 与奖励建模

3.1概念：

通过人类偏好数据训练奖励模型（Reward Model, RM），引导模型生成符合人类价值观的输出。
3.2主流方法：

基于偏好：标注样本对（好/坏回答），需高质量数据（成本高、难扩展）

基于规则（RLVR）：依赖标准答案验证，仅适合数学/代码等明确任务

新范式POLAR：策略判别学习，同一策略轨迹为正例，不同策略为负例，无需人工标注

3.3优点：

对齐人类价值观：减少有害/偏见输出，提升有用性 & 安全性

POLAR突破瓶颈：自动化合成数据（3.6T Token训练POLAR-7B），实现策略分布建模

强化微调适配：通过RFT框架驱动策略逼近最优（参考答案相似性打分）

3.4缺点：

传统RM扩展难：人工标注成本高，主观偏好导致奖励黑客（Reward Hacking）

RLVR通用性差：开放域任务（如创作）缺乏明确规则

评估复杂性：偏好建模需人类评估或强LLM评判（如AlpacaEval 2）

3.5POLAR效果示例：
在单词计数任务中，POLAR-7B精准区分：

答案正确+思路正确 > 答案正确+无思路 > 答案错误+思路正确

4. KTO (Kahneman-Tversky Optimization)

4.1概念：

基于行为经济学前景理论，通过非对称效用函数优化偏好，无需成对偏好数据。

4.2核心机制：

效用函数：好样本权重λ_D强调稳定，坏样本权重λ_U强化规避（λ_U > λ_D）

参考点：基于参考模型KL散度动态调整，锚定预期表现

4.3优点：

数据需求低：单样本即可训练（无需偏好对），降低标注成本

损失非对称性：对坏样本惩罚力度 >> 好样本奖励，强化风险规避

训练稳定性高：通过KL约束防止过度优化，适合中小规模场景

4.4缺点：

微差异区分弱：对相似质量样本打分粗糙（如两个合理回答难排序）

心理模型简化：难以完全模拟人类复杂决策过程（如情感偏见）

需组合使用：常与DPO同步提升鲁棒性

4.5典型应用：

合规性强化：金融/医疗场景中严格规避错误陈述

快速迭代场景：标注资源有限时的偏好对齐

5.多模态 (Multimodal)

5.1概念：

融合文本、图像、音频等多种模态数据，实现跨模态理解与生成。

5.2关键方法：

联合表示（单塔）：多模态数据早期融合（如CLIP）

协作表示（双塔）：各模态独立编码后对齐（如CCA/DCCA）

5.3优点：

跨模态理解：图像描述/视频摘要等任务表现显著优于单模态

泛化性强：真实场景信息冗余性提升抗噪能力（如模糊图像+文本推断）

认知逼近人类：多源信息互补支持复杂推理（医疗影像+报告生成诊断）

5.4缺点：

数据需求复杂：需对齐的多模态数据集（如图像-文本对），构建成本高

计算开销大：融合模块增加显存/算力消耗（ViT+LLM训练需千卡级集群）

模态失衡风险：弱模态主导预测（如文本描述忽略关键视觉线索）

5.5场景适配建议：

任务类型	推荐模型	案例场景
单一模态处理	单模态	文本翻译/图像分类
跨模态关联	多模态	视频摘要（画面+语音+字幕）
低资源场景	协作表示	语音识别+文本纠错

6.以下为五大技术核心特性对比：

技术	核心优势	主要局限	适用阶段
预训练	通用语言表征	上下文捕捉弱/成本高	模型基座构建
指令微调 (SFT)	少样本任务适配	领域外泛化差	任务专项优化
偏好训练 (POLAR)	策略分布建模/自动化数据	需微调对齐人类偏好	价值观对齐
KTO	单样本训练/风险规避强化	微差异区分弱	轻量化偏好对齐
多模态	跨模态推理/真实场景鲁棒性	数据对齐难/算力需求高	复杂环境感知

二、llama-factory的数据格式

dataset_info.json 包含了所有经过预处理的本地数据集以及在线数据集。如果您希望使用自定义数据集，请务必在 dataset_info.json 文件中添加对数据集及其内容的定义。

目前我们支持 Alpaca 格式和 ShareGPT 格式的数据集。

1、Alpaca

针对不同任务，数据集格式要求如下：

指令监督微调
预训练
偏好训练
KTO
多模态

1.1 指令监督微调数据集:

指令监督微调(Instruct Tuning)通过让模型学习详细的指令以及对应的回答来优化模型在特定指令下的表现。

instruction 列对应的内容为人类指令， input 列对应的内容为人类输入， output 列对应的内容为模型回答。下面是一个例子

"alpaca_zh_demo.json"

{

  "instruction": "计算这些物品的总费用。 ",

  "input": "输入：汽车 - $3000，衣服 - $100，书 - $20。",

  "output": "汽车、衣服和书的总费用为 $3000 + $100 + $20 = $3120。"

},

在进行指令监督微调时， instruction 列对应的内容会与 input 列对应的内容拼接后作为最终的人类输入，即人类输入为 instruction\ninput。而 output 列对应的内容为模型回答。在上面的例子中，人类的最终输入是：

计算这些物品的总费用。

输入：汽车 - $3000，衣服 - $100，书 - $20。

模型的回答是：

汽车、衣服和书的总费用为 $3000 + $100 + $20 = $3120。

如果指定， system 列对应的内容将被作为系统提示词。

history 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮对话的指令和回答。注意在指令监督微调时，历史消息中的回答内容也会被用于模型学习。

指令监督微调数据集格式要求如下：

[

  {

    "instruction": "人类指令（必填）",

    "input": "人类输入（选填）",

    "output": "模型回答（必填）",

    "system": "系统提示词（选填）",

    "history": [

      ["第一轮指令（选填）", "第一轮回答（选填）"],

      ["第二轮指令（选填）", "第二轮回答（选填）"]

    ]

  }

]

下面提供一个 alpaca 格式多轮对话的例子，对于单轮对话只需省略 history 列即可。

[

  {

    "instruction": "今天的天气怎么样？",

    "input": "",

    "output": "今天的天气不错，是晴天。",

    "history": [

      [

        "今天会下雨吗？",

        "今天不会下雨，是个好天气。"

      ],

      [

        "今天适合出去玩吗？",

        "非常适合，空气质量很好。"

      ]

    ]

  }

]

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "columns": {

    "prompt": "instruction",

    "query": "input",

    "response": "output",

    "system": "system",

    "history": "history"

  }

}

1.2 预训练数据集

大语言模型通过学习未被标记的文本进行预训练，从而学习语言的表征。通常，预训练数据集从互联网上获得，因为互联网上提供了大量的不同领域的文本信息，有助于提升模型的泛化能力。预训练数据集文本描述格式如下：

[

{"text": "document"},

{"text": "document"}

]

在预训练时，只有 text 列中的内容（即document）会用于模型学习。

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "columns": {

    "prompt": "text"

  }

}

数据样例如下：

{"text": "Don’t think you need all the bells and whistles? No problem. McKinley Heating Service Experts Heating & Air Conditioning offers basic air cleaners that work to improve the quality of the air in your home without breaking the bank. It is a low-cost solution that will ensure you and your family are living comfortably.\nIt’s a good idea to understand the efficiency rate of the filters, which measures what size of molecules can get through the filter. Basic air cleaners can filter some of the dust, dander and pollen that need to be removed. They are 85% efficient, and usually have a 6-inch cleaning surface.\nBasic air cleaners are not too expensive and do the job well. If you do want to hear more about upgrading from a basic air cleaner, let the NATE-certified experts at McKinley Heating Service Experts in Edmonton talk to you about their selection.\nEither way, now’s a perfect time to enhance and protect the indoor air quality in your home, for you and your loved ones.\nIf you want expert advice and quality service in Edmonton, give McKinley Heating Service Experts a call at 780-800-7092 to get your questions or concerns related to your HVAC system addressed."}

{"text": "To the apparent surprise of everyone, the Walt Disney Company has announced a deal to purchase Lucasfilm Ltd. According to the official press release, Disney has agreed to fork over $4.05 billion in cash and stock for George Lucas’ studio in a deal that brings together two of the world’s most important intellectual property libraries.\nAs you might expect, Disney is itching to take advantage of its new toys. “This transaction combines a world-class portfolio of content including Star Wars, one of the greatest family entertainment franchises of all time, with Disney’s unique and unparalleled creativity across multiple platforms, businesses, and markets to generate sustained growth and drive significant long-term value,” said Disney CEO Robert Iger in this afternoon’s announcement.\nUnder the terms of this agreement Disney will acquire control over all Lucasfilm iterations. This includes both its traditional film-making studio facilities, as well as the various technologies Lucasfilm has created over the years to further its various media properties. Thus, the gigantic Disney family now includes Lucasfilm itself, special effects house Industrial Light & Magic, Skywalker Sound and LucasArts, the company’s video game creation division.\nThis acquisition alone would be huge news, but as if to pre-empt fan speculation on the future of Star Wars the same announcement also mentions that a new Star Wars movie is scheduled to appear in 2015. Though the vast majority of recent Star Wars media has been focused on the property’s various animated iterations and LEGO crossovers, this new film will be the first official cinematic continuation of George Lucas’ original Star Wars trilogy. Though very few details are offered on this film, it has officially been dubbed Star Wars: Episode VII, and barring any major catastrophes it should hit theaters at some point in 2015 (if we had to guess, we’d assume an early summer release in keeping with the tradition established by its predecessors).\nPerhaps even more intriguing however, is the announcement’s claim that Episode VII’s release will herald a new era in which new Star Wars movies hit theaters “every two to three years.” It specifically mentions Episodes VIII and IX by name, though offers no solid details on either film.\nWhile the effects of the move won’t be fully known for at least a few months, we can think of a number of a things this new union might change. For instance, currently Dark Horse Comics publishes all Star Wars comic books, but with Disney owning Marvel Comics we can’t see that agreement lasting for long. Likewise, both Disney and Lucasfilm have sizable divisions dedicated to creating video games based on their various media properties. Normally these companies have had to seek outside publishing agreements, but now that they’ve joined forces and massively expanded the number of games either company is capable of releasing in any given year, it makes a lot of sense for Disney to invest in its own games publishing wing.\nFinally, this agreement almost certainly heralds future crossovers between Disney and Lucasfilm characters. We don’t know any specifics, but it’s only a matter of time before we see toys depicting Mickey Mouse dressed as Darth Vader. Whether that sounds awesome or stomach-churningly disgusting is entirely up to your rapidly waning sense of childhood whimsy.\nUpdate: Scratch that last prediction. Apparently Disney characters dressed as Star Wars characters is already a thing.\nOur partnership with LucasFilm has produced over 20 yrs worth of stories. We have Star Wars for the near future, and hope for years to come."}

1.3 偏好数据集

偏好数据集用于奖励模型训练、DPO 训练和 ORPO 训练。对于系统指令和人类输入，偏好数据集给出了一个更优的回答和一个更差的回答。

一些研究表明通过让模型学习“什么更好”可以使得模型更加迎合人类的需求。甚至可以使得参数相对较少的模型的表现优于参数更多的模型。

偏好数据集需要在 chosen 列中提供更优的回答，并在 rejected 列中提供更差的回答，在一轮问答中其格式如下：

[

  {

    "instruction": "人类指令（必填）",

    "input": "人类输入（选填）",

    "chosen": "优质回答（必填）",

    "rejected": "劣质回答（必填）"

  }

]

对于上述格式的数据，dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "ranking": true,

  "columns": {

    "prompt": "instruction",

    "query": "input",

    "chosen": "chosen",

    "rejected": "rejected"

  }

}

1.4 KTO 数据集

KTO数据集与偏好数据集类似，但不同于给出一个更优的回答和一个更差的回答，KTO数据集对每一轮问答只给出一个 true/false 的 label。除了 instruction 以及 input 组成的人类最终输入和模型回答 output ，KTO 数据集还需要额外添加一个 kto_tag 列（true/false）来表示人类的反馈。

在一轮问答中其格式如下：

[

  {

    "instruction": "人类指令（必填）",

    "input": "人类输入（选填）",

    "output": "模型回答（必填）",

    "kto_tag": "人类反馈 [true/false]（必填）"

  }

]

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "columns": {

    "prompt": "instruction",

    "query": "input",

    "response": "output",

    "kto_tag": "kto_tag"

  }

}

1.5 多模态数据集

目前我们支持多模态图像数据集、视频数据集以及音频数据集的输入。

图像数据集

多模态图像数据集需要额外添加一个 images 列，包含输入图像的路径。注意图片的数量必须与文本中所有 <image> 标记的数量严格一致。

[

  {

    "instruction": "人类指令（必填）",

    "input": "人类输入（选填）",

    "output": "模型回答（必填）",

    "images": [

      "图像路径（必填）"

    ]

  }

]

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "columns": {

    "prompt": "instruction",

    "query": "input",

    "response": "output",

    "images": "images"

  }

}

视频数据集

多模态视频数据集需要额外添加一个 videos 列，包含输入视频的路径。注意视频的数量必须与文本中所有 <video> 标记的数量严格一致。

[

  {

    "instruction": "人类指令（必填）",

    "input": "人类输入（选填）",

    "output": "模型回答（必填）",

    "videos": [

      "视频路径（必填）"

    ]

  }

]

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "columns": {

    "prompt": "instruction",

    "query": "input",

    "response": "output",

    "videos": "videos"

  }

}

音频数据集

多模态音频数据集需要额外添加一个 audio 列，包含输入图像的路径。注意音频的数量必须与文本中所有 <audio> 标记的数量严格一致。

[

  {

    "instruction": "人类指令（必填）",

    "input": "人类输入（选填）",

    "output": "模型回答（必填）",

    "audios": [

      "音频路径（必填）"

    ]

  }

]

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "columns": {

    "prompt": "instruction",

    "query": "input",

    "response": "output",

    "audios": "audios"

  }

}

2. ShareGPT

针对不同任务，数据集格式要求如下：

指令监督微调
偏好训练
OpenAI格式

ShareGPT 格式中的 KTO数据集(样例)和多模态数据集(样例) 与 Alpaca 格式的类似。
预训练数据集不支持 ShareGPT 格式。

2.1 指令监督微调数据集

相比 alpaca 格式的数据集， sharegpt 格式支持更多的角色种类，例如 human、gpt、observation、function 等等。它们构成一个对象列表呈现在 conversations 列中。下面是 sharegpt 格式的一个例子：

{

    "conversations": [

      {

        "from": "human",

        "value": "我需要为John Doe生成一张发票。他购买了2个苹果，每个$1，以及3根香蕉，每根$0.5。"

      },

      {

        "from": "function_call",

        "value": "{\"name\": \"generate_invoice\", \"arguments\": {\"customer_name\": \"约翰·多伊\", \"items\": [{\"name\": \"苹果\", \"quantity\": 2, \"price\": 1}, {\"name\": \"香蕉\", \"quantity\": 3, \"price\": 0.5}]}}"

      },

      {

        "from": "observation",

        "value": "{\"invoice_id\": \"INV12345\", \"customer_name\": \"约翰·多伊\", \"items\": [{\"name\": \"苹果\", \"quantity\": 2, \"price\": 1, \"total\": 2}, {\"name\": \"香蕉\", \"quantity\": 3, \"price\": 0.5, \"total\": 1.5}], \"total\": 3.5, \"status\": \"生成\"}"

      },

      {

        "from": "gpt",

        "value": "发票已成功生成。发票编号为INV12345。约翰·多伊的总金额为$3.5。发票包含2个苹果，总金额为$2，以及3根香蕉，总金额为$1.5。"

      }

    ],

    "tools": "[{\"name\": \"generate_invoice\", \"description\": \"生成发票\", \"parameters\": {\"type\": \"object\", \"properties\": {\"customer_name\": {\"type\": \"string\", \"description\": \"客户名称\"}, \"items\": {\"type\": \"array\", \"items\": {\"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\", \"description\": \"The item name\"}, \"quantity\": {\"type\": \"integer\", \"description\": \"The quantity of the item\"}, \"price\": {\"type\": \"number\", \"description\": \"The price per unit\"}}, \"required\": [\"name\", \"quantity\", \"price\"]}}}, \"required\": [\"customer_name\", \"items\"]}}, {\"name\": \"generate_password\", \"description\": \"生成随机密码\", \"parameters\": {\"type\": \"object\", \"properties\": {\"length\": {\"type\": \"integer\", \"description\": \"密码的长度\"}}, \"required\": [\"length\"]}}]"

  },

注意其中 human 和 observation 必须出现在奇数位置，gpt 和 function 必须出现在偶数位置。

[

  {

    "conversations": [

      {

        "from": "human",

        "value": "人类指令"

      },

      {

        "from": "function_call",

        "value": "工具参数"

      },

      {

        "from": "observation",

        "value": "工具结果"

      },

      {

        "from": "gpt",

        "value": "模型回答"

      }

    ],

    "system": "系统提示词（选填）",

    "tools": "工具描述（选填）"

  }

]

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "formatting": "sharegpt",

  "columns": {

    "messages": "conversations",

    "system": "system",

    "tools": "tools"

  }

}

2.2 偏好数据集

Sharegpt 格式的偏好数据集同样需要在 chosen 列中提供更优的消息，并在 rejected 列中提供更差的消息。下面是一个例子：

{

    "conversations": [

      {

        "from": "human",

        "value": "国会的转发\n美国国会由众议院和参议院组成，每两年换届一次（参议员任期为6年，但参议院选举是错位的，使得国会的组成仍然每两年变化一次）。这两年期间按顺序标记，第115届国会发生在2017-2018年。\n\n密歇根大学信息学院的研究人员在这段时间内收集了现任国会议员（我们将“国会议员”缩写为MoC）的Twitter帖子，并对它们进行编码，标记为原创声明或其他用户提交的转发。我们将重点关注转发数据。这些发布的数据不仅包括转发的文本，还包括国会议员的信息和原始推文的帐户。\n#python:\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sb\nimport statsmodels.api as sm\nimport os\nfrom pandas.plotting import register\\_matplotlib\\_converters\nregister\\_matplotlib\\_converters()\n\npd.options.mode.chained\\_assignment = None\n\n# 在接下来的内容中，我们将加载数据，但出于速度原因，我们将排除推文本身的文本。\n\ntweets = pd.read\\_csv(os.path.join(base, \"retweets\\_congress\\_115th.csv.gz\"), \n usecols = lambda x: x != \"full\\_text\",\n dtype = {\"tweet\\_id\\_str\": \"str\"},\n index\\_col = \"tweet\\_id\\_str\")\ntweets.shape\nout: (209856, 21)\n\ntweets.columns\nout:Index(['user\\_id\\_str', 'user\\_screen\\_name', 'user\\_followers\\_count', 'created\\_at',\n 'retweet\\_count', 'retweeted\\_status', 'year', 'bioguide', 'birthday',\n 'gender', 'name', 'state', 'chamber', 'party', 'region', 'rt\\_user\\_id',\n 'rt\\_screen\\_name', 'rt\\_followers\\_count', 'rt\\_bio', 'age', 'age\\_bucket'],\n dtype='object')\n\n第（a）部分\n国会主要由两个政党（民主党和共和党）组成，还有一些独立派。哪个政党负责大部分的转发？\n\n计算推文的政党的边际概率，并将结果保存到party\\_marg。显示这个结果。\n\n第（b）部分\n我们可能会问的一个问题是，“国会成员彼此转发的频率是多少？”\n\n创建一个名为“is\\_congress\\_retweet”的新列，指示转发是否来自MoC。首先使用.unique()方法获取唯一的“user\\_id\\_str”值集合。然后使用这组id，在“rt\\_user\\_id”列上使用.isin(values)方法创建新的布尔列“is\\_congress\\_retweet”。\n\n找到转发MoC的边际概率，并将结果保存到“congress\\_retweet\\_marg”。显示这些结果。是更常见地转发国会议员还是来自国会外部？"

      }

    ],

    "chosen": {

      "from": "gpt",

      "value": "计算推文的政党边际概率，我们可以使用以下代码：\n```python\ntweets['party'] = tweets['party'].replace(0, 'Democrat')\ntweets['party'] = tweets['party'].replace(1, 'Republican')\ntweets['party'] = tweets['party'].replace(2, 'Independent')\n\nparty_counts = tweets.groupby('party').size().sort_values(ascending=False)\n\nparty_marg = party_counts / party_counts.sum()\n\nprint(\"政党的边际概率：\")\nprint(party_marg)\n```\n这段代码将'party'列中的数字值替换为相应的政党名称，然后按政党对推文进行分组并计算每个组的大小。最后，通过将每个组的大小除以总推文数来计算每个政党的边际概率。\n\n输出应该类似于：\n```markdown\n政党的边际概率：\n0.522152    民主党\n0.445541    共和党\n0.032307    独立派\n```\n这表明大多数转发来自民主党，其次是共和党，然后是独立派。\n\n要创建一个名为\"is\\_congress\\_retweet\"的新列，指示转发是否来自国会议员，我们可以使用以下代码：\n```python\ncongress_ids = tweets['user_id_str'].unique()\nis_congress_retweet = tweets['rt_user_id'].isin(congress_ids)\ntweets['is_congress_retweet'] = is_congress_retweet\n\ncongress_retweet_marg = tweets['is_congress_retweet'].mean()\nprint(\"转发国会议员的边际概率：\")\nprint(congress_retweet_marg)\n```\n这段代码首先获取唯一的\"user\\_id\\_str\"值集合，对应于国会议员的集合。然后，在\"rt\\_user\\_id\"列上使用.isin(values)方法创建一个新的布尔列\"is\\_congress\\_retweet\"，指示转发是否来自国会议员。最后，通过取\"is\\_congress\\_retweet\"列的平均值来计算转发国会议员的边际概率。\n\n输出应该类似于：\n```markdown\n转发国会议员的边际概率：\n0.775151\n```\n这表明大多数转发不是来自国会议员，而是来自国会之外。"

    },

    "rejected": {

      "from": "gpt",

      "value": "回答问题的第（a）部分，您可以通过首先使用groupby（）函数按政党对数据进行分组，然后计算每个组的转发数来计算政党的边际概率。具体来说，您可以使用以下代码：\n```scss\nparty_count = tweets.groupby('party').count()\nparty_count.loc['Democrats','retweet_count'] / party_count.loc['Republicans','retweet_count']\n```\n回答问题的第（b）部分，您可以创建一个新列来指示转发是否来自国会议员，方法是使用“bioguide”列上的.unique（）方法获取唯一的国会议员用户ID集合，然后使用该集合创建一个布尔列，指示转发是否来自国会议员，使用“rt_user_id”列上的.isin（）方法。最后，使用.sum（）方法来计算转发国会议员的边际概率，然后将其与转发来自国会以外的人的概率进行比较。以下是相关代码：\n```makefile\nimport pandas as pd\ntweets = pd.read_csv('retweets_congress_115th.csv')\n\n# 获取唯一的国会议员ID集合\nmovers_ids = tweets['bioguide'].unique()\n\n# 创建一个新的布尔列，指示转发是否来自国会议员\ntweets['is_congress_retweet'] = tweets['rt_user_id'].isin(movers_ids)\n\n# 计算转发国会议员或来自国会以外的人的边际概率\nmovers_retweets = tweets[tweets['is_congress_retweet'] == True].shape[0]\nnon_movers_retweets = tweets[tweets['is_congress_retweet'] == False].shape[0]\nproportion_movers_retweets = movers_retweets / (movers_retweets + non_movers_retweets)\nprint('Proportion of retweets from MoCs:', proportion_movers_retweets)\n```"

    }

  },

其格式为：

[

  {

    "conversations": [

      {

        "from": "human",

        "value": "人类指令"

      },

      {

        "from": "gpt",

        "value": "模型回答"

      },

      {

        "from": "human",

        "value": "人类指令"

      }

    ],

    "chosen": {

      "from": "gpt",

      "value": "优质回答"

    },

    "rejected": {

      "from": "gpt",

      "value": "劣质回答"

    }

  }

]

对于上述格式的数据，dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "formatting": "sharegpt",

  "ranking": true,

  "columns": {

    "messages": "conversations",

    "chosen": "chosen",

    "rejected": "rejected"

  }

}

2.3 OpenAI格式

OpenAI 格式仅仅是 sharegpt 格式的一种特殊情况，其中第一条消息可能是系统提示词。

[

  {

    "messages": [

      {

        "role": "system",

        "content": "系统提示词（选填）"

      },

      {

        "role": "user",

        "content": "人类指令"

      },

      {

        "role": "assistant",

        "content": "模型回答"

      }

    ]

  }

]

对于上述格式的数据， dataset_info.json 中的数据集描述应为：

"数据集名称": {

  "file_name": "data.json",

  "formatting": "sharegpt",

  "columns": {

    "messages": "messages"

  },

  "tags": {

    "role_tag": "role",

    "content_tag": "content",

    "user_tag": "user",

    "assistant_tag": "assistant",

    "system_tag": "system"

  }

}