当前位置：首页 > news >正文

1.2 Kaggle大白话：Eedi竞赛Transformer框架解决方案02-GPT_4o生成训练集缺失数据

news 2025/9/14 7:37:14

- 0. 本栏目竞赛汇总表
- 1. 本文主旨
- 2. AI工程架构
- 3. 数据预处理模块
- - 3.1 配置数据路径和处理参数
  - 3.2 配置API参数
  - 3.3 配置输出路径
- 4. AI并行处理模块
- - 4.1 定义LLM客户端类
  - 4.2 定义数据处理函数
  - 4.3 定义JSON保存函数
  - 4.4 定义数据分片函数
  - 4.5 定义分片处理函数
  - 4.5 定义文件名排序函数
- 5. 数据整合模块
- - 5.1 加载数据并生成分片
  - 5.2 初始化LLM客户端并测试
  - 5.3 并行处理数据生成
  - 5.4 合并处理结果
  - 5.5 保存最终结果

0. 本栏目竞赛汇总表

Kaggle竞赛汇总

1. 本文主旨

大白话：由于在上一篇文章的数据探索中，我们发现了部分训练数据的错误解释存在缺失，因此直接使用GPT_4o+人设提示词工程，对训练集数据存在的错误解释缺失问题的处理。
通过本文可收获技能：API调用AI接口、人设提示词工程案例、复杂的数据处理与缓存处理。
上文回顾：Eedi大模型蒸馏方案01-竞赛信息解读与数据理解

2. AI工程架构

3. 数据预处理模块

3.1 配置数据路径和处理参数

data_path = "~/work/eedi_synthetic_data/MalAlgoQA_format.csv"
index_start = 0
index_end = len(df)
step = 100
max_workers = 2

3.2 配置API参数

model_config = dict(openai_api_base = "https://testshellapi.kimi.asia/v1", api_key = "****",model = "gpt-4o",default_system_prompt = """##TaskYou are a Mathematics teacher. Your task is to reason and identify the ConstructName and SubjectName and then the misconception behind the user input Incorrect Answers with the Question.ConstructName is Most granular level of knowledge related to question, appears to describe the specific mathematical method or procedure used to solve the question. It explains the technique or approach needed to reach the answer.SubjectName is More general context than the construct, represents the broader mathematical topic or category that the question belongs to.Misconceptions are a mistake in conceptual understanding and they have relations with all the applications of those concepts. For example, a single misconception on the connections among proportional relationships (part/whole, part/part, whole/part) can cause problems in identifying those patterns in drawings and can be the cause of failing to realize all parts must be of equal size, therefore associating the denominator of the fraction with the total number of parts regardless their size.Answer concisely what misconception it is to lead to getting the incorrect answer.Do not use "The misconception is" to start your answers.Do not mention the concrete details of the question or answers. ##User inputQuestion: The question textA: multiple choice answer A textB: multiple choice answer B textC: multiple choice answer C textD: multiple choice answer D textCorrect Answer: The correct answer text##You should answer in the following JSON format{"ConstructName": "here writes the constructName","SubjectName": "here writes the SubjectName""MisconceptionAName": "here writes the answer A's misconception.","MisconceptionBName": "here writes the answer B's misconception.","MisconceptionCName": "here writes the answer C's misconception.","MisconceptionDName": "here writes the answer D's misconception.",}""", # system prompt,default_temperature = 0.5,max_tokens = 256,
)

3.3 配置输出路径

cache_folder = f"./cache_{model_config['model']}_model_misconceptions_result"
if not os.path.exists(cache_folder):os.makedirs(cache_folder)
output_data_path = f"misconception_data_{os.path.splitext(os.path.basename(data_path))[0]}_{model_config['model']}.csv"

4. AI并行处理模块

4.1 定义LLM客户端类

class LLMChat:def __init__(self, openai_api_base, api_key, model, default_temperature, default_system_prompt, max_tokens=512):self.client = OpenAI(api_key = api_key,base_url=openai_api_base,)self.model = modelself.default_temperature = default_temperatureself.default_system_prompt = default_system_promptself.max_tokens = max_tokensdef chat(self, user_prompt, system_prompt=None, temperature=None):if not system_prompt:system_prompt = self.default_system_promptif not temperature:temperature = self.default_temperaturechat_response = self.client.chat.completions.create(model=self.model,temperature=temperature,messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],max_tokens=self.max_tokens,response_format={"type": "json_object"})return chat_response.choices[0].message.content

4.2 定义数据处理函数

def process_row(args, debug=False):user_prompt = """Question: {question}A: {answer_a}B: {answer_b}C: {answer_c}D: {answer_d}Correct Answer: {correct_answer}"""index, row = argsca = row["CorrectAnswer"]correctanswer = row[f"Answer{ca}Text"]input_user_prompt = user_prompt.format(question=row['QuestionText'],answer_a=row['AnswerAText'],answer_b=row['AnswerBText'],answer_c=row['AnswerCText'],answer_d=row['AnswerDText'],correct_answer=correctanswer,)ret_data = {}try:ret_data = vc.chat(input_user_prompt)if debug:print(ret_data+'\n')except Exception as e:print(f'An exception occur {str(e)}')ret_data['error'] = str(e)passif debug:print('system: ', model_config['default_system_prompt'])print('>'* 50)print('user_input: ', input_user_prompt)print('>'* 50)print('assistant: ', ret_data)return ret_data

4.3 定义JSON保存函数

def save_json(fn, obj):with open(fn, 'w') as f:json.dump(obj, f, ensure_ascii=False, indent=4)print(f"save file to {fn}")

4.4 定义数据分片函数

def slice_range(start, end, step):if step <= 0:raise ValueError("步长必须大于0")result = []while start <= end:result.append(start)start += stepif result[-1] < end:result.append(end)return result

4.5 定义分片处理函数

def process_pairs(sliced_range):slices = []for first, second in zip(sliced_range, sliced_range[1:]):slices.append([first, second])return slices

4.5 定义文件名排序函数

def natural_sort_key(filename):parts = re.findall(r'\d+', filename)return tuple(map(int, parts))

5. 数据整合模块

5.1 加载数据并生成分片

df = pd.read_csv(data_path)
df.head()
sliced_range = process_pairs(slice_range(index_start, index_end, step))

df数据检查：
在这里插入图片描述

5.2 初始化LLM客户端并测试

vc = LLMChat(**model_config)
r = process_row((7, df.iloc[7]), debug=True)

5.3 并行处理数据生成

for slices in tqdm(sliced_range, total=len(sliced_range)):output_filepath = f'{cache_folder}/cache_res_{slices[0]}.json'if os.path.exists(output_filepath):print(f'cache file exists, skip {output_filepath}')continuedf_tasks = df.iloc[slices[0]:slices[1]]results = []with ProcessPoolExecutor(max_workers=max_workers) as executor:results = list(tqdm(executor.map(process_row, df_tasks.iterrows()), total=len(df_tasks)))save_json(output_filepath, results)

5.4 合并处理结果

f_names = glob.glob(f'{cache_folder}/*.json')
sorted_filenames = sorted(f_names, key=natural_sort_key)
f_names = sorted_filenamesresults = []
for fn in f_names:with open(fn, 'r') as f:batch_results = json.load(f)results.extend(batch_results)l = len(results)
results = [json.loads(r) for r in results]

5.5 保存最终结果

df = df.iloc[:l]
gen_df = pd.DataFrame(results)
df = pd.concat([df, gen_df], axis=1)
df.to_csv(output_data_path, index=False)

(To be continued)

查看全文

http://www.lryc.cn/news/543791.html

数据结构-顺序表专题

docker和containerd从TLS harbor拉取镜像

【AI论文】RAD: 通过大规模基于3D图形仿真器的强化学习训练端到端驾驶策略

Web开发：ORM框架之使用Freesql的导航属性

【docker】namespace底层机制

【每天认识一个漏洞】url重定向

端口映射/内网穿透方式及问题解决:warning: remote port forwarding failed for listen port

Polardb开发者大会

从二维随机变量到多维随机变量

Vulnhub靶场 Kioptrix: Level 1.3 (#4) 练习

权重生成图像

实时时钟（RTC）/日历芯片PCF8563的I2C读写驱动（2）：功能介绍

猿大师播放器：HTML内嵌VLC播放RTSP视频流，无需转码，300ms级延迟，碾压服务器转码方案

牛客刷题自留-深度学习

AI 时代下，操作系统如何进化与重构？

Hadoop最新版本hadoop-3.4.1搭建伪分布式集群以及相关报错解决

Android SDK与NDK的区别

【保姆级视频教程（二）】YOLOv12训练数据集构建：标签格式转换-划分-YAML 配置避坑指南 | 小白也能轻松玩转目标检测！

smolagents学习笔记系列（八）Examples - Master you knowledge base with agentic RAG

满血版DeepSeek R1使用体验

Java类中的this操作

LeetCode刷题---双指针---532

cpp单调栈模板

1.2 Kaggle大白话：Eedi竞赛Transformer框架解决方案02-GPT_4o生成训练集缺失数据

目录

0. 本栏目竞赛汇总表

1. 本文主旨

2. AI工程架构

3. 数据预处理模块

3.1 配置数据路径和处理参数

3.2 配置API参数

3.3 配置输出路径

4. AI并行处理模块

4.1 定义LLM客户端类

4.2 定义数据处理函数

4.3 定义JSON保存函数

4.4 定义数据分片函数

4.5 定义分片处理函数

4.5 定义文件名排序函数

5. 数据整合模块

5.1 加载数据并生成分片

5.2 初始化LLM客户端并测试

5.3 并行处理数据生成

5.4 合并处理结果

5.5 保存最终结果

相关文章：