当前位置: 首页 > news >正文

eggnog后kegg结果提取和注释

首先进入KEGG BRITE: KEGG Orthology (KO)

下载json文件

用python处理一下

import json
import re
import osos.chdir("C:/Users/fordata/Downloads/")
with open("ko00001.json","r") as f:fj = f.read()kojson = json.loads(fj)with open("newKegg.tsv", "w") as k:for i in kojson['children']:ii = i['name'].replace(" ", "\t", 1)for j in i['children']:jj = j['name'].replace(" ", "\t", 1)for m in j['children']:if re.findall(r"ko\d{5}", m['name']):mm = "ko" + m['name'].replace(" ", "\t", 1)else:mm = m['name'].replace(" ", "\t", 1)try:for n in m['children']:if ";" in n['name']:nn = n['name'].replace(" ", "\t", 1).replace("; ", "\t", 1)else:nn = n['name'].replace(" ", "\t \t", 1)k.write(ii + "\t" + jj + "\t" + mm + "\t" + nn + "\n")except:nn = " \t \t "k.write(ii+"\t"+jj+"\t"+mm+"\t"+nn+"\n")

得到结果 

写个代码看看把keggKO和tpm关联起来

#! /usr/bin/env python
#########################################################
# mix eggnog(kegg) result with tpm
# written by PeiZhong in IFR of CAASimport argparse
import pandas as pd# Parse command-line arguments
parser = argparse.ArgumentParser(description='Mix eggnog(kegg) result with TPM')
parser.add_argument('--result', "-r", required=True, help='Path to eggnog result file')
parser.add_argument('--tpm', "-t", required=True, help='Path to TPM table file')
parser.add_argument('--out', "-o", required=True, help='Path to output file')args = parser.parse_args()# Step 1: Read input files
print("Reading input files")# Read dbcan result
df_result = {}
df_kegg = set()  # Use a set to store unique CAZy families
with open(args.result, "r") as f:for line in f:if "#" not in line:protein_id = line.split("\t")[0]kegg_str = line.split("\t")[11]if "-" != kegg_str:df_result[protein_id] = kegg_str# Extract CAZy families and remove duplicatesfamilies = set(entry.split(":")[1].strip() for entry in kegg_str.split(','))df_kegg.update(families)  # Add unique families to the global set# Read TPM file
df_tpm = pd.read_csv(args.tpm, sep='\t')# Step 2: Process dbcan results and calculate TPM sums for each sample
print("Processing dbcan results and calculating TPM sums for each sample")# Initialize a dictionary to store TPM sums for each CAZy family and sample
kegg_tpm_sums = {ko: {sample: 0.0 for sample in df_tpm.columns[1:]} for ko in df_kegg}# Convert TPM table to a dictionary for faster lookup
tpm_dict = df_tpm.set_index(df_tpm.columns[0]).to_dict(orient='index')# Process each protein in the dbcan result
for protein_id, kegg_str in df_result.items():# Convert protein ID to gene ID by removing trailing "_number"if "_" in protein_id:gene_id = protein_id.rsplit("_", 1)[0]  # Split from right on the last "_"else:print(f"Warning: Protein ID {protein_id} has no underscore, using as gene ID")gene_id = protein_id# Get TPM values for this geneif gene_id not in tpm_dict:print(f"Warning: No TPM values found for {gene_id} (protein {protein_id})")continuetpm_values = tpm_dict[gene_id]# Extract unique CAZy families for this proteinfamilies = set(entry.split(':')[1].strip() for entry in kegg_str.split(','))# Update TPM sums for each unique CAZy familyfor family in families:if family in kegg_tpm_sums:for sample in df_tpm.columns[1:]:kegg_tpm_sums[family][sample] += tpm_values[sample]else:# Dynamically add new CAZy familieskegg_tpm_sums[family] = {sample: tpm_values[sample] for sample in df_tpm.columns[1:]}# Create and save output DataFrame
output_df = pd.DataFrame.from_dict(kegg_tpm_sums, orient='index')
output_df.index.name = 'CAZy_Family'
output_df.to_csv(args.out, sep='\t', float_format='%.2f')  # Round to 2 decimal places
print(f"Results saved to {args.out}")

得到 

kegg的对应level,在excel钟使用vlookup函数对应即可 

http://www.lryc.cn/news/532852.html

相关文章:

  • shell脚本控制——处理信号
  • Doris更新某一列数据完整教程
  • VIVADO生成DCP和EDF指南
  • Python中字节顺序、大小与对齐方式:深入理解计算机内存的底层奥秘
  • 在亚马逊云科技上云原生部署DeepSeek-R1模型(上)
  • Redis实现分布式锁详解
  • 表单标签(使用场景注册页面)
  • c++ template-3
  • 【创建模式-单例模式(Singleton Pattern)】
  • 攻防世界你猜猜
  • 【Axure教程】标签版分级多选下拉列表
  • DeepSeek图解10页PDF
  • Centos7 停止维护,docker 安装
  • 日志级别修改不慎引发的一场CPU灾难
  • FPGA实现SDI视频缩放转UltraScale GTH光口传输,基于GS2971+Aurora 8b/10b编解码架构,提供2套工程源码和技术支持
  • 二级C语言题解:矩阵主、反对角线元素之和,二分法求方程根,处理字符串中 * 号
  • 利用 Python 爬虫获取按关键字搜索淘宝商品的完整指南
  • 什么是幂等性
  • 群晖NAS如何通过WebDAV和内网穿透实现Joplin笔记远程同步
  • 示例:JAVA调用deepseek
  • 【提示工程】:如何有效与大语言模型互动
  • 操作系统—经典同步问题
  • profinet工业通信协议网关:提升钢铁冶炼智能制造效率的利器
  • Vue基础:计算属性(描述依赖响应式状态的复杂逻辑)
  • leetcode:1534. 统计好三元组(python3解法)
  • BUU27 [SUCTF 2019]CheckIn1
  • unity学习30:Audio Source, Audio clip 音效和音乐
  • 【Qt 常用控件】输入类控件1(QLineEdit和QTextEdit 输入框)
  • openEuler22.03LTS系统升级docker至26.1.4以支持启用ip6tables功能
  • 深入解析:如何利用 Java 爬虫按关键字搜索淘宝商品