cellphoneDB v5更新与Python环境可视化
目录
导入环境
输入准备
矩阵与注释
DEG信息
空间微环境(可选)
调控因子信息(可选)
正式运行
读取结果可视化
热图
气泡图
之前发布过一篇使用cellphonedb进行细胞间通讯分析的文章
cellphoneDB进行CCI以及可视化-CSDN博客
后面发现这个方法的最新版本添加了很多新的输入文件,并且可视化也可以不用转到R环境,对应的python包就可以实现所有必要的可视化,于是更新记录一下
最新的安装包和说明详见githubhttps://github.com/ventolab/CellphoneDB
导入环境
import numpy as np
import pandas as pd
import scanpy as sc
import os
import sys
from scipy import sparse
from cellphonedb.src.core.methods import cpdb_degs_analysis_methodsc.settings.verbosity = 1
sys.executableimport anndata as ad
import ktplotspy as kpy
import matplotlib.pyplot as plt
%matplotlib inline
注意这里安装完记得检查一下cellphonedb版本是否>5.0.0
输入准备
# Our base directory for the analysis
os.chdir('/data/work/')# Path to our input files
cpdb_file_path = 'input/cellphonedb.zip'
meta_file_path = 'input/meta.tsv'
counts_file_path = 'input/counts_normloqTransformed.h5ad'
microenvs_file_path = 'input/microenvironments.tsv'
degs_file_path = 'input/DEGs_upregulated_genes.tsv'
active_tf_path = 'input/active_TFs.tsv'
out_path = 'results/'
矩阵与注释
首先还是正常先导入矩阵和注释信息
# 计数矩阵
adata = sc.read('/data/work/your_sc_or_sp.h5ad')
adata.write(counts_file_path) # 生成Metadata文件(假设注释列为annotation)
meta_df = adata.obs.reset_index()[['index', 'annotation']]
meta_df.columns = ['Cell', 'annotation']
meta_df.to_csv(meta_file_path, sep='\t', index=False)
接下来是相比旧版本多出的三个
DEG信息
degs_file_path 添加数据高表达基因的信息,其实只需要注释分群信息和对应的基因两列即可
这里我的DEG表格来自之前Seurat中的FindAllMarkers,并进行了简单的过滤
library(Seurat)
library(harmony)
library(ggplot2)data <- readRDS("/data/work/your_sc_sp.rds")
Idents(data) <- "annotation"
# 可以自定义过滤阈值
markers <- FindAllMarkers(data, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
write.csv(markers, file = "/data/work/annotation_deg.csv", row.names = FALSE)
导入结果表格如下
# 读取DEG文件
input_path = "/data/work/annotation_deg.csv"
df = pd.read_csv(input_path)
df_sub = df[['cluster', 'gene']]
df_sub.to_csv(degs_file_path, sep='\t', index=False, header=True)
空间微环境(可选)
microenvs_file_path 用来限制空间转录组数据的细胞微环境,最后的通路会被限制在单个微环境中,从而过滤一些较远的不真实信号,
这一步可以自行选择,可以不添加这个输入也可以设置成同一个微环境作为输入
## 假设我这里创建neuron与noneuron两种环境
cell_types = adata.obs['annotation'].unique().tolist()# 创建微环境映射
microenv_mapping = []
for cell_type in cell_types:# 判断是否为neuron微环境,EX开头和In的亚群if cell_type.startswith('EX_') or cell_type == 'In':microenv = 'neuron'else:microenv = 'noneuron'microenv_mapping.append({'annotation': cell_type, 'microenvironment': microenv})microenv_df = pd.DataFrame(microenv_mapping)
microenv_df.to_csv(microenvs_file_path, sep='\t', index=False)
调控因子信息(可选)
active_tf_path 纳入转录调控因子分析的信息,比如我们可以把scenic/pyscenic的结果进行适当修改作为这里转录调控信息的输入,这一步同样也是可选的
import networkx as nx
import matplotlib.pyplot as plt
from scipy.stats import wilcoxon
from scipy.stats import mannwhitneyu
from statsmodels.stats.multitest import fdrcorrection# auc_matrix是每个细胞对应每个TF的分值矩阵
auc_df = pd.read_csv("/data/work/pyscenic_out/auc_matrix.csv")
auc_df.columns = auc_df.columns.str.replace(r'\(\+\)$', '', regex=True)
auc_df.set_index('Cell', inplace=True)# 对齐细胞名称
common_cells = adata.obs_names[adata.obs_names.isin(auc_df.index)]
adata = adata[common_cells]
auc_df = auc_df.loc[common_cells]# 提取TF名称(从列名中排除'Cell')
tfs = auc_df.columns.tolist() # 直接获取所有TF列名# 按细胞类型检测差异活性
active_tfs_list = []
for cluster in adata.obs['annotation'].unique():cluster_mask = (adata.obs['annotation'] == cluster)cluster_auc = auc_df.loc[cluster_mask]other_auc = auc_df.loc[~cluster_mask]for tf in tfs:# 跳过全零的TF以避免检验失败if cluster_auc[tf].max() == 0 and other_auc[tf].max() == 0:continue try:# Mann-Whitney U检验(单侧:当前类型活性更高)stat, pval = mannwhitneyu(cluster_auc[tf], other_auc[tf], alternative='greater')mean_activity = cluster_auc[tf].mean()active_tfs_list.append({'annotation': cluster, 'TF': tf,'pval': pval,'mean_activity': mean_activity})except ValueError: # 处理样本量不足continue# 多重检验校正
tf_activity_df = pd.DataFrame(active_tfs_list)
if not tf_activity_df.empty:_, qvals = fdrcorrection(tf_activity_df['pval'])tf_activity_df['qval'] = qvals
else:tf_activity_df['qval'] = np.nan # 避免空数据时报错# 筛选显著活跃的TF(q<0.05 & 平均活性>0.1)
sig_tf_df = tf_activity_df[(tf_activity_df['qval'] < 0.05) & (tf_activity_df['mean_activity'] > 0.1)
]# 保存为CellPhoneDB格式
sig_tf_df[['annotation', 'TF']].to_csv(active_tf_path, sep='\t', index=False, header=True)
这里的调控因子矩阵打分来自pyscenic流程,有空也梳理一下这个软件,具体显著活跃TF的筛选过程仅供参考,最终得到每个分群注释对应的活跃TF两列表格即可。
总之,通过对上述新输入信息的加权计算,最终我们得到的通路信息显然更加真实可信
正式运行
准备好所有的输入文件后,就可以进行正式计算了
cpdb_results = cpdb_degs_analysis_method.call(cpdb_file_path = cpdb_file_path, # mandatory: CellphoneDB database zip file.meta_file_path = meta_file_path, # mandatory: tsv file defining barcodes to cell label.counts_file_path = counts_file_path, # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData objectdegs_file_path = degs_file_path, # mandatory: tsv file with DEG to account.counts_data = 'hgnc_symbol', # defines the gene annotation in counts matrix.
# active_tfs_file_path = active_tf_path, # optional: defines cell types and their active TFs.
# microenvs_file_path = microenvs_file_path, # optional (default: None): defines cells per microenvironment.score_interactions = True, # optional: whether to score interactions or not. threshold = 0.1, # defines the min % of cells expressing a gene for this to be employed in the analysis.result_precision = 3, # Sets the rounding for the mean values in significan_means.separator = '|', # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".debug = False, # Saves all intermediate tables emplyed during the analysis in pkl format.output_path = out_path, # Path to save resultsoutput_suffix = None, # Replaces the timestamp in the output files by a user defined string in the (default: None)threads = 25)
运行日志以及输出的文件如下
(这个方法运行速度要快一点,毕竟取了DEG)
读取结果可视化
首先读取我们上一步结果路径里的文件,注意DEG方法的pvals指的是relevant_interactions这个文件,里面是布尔值,旧方法里指的就是pvalues文件
# .h5ad file used for performing CellPhoneDB
adata = ad.read_h5ad("/data/work/input/counts_normloqTransformed.h5ad")
# output from CellPhoneDB
means = pd.read_csv("/data/work/results/degs_analysis_means_07_10_2025_103746.txt", sep="\t")
significant_means = pd.read_csv("data/work/results/degs_analysis_significant_means_07_10_2025_103746.txt", sep="\t")
decon = pd.read_csv("data/work/results/degs_analysis_deconvoluted_07_10_2025_103746.txt", sep="\t")
pvals = pd.read_csv("data/work/results/degs_analysis_relevant_interactions_07_10_2025_103746.txt", sep="\t")
热图
symmetrical参数可以选择是否强制对称,也可以挑选感兴趣的注释分群显示
kpy.plot_cpdb_heatmap(pvals=pvals, figsize=(5, 5), title="Sum of significant interactions")
kpy.plot_cpdb_heatmap(pvals=pvals, figsize=(5, 5), title="Sum of significant interactions", symmetrical=False)
kpy.plot_cpdb_heatmap(pvals=pvals, cell_types=["EX_BDNF","EX_NTNG1","Oligo","Astro","In","Endo","Opc"], figsize=(4, 4), title="Sum of significant interactions"
)
气泡图
将结果与Anndata矩阵文件结合,可以绘制表达量气泡图
p1 = kpy.plot_cpdb(adata = adata,cell_type1 = "EX_BDNF",#感兴趣分群,全部则使用"."cell_type2 = ".", means = means,pvals = pvals,celltype_key = "annotation",#分群标签genes = ["ADCYAP1"],#感兴趣基因figsize = (25,5),title = "Interactions involving ADCYAP1 ligand",max_size = 5,highlight_size = 0.75,degs_analysis = True,standard_scale = False
)
p1.save('dotplot_EX_BDNF_ADCYAP1.pdf')
当然可视化部分也完全可以在外部软件进行,结果表保存好就可以啦~