当前位置：首页 > news >正文

不足3个细胞怎么做差异分析？

news 2025/8/10 5:37:30

一、写在前面

如图，有粉丝发现，当差异基因时存在细胞类型不足三个细胞，则会出现报错。解决方法自然是有的，我们此前分享的应审稿人要求| pseudo bulk差异分析便可解决这一问题。只是，为何要强求这一两个细胞的分析？做出的差异结果可信度大大的存疑，后续的验证也只怕是难上加难，不如及时止损。系统性单细胞教程可见：scRNA-seq，从小白到文献复现大神！

二、重现报错

这里的测试数据来自经典教案，多样本整合

1、构建模拟数据

# 加载包：
library(Seurat)

## Loading required package: SeuratObject

## Loading required package: sp

## 'SeuratObject' was built with package 'Matrix' 1.7.1 but the current
## version is 1.7.2; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed

## 
## Attaching package: 'SeuratObject'

## The following objects are masked from 'package:base':
## 
##     intersect, t

# 读取数据：
scRNA <-readRDS('./pbmcrenamed.rds')# 查看包含细胞类型：
DimPlot(scRNA)

# 取出EC：
EC <-subset(scRNA,idents ='EC')
# 保留两个EC
EC <- EC[,1:2]
# 整合EC与Macro
data4diff <-merge(EC,subset(scRNA,idents ='Macro'))# 此时我们就获得了包含EC、Macro的单细胞对象，其中EC的数量只有两个：  
table(Idents(data4diff))

## 
##    EC Macro 
##     2   100

2、查看报错

其实是我还挣扎过，尝试更改FindMarkers函数的test.use的参数来规避这个问题。事实上，这种情况下无论你采用什么内置差异分析算法，都会报Cell group 2 has fewer than 3 cells的错误，应该是ValidateCellGroups的底层逻辑了。

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='wilcox')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='wilcox_limma')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='bimod')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='roc')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='t')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='negbinom')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='poisson')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

try({FindMarkers(data4diff,ident.1 ='Macro',ident.2 ='EC',test.use ='LR')})

## Error in ValidateCellGroups(object = object, cells.1 = cells.1, cells.2 = cells.2,  : 
##   Cell group 2 has fewer than 3 cells

三、解决问题

3.1 `Seurat`的伪`bulk`方案

### 生成拟bulk RNA数据 ###
bulk <-AggregateExpression(scRNA, return.seurat = T, slot ="counts", assays ="RNA",
group.by =c("celltype", # 细胞类型对应的注释
"orig.ident", # 样本对应的注释
"group")# 分组变量对应的注释 )

## Centering and scaling data matrix

# 生成的是一个新的Seurat对象
bulk

## An object of class Seurat 
## 20254 features across 26 samples within 1 assay 
## Active assay: RNA (20254 features, 0 variable features)
##  3 layers present: counts, data, scale.data

# 下面的额计算依赖DESeq2，做过Bulk RNA-Seq的同学都知道：
if(!require(DESeq2))BiocManager::install('DESeq2')

## Loading required package: DESeq2

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following object is masked from 'package:SeuratObject':
## 
##     intersect

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
##     table, tapply, union, unique, unsplit, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:sp':
## 
##     %over%

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'SummarizedExperiment'

## The following object is masked from 'package:Seurat':
## 
##     Assays

## The following object is masked from 'package:SeuratObject':
## 
##     Assays

Idents(bulk) <-'celltype'
bulk_deg <-FindMarkers(bulk, 
ident.1 ="Macro", ident.2 ="EC", # 这样算出来的Fold Change就是Macro/EC
slot ="counts", test.use ="DESeq2",# 这里可以选择其它算法
verbose = F# 关闭进度提示)

## converting counts to integer mode

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

head(bulk_deg)# 看一下差异列表

##                p_val avg_log2FC pct.1 pct.2     p_val_adj
## Lyz2   1.319068e-167   8.390904 1.000     1 2.671640e-163
## Pecam1  1.676724e-48  -5.088600 1.000     1  3.396037e-44
## Fabp4   7.472478e-46  -6.226709 0.667     1  1.513476e-41
## Tyrobp  5.513580e-45   5.530070 1.000     1  1.116720e-40
## Cytl1   3.912488e-34  -6.691938 0.667     1  7.924352e-30
## Cd52    1.527090e-33   5.216027 1.000     1  3.092968e-29

3.2 生成`bulk`矩阵

当然我们也可以把sc直接变成Bulk矩阵，然后直接用Bulk RNASeq| 转录组实战里经典的差异分析方法即可完成。

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following object is masked from 'package:matrixStats':
## 
##     count

## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union

## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect

## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union

## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# 获取countss数据：
rna_count <-GetAssayData(data4diff,'RNA','count') %>%as.data.frame() 
# 得到的数据是一个列名是细胞barcode名(现在应该叫做样本名)，行名是基因名的矩阵
rna_count[1:4,1:4]

##         AACCCAAAGGCCTTGC-1_1 AAGACTCAGGCTCACC-1_1 GATGTTGGTGACAGGT-1_1
## Xkr4                       0                    0                    0
## Gm19938                    0                    0                    0
## Sox17                      1                   14                    0
## Gm37587                    0                    0                    0
##         GCCTGTTAGAGCTTTC-1_1
## Xkr4                       0
## Gm19938                    0
## Sox17                      0
## Gm37587                    0

# 构建分组信息：
group_info <-data.frame(sample =colnames(rna_count),
group = data4diff$celltype)# 查看整理好的分组信息：
group_info

##                                    sample group
## AACCCAAAGGCCTTGC-1_1 AACCCAAAGGCCTTGC-1_1    EC
## AAGACTCAGGCTCACC-1_1 AAGACTCAGGCTCACC-1_1    EC
## GATGTTGGTGACAGGT-1_1 GATGTTGGTGACAGGT-1_1 Macro
## GCCTGTTAGAGCTTTC-1_1 GCCTGTTAGAGCTTTC-1_1 Macro
## TACTTGTGTAGAGACC-1_1 TACTTGTGTAGAGACC-1_1 Macro
## AAAGTCCCAAGCTCTA-1_2 AAAGTCCCAAGCTCTA-1_2 Macro
## AACCTTTGTGTGACCC-1_2 AACCTTTGTGTGACCC-1_2 Macro
## ACATTTCAGGGTGAGG-1_2 ACATTTCAGGGTGAGG-1_2 Macro
## ACGGTTAGTTGAATCC-1_2 ACGGTTAGTTGAATCC-1_2 Macro

# 安装和加载必要的包：
if(!require(limma))BiocManager::install("limma")

## Loading required package: limma

## 
## Attaching package: 'limma'

## The following object is masked from 'package:DESeq2':
## 
##     plotMA

## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA

# 设置因子的水平，保证后续求出的fold change是Macro vs EC的结果：
group_info$group <-factor(group_info$group, levels=c("EC", "Macro"))# 使用voom转换count数据：
v <-voom(rna_count[,group_info$sample], plot=TRUE, design=model.matrix(~ group_info$group))

# 创建设计矩阵
design <-model.matrix(~ group_info$group)# 拟合线性模型
fit <-lmFit(v, design)# 应用经验贝叶斯方法
fit <-eBayes(fit)# 获取差异表达结果
results <-topTable(fit, adjust="BH", sort.by="P", number=Inf)

## Removing intercept from test coefficients

# 查看差异计算结果：
head(results)

##            logFC  AveExpr         t      P.Value    adj.P.Val         B
## Cytl1  -7.373118 6.217164 -16.08714 3.166532e-58 6.413494e-54 120.39129
## Tm4sf1 -7.100361 6.313693 -15.44600 8.084909e-54 8.187587e-50 110.46178
## Mgp    -6.442328 7.011167 -14.15988 1.630966e-45 1.101120e-41  91.80967
## Apoe   -6.635193 6.164305 -14.06780 6.019744e-45 3.048097e-41  90.43574
## Pecam1 -5.754467 6.615233 -12.22598 2.264653e-34 9.173657e-31  66.70513
## Bgn    -5.872551 6.342975 -11.82165 3.024253e-32 1.020887e-28  61.83869

######## 火山图 ##########
data <-data.frame(gene =row.names(results),
pval =-log10(results$P.Value), #pval已做-log10处理
lfc = results$logFC)# 查看数据：
head(data)

##     gene     pval       lfc
## 1  Cytl1 57.49942 -7.373118
## 2 Tm4sf1 53.09232 -7.100361
## 3    Mgp 44.78755 -6.442328
## 4   Apoe 44.22042 -6.635193
## 5 Pecam1 33.64500 -5.754467
## 6    Bgn 31.51938 -5.872551

# pval 小于0.05，且上下调log(Fold change)大于0.5(差异倍数为1.41)的为显著差异基因并标注上下调，其余
data <-mutate(data, type =case_when(data$lfc >0.5& data$pval >abs(log10(0.05)) ~"Up",data$lfc <-0.5& data$pval >abs(log10(0.05)) ~"Down",data$pval <abs(log10(0.05)) ~"no.Sig"))
# 去除包含na的数据
data <-na.omit(data)# 通过ggplot定义火山图
if(!require(ggplot2))install.packages('ggplot2')

## Loading required package: ggplot2

myplot <-ggplot(data,aes(lfc, pval))+# 横坐标为差异倍数，纵坐标为显著性
# 横向水平参考线：
geom_hline(yintercept =-log10(0.05), linetype ="dashed", color ="#999999")+
# 纵向垂直参考线：
geom_vline(xintercept =c(-0.5,0.5), linetype ="dashed", color ="#999999")+
# 散点图:
geom_point(aes(size=pval, shape=type,color= pval))+
# 指定颜色渐变模式：
scale_shape_manual(values =c(6,1,2))+
scale_color_gradientn(values =seq(0,1,0.2),
colors =c("#39489f","#39bbec",'#FFA500',"#f38466","#b81f25"))+
# 指定散点大小渐变模式：
scale_size_continuous(range =c(0.5,4))+
# 主题调整：
theme_bw()+
guides(col =guide_colourbar(title ="-Log10_q-value"),
size ="none")+
xlab("Log2FC")+
ylab("-Log10(FDR q-value)")+
scale_y_continuous(trans ="log1p")# 这时的火山图还是光秃秃的：
print(myplot)

# 过滤上下调top5的基因：
top_m <- data %>%group_by(type) %>%top_n(n =5, wt = pval) %>%# 如果你想改成top10，把5替换成对应的数字就好
filter(type !='no.Sig')           # 把top标签添加到火山图中
if(!require(ggrepel))install.packages('ggrepel')

## Loading required package: ggrepel

myplot <- myplot +
geom_text_repel(data=top_m,aes(x=lfc,y=pval,label=gene), size =3)# 查看最终的火山图，这时标签已经被添加好啦：
print(myplot)

# 挑一个差异看起来比较大的基因，在单细胞里检查一番：
VlnPlot(data4diff,features ='Lyz2')

虽然是给大家捣鼓出来了，Lyz2的差异还是很明显的，确实是巨噬细胞的marker，不过火山图还是有些奇怪的。还是劝大家及时止损，这两三个细胞的，不研究也罢。

四、文件下载

一切不给测试文件和分析环境版本的教程都是耍流氓，（获取方式）

分析环境：

sessionInfo()## R version 4.4.2 (2024-10-31)## Platform: x86_64-pc-linux-gnu## Running under: Ubuntu 20.04.4 LTS## ## Matrix products: default## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0## ## locale:##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       ## ## time zone: Etc/UTC## tzcode source: system (glibc)## ## attached base packages:## [1] stats4    stats     graphics  grDevices utils     datasets  methods  ## [8] base     ## ## other attached packages:##  [1] ggrepel_0.9.6               ggplot2_3.5.1              ##  [3] limma_3.62.2                dplyr_1.1.4                ##  [5] DESeq2_1.36.0               SummarizedExperiment_1.36.0##  [7] Biobase_2.66.0              MatrixGenerics_1.18.1      ##  [9] matrixStats_1.5.0           GenomicRanges_1.58.0       ## [11] GenomeInfoDb_1.42.1         IRanges_2.40.1             ## [13] S4Vectors_0.44.0            BiocGenerics_0.52.0        ## [15] Seurat_5.2.1                SeuratObject_5.0.2         ## [17] sp_2.2-0                   ## ## loaded via a namespace (and not attached):##   [1] RcppAnnoy_0.0.22        splines_4.4.2           later_1.4.1            ##   [4] bitops_1.0-9            tibble_3.2.1            polyclip_1.10-7        ##   [7] XML_3.99-0.18           fastDummies_1.7.5       lifecycle_1.0.4        ##  [10] globals_0.16.3          lattice_0.22-6          MASS_7.3-64            ##  [13] magrittr_2.0.3          plotly_4.10.4           sass_0.4.9             ##  [16] rmarkdown_2.29          jquerylib_0.1.4         yaml_2.3.10            ##  [19] httpuv_1.6.15           sctransform_0.4.1       spam_2.11-1            ##  [22] spatstat.sparse_3.1-0   reticulate_1.40.0       cowplot_1.1.3          ##  [25] pbapply_1.7-2           DBI_1.2.3               RColorBrewer_1.1-3     ##  [28] abind_1.4-8             zlibbioc_1.52.0         Rtsne_0.17             ##  [31] purrr_1.0.2             RCurl_1.98-1.16         GenomeInfoDbData_1.2.13##  [34] irlba_2.3.5.1           listenv_0.9.1           spatstat.utils_3.1-4   ##  [37] genefilter_1.84.0       openintro_2.5.0         airports_0.1.0         ##  [40] goftest_1.2-3           RSpectra_0.16-2         spatstat.random_3.4-1  ##  [43] annotate_1.84.0         fitdistrplus_1.2-2      parallelly_1.42.0      ##  [46] codetools_0.2-20        DelayedArray_0.32.0     tidyselect_1.2.1       ##  [49] UCSC.utils_1.2.0        farver_2.1.2            spatstat.explore_3.4-3 ##  [52] jsonlite_1.8.9          progressr_0.15.1        ggridges_0.5.6         ##  [55] survival_3.8-3          tools_4.4.2             ica_1.0-3              ##  [58] Rcpp_1.0.14             glue_1.8.0              gridExtra_2.3          ##  [61] SparseArray_1.6.0       xfun_0.50               withr_3.0.2            ##  [64] fastmap_1.2.0           digest_0.6.37           R6_2.5.1               ##  [67] mime_0.12               colorspace_2.1-1        scattermore_1.2        ##  [70] tensor_1.5              spatstat.data_3.1-6     RSQLite_2.3.9          ##  [73] tidyr_1.3.1             generics_0.1.3          data.table_1.16.4      ##  [76] usdata_0.3.1            httr_1.4.7              htmlwidgets_1.6.4      ##  [79] S4Arrays_1.6.0          uwot_0.2.2              pkgconfig_2.0.3        ##  [82] gtable_0.3.6            blob_1.2.4              lmtest_0.9-40          ##  [85] XVector_0.46.0          htmltools_0.5.8.1       geneplotter_1.84.0     ##  [88] dotCall64_1.2           scales_1.3.0            png_0.1-8              ##  [91] spatstat.univar_3.1-3   knitr_1.49              rstudioapi_0.17.1      ##  [94] tzdb_0.4.0              reshape2_1.4.4          nlme_3.1-168           ##  [97] cachem_1.1.0            zoo_1.8-12              stringr_1.5.1          ## [100] KernSmooth_2.23-26      vipor_0.4.7             parallel_4.4.2         ## [103] miniUI_0.1.1.1          AnnotationDbi_1.68.0    ggrastr_1.0.2          ## [106] pillar_1.10.1           grid_4.4.2              vctrs_0.6.5            ## [109] RANN_2.6.2              promises_1.3.2          xtable_1.8-4           ## [112] cluster_2.1.8           beeswarm_0.4.0          evaluate_1.0.3         ## [115] readr_2.1.5             locfit_1.5-9.11         cli_3.6.3              ## [118] compiler_4.4.2          rlang_1.1.5             crayon_1.5.3           ## [121] future.apply_1.11.3     labeling_0.4.3          ggbeeswarm_0.7.2       ## [124] plyr_1.8.9              stringi_1.8.4           viridisLite_0.4.2      ## [127] deldir_2.0-4            BiocParallel_1.40.0     munsell_0.5.1          ## [130] Biostrings_2.74.1       lazyeval_0.2.2          spatstat.geom_3.4-1    ## [133] Matrix_1.7-2            RcppHNSW_0.6.0          hms_1.1.3              ## [136] patchwork_1.3.0         bit64_4.6.0-1           future_1.34.0          ## [139] statmod_1.5.0           KEGGREST_1.46.0         shiny_1.10.0           ## [142] ROCR_1.0-11             igraph_2.1.4            memoise_2.0.1          ## [145] bslib_0.9.0             bit_4.5.0.1             cherryblossom_0.1.0