jamesbang

V1

2022/10/16阅读：41主题：雁栖湖

# 🤩 scRNA-seq | 吐血整理的单细胞入门教程（差异分析）（十三）

## 2用到的包

``rm(list = ls())library(scRNA.seq.funcs)library(edgeR)#library(monocle)library(MAST)library(ROCR)``

## 3示例数据

1️⃣ 由于数据量比较大，这里我们只比较一下`NA19101``NA19239`的差异基因。🤨

``molecules <- read.table("./molecules.txt", sep = "\t")anno <- read.table("./annotation.txt", sep = "\t", header = TRUE)keep <- anno[,1] == "NA19101" | anno[,1] == "NA19239"data <- molecules[,keep]group <- anno[keep,1]batch <- anno[keep,4]# 过滤一下gkeep <- rowSums(data > 0) > 5;counts <- data[gkeep,]# Library size normalizationlib_size = colSums(counts)norm <- t(t(counts)/lib_size * median(lib_size)) # Variant of CPM for datasets with library sizes of fewer than 1 mil molecules``

2️⃣ 这里我们再提供一个事先计算好的差异基因数据，方便后面对不同方法的评估，`TPs`为真实的`差异基因``TNs`为真实的`无差异基因`

``DE <- read.table("./TPs.txt")notDE <- read.table("./TNs.txt")GroundTruth <- list(    DE = as.character(unlist(DE)),     notDE = as.character(unlist(notDE)))``

## 4Kolmogorov-Smirnov Test

### 4.1 差异分析

``pVals <- apply(    norm, 1, function(x) {        ks.test(            x[group == "NA19101"],             x[group == "NA19239"]        )\$p.value    })# 校正pVals <- p.adjust(pVals, method = "fdr")``

### 4.2 准确性评估

``sigDE <- names(pVals)[pVals < 0.05]length(sigDE) ``

``sum(GroundTruth\$DE %in% sigDE) ``

``sum(GroundTruth\$notDE %in% sigDE)``

### 4.3 准确性可视化

``DE_Quality_AUC <- function(pVals) {    pVals <- pVals[names(pVals) %in% GroundTruth\$DE |                    names(pVals) %in% GroundTruth\$notDE]    truth <- rep(1, times = length(pVals));    truth[names(pVals) %in% GroundTruth\$DE] = 0;    pred <- ROCR::prediction(pVals, truth)    perf <- ROCR::performance(pred, "tpr", "fpr")    ROCR::plot(perf)    aucObj <- ROCR::performance(pred, "auc")    return(aucObj@y.values[[1]])}``

``DE_Quality_AUC(pVals)``

`AUC = 0.7954796`

## 5Wilcox/Mann-Whitney-U Test

`Wilcox-rank-sum test`也是一种非参数检验，看一下它的表现吧。

``pVals <- apply(    norm, 1, function(x) {        wilcox.test(            x[group == "NA19101"],             x[group == "NA19239"]        )\$p.value    })# multiple testing correctionpVals <- p.adjust(pVals, method = "fdr")DE_Quality_AUC(pVals)``

`AUC = 0.8320326`，表现要比`KS-test`好一些。

## 6edgeR

`edgeR`基于广义线性模型(`generalized linear model`, `glm`)进行差异基因分析，而且还可以在模型中纳入多种影响因素，比如`batch`等等。

``dge <- DGEList(    counts = counts,     norm.factors = rep(1, length(counts[1,])),     group = group)group_edgeR <- factor(group)design <- model.matrix(~ group_edgeR)dge <- estimateDisp(dge, design = design, trend.method = "none")fit <- glmFit(dge, design)res <- glmLRT(fit)pVals <- res\$table[,4]names(pVals) <- rownames(res\$table)pVals <- p.adjust(pVals, method = "fdr")DE_Quality_AUC(pVals)``

`AUC = 0.8466764`，表现更好一些了。🤩

## 7MAST

``log_counts <- log(counts + 1) / log(2)fData <- data.frame(names = rownames(log_counts))rownames(fData) <- rownames(log_counts);cData <- data.frame(cond = group)rownames(cData) <- colnames(log_counts)obj <- FromMatrix(as.matrix(log_counts), cData, fData)colData(obj)\$cngeneson <- scale(colSums(assay(obj) > 0))cond <- factor(colData(obj)\$cond)# Model expression as function of condition & number of detected geneszlmCond <- zlm(~ cond + cngeneson, obj) summaryCond <- summary(zlmCond, doLRT = "condNA19239")summaryDt <- summaryCond\$datatablesummaryDt <- as.data.frame(summaryDt)pVals <- unlist(summaryDt[summaryDt\$component == "H",4]) # H = hurdle modelnames(pVals) <- unlist(summaryDt[summaryDt\$component == "H",1])pVals <- p.adjust(pVals, method = "fdr")DE_Quality_AUC(pVals)``

`AUC = 0.8284046`

## 8其他方法

### 8.1 BPSC

``library(BPSC)bpsc_data <- norm[,batch=="NA19101.r1" | batch=="NA19239.r1"]bpsc_group = group[batch=="NA19101.r1" | batch=="NA19239.r1"]control_cells <- which(bpsc_group == "NA19101")design <- model.matrix(~bpsc_group)coef=2 # group labelres=BPglm(data=bpsc_data, controlIds=control_cells, design=design, coef=coef,                 estIntPar=FALSE, useParallel = FALSE)pVals = res\$PVALpVals <- p.adjust(pVals, method = "fdr")DE_Quality_AUC(pVals)``

### 8.2 SCDE

``library(scde)cnts <- apply(    counts,    2,    function(x) {        storage.mode(x) <- 'integer'        return(x)    })names(group) <- 1:length(group)colnames(cnts) <- 1:length(group)o.ifm <- scde::scde.error.models(    counts = cnts,    groups = group,    n.cores = 1,    threshold.segmentation = TRUE,    save.crossfit.plots = FALSE,    save.model.plots = FALSE,    verbose = 0,    min.size.entries = 2)priors <- scde::scde.expression.prior(    models = o.ifm,    counts = cnts,    length.out = 400,    show.plot = FALSE)resSCDE <- scde::scde.expression.difference(    o.ifm,    cnts,    priors,    groups = group,    n.randomizations = 100,    n.cores = 1,    verbose = 0)# Convert Z-scores into 2-tailed p-valuespVals <- pnorm(abs(resSCDE\$cZ), lower.tail = FALSE) * 2DE_Quality_AUC(pVals)``

##### jamesbang
V1

wx🔍: Grassssss 卷起来了