chang
2023/01/09阅读:316主题:默认主题
RNA-seq上游数据处理
软件安装
此次需要用到的有
-
质控 fastqc
-
质控过滤 trim-galore
使用
trim_galore
命令 -
RNA序列对比软件 hisat2
-
序列比对 samtools
-
计算表达量(定量工具)subread
这里使用的是
featureCounts
-
转录本定量 salmon
安装代码如下
conda create -n rna
conda activate rna
conda install -y -c bioconda fastqc trim-galore hisat2 subread
conda install -y -c bioconda salmon # salmon-0.14.2
conda install -y -c bioconda samtools # samtools-1.6
# 如果是 STAR-Fusion 看融合基因,RNA编辑位点
# 可能性需要为 STAR-Fusion 专门的设置一个conda小环境,它毕竟特殊。
验证代码如下
fastqc --help 1>/dev/null
trim_galore --help 1>/dev/null ###注意:这里是下划线 不是横线
hisat2 --help 1>/dev/null
featureCounts --help 1>/dev/null
下载样本数据
使用prefetch下载
#下载单个数据
prefetch SRR11618610
#如果同时下载两个或多个,并行即可
prefetch SRR11618610 SRR11618616
#批量全部下载,样本量大时,需谨慎服务器资源
cat SRR_Acc_List.txt |while read id;do (prefetch -X 100G $id &);done
sra数据转化成fastq格式
注意路径的使用
-O输出路径 -e线程数 -A输入文件
*表示任何长度和数值的通配符 “TEXTGROUP”文件夹为测试的输出文件夹
-
批处理方法
ls -d SRR/*sra | while read id;do ( fasterq-dump -O ./TEXTGROUP --split-files -e 2 ./$id --include-technical & );done
-
单独处理的方法
*.sra-> *.fastq
fastq-dump --split-3 ~/SSR/SRR11618621.sra
fastq-dump --split-3 ~/SSR/SRR11618616.sra
(download) tree116@st8-Horsea-12U:~$ fastq-dump --split-3 -O ~/text -A ~/SSR/*.sra
Read 676511 spots for /home/st8/ssd2/tree116/SSR/SRR11618616.sra
Written 676511 spots for /home/st8/ssd2/tree116/SSR/SRR11618616.sra
Read 832516 spots for /home/st8/ssd2/tree116/SSR/SRR11618621.sra
Written 832516 spots for /home/st8/ssd2/tree116/SSR/SRR11618621.sra
(download) tree116@st8-Horsea-12U:~$ ls text
_home_st8_ssd2_tree116_SSR_SRR11618610.sra_1.fastq
_home_st8_ssd2_tree116_SSR_SRR11618610.sra_2.fastq
将fastq文件压缩
ls TEXTGROUP/*fastq |while read id;do ( gzip $id &);done
(rna) [tree116@st8-Horsea-12U 17:01:31 ~ ]$ ls TEXTGROUP/*fastq |while read id;do ( gzip $id &);done
(rna) [tree116@st8-Horsea-12U 17:01:48 ~ ]$ ls TEXTGROUP
SRR11618610_1.fastq.gz SRR11618616_1.fastq.gz SRR11618621_1.fastq.gz
SRR11618610_2.fastq SRR11618616_2.fastq SRR11618621_2.fastq
SRR11618610_2.fastq.gz SRR11618616_2.fastq.gz SRR11618621_2.fastq.gz
SRR11618616_1.fastq SRR11618621_1.fastq
质控
-
使用fastqc进行质控
nohup fastqc -t 6 -o ./ SRR*.fastq.gz >qc.log &
-
使用multiqc对fastqc结果进行整合
nohup multiqc ./*.zip -o ./multiqc > ./multiqc.log &
-
trim_galore对测序结果进行处理(但是运行这个命令时遇到了麻烦)
ls *gz |while read id;do (nohup trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o ./ $id & );done
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.1
single-core operation.
Please provide an even number of input files for paired-end FastQ trimming! Aborting ...
Cutadapt version: 4.1
single-core operation.
Please provide an even number of input files for paired-end FastQ trimming! Aborting ...
Cutadapt version: 4.1
single-core operation.
Please provide an even number of input files for paired-end FastQ trimming! Aborting ...
Cutadapt version: 4.1
single-core operation.
Please provide an even number of input files for paired-end FastQ trimming! Aborting ...
Cutadapt version: 4.1
single-core operation.
Cutadapt version: 4.1
Please provide an even number of input files for paired-end FastQ trimming! Aborting ...
single-core operation.
Please provide an even number of input files for paired-end FastQ trimming! Aborting ...
#觉得是双端变单端,输入参数不正确
将--paired
参数去掉,测试单端比对,这次出了结果,说明单次输入文件应为两个
(rna) [tree116@st8-Horsea-12U 11:30:42 ~/SRR/test ]$ ls out -l
total 141468
-rw-rw-r-- 1 tree116 tree116 3002 12月 5 11:30 SRR11618610_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 6371450 12月 5 11:30 SRR11618610_1_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 3026 12月 5 11:30 SRR11618610_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 6758028 12月 5 11:30 SRR11618610_2_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 3443 12月 5 11:30 SRR11618616_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 29130386 12月 5 11:30 SRR11618616_1_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 3432 12月 5 11:30 SRR11618616_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 30419370 12月 5 11:30 SRR11618616_2_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 3043 12月 5 11:30 SRR11618621_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 35856265 12月 5 11:30 SRR11618621_1_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 3094 12月 5 11:30 SRR11618621_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 36291370 12月 5 11:30 SRR11618621_2_trimmed.fq.gz
(rna) [tree116@st8-Horsea-12U 11:55:42 ~/SRR/test/out ]$ cat SRR11618610_1.fastq.gz_trimming_report.txt
SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR11618610_1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 4.1
Number of cores used for trimming: 1
Quality Phred score cutoff: 25
Quality encoding type selected: ASCII+33
Using Nextera adapter for trimming (count: 452). Second best hit was smallRNA (count: 2)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 36 bp
Output file will be GZIP compressed
但是样本是双端,修改了批处理命令,参考:https://cloud.tencent.com/developer/article/1703054
for i in SRR11618610 SRR11618616 SRR11618621; do trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired /home/st8/ssd2/tree116/fastqc/${i}_1.fastq.gz /home/st8/ssd2/tree116/fastqc/${i}_2.fastq.gz --gzip -o /home/st8/ssd2/tree116/fastqc/trim_galoreR/; done
( 12月6日更新,群内tree011小伙伴分享了一个更简单的命令
ls *gz |xargs -n 2 -t nohup trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -j 4 -o ./ &
管道符后面的xargs
指定传两个文件名作为参数,默认"_1"和"_2"是排好序的,可以直接使用。-t
参数可以显示即将执行的代码 )
在文件夹内成功输出
(rna) [tree116@st8-Horsea-12U 18:39:09 ~/fastqc ]$ for i in SRR11618610 SRR11618616 SRR11618621; do trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired /home/st8/ssd2/tree116/fastqc/${i}_1.fastq.gz /home/st8/ssd2/tree116/fastqc/${i}_2.fastq.gz --gzip -o /home/st8/ssd2/tree116/fastqc/trim_galoreR/; done
(rna) [tree116@st8-Horsea-12U 18:39:00 ~/fastqc ]$ ls trim_galoreR -ll
total 140612
-rw-rw-r-- 1 tree116 tree116 3022 12月 5 18:37 SRR11618610_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 6293855 12月 5 18:37 SRR11618610_1_val_1.fq.gz
-rw-rw-r-- 1 tree116 tree116 3245 12月 5 18:37 SRR11618610_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 6717968 12月 5 18:37 SRR11618610_2_val_2.fq.gz
-rw-rw-r-- 1 tree116 tree116 3462 12月 5 18:37 SRR11618616_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 28780467 12月 5 18:38 SRR11618616_1_val_1.fq.gz
-rw-rw-r-- 1 tree116 tree116 3652 12月 5 18:38 SRR11618616_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 30172038 12月 5 18:38 SRR11618616_2_val_2.fq.gz
-rw-rw-r-- 1 tree116 tree116 3062 12月 5 18:38 SRR11618621_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 35755068 12月 5 18:39 SRR11618621_1_val_1.fq.gz
-rw-rw-r-- 1 tree116 tree116 3313 12月 5 18:39 SRR11618621_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 36224861 12月 5 18:39 SRR11618621_2_val_2.fq.gz
这里看到了一种用脚本处理的方法 尝试一下 https://www.jianshu.com/p/74a328f4d23a
grep _1.fastq.gz>gz1
grep _2.fastq.gz>gz2
#创建名为config的矩阵,以1,2为两列,以进行分组
paste gz1 gz2>config
(rna) [tree116@st8-Horsea-12U 14:52:10 ~/SRR/test ]$ cat config
SRR11618610_1.fastq.gz SRR11618610_2.fastq.gz
SRR11618616_1.fastq.gz SRR11618616_2.fastq.gz
SRR11618621_1.fastq.gz SRR11618621_2.fastq.gz
ls /bin/sh -al
结果bin/sh指向为dash
而我们的脚本为bash脚本,所以运行时应使用bash
命令运行
#创建脚本文件
vi trim.sh
#以下为.sh文件内容
#!/bin/bash
#运行时也需要使用bash,否则会出现`Bad substitution`的问题
dir=/home/st8/ssd2/tree116/SRR/test/outsh/
cat config |while read id
do
arr=${id}
fq1=${arr[0]}
fq2=${arr[1]}
nohup trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired $fq1 $fq2 -o $dir & > $dir/bashtest.log
done
#运行
bash trim.sh
结果如下
(rna) [tree116@st8-Horsea-12U 16:22:56 ~/SRR/test/outsh ]$ ls -alh
total 94M
drwxrwxrwx 2 tree116 tree116 4.0K 12月 5 16:23 .
drwxrwxr-- 6 tree116 tree116 4.0K 12月 5 16:15 ..
-rw-rw-r-- 1 tree116 tree116 0 12月 5 16:22 bashtest.log
-rw-rw-r-- 1 tree116 tree116 2.9K 12月 5 16:22 SRR11618610_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 6.1M 12月 5 16:22 SRR11618610_1_val_1.fq.gz
-rw-rw-r-- 1 tree116 tree116 3.1K 12月 5 16:22 SRR11618610_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 6.5M 12月 5 16:22 SRR11618610_2_val_2.fq.gz
-rw-rw-r-- 1 tree116 tree116 3.3K 12月 5 16:22 SRR11618616_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 28M 12月 5 16:22 SRR11618616_1_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 670 12月 5 16:22 SRR11618616_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 13M 12月 5 16:23 SRR11618616_2_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 3.0K 12月 5 16:23 SRR11618621_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 35M 12月 5 16:23 SRR11618621_1_trimmed.fq.gz
-rw-rw-r-- 1 tree116 tree116 670 12月 5 16:23 SRR11618621_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 tree116 tree116 6.0M 12月 5 16:23 SRR11618621_2_trimmed.fq.gz
-
使用fastqc重新质控
nohup fastqc -t 6 -o ./ SRR*.fq.gz >qc.log &
这里的格式为fq.gz,上文为fastq.gz
nohup multiqc ./*.zip -o ./multiqc > ./multiqc.log &
质控'
fastp直接质控(双端) 参考: https://zhuanlan.zhihu.com/p/33601691
for i in SRR11618610 SRR11618616 SRR11618621
do
nohup fastp -i ../fastqc/${i}_1.fastq.gz -o ${i}_1.fq.gz -I ../fastqc/${i}_2.fastq.gz -O ${i}_2.fq.gz &
done
(rna) [tree116@st8-Horsea-12U 20:42:42 ~/RNA/fastp ]$ ls -lh
total 1.2G
-rw-rw-r-- 1 tree116 tree116 448K 12月 30 20:42 fastp.html
-rw-rw-r-- 1 tree116 tree116 111K 12月 30 20:42 fastp.json
-rw------- 1 tree116 tree116 3.1K 12月 30 20:42 nohup.out
-rw-rw-r-- 1 tree116 tree116 35M 12月 5 21:10 SRR11618610_1.fq
-rw-rw-r-- 1 tree116 tree116 6.9M 12月 30 20:42 SRR11618610_1.fq.gz
-rw-rw-r-- 1 tree116 tree116 35M 12月 5 21:10 SRR11618610_2.fq
-rw-rw-r-- 1 tree116 tree116 7.2M 12月 30 20:42 SRR11618610_2.fq.gz
-rw-rw-r-- 1 tree116 tree116 169M 12月 5 21:10 SRR11618616_1.fq
-rw-rw-r-- 1 tree116 tree116 32M 12月 30 20:42 SRR11618616_1.fq.gz
-rw-rw-r-- 1 tree116 tree116 169M 12月 5 21:10 SRR11618616_2.fq
-rw-rw-r-- 1 tree116 tree116 33M 12月 30 20:42 SRR11618616_2.fq.gz
-rw-rw-r-- 1 tree116 tree116 218M 12月 5 21:10 SRR11618621_1.fq
-rw-rw-r-- 1 tree116 tree116 41M 12月 30 20:42 SRR11618621_1.fq.gz
-rw-rw-r-- 1 tree116 tree116 218M 12月 5 21:10 SRR11618621_2.fq
-rw-rw-r-- 1 tree116 tree116 41M 12月 30 20:42 SRR11618621_2.fq.gz
-rw-rw-r-- 1 tree116 tree116 218M 12月 5 22:11 SRR11618621_2.fq.gzfor
输出文件nohup.out如下:
Read1 before filtering:
total reads: 137124
total bases: 13712400
Q20 bases: 13189766(96.1886%)
Q30 bases: 12571273(91.6781%)
Read2 before filtering:
total reads: 137124
total bases: 13712400
Q20 bases: 12892522(94.0209%)
Q30 bases: 12030598(87.7352%)
Read1 after filtering:
total reads: 131840
total bases: 13147049
Q20 bases: 12746851(96.956%)
Q30 bases: 12183232(92.6689%)
Read2 after filtering:
total reads: 131840
total bases: 13147049
Q20 bases: 12522143(95.2468%)
Q30 bases: 11724352(89.1786%)
Filtering result:
reads passed filter: 263680
reads failed due to low quality: 10552
reads failed due to too many N: 16
reads failed due to too short: 0
reads with adapter trimmed: 2116
bases trimmed due to adapters: 74124
Duplication rate: 35.6575%
Insert size peak (evaluated by paired-end reads): 129
JSON report: fastp.json
HTML report: fastp.html
fastp -i ../fastqc/SRR11618610_1.fastq.gz -o SRR11618610_1.fq.gz -I ../fastqc/SRR11618610_2.fastq.gz -O SRR11618610_2.fq.gz
fastp v0.23.2, time used: 1 seconds
Read1 before filtering:
total reads: 676511
total bases: 67651100
Q20 bases: 64491682(95.3298%)
Q30 bases: 61523379(90.9422%)
Read2 before filtering:
total reads: 676511
total bases: 67651100
Q20 bases: 63554856(93.945%)
Q30 bases: 59823489(88.4294%)
Read1 after filtering:
total reads: 640856
total bases: 63396915
Q20 bases: 61359387(96.7861%)
Q30 bases: 58807105(92.7602%)
Read2 after filtering:
total reads: 640856
total bases: 63396915
Q20 bases: 60742378(95.8128%)
Q30 bases: 57477207(90.6625%)
Filtering result:
reads passed filter: 1281712
reads failed due to low quality: 71226
reads failed due to too many N: 84
reads failed due to too short: 0
reads with adapter trimmed: 42276
bases trimmed due to adapters: 1384228
Duplication rate: 55.2044%
Insert size peak (evaluated by paired-end reads): 129
JSON report: fastp.json
HTML report: fastp.html
fastp -i ../fastqc/SRR11618616_1.fastq.gz -o SRR11618616_1.fq.gz -I ../fastqc/SRR11618616_2.fastq.gz -O SRR11618616_2.fq.gz
fastp v0.23.2, time used: 3 seconds
Read1 before filtering:
total reads: 832516
total bases: 83251600
Q20 bases: 81395406(97.7704%)
Q30 bases: 78398402(94.1704%)
Read2 before filtering:
total reads: 832516
total bases: 83251600
Q20 bases: 81146923(97.4719%)
Q30 bases: 78000409(93.6924%)
Read1 after filtering:
total reads: 824867
total bases: 82295395
Q20 bases: 80606696(97.948%)
Q30 bases: 77688078(94.4015%)
Read2 after filtering:
total reads: 824867
total bases: 82295395
Q20 bases: 80454675(97.7633%)
Q30 bases: 77395871(94.0464%)
Filtering result:
reads passed filter: 1649734
reads failed due to low quality: 15190
reads failed due to too many N: 108
reads failed due to too short: 0
reads with adapter trimmed: 14158
bases trimmed due to adapters: 382916
Duplication rate: 42.696%
Insert size peak (evaluated by paired-end reads): 169
JSON report: fastp.json
HTML report: fastp.html
fastp -i ../fastqc/SRR11618621_1.fastq.gz -o SRR11618621_1.fq.gz -I ../fastqc/SRR11618621_2.fastq.gz -O SRR11618621_2.fq.gz
fastp v0.23.2, time used: 3 seconds
比对
首先是人类参考基因组和注释文件 在hisat2官网下载 https://daehwankimlab.github.io/hisat2/download/#h-sapiens
这里使用Ch38中的 genome_tran
下载后用FTP拖进服务器
使用hisat2进行比对
for i in 10 16 21
do
hisat2 --dta -t -x ./grch38/genome -1 ./RNA/fastp/SRR116186${i}_1.fq.gz -2 RNA/fastp/SRR116186${i}_2.fq.gz -S ./RNA/hisat2/SRR116186${i}.sam &
done
#结果如下
(rna) [tree116@st8-Horsea-12U 17:13:52 ~/RNA/h2test ]$ ls -lh
total 1.2G
-rw-rw-r-- 1 tree116 tree116 90M 12月 29 17:11 SRR11618610.sam
-rw-rw-r-- 1 tree116 tree116 481M 12月 29 17:11 SRR11618616.sam
-rw-rw-r-- 1 tree116 tree116 616M 12月 29 17:11 SRR11618621.sam
输出如下 overall alignment rate分别为86.48% 66.24% 82.24%
Time loading forward index: 00:00:02
Time loading forward index: 00:00:02
Time loading forward index: 00:00:02
Time loading reference: 00:00:01
Time loading reference: 00:00:01
Time loading reference: 00:00:01
Multiseed full-index search: 00:00:09
131840 reads; of these:
131840 (100.00%) were paired; of these:
31540 (23.92%) aligned concordantly 0 times
97199 (73.72%) aligned concordantly exactly 1 time
3101 (2.35%) aligned concordantly >1 times
----
31540 pairs aligned concordantly 0 times; of these:
2189 (6.94%) aligned discordantly 1 time
----
29351 pairs aligned 0 times concordantly or discordantly; of these:
58702 mates make up the pairs; of these:
35646 (60.72%) aligned 0 times
20843 (35.51%) aligned exactly 1 time
2213 (3.77%) aligned >1 times
86.48% overall alignment rate
Time searching: 00:00:11
Overall time: 00:00:13
Multiseed full-index search: 00:00:44
640856 reads; of these:
640856 (100.00%) were paired; of these:
254355 (39.69%) aligned concordantly 0 times
344661 (53.78%) aligned concordantly exactly 1 time
41840 (6.53%) aligned concordantly >1 times
----
254355 pairs aligned concordantly 0 times; of these:
3482 (1.37%) aligned discordantly 1 time
----
250873 pairs aligned 0 times concordantly or discordantly; of these:
501746 mates make up the pairs; of these:
432723 (86.24%) aligned 0 times
52676 (10.50%) aligned exactly 1 time
16347 (3.26%) aligned >1 times
66.24% overall alignment rate
Time searching: 00:00:45
Overall time: 00:00:47
Multiseed full-index search: 00:00:53
824867 reads; of these:
824867 (100.00%) were paired; of these:
191323 (23.19%) aligned concordantly 0 times
587332 (71.20%) aligned concordantly exactly 1 time
46212 (5.60%) aligned concordantly >1 times
----
191323 pairs aligned concordantly 0 times; of these:
4757 (2.49%) aligned discordantly 1 time
----
186566 pairs aligned 0 times concordantly or discordantly; of these:
373132 mates make up the pairs; of these:
293021 (78.53%) aligned 0 times
66181 (17.74%) aligned exactly 1 time
13930 (3.73%) aligned >1 times
82.24% overall alignment rate
Time searching: 00:00:54
Overall time: 00:00:56
sam 转化成bam
#转换
for i in 10 16 21
do
samtools view SRR116186${i}.sam -b > SRR116186${i}.bam #转换
samtools sort SRR116186${i}.bam -o SRR116186${i}_sorted.bam #排序
samtools index SRR116186${i}_sorted.bam; #索引
done
查看单个结果比对后的总体质量 给出BAM文件的比对结果
samtools flagstat SRR11618610_sorted.bam
(rna) [tree116@st8-Horsea-12U 21:35:48 ~/RNA/fastp ]$ samtools flagstat SRR11618610_sorted.bam
288817 + 0 in total (QC-passed reads + QC-failed reads)
25137 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
253171 + 0 mapped (87.66% : N/A)
263680 + 0 paired in sequencing
131840 + 0 read1
131840 + 0 read2
200600 + 0 properly paired (76.08% : N/A)
207422 + 0 with itself and mate mapped
20612 + 0 singletons (7.82% : N/A)
2340 + 0 with mate mapped to a different chr
1979 + 0 with mate mapped to a different chr (mapQ>=5)
(rna) [tree116@st8-Horsea-12U 21:38:51 ~/RNA/fastp ]$ samtools flagstat SRR11618616_sorted.bam
1610889 + 0 in total (QC-passed reads + QC-failed reads)
329177 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1178166 + 0 mapped (73.14% : N/A)
1281712 + 0 paired in sequencing
640856 + 0 read1
640856 + 0 read2
773002 + 0 properly paired (60.31% : N/A)
784814 + 0 with itself and mate mapped
64175 + 0 singletons (5.01% : N/A)
6486 + 0 with mate mapped to a different chr
5589 + 0 with mate mapped to a different chr (mapQ>=5)
(rna) [tree116@st8-Horsea-12U 21:43:24 ~/RNA/fastp ]$ samtools flagstat SRR11618621_sorted.bam
1983306 + 0 in total (QC-passed reads + QC-failed reads)
333572 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1690285 + 0 mapped (85.23% : N/A)
1649734 + 0 paired in sequencing
824867 + 0 read1
824867 + 0 read2
1267088 + 0 properly paired (76.81% : N/A)
1285356 + 0 with itself and mate mapped
71357 + 0 singletons (4.33% : N/A)
6374 + 0 with mate mapped to a different chr
4009 + 0 with mate mapped to a different chr (mapQ>=5)
(rna) [tree116@st8-Horsea-12U 21:43:32 ~/RNA/fastp ]$
samtools sort -n test.bam -o sort_left.bam
-n
表示左
samtools view sort_left.bam | head

samtools view SRR11618621_sorted.bam 1:1234-123456 | head

对BAM文件进行统计分析
①RSeQC 三个文件的统计分析如下图:
bam_stat.py -i SRR11618610_sorted.bam



②使用bed文件作为参考 查看基因组覆盖率
需要提供bed文件,可以直接RSeQC的网站下载(本文统一使用hg38),或者可以用gtf转换 https://sourceforge.net/projects/rseqc/postdownload
2023.1.1日凌晨更新:事实上,此时定量已经成功,且对比率水平不错(如下所示,具体定量细节见下文“ 定量 ”部分),怀疑是bed文件出了问题。(未验证)
Load annotation file gencode.v42.annotation.gtfFeatures : 1643237
Meta-features : 62696
Chromosomes/contigs : 25
Process BAM file SRR11618610 sorted.bam...
Paired-end reads are included
Total alignments : 147824
Successfully assigned alignments : 57279 (38.7%)
Running time : 0.01 minutes
Process BAM file SRR11618616 sorted.bam...
Paired-end reads are included.
Total alignments : 831973
Successfully assigned alignments : 308850 (37.1%)
Running time : 0.01 minutes
Process BAM file SRR11618621 sorted.bam...
Paired-end reads are included.
Total alignments : 1012896
Successfully assigned alignments : 485746 (48.0%)Running time : 0.01 minutes
定量
这里使用featureCounts工具 https://www.bioinfo-scrounger.com/archives/407/
gtf相关阅读: https://mp.weixin.qq.com/s/YHWLcZYeKLEMufUS-TLHVQ
代码举例如下,这里主要讨论不同gtf带来的表达矩阵的结果差异及其原因。
gtf=$HOME/database/GRCh38.106/Homo_sapiens.GRCh38.106.chr.gtf
nohup featureCounts -T 5 -p -t exon -g gene_id -a $gtf \
-o all.id.txt *sorted.bam 1>counts.id.log 2>&1 &
下载参考gtf文件的方法
https://blog.csdn.net/u011262253/article/details/89363809 https://mp.weixin.qq.com/s/O9KZdU9XvqW0_ZWtJM-rnA
各个参数的说明:
-a 输入GTF/GFF基因组注释文件
-p 这个参数是针对paired-end数据
-F 指定-a注释文件的格式,默认是GTF
-g 从注释文件中提取Meta-features信息用于read count,默认是gene_id
-t 跟-g一样的意思,其是默认将exon作为一个feature
-o 输出文件
-T 多线程数
-
首先NCBI的gff文件 GRCh38_latest_genomic.gff
nohup featureCounts -T 5 -p -t exon -g ID -a $gtf -o resulNCBI/all.id.txt *sorted.bam 1> resulNCBI/counts.id.log 2>&1 &
-g 当提供参考的gtf/gff的时候,我们需要提供一个id identifier 来将feature水平的统计汇总为meta-feature水平的统计,默认为“gene_id”,注意!选择gtf/gff中提供的id identifier!!!GFF文件来自NCBI官网,该参数值改为“ID”(见GTF/GFF文件第9列(参考群内同学文章) 结果如下:counts数全部为0,且比对率也为0


-
GENECODE
https://www.gencodegenes.org/human/releases.html gencode.v42.annotation.gtf

Homo_sapiens.GRCh38.86.chr_patch_hapl_scaff.gtf
使用R查看gtf文件

nohup featureCounts -T 5 -p -t exon -g gene_id -a $gtf -o resultHomo/allHomo.id.txt *sorted.bam 1> resultHomo/countsHomo.id.log 2>&1 &

后续数据处理参考: https://blog.csdn.net/leo12354/article/details/107536729
这里可以在终端内将需要的几列单独保留下来(包括gene_id 和 各样本表达量) cut -f 1,7,8,9 allHomo.id.txt |grep -v '^#' > feacturCounts.txt
R语言内处理部分:
counts<-read.table("feacturCounts.txt",sep = "\t",header =T)
counts = subset(counts,SRR11618610_sorted.bam != 0 & SRR11618616_sorted.bam != 0 & SRR11618621_sorted.bam != 0) #去掉表达量为0的行

作者介绍