V1

2023/02/27阅读：10主题：姹紫

# 转录组实战02: 计算非冗余外显子长度之和

![生信交流与合作请关注公众号@生信探索]

## 计算非冗余外显子长度

``micromamba activate RNAmicromamba install -c bioconda gtftoolsgtftools -l gene_length.txt ~/DataHub/Genomics/GENCODE/release_42/HS.gencode.v42.annotation.gtf``

Calculate gene lengths. Since a gene may have multiple isoforms, there are multiple ways to calculate gene length based on literature. Three simple ways are considering the mean, median and maximum of the lengths of isoforms as the length of the gene. A fourth way is to calculate the length of merged exons of all isoforms (i.e. non-overlapping exonic length). So, in total, four different types of gene lengths(the mean, median and max of lengths of isoforms of a gene, and the length of merged exons of isoforms of a gene) are provided.

``````gene  mean  median  longest_isoform  merged
ENSG00000287252.3  1546  1323  2751  3266
ENSG00000242268.3  1709  1709  2708  2750
ENSG00000270112.5  1793  1818  4685  10685
ENSG00000280143.1  5326  5326  5326  5326
ENSG00000146083.12  1239  816  4122  5627
ENSG00000158486.15  5952  3404  12567  15023
``````

## 获取基因类型、基因名等信息

``gtftools -g gene_info.txt ~/DataHub/Genomics/GENCODE/release_42/HS.gencode.v42.annotation.gtf``

## 压缩保存

``import pandas as pdg1 = pd.read_csv("gene_info.txt",sep='\t',index_col=4,header=None)g1.columns = ["Chr","Start","Stop","Strand","Symbol","GeneType"]g2 = pd.read_csv("gene_length.txt",sep='\t',index_col=0,usecols=["gene","merged"])g = pd.merge(g1,g2,left_index=True,right_index=True)g.insert(0,column="Ensembl",value=g.index)g.rename(columns={"merged":"Length"},inplace=True)g.sort_values("Chr",inplace=True)g.to_csv("hg38_gene_info_v42.csv.gz",index=False)``

## 给count添加基因信息

``# 现在的目录是~/Project/Human_16_Asthma_Bulkfrom pathlib import Pathimport pandas as pdimport datatable as dtdir="alignment/STAR"count_list = []tpm_list = []paths = Path(dir).glob("*ReadsPerGene.out.tab")for x,y in enumerate(list(paths)):    if x < 1:        Ensembl =  pd.read_csv(y,usecols=[0],sep='\t',skiprows=4,header=None)        Ensembl.rename(columns={0:'Ensembl'},inplace=True)    _count_df = pd.read_csv(y,usecols=[1],sep='\t',skiprows=4,header=None)    count_list.append(_count_df.rename(columns={1:y.name.split('Reads')[0]}))count_df = pd.concat(count_list,axis=1)count_df.insert(0,column='Ensembl',value=Ensembl)count = pd.merge(gene_info,count_df,on="Ensembl")dt.Frame(count).to_csv('star_count.csv.gz')``

``````Ensembl  Chr  Start  Stop  Strand  Symbol  GeneType  Length  SRR1039516  SRR1039522  SRR1039508  SRR1039523  SRR1039511  SRR1039510  SRR1039514  SRR1039521  SRR1039512  SRR1039513  SRR1039519  SRR1039515  SRR1039520  SRR1039509  SRR1039518  SRR1039>
ENSG00000130762.15  1  3454664  3481113  +  ARHGEF16  protein_coding  5324  2  0  1  0  1  1  1  0  2  0  0  0  1  0  1  1
ENSG00000117472.10  1  46175072  46185962  +  TSPAN1  protein_coding  2370  16  2  1  2  2  4  11  1  11  1  17  11  1  0  8  9
ENSG00000227857.2  1  46134530  46139081  +  ENSG00000227857  lncRNA  339  0  1  2  0  0  2  0  0  2  0  0  1  1  0  0  0
ENSG00000233114.2  1  46104949  46105175  +  ENSG00000233114  processed_pseudogene  226  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

``````

V1