
生信探索
V1
2023/03/25阅读:33主题:姹紫
所有物种基因Symbol别名转换为最新Symbol
<生信交流与合作请关注公众号@生信探索>
需求
在数据分析中会经常出现感兴趣的基因不在矩阵中,可能的原因是没有测到和旧版Symbol。因此需要找到旧版Symbol(Alias别名)和最新Symbol(Current Symbol)之间的对应关系。
bq.tl.current_symbol可以把(表达)矩阵中的Symbol变为最新版
-
第一个参数数据框(index为Symbol) -
第二个参数Symbol与Alias对应关系文件路径 -
第三个参数物种tax_id比如人的是9606。
SymbolAlias_20230317.feather
的获取可以发送邮件到victor@bioquest.cn
从NCBI下载最新的基因信息https://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz
import numpy as np
import pandas as pd
import bioquest as bq
得到Symbol与Alias对应关系
g=pd.read_csv("gene_info_20230317.gz",sep='\t',usecols=['#tax_id','GeneID','Symbol','Synonyms'])
g.rename(columns={"#tax_id":"tax_id"},inplace=True)
g.loc[:,"Alias"]=g.Synonyms.str.split('|')
g = g.explode("Alias")
g = bq.tl.select(g,columns=["tax_id","GeneID","Symbol","Alias"])
g.reset_index(drop=True,inplace=True)
g.replace({'Alias': {'-':''}},inplace=True)
g.to_feather("SymbolAlias_20230317.feather",compression='zstd',compression_level=1)
tax_id GeneID Symbol Alias
0 7 5692769 NEWENTRY
1 9 2827857 NEWENTRY
2 11 10823747 NEWENTRY
3 14 6951813 NEWENTRY
4 19 3758873 NEWENTRY
... ... ... ... ...
44205723 3032134 60460443 ND6
44205724 3032134 60460444 ND1
44205725 3032134 60460445 I9997_mgr02
44205726 3032134 60460446 I9997_mgt22
44205727 3032134 60460447 I9997_mgr01
[44205728 rows x 4 columns]
使用示例
-
示例数据
df = pd.read_csv("BLCA.csv",index_col="Gene Symbol")
# Gene Name Species
# Gene Symbol
# ATP2B1 ATPase, Ca++ transporting, plasma membrane 1 Homo sapiens
# MYL6 myosin, light chain 6, alkali, smooth muscle a... Homo sapiens
# RPS16 ribosomal protein S16 Homo sapiens
# HIST1H2BA histone cluster 1, H2ba Homo sapiens
# H2AFY2 H2A histone family, member Y2 Homo sapiens
# ... ... ...
# UBB ubiquitin B Homo sapiens
# PYGB phosphorylase, glycogen; brain Homo sapiens
# HLA-A major histocompatibility complex, class I, A Homo sapiens
# HSPA1A heat shock 70kDa protein 1A Homo sapiens
# HSP90AB1 heat shock protein 90kDa alpha (cytosolic), cl... Homo sapiens
-
转换
bq.tl.current_symbol(frame=df,reference="SymbolAlias_20230317.feather", tax_id=9606)
# Gene Name Species \
# H2BC1 histone cluster 1, H2ba Homo sapiens
# MACROH2A2 H2A histone family, member Y2 Homo sapiens
# H3-3B H3 histone, family 3B (H3.3B) Homo sapiens
# H1-5 histone cluster 1, H1b Homo sapiens
# DARS1 aspartyl-tRNA synthetase Homo sapiens
# ... ... ...
# UBB ubiquitin B Homo sapiens
# PYGB phosphorylase, glycogen; brain Homo sapiens
# HLA-A major histocompatibility complex, class I, A Homo sapiens
# HSPA1A heat shock 70kDa protein 1A Homo sapiens
# HSP90AB1 heat shock protein 90kDa alpha (cytosolic), cl... Homo sapiens
# Alias
# H2BC1 HIST1H2BA
# MACROH2A2 H2AFY2
# H3-3B H3F3B
# H1-5 HIST1H1B
# DARS1 DARS
# ... ...
# UBB NaN
# PYGB NaN
# HLA-A NaN
# HSPA1A NaN
# HSP90AB1 NaN
# [378 rows x 3 columns]
作者介绍

生信探索
V1
微信公众号:生信探索