2022-11-18 Consensus and Profile

ROSALIND | Consensus and Profile

“
Problem

A matrix is a rectangular table of values divided into rows and columns. An m×nm×n matrix has mm rows and nn columns. Given a matrix AA, we write Ai,jAi,j to indicate the value found at the intersection of row ii and column jj.

Say that we have a collection of DNA strings, all having the same length nn. Their profile matrix is a 4×n4×n matrix PP in which P1,jP1,j represents the number of times that 'A' occurs in the jjth position of one of the strings, P2,jP2,j represents the number of times that C occurs in the jjth position, and so on (see below).

A consensus string cc is a string of length nn formed from our collection by taking the most common symbol at each position; the jjth symbol of cc therefore corresponds to the symbol having the maximum value in the jj-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

A T C C A G C TG G G C A A C TA T G G A T C TDNA StringsA A G C A A C CT T G G A A C TA T G C C A T TA T G G C A C TA 5 1 0 0 5 5 0 0ProfileC 0 0 1 4 2 0 6 1G 1 1 6 3 0 1 0 0T 1 5 0 0 0 1 1 6ConsensusA T G C A A C T

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
Sample Output
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

思考

说实在话，这个题，和我在硕士期间老板提的需求很类似

我当时没弄出来，就毕业了

我看了一下出题时间，是2012年7月，就已经在Rosalind上挂着了

很惭愧，没有好好学习，对不起老板。

这道题如果懂一点pandas 或者的话，可能很快就解决了

试一下不用pandas

我的解法

seqLists = []
with open("Consensus and Profile-sample.txt", "r") as fastqs:
    # 这里仿照之前读取fasta的写法
    buf = fastqs.readline().rstrip() # 先读第一行为 read ID
    while buf:
        seq = "" # seq置空用于拼接
        buf = fastqs.readline().rstrip() # 去除行尾换行符很重要
        while not buf.startswith(">") and buf: # 如果下一行不是 以 ">" 开头的readID的话，就一直读
            seq += buf # 把序列拼接起来
            buf = fastqs.readline().rstrip()
        seqLists.append(seq)

seqLen = len(seqLists[0])

# 这里是借鉴网上的代码，二维数组列表生成式。
tableDict = {"A": [0 for row in range(seqLen)],
             "C": [0 for row in range(seqLen)],
             "G": [0 for row in range(seqLen)],
             "T": [0 for row in range(seqLen)]}

# 下面也是仿写之前RNA转蛋白质的写法，通过一串有规律的字符串 制作字典
comparisonStr = "A 0 C 1 G 2 T 3"
comparisonStr = comparisonStr.split()
comparisonDict = dict(zip(comparisonStr[1::2], comparisonStr[::2]))

# 如果seqlist中的seq在 i 这个位置上出现了，就读取 tableDict[nucleic] 相应的位置 +1
for seq in seqLists:
    for i, nucleic in enumerate(seq):
        if nucleic not in tableDict:
            continue
        tableDict[nucleic][i] += 1

consensusSeq = ""
# tableDict中ACGT各个碱基的出现频率已经有了
# 接下来比较每个位置碱基出现的次数多少，来得到共识序列
for A, C, G, T in zip(tableDict["A"], tableDict["C"], tableDict["G"], tableDict["T"]):
    nucleics = [A, C, G, T]
    maxNum = max(nucleics)
    consensusSeq += comparisonDict[str(nucleics.index(maxNum))]

print(consensusSeq)
for nucleic, frequency in tableDict.items():
    print(f"{nucleic}: " + " ".join([str(i) for i in frequency]))

我的代码主要是对之前的回顾和总结了

分两次写完的，中间耽搁了一下，所以写的有点乱

回答里参考的一些链接

python 纯数字list转化为字符串_weixin_45903952的博客-CSDN博客

“
python 纯数字list转化为字符串在python中，如果将字符串类型的list转化为一个字符串用 s=[“one”,“two”,“three”] ““join(s)就可以达到目的，而用纯数字list s=[1,223,23] print(”,”.join(s)) TypeError: sequence item 0: expected str instance, int found 会出错，这样就得把list中的每个转为字符，如下：
s=[1,223,23]
print(",".join(str(i) for i in s))

python初始化list列表（1维、2维） - zqifa - 博客园 (cnblogs.com)

“
但是如下写法是对的： multilist = [[0] * 5 for row in range(3)] multilist[0][0] = 'Hello' print multilist 我们看输出结果： [['Hello', 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] 恩，没问题。但是，由于使用 * 的方法比较容易引起混淆导致Bug，所以还是推荐使用上面方法4，即：

multilist = [[0 for col in range(5)] for row in range(6)]

期待大神的解答

大神解答

1

strands = [x.strip() for x in f.readlines()]

matrix = zip(*strands)

profile_matrix = map(lambda x: dict((base, x.count(base)) for base in "ACGT"), matrix)

consensus = [max(x,key = x.get) for x in profile_matrix]

print "".join(consensus);

for base in "ACGT":
    print base+":",
    for x in profile_matrix:
        print x[base],
    print ""

2

def cons(strings):
    from collections import Counter
    counters = map(Counter, zip(*strings))
    consensus = "".join(c.most_common(1)[0][0] for c in counters)
    profile_matrix = "\n".join(b + ": " + \
        " ".join(str(c[b]) for c in counters) for b in "ACGT")
    return consensus + "\n" + profile_matrix

学习

第一个的代码是Python2写的，有点点老了，看看第二个

代码 zip(*strings) 的效果是这样

('A', 'G', 'A', 'A', 'T', 'A', 'A')
('T', 'G', 'T', 'A', 'T', 'T', 'T')
('C', 'G', 'G', 'G', 'G', 'G', 'G')
('C', 'C', 'G', 'C', 'G', 'C', 'G')
('A', 'A', 'A', 'A', 'A', 'C', 'C')
('G', 'A', 'T', 'A', 'A', 'A', 'A')
('C', 'C', 'C', 'C', 'C', 'T', 'C')
('T', 'T', 'T', 'C', 'T', 'T', 'T')

counters = map(Counter, zip(*strings)) 这段代码counters打出来是：

Counter({'A': 5, 'G': 1, 'T': 1})
Counter({'T': 5, 'G': 1, 'A': 1})
Counter({'G': 6, 'C': 1})
Counter({'C': 4, 'G': 3})
Counter({'A': 5, 'C': 2})
Counter({'A': 5, 'G': 1, 'T': 1})
Counter({'C': 6, 'T': 1})
Counter({'T': 6, 'C': 1})

collections.Counter 的相关操作在网上搜到

Python计数器collections.Counter用法详解_THEAQING的博客-CSDN博客

其中他自带方法 most_common() 可以输出出现次数最多的元素

“

#求序列中出现次数最多的元素
 
from collections import Counter
 
list_01 = [1,9,9,5,0,8,0,9]
temp = Counter(list_01)
 
#统计出现次数最多的一个元素
print(temp.most_common(1))   #[(9, 3)]  元素“9”出现3次。
print(temp.most_common(2)) #[(9, 3), (0, 2)]  统计出现次数最多个两个元素
 
#没有指定个数，就列出全部
print(temp.most_common())  #[(9, 3), (0, 2), (1, 1), (5, 1), (8, 1)]

consensus = "".join(c.most_common(1)[0][0] for c in counters) 所以他这句话数出来的是每个read中出现次数最多的碱基，拼成的序列。

感觉他这个代码 def cons(strings): 里的strings参数，输入的应该是每一列的碱基？

而不是每条read的碱基，要不然他的consensus不就成了每条read出现最多的碱基序列了？不是很懂

但是这个collections.Counter 的使用可以学习一下。