V1

2022/11/18阅读：16主题：嫩青

# 2022-11-18 Consensus and Profile

ROSALIND | Consensus and Profile

## Problem

A matrix is a rectangular table of values divided into rows and columns. An m×nm×n matrix has mm rows and nn columns. Given a matrix AA, we write Ai,jAi,j to indicate the value found at the intersection of row ii and column jj.

Say that we have a collection of DNA strings, all having the same length nn. Their profile matrix is a 4×n4×n matrix PP in which P1,jP1,j represents the number of times that 'A' occurs in the jjth position of one of the strings, P2,jP2,j represents the number of times that C occurs in the jjth position, and so on (see below).

A consensus string cc is a string of length nn formed from our collection by taking the most common symbol at each position; the jjth symbol of cc therefore corresponds to the symbol having the maximum value in the jj-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

A T C C A G C TG G G C A A C TA T G G A T C TDNA StringsA A G C A A C CT T G G A A C TA T G C C A T TA T G G C A C TA 5 1 0 0 5 5 0 0ProfileC 0 0 1 4 2 0 6 1G 1 1 6 3 0 1 0 0T 1 5 0 0 0 1 1 6ConsensusA T G C A A C T

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

## Sample Dataset

``>Rosalind_1ATCCAGCT>Rosalind_2GGGCAACT>Rosalind_3ATGGATCT>Rosalind_4AAGCAACC>Rosalind_5TTGGAACT>Rosalind_6ATGCCATT>Rosalind_7ATGGCACT``

## Sample Output

``ATGCAACTA: 5 1 0 0 5 5 0 0C: 0 0 1 4 2 0 6 1G: 1 1 6 3 0 1 0 0T: 1 5 0 0 0 1 1 6``

### 我的解法

``seqLists = []with open("Consensus and Profile-sample.txt", "r") as fastqs:    # 这里仿照之前读取fasta的写法    buf = fastqs.readline().rstrip() # 先读第一行为 read ID    while buf:        seq = "" # seq置空用于拼接        buf = fastqs.readline().rstrip() # 去除行尾换行符很重要        while not buf.startswith(">") and buf: # 如果下一行不是 以 ">" 开头的readID的话，就一直读            seq += buf # 把序列拼接起来            buf = fastqs.readline().rstrip()        seqLists.append(seq)seqLen = len(seqLists[0])# 这里是借鉴网上的代码，二维数组列表生成式。tableDict = {"A": [0 for row in range(seqLen)],             "C": [0 for row in range(seqLen)],             "G": [0 for row in range(seqLen)],             "T": [0 for row in range(seqLen)]}# 下面也是仿写之前RNA转蛋白质的写法，通过一串有规律的字符串 制作字典comparisonStr = "A 0 C 1 G 2 T 3"comparisonStr = comparisonStr.split()comparisonDict = dict(zip(comparisonStr[1::2], comparisonStr[::2]))# 如果seqlist中的seq在 i 这个位置上出现了，就读取 tableDict[nucleic] 相应的位置 +1for seq in seqLists:    for i, nucleic in enumerate(seq):        if nucleic not in tableDict:            continue        tableDict[nucleic][i] += 1consensusSeq = ""# tableDict中ACGT各个碱基的出现频率已经有了# 接下来比较每个位置碱基出现的次数多少，来得到共识序列for A, C, G, T in zip(tableDict["A"], tableDict["C"], tableDict["G"], tableDict["T"]):    nucleics = [A, C, G, T]    maxNum = max(nucleics)    consensusSeq += comparisonDict[str(nucleics.index(maxNum))]print(consensusSeq)for nucleic, frequency in tableDict.items():    print(f"{nucleic}: " + " ".join([str(i) for i in frequency]))``

python 纯数字list转化为字符串_weixin_45903952的博客-CSDN博客

python 纯数字list转化为字符串 在python中，如果将字符串类型的list转化为一个字符串 用 s=[“one”,“two”,“three”] ““join(s)就可以达到目的，而用 纯数字list s=[1,223,23] print(”,”.join(s)) TypeError: sequence item 0: expected str instance, int found 会出错，这样就得把list中的每个转为字符，如下：

``s=[1,223,23]print(",".join(str(i) for i in s))``

python初始化list列表（1维、2维） - zqifa - 博客园 (cnblogs.com)

multilist = [[0 for col in range(5)] for row in range(6)]

### 大神解答

1

``strands = [x.strip() for x in f.readlines()]matrix = zip(*strands)profile_matrix = map(lambda x: dict((base, x.count(base)) for base in "ACGT"), matrix)consensus = [max(x,key = x.get) for x in profile_matrix]print "".join(consensus);for base in "ACGT":    print base+":",    for x in profile_matrix:        print x[base],    print ""``

2

``def cons(strings):    from collections import Counter    counters = map(Counter, zip(*strings))    consensus = "".join(c.most_common(1)[0][0] for c in counters)    profile_matrix = "\n".join(b + ": " + \        " ".join(str(c[b]) for c in counters) for b in "ACGT")    return consensus + "\n" + profile_matrix``

### 学习

``('A', 'G', 'A', 'A', 'T', 'A', 'A')('T', 'G', 'T', 'A', 'T', 'T', 'T')('C', 'G', 'G', 'G', 'G', 'G', 'G')('C', 'C', 'G', 'C', 'G', 'C', 'G')('A', 'A', 'A', 'A', 'A', 'C', 'C')('G', 'A', 'T', 'A', 'A', 'A', 'A')('C', 'C', 'C', 'C', 'C', 'T', 'C')('T', 'T', 'T', 'C', 'T', 'T', 'T')``

`counters = map(Counter, zip(*strings))` 这段代码counters打出来是：

``Counter({'A': 5, 'G': 1, 'T': 1})Counter({'T': 5, 'G': 1, 'A': 1})Counter({'G': 6, 'C': 1})Counter({'C': 4, 'G': 3})Counter({'A': 5, 'C': 2})Counter({'A': 5, 'G': 1, 'T': 1})Counter({'C': 6, 'T': 1})Counter({'T': 6, 'C': 1})``

`collections.Counter` 的相关操作在网上搜到

Python计数器collections.Counter用法详解_THEAQING的博客-CSDN博客

``#求序列中出现次数最多的元素 from collections import Counter list_01 = [1,9,9,5,0,8,0,9]temp = Counter(list_01) #统计出现次数最多的一个元素print(temp.most_common(1))   #[(9, 3)]  元素“9”出现3次。print(temp.most_common(2)) #[(9, 3), (0, 2)]  统计出现次数最多个两个元素 #没有指定个数，就列出全部print(temp.most_common())  #[(9, 3), (0, 2), (1, 1), (5, 1), (8, 1)]``

`consensus = "".join(c.most_common(1)[0][0] for c in counters)` 所以他这句话数出来的是每个read中出现次数最多的碱基，拼成的序列。

V1