V1

2022/11/16阅读：83主题：嫩青

# 2022-11-16 【生信】 Rosalind 解题笔记

## 1 Rabbits and Recurrence Relations

ROSALIND | Rabbits and Recurrence Relations

## Problem

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2–√,0,π)(π,−2,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…)(1,3,5,7,9,…). We use the notation anan to represent the nn-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if FnFn represents the number of rabbit pairs alive after the nn-th month, then we obtain the Fibonacci sequence having terms FnFn that are defined by the recurrence relation Fn=Fn−1+Fn−2Fn=Fn−1+Fn−2 (with F1=F2=1F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the nn-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of nn. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given: Positive integers n≤40n≤40 and k≤5k≤5.

Return: The total number of rabbit pairs that will be present after nn months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of kk rabbit pairs (instead of only 1 pair).

## Sample Dataset

``5 3``

## Sample Output

``19``

### 答

``def RabiitsRelations(n, k):    if n == 1:        return 1    if n < 1:        return 0    return RabiitsRelations(n-1, k) + RabiitsRelations(n-2, k) * kn = int(input("输入月份数<=40: "))k = int(input("输入一窝产生的兔子数: "))if n in range(41) and k in range(6):    F = RabiitsRelations(n, k)    print(F)``

## 2 Computing GC Content

ROSALIND | Computing GC Content

## Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

## Sample Dataset

``>Rosalind_6404CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG>Rosalind_5959CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC>Rosalind_0808CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT``

## Sample Output

``Rosalind_080860.919540``

### 自己的答案

``resultID, resultGC = [], []fastaDict = {}def GCcontent(seq):    Gcount = seq.count("G")    Ccount = seq.count("C")    return ((Gcount + Ccount) / len(seq)) * 100with open("Computing_GC_Contentrosalind_gc.fasta", 'r') as fasta:    for line in fasta.readlines():        if line.startswith(">"):            id = line.split().strip(">")            fastaDict[id] = ""        else:            fastaDict[id] += line.strip()for id, seq in fastaDict.items():    resultID.append(id)    resultGC.append(GCcontent(seq))index = resultGC.index(max(resultGC))print(f"文件中GC含量最高的序列信息")print(resultID[index])print(resultGC[index])``

### 别人的解法

``f = open('rosalind_gc.txt', 'r')max_gc_name, max_gc_content = '', 0buf = f.readline().rstrip()while buf:    seq_name, seq = buf[1:], ''    buf = f.readline().rstrip()    while not buf.startswith('>') and buf:        seq = seq + buf        buf = f.readline().rstrip()    seq_gc_content = (seq.count('C') + seq.count('G'))/float(len(seq))    if seq_gc_content > max_gc_content:        max_gc_name, max_gc_content = seq_name, seq_gc_contentprint('%s\n%.6f%%' % (max_gc_name, max_gc_content * 100))f.close()``

### 解读

Python rstrip() 删除 string 字符串末尾的指定字符，默认为空白符，包括空格、换行符、回车符、制表符。

``f = open('rosalind_gc.txt', 'r') # 这里打开要读取的文件max_gc_name, max_gc_content = '', 0buf = f.readline().rstrip() # 直接读取第一行，这里他肯定第一行是id，也就是以">"开头while buf:    seq_name, seq = buf[1:], '' # 进while 循环，第二行切片把buf字符串">"后面的字符赋给seq_name，seq 置空字符串    buf = f.readline().rstrip() # f.readline开始读下一行，保持buf为true    while not buf.startswith('>') and buf: # 再来一个while循环        seq = seq + buf        buf = f.readline().rstrip() # 这几句意思就是，当读到下一个id（以">"开头的字符串）之前，所有的字符串 seq = seq + buf 拼起来 成为一个seq        # 这里利用了realine()方法的一个特点，就是每次调用这个方法的时候，就会读取下一行，他没有外层内层循环的影响。        # 比如说里面这个while循环调用readline()的时候，不会暂停外层的readline()。        # 所以当里面的while循环结束后，外层readline()开始读取的就是下一个id（以">"开头的字符串）了。            seq_gc_content = (seq.count('C') + seq.count('G'))/float(len(seq)) # 代码到这就已经把fasta文件读完了，这一行用count()方法计算GC值        # 这个if用的是一个原理，即每当有一个数 seq_gc_content 比 max_gc_content 大的时候，就让 max_gc_content = seq_gc_content    if seq_gc_content > max_gc_content:         max_gc_name, max_gc_content = seq_name, seq_gc_contentprint('%s\n%.6f%%' % (max_gc_name, max_gc_content * 100))f.close() # 记得关闭open的文件很重要``

# 3 Counting Point Mutations

ROSALIND | Counting Point Mutations

## Problem

Figure 2. The Hamming distance between these two strings is 7. Mismatched symbols are colored red.

Given two strings ss and tt of equal length, the Hamming distance between ss and tt, denoted dH(s,t)dH(s,t), is the number of corresponding symbols that differ in ss and tt. See Figure 2.

Given: Two DNA strings ss and tt of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t)dH(s,t).

## Sample Dataset

``GAGCCTACTAACGGGATCATCGTAATGACGGCCT``

## Sample Output

``7``

### 自己的解法

``string1 = "TACGGCGGATCTAGACAGTAGTAAGGTACAAGACGTTAATTGCGTAAGTTACAGAATGACATAACTCATAGAATCAGGATCTACTTTAAACTAGTCAAGTAGTGAAGAGCCTAGAAAGTGATACATCGCACGAAGCGTAATCCACCCCATAAAGCCACCTATAATCCTGGAGCAGTCGCCCCCCTCGATGATCGATGAACTGGTCTTATTCTTGCCAGCGTCGGCCTCTGCCTGATTAAGGGAGTTATTAATCTGTATGATATCGACACCACTTTGAACGACCGAAAGACGATCGCGCCCCCTATCTTGGGGACCTGCCACACATAATCTCTCTAGGCAGTATATGTGTGCAGCGGTCAAGAACTAAGTACTACTCCGCCCTAGACGGTTAGACCTGTGCCAACGGTAGCATTTCTCGAAGCTACGGTCGATGGACAACGCTTCATAGCGCGTATGAAAAGGTGTGGTTCAGGGCTATAGTTGGTCCCTAGGTTCGCAGCCACCGACGTAGTCCCTTCGTGACTTCACGCTTGTGCGCCAGAACTAGAGGAACTCTTCCTCTAACCCGATAGGCAAGAGTCGTCTCACAAGATCCCCGCGGGATTGTCGTATATCTGTGGAGAACGGCGTACCAACAGGATGACCTCTGTACCAAAAACTGCGGGGCACTGAGCGTAATCTAGGGGGGGGACCTCCATAGCAGCGATAACTTTATGTCATCGGTCTTCGGTGGCTACTATTCGCCGTGTCTTTTCGCATTTTAATACTAAGCCCAAAACCCCCACAACGTACTTCGGCTACGGCACATGTTGTGGGAATTCTTAAACATCGACATGCCTAAGTGTTCGCAGATCACAAACAGCAAGGACATCTCTCATATGATCTCAGTAGGGATTTTCGACGTCGCGAAATGGCATCGGAAACTGACAAGGGAAAGTGTCTGTGAATGTCCAGGCAA"string2 = "TAACGCAGCTCTTGATGGTAGCACTGTGGCGGAGGTTTGGCGGTTCTGGAGGACTCGGGCATGACTTTTACCTGGAAGGTGTACTGACGAGCTTGTATGCAGGCAGGATCGCCTGCAGCTGTGCAATGCACACAGCGTAAATGACAATTTAATTCCTGACATGGGGGCTGGCCCCTCCCCGCTAGTGGTGAACCATAGAATCGCATTAAGGCTGCCACTGTCATCTTCAACCTAAATTAGTGGGATTCGAAGGCTGATGATAACGACTGTCGCTTTGACGAGCGGCTCTCCAGCGGTTCCCAGATGACGGAAGAATGCCACGTACTACCTAACCGCCCTGAAAAGTTCTGCAGGATTCAAGACTTCGTTACCAATTCTGCGCTCAGGACGCTATTTTACCCCTCGGTAACATGTCGCTGAGTGACAGTATCCTCAAAACGCTAAAAATTGTTTATGAAGGGGGGTGGTTCATTGCTTTAGCTGCTCCATAGGAGGGAGTGATGGCCGACATGAAGCTCGATACCTGGACTTTGGGCTGCAGAAGTACGTCAGCGCGTCTACCTACGCTGATGCCCACCGTAATAGTTCTCACACGGCCCCGGGTTACTACATATCTAAGTAATCATGCCCTCTCAAGGCAGAAGATGTTTACTTGAAAAAGCGTTGCTTCCAGAAGATATTACGCGGGTGGAGTACCTAGCATTTGTGAGTTTACCTAATCGTCCGTCGTTGCCAATTTTACCCCGGGTGCTTCCAACTTATTTATCTATCCTCAAACCGTGAACCTTGAATCGTGGATTCTACACGTCCTGGGACAAAGGTTAAATACTCATAGTGCTCAATTCTTATGCATGACAAATAGCAAGAATATATATGATATGTTCTCAGTAGGGAGTCTCCCAAATCGTATATGGGAAGACAAACTGATGTGGACAAAAGACAGTTAAGGAGAAGGCCC"def dh (s1, s2):    distance = 0    for i in range(len(s1)):        if s1[i] == s2[i]:            continue        distance+=1    return distanceprint(dh(string1, string2))``

### 别人的解法

1

``s1 = 'GAGCCTACTAACGGGAT's2 = 'CATCGTAATGACGGCCT'print [ a!=b for (a, b) in zip(s1, s2)].count(True)``

2

``s1 = 'GAGCCTACTAACGGGAT's2 = 'CATCGTAATGACGGCCT'sum([a != b for a, b in zip(s1, s2)])``

### 解析

Python zip() 函数 | 菜鸟教程 (runoob.com)

zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。

## 实例(Python 3.0+)

>>> a = [1,2,3] >>> b = [4,5,6] >>> c = [4,5,6,7,8] >>> zipped = zip(a,b) # 返回一个对象 >>> zipped <zip object at 0x103abc288> >>> list(zipped) # list() 转换为列表 [(1, 4), (2, 5), (3, 6)] >>> list(zip(a,c)) # 元素个数与最短的列表一致 [(1, 4), (2, 5), (3, 6)]

>>> a1, a2 = zip(*zip(a,b)) # 与 zip 相反，zip(*) 可理解为解压，返回二维矩阵式 >>> list(a1) [1, 2, 3] >>> list(a2) [4, 5, 6] >>>

``s1 = 'GAGCCTACTAACGGGAT's2 = 'CATCGTAATGACGGCCT'sum([a != b for a, b in zip(s1, s2)])``

`print([a != b for a, b in zip(s1, s2)])`这行代码可以看到实际上生成的列表是这样的：

``[True, False, True, False, True, False, False, True, False, True, False, False, False, False, True, True, False]``

## 4 Mendel's First Law

ROSALIND | Mendel's First Law

## Problem

Figure 2. The probability of any outcome (leaf) in a probability tree diagram is given by the product of probabilities from the start of the tree to the outcome. For example, the probability that X is blue and Y is blue is equal to (2/5)(1/4), or 1/10.

Probability is the mathematical study of randomly occurring phenomena. We will model such a phenomenon with a random variable, which is simply a variable that can take a number of different distinct outcomes depending on the result of an underlying random process.

For example, say that we have a bag containing 3 red balls and 2 blue balls. If we let XX represent the random variable corresponding to the color of a drawn ball, then the probability of each of the two outcomes is given by

Random variables can be combined to yield new random variables. Returning to the ball example, let YY model the color of a second ball drawn from the bag (without replacing the first ball). The probability of YY being red depends on whether the first ball was red or blue. To represent all outcomes of XX and YY, we therefore use a probability tree diagram. This branching diagram represents all possible individual probabilities for XX and YY, with outcomes at the endpoints ("leaves") of the tree. The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree; see Figure 2 for an illustrative example.

An event is simply a collection of outcomes. Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. For our colored ball example, let AA be the event "YY is blue." Pr(A)Pr(A) is equal to the sum of the probabilities of two different outcomes: Pr(X=blue and Y=blue)+Pr(X=red and Y=blue)Pr(X=blue and Y=blue)+Pr(X=red and Y=blue), or 310+110=25310+110=25 (see Figure 2 above).

Given: Three positive integers kk, mm, and nn, representing a population containing k+m+nk+m+n organisms: kk individuals are homozygous dominant for a factor, mm are heterozygous, and nn are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

## Sample Dataset

``2 2 2``

## Sample Output

``0.78333``

### 提示

Consider simulating inheritance on a number of small test cases in order to check your solution.

### 我的解法

``k = 27m = 18n = 26def acc(num):    return num * (num - 1) / 2  # 等差数列求和公式def MFL(k, m, n):    p1 = acc(k + m + n)  # 相互交配的可能性总数    kSelf = acc(k) / p1  # AA自交的概率    mSelf = acc(m) / p1  # Aa自交的概率    nSelf = acc(n) / p1  # aa自交的概率    km = (m * k) / p1  # AA 与 Aa 相交的概率    kn = (k * n) / p1  # AA 与 aa 相交的概率    mn = (m * n) / p1  # Aa 与 aa 相交的概率    return 1 - nSelf - (mn * 1 / 2) - (mSelf * 1 / 4)print(MFL(k, m, n))``

### 别人的解法

1

``def firstLaw(k,m,n):   N = float(k+m+n)   return(1 - 1/N/(N-1)*(n*(n-1) + n*m + m*(m-1)/4.))``

using the complementary probability of getting no dominant alleles. The terms in the sum correspond to the choice of the mating partners: from the subpopulation with recessive alleles only, one from the mixed and one from recessive only (counted twice, but only 50% of the offspring are homozygous), both from mixed (only 25% of the offspring are homozygous).

2

``def firstLaw(k,m,n):    N = float(k+m+n)    return 1 - ( m*n + .25*m*(m-1) + n*(n-1) ) / ( N*(N-1) )``

## 5 Translating RNA into Protein

ROSALIND | Translating RNA into Protein

## Problem

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string ss corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by ss.

## Sample Dataset

``AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA``

## Sample Output

``MAMAPRTEINSTRING``

### 我的解法

``CodonDict = {    "F": ["UUU", "UUC"],    "L": ["UUA", "UUG", "CUU", "CUC", "CUA", "CUG"],    "S": ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],    "Y": ["UAU", "UAC", ],    "Stop": ["UAA", "UAG", "UGA"],    "C": ["UGU", "UGC", ],    "W": ["UGG"],    "P": ["CCU", "CCC", "CCA", "CCG"],    "H": ["CAU", "CAC", ],    "Q": ["CAA", "CAG", ],    "R": ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],    "I": ["AUU", "AUC", "AUA", ],    "M": ["AUG"],    "T": ["ACU", "ACC", "ACA", "ACG"],    "N": ["AAU", "AAC"],    "K": ["AAA", "AAG"],    "V": ["GUU", "GUC", "GUA", "GUG"],    "A": ["GCU", "GCC", "GCA", "GCG"],    "D": ["GAU", "GAC"],    "E": ["GAA", "GAG"],    "G": ["GGU", "GGC", "GGA", "GGG"]    }seq = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"nucleics = ""import re# 用正则表达式，re 中的 findall方法分割字符串seq_codons = re.findall('.{3}', seq.upper())for codon in seq_codons:    for nucleic, codons in CodonDict.items():        if codon not in codons:            continue        nucleics += nucleicprint(nucleics.strip("Stop"))``

### 别人的解法

``string = """UUU F      CUU L      AUU I      GUU VUUC F      CUC L      AUC I      GUC VUUA L      CUA L      AUA I      GUA VUUG L      CUG L      AUG M      GUG VUCU S      CCU P      ACU T      GCU AUCC S      CCC P      ACC T      GCC AUCA S      CCA P      ACA T      GCA AUCG S      CCG P      ACG T      GCG AUAU Y      CAU H      AAU N      GAU DUAC Y      CAC H      AAC N      GAC DUAA Stop   CAA Q      AAA K      GAA EUAG Stop   CAG Q      AAG K      GAG EUGU C      CGU R      AGU S      GGU GUGC C      CGC R      AGC S      GGC GUGA Stop   CGA R      AGA R      GGA GUGG W      CGG R      AGG R      GGG G"""coded = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"decoded = ''traL =  string.split()traDict = dict(zip(traL[0::2], traL[1::2]))for i in range(0, len(coded)-3, 3):    decoded += traDict[coded[i:i+3]]print decoded``

### 尝试理解

`dict(zip(traL[0::2], traL[1::2]))` 这句话直接就把上面那一长串字符串做成了字典。

`traL[0::2]` 作为key

`traL[1::2]` 作为value

`for i in range(0, len(coded)-3, 3): decoded += traDict[coded[i:i+3]]`

I arrive on these threads a proud papa, thrilled that I have cracked the problem. And then I'm floored by these answers that get to the same result in half the lines (and elegantly, to boot) :)

Python

V1