李拖沓

V1

2022/11/16阅读:40主题:嫩青

2022-11-16 【生信】 Rosalind 解题笔记

1 Rabbits and Recurrence Relations

ROSALIND | Rabbits and Recurrence Relations

Problem

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2–√,0,π)(π,−2,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…)(1,3,5,7,9,…). We use the notation anan to represent the nn-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if FnFn represents the number of rabbit pairs alive after the nn-th month, then we obtain the Fibonacci sequence having terms FnFn that are defined by the recurrence relation Fn=Fn−1+Fn−2Fn=Fn−1+Fn−2 (with F1=F2=1F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the nn-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of nn. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given: Positive integers n≤40n≤40 and k≤5k≤5.

Return: The total number of rabbit pairs that will be present after nn months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of kk rabbit pairs (instead of only 1 pair).

Sample Dataset

5 3

Sample Output

19

题干翻译

问题

序列是一个有序的对象(通常是数字)的集合,允许重复。序列可以是有限的或无限的。两个例子是有限序列(π,-2-√,0,π)和奇数的无限序列(1,3,5,7,9,...)。我们用符号an来表示一个序列的第n项。

递归关系是一种定义序列中与前几项数值有关的项的方式。在前面介绍的斐波那契兔子的例子中,任何给定的月份都包含前一个月活着的兔子,加上任何新的后代。一个关键的观察是,任何一个月的后代数量都等于前两个月活着的兔子数量。因此,如果Fn代表第n个月后活着的兔子对的数量,那么我们得到斐波那契序列,其条款Fn由递归关系Fn=Fn-1+Fn-2定义(F1=F2=1以启动该序列)。虽然这个序列以斐波那契的名字命名,但它在两千多年前就为印度数学家所知。

在寻找由递归关系定义的序列的第n项时,我们可以简单地使用递归关系来生成n值逐渐变大的项。这个问题向我们介绍了动态编程的计算技术,它通过使用较小情况的答案来逐步建立起解决方案。

给定。正整数n≤40,k≤5。

返回。如果我们从1对兔子开始,并且在每一代中,每对繁殖年龄的兔子产生一窝k对兔子(而不是只有1对),那么n个月后将出现的兔子总数。

理解

这里面有个隐藏的条件,需要点开expand才能看到:就是每对兔子需要到第二个月才会开始繁殖。

def RabiitsRelations(n, k):
    if n == 1:
        return 1
    if n < 1:
        return 0
    return RabiitsRelations(n-1, k) + RabiitsRelations(n-2, k) * k

n = int(input("输入月份数<=40: "))
k = int(input("输入一窝产生的兔子数: "))

if n in range(41and k in range(6):
    F = RabiitsRelations(n, k)
    print(F)

Output

输入 36 5

输出 2442624980786496

2 Computing GC Content

ROSALIND | Computing GC Content

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540

自己的答案

resultID, resultGC = [], []
fastaDict = {}
def GCcontent(seq):
    Gcount = seq.count("G")
    Ccount = seq.count("C")
    return ((Gcount + Ccount) / len(seq)) * 100


with open("Computing_GC_Contentrosalind_gc.fasta"'r'as fasta:
    for line in fasta.readlines():
        if line.startswith(">"):
            id = line.split()[0].strip(">")
            fastaDict[id] = ""
        else:
            fastaDict[id] += line.strip()

for id, seq in fastaDict.items():
    resultID.append(id)
    resultGC.append(GCcontent(seq))


index = resultGC.index(max(resultGC))
print(f"文件中GC含量最高的序列信息")
print(resultID[index])
print(resultGC[index])

看上去比较复杂,不够优雅。

别人的解法

f = open('rosalind_gc.txt''r')

max_gc_name, max_gc_content = ''0

buf = f.readline().rstrip()
while buf:
    seq_name, seq = buf[1:], ''
    buf = f.readline().rstrip()
    while not buf.startswith('>'and buf:
        seq = seq + buf
        buf = f.readline().rstrip()
    seq_gc_content = (seq.count('C') + seq.count('G'))/float(len(seq))
    if seq_gc_content > max_gc_content:
        max_gc_name, max_gc_content = seq_name, seq_gc_content

print('%s\n%.6f%%' % (max_gc_name, max_gc_content * 100))
f.close()

解读

Python rstrip() 删除 string 字符串末尾的指定字符,默认为空白符,包括空格、换行符、回车符、制表符。

其实我一直不知道怎么读取这种 id 下面序列中间带换行符的文件,所以我是用的字典。但这个解法正好给了一个看上去很简洁的写法。

f = open('rosalind_gc.txt''r'# 这里打开要读取的文件

max_gc_name, max_gc_content = ''0

buf = f.readline().rstrip() # 直接读取第一行,这里他肯定第一行是id,也就是以">"开头
while buf:
    seq_name, seq = buf[1:], '' # 进while 循环,第二行切片把buf字符串">"后面的字符赋给seq_name,seq 置空字符串
    buf = f.readline().rstrip() # f.readline开始读下一行,保持buf为true
    while not buf.startswith('>'and buf: # 再来一个while循环
        seq = seq + buf
        buf = f.readline().rstrip() # 这几句意思就是,当读到下一个id(以">"开头的字符串)之前,所有的字符串 seq = seq + buf 拼起来 成为一个seq
        # 这里利用了realine()方法的一个特点,就是每次调用这个方法的时候,就会读取下一行,他没有外层内层循环的影响。
        # 比如说里面这个while循环调用readline()的时候,不会暂停外层的readline()。
        # 所以当里面的while循环结束后,外层readline()开始读取的就是下一个id(以">"开头的字符串)了。
        
    seq_gc_content = (seq.count('C') + seq.count('G'))/float(len(seq)) # 代码到这就已经把fasta文件读完了,这一行用count()方法计算GC值
    
    # 这个if用的是一个原理,即每当有一个数 seq_gc_content 比 max_gc_content 大的时候,就让 max_gc_content = seq_gc_content
    if seq_gc_content > max_gc_content: 
        max_gc_name, max_gc_content = seq_name, seq_gc_content

print('%s\n%.6f%%' % (max_gc_name, max_gc_content * 100))
f.close() # 记得关闭open的文件很重要

3 Counting Point Mutations

ROSALIND | Counting Point Mutations

Problem

img
img

Figure 2. The Hamming distance between these two strings is 7. Mismatched symbols are colored red.

Given two strings ss and tt of equal length, the Hamming distance between ss and tt, denoted dH(s,t)dH(s,t), is the number of corresponding symbols that differ in ss and tt. See Figure 2.

Given: Two DNA strings ss and tt of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t)dH(s,t).

Sample Dataset

GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT

Sample Output

7

自己的解法

string1 = "TACGGCGGATCTAGACAGTAGTAAGGTACAAGACGTTAATTGCGTAAGTTACAGAATGACATAACTCATAGAATCAGGATCTACTTTAAACTAGTCAAGTAGTGAAGAGCCTAGAAAGTGATACATCGCACGAAGCGTAATCCACCCCATAAAGCCACCTATAATCCTGGAGCAGTCGCCCCCCTCGATGATCGATGAACTGGTCTTATTCTTGCCAGCGTCGGCCTCTGCCTGATTAAGGGAGTTATTAATCTGTATGATATCGACACCACTTTGAACGACCGAAAGACGATCGCGCCCCCTATCTTGGGGACCTGCCACACATAATCTCTCTAGGCAGTATATGTGTGCAGCGGTCAAGAACTAAGTACTACTCCGCCCTAGACGGTTAGACCTGTGCCAACGGTAGCATTTCTCGAAGCTACGGTCGATGGACAACGCTTCATAGCGCGTATGAAAAGGTGTGGTTCAGGGCTATAGTTGGTCCCTAGGTTCGCAGCCACCGACGTAGTCCCTTCGTGACTTCACGCTTGTGCGCCAGAACTAGAGGAACTCTTCCTCTAACCCGATAGGCAAGAGTCGTCTCACAAGATCCCCGCGGGATTGTCGTATATCTGTGGAGAACGGCGTACCAACAGGATGACCTCTGTACCAAAAACTGCGGGGCACTGAGCGTAATCTAGGGGGGGGACCTCCATAGCAGCGATAACTTTATGTCATCGGTCTTCGGTGGCTACTATTCGCCGTGTCTTTTCGCATTTTAATACTAAGCCCAAAACCCCCACAACGTACTTCGGCTACGGCACATGTTGTGGGAATTCTTAAACATCGACATGCCTAAGTGTTCGCAGATCACAAACAGCAAGGACATCTCTCATATGATCTCAGTAGGGATTTTCGACGTCGCGAAATGGCATCGGAAACTGACAAGGGAAAGTGTCTGTGAATGTCCAGGCAA"
string2 = "TAACGCAGCTCTTGATGGTAGCACTGTGGCGGAGGTTTGGCGGTTCTGGAGGACTCGGGCATGACTTTTACCTGGAAGGTGTACTGACGAGCTTGTATGCAGGCAGGATCGCCTGCAGCTGTGCAATGCACACAGCGTAAATGACAATTTAATTCCTGACATGGGGGCTGGCCCCTCCCCGCTAGTGGTGAACCATAGAATCGCATTAAGGCTGCCACTGTCATCTTCAACCTAAATTAGTGGGATTCGAAGGCTGATGATAACGACTGTCGCTTTGACGAGCGGCTCTCCAGCGGTTCCCAGATGACGGAAGAATGCCACGTACTACCTAACCGCCCTGAAAAGTTCTGCAGGATTCAAGACTTCGTTACCAATTCTGCGCTCAGGACGCTATTTTACCCCTCGGTAACATGTCGCTGAGTGACAGTATCCTCAAAACGCTAAAAATTGTTTATGAAGGGGGGTGGTTCATTGCTTTAGCTGCTCCATAGGAGGGAGTGATGGCCGACATGAAGCTCGATACCTGGACTTTGGGCTGCAGAAGTACGTCAGCGCGTCTACCTACGCTGATGCCCACCGTAATAGTTCTCACACGGCCCCGGGTTACTACATATCTAAGTAATCATGCCCTCTCAAGGCAGAAGATGTTTACTTGAAAAAGCGTTGCTTCCAGAAGATATTACGCGGGTGGAGTACCTAGCATTTGTGAGTTTACCTAATCGTCCGTCGTTGCCAATTTTACCCCGGGTGCTTCCAACTTATTTATCTATCCTCAAACCGTGAACCTTGAATCGTGGATTCTACACGTCCTGGGACAAAGGTTAAATACTCATAGTGCTCAATTCTTATGCATGACAAATAGCAAGAATATATATGATATGTTCTCAGTAGGGAGTCTCCCAAATCGTATATGGGAAGACAAACTGATGTGGACAAAAGACAGTTAAGGAGAAGGCCC"

def dh (s1, s2):
    distance = 0
    for i in range(len(s1)):
        if s1[i] == s2[i]:
            continue
        distance+=1
    return distance

print(dh(string1, string2))

本来感觉已经很简洁了,没想下面还有更简洁的。

别人的解法

1

s1 = 'GAGCCTACTAACGGGAT'
s2 = 'CATCGTAATGACGGCCT'
print [ a!=b for (a, b) in zip(s1, s2)].count(True)

2

s1 = 'GAGCCTACTAACGGGAT'
s2 = 'CATCGTAATGACGGCCT'
sum([a != b for a, b in zip(s1, s2)])

解析

Python zip() 函数 | 菜鸟教程 (runoob.com)

zip() 函数用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。

实例(Python 3.0+)

>>> a = [1,2,3] >>> b = [4,5,6] >>> c = [4,5,6,7,8] >>> zipped = zip(a,b) # 返回一个对象 >>> zipped <zip object at 0x103abc288> >>> list(zipped) # list() 转换为列表 [(1, 4), (2, 5), (3, 6)] >>> list(zip(a,c)) # 元素个数与最短的列表一致 [(1, 4), (2, 5), (3, 6)]

>>> a1, a2 = zip(*zip(a,b)) # 与 zip 相反,zip(*) 可理解为解压,返回二维矩阵式 >>> list(a1) [1, 2, 3] >>> list(a2) [4, 5, 6] >>>

感觉第二个可读性更好,而且更简洁。

s1 = 'GAGCCTACTAACGGGAT'
s2 = 'CATCGTAATGACGGCCT'
sum([a != b for a, b in zip(s1, s2)])

就是列表生成式里面套了个zip

生成一个包含所有 a != b 的情况的列表

print([a != b for a, b in zip(s1, s2)])这行代码可以看到实际上生成的列表是这样的:

[True, False, True, False, True, False, False, True, False, True, False, False, False, False, True, True, False]

然后用sum统计,从这里也可以看出sum()只会统计true

4 Mendel's First Law

ROSALIND | Mendel's First Law

Problem

img
img

Figure 2. The probability of any outcome (leaf) in a probability tree diagram is given by the product of probabilities from the start of the tree to the outcome. For example, the probability that X is blue and Y is blue is equal to (2/5)(1/4), or 1/10.

Probability is the mathematical study of randomly occurring phenomena. We will model such a phenomenon with a random variable, which is simply a variable that can take a number of different distinct outcomes depending on the result of an underlying random process.

For example, say that we have a bag containing 3 red balls and 2 blue balls. If we let XX represent the random variable corresponding to the color of a drawn ball, then the probability of each of the two outcomes is given by

Random variables can be combined to yield new random variables. Returning to the ball example, let YY model the color of a second ball drawn from the bag (without replacing the first ball). The probability of YY being red depends on whether the first ball was red or blue. To represent all outcomes of XX and YY, we therefore use a probability tree diagram. This branching diagram represents all possible individual probabilities for XX and YY, with outcomes at the endpoints ("leaves") of the tree. The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree; see Figure 2 for an illustrative example.

An event is simply a collection of outcomes. Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. For our colored ball example, let AA be the event "YY is blue." Pr(A)Pr(A) is equal to the sum of the probabilities of two different outcomes: Pr(X=blue and Y=blue)+Pr(X=red and Y=blue)Pr(X=blue and Y=blue)+Pr(X=red and Y=blue), or 310+110=25310+110=25 (see Figure 2 above).

Given: Three positive integers kk, mm, and nn, representing a population containing k+m+nk+m+n organisms: kk individuals are homozygous dominant for a factor, mm are heterozygous, and nn are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

Sample Dataset

2 2 2

Sample Output

0.78333

问题翻译

给定:三个正整数k、m、n,代表一个包含k+m+n生物的种群:k个个体是同质显性因子,m个是杂合因子,n个是同质隐性因子。

返回:两个随机选择的交配生物体产生一个拥有显性等位基因的个体(从而显示显性表型)的概率。假设任何两个生物体都可以交配。

提示

Consider simulating inheritance on a number of small test cases in order to check your solution.

考虑在一些小的测试案例上模拟继承,以便检查你的解决方案。

思考

说实在话,没看懂这道题在问什么,题干和例题好像是分离的。

看了半天终于弄明白在问什么了

我的解法

k = 27
m = 18
n = 26


def acc(num):
    return num * (num - 1) / 2  # 等差数列求和公式


def MFL(k, m, n):
    p1 = acc(k + m + n)  # 相互交配的可能性总数

    kSelf = acc(k) / p1  # AA自交的概率
    mSelf = acc(m) / p1  # Aa自交的概率
    nSelf = acc(n) / p1  # aa自交的概率

    km = (m * k) / p1  # AA 与 Aa 相交的概率
    kn = (k * n) / p1  # AA 与 aa 相交的概率
    mn = (m * n) / p1  # Aa 与 aa 相交的概率

    return 1 - nSelf - (mn * 1 / 2) - (mSelf * 1 / 4)


print(MFL(k, m, n))

我这就是平铺直述了,估计有大神几行搞定

别人的解法

1

def firstLaw(k,m,n):
   N = float(k+m+n)
   return(1 - 1/N/(N-1)*(n*(n-1) + n*m + m*(m-1)/4.))

using the complementary probability of getting no dominant alleles. The terms in the sum correspond to the choice of the mating partners: from the subpopulation with recessive alleles only, one from the mixed and one from recessive only (counted twice, but only 50% of the offspring are homozygous), both from mixed (only 25% of the offspring are homozygous).

2

def firstLaw(k,m,n):
    N = float(k+m+n)
    return 1 - ( m*n + .25*m*(m-1) + n*(n-1) ) / ( N*(N-1) )

理解

第二种公式大概可以这么理解

5 Translating RNA into Protein

ROSALIND | Translating RNA into Protein

Problem

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string ss corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by ss.

Sample Dataset

AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA

Sample Output

MAMAPRTEINSTRING

思考

最简单的办法就是写一个字典,把所有的密码子和氨基酸写到字典里,然后三个三个的读就行了。

这样感觉很复杂,有没有优雅一点的办法?

我的解法

CodonDict = {
    "F": ["UUU""UUC"],
    "L": ["UUA""UUG""CUU""CUC""CUA""CUG"],
    "S": ["UCU""UCC""UCA""UCG""AGU""AGC"],
    "Y": ["UAU""UAC", ],
    "Stop": ["UAA""UAG""UGA"],
    "C": ["UGU""UGC", ],
    "W": ["UGG"],
    "P": ["CCU""CCC""CCA""CCG"],
    "H": ["CAU""CAC", ],
    "Q": ["CAA""CAG", ],
    "R": ["CGU""CGC""CGA""CGG""AGA""AGG"],
    "I": ["AUU""AUC""AUA", ],
    "M": ["AUG"],
    "T": ["ACU""ACC""ACA""ACG"],
    "N": ["AAU""AAC"],
    "K": ["AAA""AAG"],
    "V": ["GUU""GUC""GUA""GUG"],
    "A": ["GCU""GCC""GCA""GCG"],
    "D": ["GAU""GAC"],
    "E": ["GAA""GAG"],
    "G": ["GGU""GGC""GGA""GGG"]
    }

seq = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"
nucleics = ""
import re
# 用正则表达式,re 中的 findall方法分割字符串
seq_codons = re.findall('.{3}', seq.upper())
for codon in seq_codons:
    for nucleic, codons in CodonDict.items():
        if codon not in codons:
            continue
        nucleics += nucleic

print(nucleics.strip("Stop"))

绝对有更好的办法吧?

别人的解法

string = """UUU F      CUU L      AUU I      GUU V
UUC F      CUC L      AUC I      GUC V
UUA L      CUA L      AUA I      GUA V
UUG L      CUG L      AUG M      GUG V
UCU S      CCU P      ACU T      GCU A
UCC S      CCC P      ACC T      GCC A
UCA S      CCA P      ACA T      GCA A
UCG S      CCG P      ACG T      GCG A
UAU Y      CAU H      AAU N      GAU D
UAC Y      CAC H      AAC N      GAC D
UAA Stop   CAA Q      AAA K      GAA E
UAG Stop   CAG Q      AAG K      GAG E
UGU C      CGU R      AGU S      GGU G
UGC C      CGC R      AGC S      GGC G
UGA Stop   CGA R      AGA R      GGA G
UGG W      CGG R      AGG R      GGG G"""


coded = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"
decoded = ''

traL =  string.split()
traDict = dict(zip(traL[0::2], traL[1::2]))

for i in range(0, len(coded)-33):
    decoded += traDict[coded[i:i+3]]

print decoded

尝试理解

不难看出,上面这个人极度优雅。

用一段文本制作字典原来这么简单,亏得我一个一个的手动敲

dict(zip(traL[0::2], traL[1::2])) 这句话直接就把上面那一长串字符串做成了字典。

traL[0::2] 作为key

traL[1::2] 作为value

然后用 zip 函数把上面两个包起来

然后 dict 直接转成字典

还有一个不用正则表达式就实现 以3为长度 分割字符串的切片方法

for i in range(0, len(coded)-3, 3): decoded += traDict[coded[i:i+3]]

引用一个 这个段代码下面的评论:

I arrive on these threads a proud papa, thrilled that I have cracked the problem. And then I'm floored by these answers that get to the same result in half the lines (and elegantly, to boot) :)

分类:

后端

标签:

Python

作者介绍

李拖沓
V1