import Bio

# 1. Create Sequence Object

from Bio.Seq import Seq

tatabox_seq = Seq("tataaaggcAATATGCAGTAG")
print(tatabox_seq)
print(type(tatabox_seq))

tataaaggcAATATGCAGTAG
<class 'Bio.Seq.Seq'>

There should be information about the Sequence object. We can add an Information about type of this sequence (DNA, RNA or amino acid) using the Alphabet module.

#2. Alpabet Module

from Bio.Alphabet import IUPAC

tatabox_seq = Seq("tataaaggcAATATGCAGTAG", IUPAC.unambiguous_dna)
print(tatabox_seq)
print(type(tatabox_seq))

tataaaggcAATATGCAGTAG
<class 'Bio.Seq.Seq'>

The IUPAC module contains several objects as well as objects representing DNA.

Now that we have a Sequence object, we can use it with Sequence object methods.

#3. Count Base Number in Sequence 
from Bio.Seq import Seq

my_seq = Seq("ATGCAGTAG")
count_a = my_seq.count("A")

print(count_a) #count the number of Adenin base

3

You can calculate the GC-contents (%), which tells you how much G and C smoke is in the sequence. GC-contents(%) = ((count_C + count_G)/(count_totalbase))*100(%)

#4. Calculate the GC-contents in this Sequence Object

count_c = my_seq.count("C")
count_g = my_seq.count("G")
count_totalbase = len(my_seq)

GC_contents = ((count_c + count_g)/count_totalbase)*100

print(GC_contents)

44.44444444444444

#5. Converting Sequence Object Upper,Lowercase Letters

tatabox_seq = Seq("tataaggCAATATGCAGTAG")
print(tatabox_seq.upper())
print(tatabox_seq.lower())

TATAAGGCAATATGCAGTAG
tataaggcaatatgcagtag

DNA is transcribed into mRNA and translated into protein. This is the central principle of molecular biology.

#6. Transcribing and Translating Sequence Objects

my_dna = Seq("ATGCAGTAGACT")
my_mrna = my_dna.transcribe()
my_protein = my_dna.translate()

print(my_mrna)
print(my_protein)

AUGCAGUAGACU
MQ*T

If you see a stop codon while translating to a protein, you should stop translating. There's the way to end the translation at the first stop codon.

#7. Stop Translate 
my_mrna = Seq("AUGAACUAAGUUUAGAAU")
my_protein = my_mrna.translate()
my_protein_stop = my_mrna.translate(to_stop = True)

print(my_protein)
print(my_protein_stop)

MN*V*N
MN

#8. Split by Stop Translation
my_mrna = Seq("AUGAACUAAGUUUAGAAU")
my_protein = my_mrna.translate()
print(my_protein)

for seq in my_protein.split('*'):
    print(seq)

MN*V*N
MN
V
N

DNA bases are paired with adenine and thymine by double bonds, and guanine and cytosine by triple bonds. This is called a complementary relationship.

#9-1. Create complementary and reverse complementary sequences of DNA sequence in Python

my_dna = "TATAAAGGCAATATGCAGTAG"
comp_dic ={'A':'T', 'T':'A', 'G':'C', 'C':'G'}#Create a dictionary with complementary bases as key-values.

comp_seq = ""

for base in my_dna:
    comp_seq += comp_dic[base]

revcomp_seq = comp_seq[::-1]

print(comp_seq)
print(revcomp_seq)

ATATTTCCGTTATACGTCATC
CTACTGCATATTGCCTTTATA

#9-2. Create complementary and reverse complementary sequences of DNA sequence in BioPython

my_dna = Seq("TATAAAGGCAATATGCAGTAG")

comp_seq = my_dna.complement()
revcomp_seq = my_dna.reverse_complement()

print(comp_seq)
print(revcomp_seq)

ATATTTCCGTTATACGTCATC
CTACTGCATATTGCCTTTATA

As a result of DNA transcription, mrna is produced. The translation process reads the three bases of mrna and generates the corresponding amino acids according to codon table. You can print a codon table using BioPython.

#10-1. Standard Codon Table
from Bio.Data import CodonTable

codon_table = CodonTable.unambiguous_dna_by_name["Standard"] #standard codon table
print(codon_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--

#10-2. Mitochondria Codon Table

mito_codon_table =CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
print(mito_codon_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--

The ORF is an Open Reading Frame, which is a base that is likely to make a protein, starting with ATG, the start codon, and ending with the stop codon.

Therefore, finding the ORF means finding the sequence between the start codon and the end codon.

#11. Find Open Reading Frame

tatabox_seq = Seq("tataaaggcAATATGCAGTAG")
start_idx = tatabox_seq.find("ATG")
end_idx = tatabox_seq.find("TAG", start_idx) #More than this, there are 'TAA', 'TAG', and 'TGA' in the termination codon.

orf = tatabox_seq[start_idx:end_idx+3] #have to include end_idx 

print(orf)

ATGCAGTAG

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

7. Chapter5. Sequence Record object (0)	2020.03.15
6. Chapter4-2. Gene Sequences - Sequence object (0)	2020.03.13
4. Chapter3. Introduction of the Bioinformatics File Format (0)	2020.03.08
3. Chapter2. Biopython Installation (0)	2020.03.08
2. Chapter1. Introduction to BioPython (0)	2020.03.08

Grace's Tech Blog

5. Chapter4-1. Gene Sequences - Sequence object

'Data Science > Bioinformatics with Biopython' 카테고리의 다른 글

'Data Science/Bioinformatics with Biopython'의 다른글

티스토리툴바

5. Chapter4-1. Gene Sequences - Sequence object

'Data Science > Bioinformatics with Biopython' 카테고리의 다른 글

'Data Science/Bioinformatics with Biopython'의 다른글

관련글

티스토리툴바