Data Science/Bioinformatics with Biopython

5. Chapter4-1. Gene Sequences - Sequence object

HJChung 2020. 3. 10. 09:13

Sequence object is basic topic of biopython. In this chapter, studying what a Sequence object is, and use it to handle target gene sequence.

 

For Training, we use the 'TATA Box sequence'.

 

 

 

 

In [1]:
import Bio
In [2]:
# 1. Create Sequence Object

from Bio.Seq import Seq

tatabox_seq = Seq("tataaaggcAATATGCAGTAG")
print(tatabox_seq)
print(type(tatabox_seq))
 
tataaaggcAATATGCAGTAG
<class 'Bio.Seq.Seq'>
 

There should be information about the Sequence object. We can add an Information about type of this sequence (DNA, RNA or amino acid) using the Alphabet module.

In [3]:
#2. Alpabet Module

from Bio.Alphabet import IUPAC

tatabox_seq = Seq("tataaaggcAATATGCAGTAG", IUPAC.unambiguous_dna)
print(tatabox_seq)
print(type(tatabox_seq))
 
tataaaggcAATATGCAGTAG
<class 'Bio.Seq.Seq'>
 

The IUPAC module contains several objects as well as objects representing DNA.

 

Now that we have a Sequence object, we can use it with Sequence object methods.

In [4]:
#3. Count Base Number in Sequence 
from Bio.Seq import Seq

my_seq = Seq("ATGCAGTAG")
count_a = my_seq.count("A")

print(count_a) #count the number of Adenin base
 
3
 

You can calculate the GC-contents (%), which tells you how much G and C smoke is in the sequence. GC-contents(%) = ((count_C + count_G)/(count_totalbase))*100(%)

In [5]:
#4. Calculate the GC-contents in this Sequence Object

count_c = my_seq.count("C")
count_g = my_seq.count("G")
count_totalbase = len(my_seq)

GC_contents = ((count_c + count_g)/count_totalbase)*100

print(GC_contents)
 
44.44444444444444
In [6]:
#5. Converting Sequence Object Upper,Lowercase Letters

tatabox_seq = Seq("tataaggCAATATGCAGTAG")
print(tatabox_seq.upper())
print(tatabox_seq.lower())
 
TATAAGGCAATATGCAGTAG
tataaggcaatatgcagtag
 

DNA is transcribed into mRNA and translated into protein. This is the central principle of molecular biology.

In [7]:
#6. Transcribing and Translating Sequence Objects

my_dna = Seq("ATGCAGTAGACT")
my_mrna = my_dna.transcribe()
my_protein = my_dna.translate()

print(my_mrna)
print(my_protein)
 
AUGCAGUAGACU
MQ*T
 

If you see a stop codon while translating to a protein, you should stop translating. There's the way to end the translation at the first stop codon.

In [8]:
#7. Stop Translate 
my_mrna = Seq("AUGAACUAAGUUUAGAAU")
my_protein = my_mrna.translate()
my_protein_stop = my_mrna.translate(to_stop = True)

print(my_protein)
print(my_protein_stop)
 
MN*V*N
MN
In [9]:
#8. Split by Stop Translation
my_mrna = Seq("AUGAACUAAGUUUAGAAU")
my_protein = my_mrna.translate()
print(my_protein)

for seq in my_protein.split('*'):
    print(seq)
 
MN*V*N
MN
V
N
 

DNA bases are paired with adenine and thymine by double bonds, and guanine and cytosine by triple bonds. This is called a complementary relationship.

In [10]:
#9-1. Create complementary and reverse complementary sequences of DNA sequence in Python

my_dna = "TATAAAGGCAATATGCAGTAG"
comp_dic ={'A':'T', 'T':'A', 'G':'C', 'C':'G'}#Create a dictionary with complementary bases as key-values.

comp_seq = ""

for base in my_dna:
    comp_seq += comp_dic[base]

revcomp_seq = comp_seq[::-1]

print(comp_seq)
print(revcomp_seq)
 
ATATTTCCGTTATACGTCATC
CTACTGCATATTGCCTTTATA
In [11]:
#9-2. Create complementary and reverse complementary sequences of DNA sequence in BioPython

my_dna = Seq("TATAAAGGCAATATGCAGTAG")

comp_seq = my_dna.complement()
revcomp_seq = my_dna.reverse_complement()

print(comp_seq)
print(revcomp_seq)
 
ATATTTCCGTTATACGTCATC
CTACTGCATATTGCCTTTATA
 

As a result of DNA transcription, mrna is produced. The translation process reads the three bases of mrna and generates the corresponding amino acids according to codon table. You can print a codon table using BioPython.

In [12]:
#10-1. Standard Codon Table
from Bio.Data import CodonTable

codon_table = CodonTable.unambiguous_dna_by_name["Standard"] #standard codon table
print(codon_table)
 
Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--
In [13]:
#10-2. Mitochondria Codon Table

mito_codon_table =CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
print(mito_codon_table)
 
Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--
 

The ORF is an Open Reading Frame, which is a base that is likely to make a protein, starting with ATG, the start codon, and ending with the stop codon.

Therefore, finding the ORF means finding the sequence between the start codon and the end codon.

In [14]:
#11. Find Open Reading Frame

tatabox_seq = Seq("tataaaggcAATATGCAGTAG")
start_idx = tatabox_seq.find("ATG")
end_idx = tatabox_seq.find("TAG", start_idx) #More than this, there are 'TAA', 'TAG', and 'TGA' in the termination codon.

orf = tatabox_seq[start_idx:end_idx+3] #have to include end_idx 

print(orf)
 
ATGCAGTAG
In [15]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))