Data Science/Bioinformatics with Biopython

4. Chapter3. Introduction of the Bioinformatics File Format

HJChung 2020. 3. 8. 16:49

To get specific information from a file, you need to understand the file format.  

This is because you need to know what type of data a file contains to get information.

 

Therefore in this chapter, I learned about the variant types of files for dealing with bioinformatics.

 

1. FASTA/FASTQ

  • FASTA: format represents nucleotide sequences or protein sequences in a text based format.
  • FASTQ: format includes quality score of each sequences in FASTA.
    • Use of FASTA: Since the file stores the sequence information, it is used as a reference.
    • Use of FASTQ: used when quality is required for each base in FASTA.
      • File format of FASTQ: Information about one read appears in four lines.
        1.  Header: Expressed in the form @SEQ_ID, this is a unique value with the name (ID) and information of the read. ex) @DRR000615. 149 HWUSI-EAS100R:6:73:941:1973 lenghth = 51
                @공공데이터코드 이름   시퀀싱_장비명 좌표:_:_:_:_          길이
        2.  Sequence: Represents DNA information (A / T / G / C) in the 5 'to 3' order. ('N' stands for an unknown base.)
        3. Seperator: Normally the same sequence identifier starts again with a '+' character and does not mean much.
        4. Quality value: Contains information about the quality of the sequence. The quality of each base is represented using ASCII code. Based on the quality, you can determine how accurately each base is sequenced.

FASTQ file format 

2. SAM/BAM

  • SAM: Sequence Alignment Map, which contains the read alignment data in text-based format. The result of genome mapping by BWA program is saved in SAM format.
  • BAM: Binary Alignment Map is a conversion of a SAM file into binay file format. It has very small file size than SAM. 
    • File format of SAM/BAM: This format consists of a Header section and Alignment section.
      1. Header section 
      2. Alignment section

Header Section (출처: <바이오파이썬으로 만나는 생물정보학>)
Alignment Section