Sequence in BioPython module
Prerequisite: BioPython module
Sequence is basically a special series of letters which is used to represent the protein of an organism, DNA or RNA. Sequences in Biopython are usually handled by the Seq object described in Bio.Seq module. The Seq object has inbuilt functions like complement, reverse_complement, transcribe, back_transcribe and translate, etc. The Seq objects has numerous string methods like count(), find(), split(), strip(), etc.
Below are some examples of sequence in Biopython:
Example 1:
Python3
# Import libraries from Bio.Seq import Seq # Creating a sequence seq = Seq( "GACT" ) # Printing Sequence print (seq) |
Output:
GACT
In the above example, the sequence GACT, each letter represents Glycine, Alanine, Cysteine and Threonine. Each Seq object has two important attributes:
- Data, which is the actual sequence string(GACT in this case).
- Alphabet, which is used to represent the type of the sequence i.e. DNA sequence, RNA sequence, etc. It is generic in nature and by default does not represent any sequence.
Example 2:
Python3
# Import libraries from Bio.Seq import Seq # Creating a sequence seq = Seq( "ACGT=TT" ) # Updating sequence updatedSeq = my_dna.ungap( "=" ) # Printing Sequence print (updatedSeq) |
Output:
ACGTT
Here, the sequence ACGT, each letter represents Adenine, Cytosine, Guanine, and Thymine. The =TT refers various protein naming conventions and functionalities.
Alphabet Class:
In addition to the string properties, Seq object also posses alphabet properties, these properties are instances of Alphabet class from Bio.Alphabet module, example IUPAC DNA or generic DNA describes the type of molecule i.e DNA, RNA, protein or it may also indicate expected symbols.
The Alphabet module provides the following classes to represent various sequences:
Class | Property |
---|---|
SingleLetterAlphabet | Generic alphabet with letters of size one,derives from alphabet and all other alphabet types are derived from this. |
ProteinAlphabet | Generic single letter protein alphabet |
NucleotideAlphabet | Generic single letter nucleotide alphabet |
DNAAlphabet | Generic single letter DNA alphabet. |
RNAAlphabet | Generic single letter RNA alphabet. |
SecondaryStructure | Alphabet used to describe secondary structure. |
ThreeLetterProtein | Three letter protein alphabet. |
AlphabetEncoder | class used to construct a new and extended alphabet from an existing one. |
Gapped | Alphabets which contain a gap character. |
HasStopCodon | Alphabets which contain a stop symbol. |
Bio.Alphabet also provides an IUPAC module which gives sequence types as defined by the IUPAC community. Some classes in IUPAC module are listed below:
Name | Class | Property |
---|---|---|
IUPACProtein | Protein | IUPAC protein alphabet of 20 standard amino acids. |
ExtendedIUPACProtein | extended_protein | Extended uppercase IUPAC protein single letter alphabet . |
IUPACAmbiguousDNA | ambiguous_dna | Uppercase IUPAC ambiguous DNA. |
IUPACUnambiguousDNA | unambiguous_dna | Uppercase IUPAC unambiguous DNA (GATC). |
ExtendedIUPACDNA | extended_dna | Extended IUPAC DNA alphabet. |
IUPACAmbiguousRNA | ambiguous_rna | Uppercase IUPAC ambiguous RNA. |
IUPACUnambiguousRNA | unambiguous_rna | Uppercase IUPAC unambiguous RNA (GAUC). |
The, Bio.Alphabet was deleted from Biopython. The intended function of the alphabet objects has never been well established, and there have been disadvantages to the pre-existing 20-year-old style. In particular, the AlphabetEncoder class was excessively complex, making it difficult to decide the type of molecule. The consensus of several alphabet objects (e.g. during string addition) was often difficult.
Without a concrete plan for how to strengthen or replace the current structure, it was decided to completely abolish Bio.Aplphabet module.