Previous Up Next

Chapter 9  Advanced

9.1  The SeqRecord and SeqFeature classes

You read all about the basic Biopython sequence class in Chapter 3, which described how to do many useful things with just the sequence. However, many times sequences have important additional properties associated with them -- as you will have seen with the SeqRecord object in Chapter 4. This section described how Biopython handles these higher level descriptions of a sequence.

9.1.1  Sequence ids and Descriptions -- dealing with SeqRecords

Immediately above the Sequence class is the Sequence Record class, defined in the Bio.SeqRecord module. This class allows higher level features such as ids and features to be associated with the sequence, and is used thoughout the sequence input/output interface Bio.SeqIO, described in Chapter 4. The SeqRecordclass itself is very simple, and offers the following information as attributes:
seq
-- The sequence itself -- A Seq object

id
-- The primary id used to identify the sequence. In most cases this is something like an accession number.

name
-- A ``common'' name/id for the sequence. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analagous to the LOCUS id in a GenBank record.

description
-- A human readible description or expressive name for the sequence. This is similar to what follows the id information in a FASTA formatted entry.

annotations
-- A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more ``unstructed'' information to the sequence.

features
-- A list of SeqFeature objects with more structured information about the features on a sequence. The structure of sequence features is described below in Section 9.1.2.
Using a SeqRecord class is not very complicated, since all of the information is stored as attributes of the class. Initializing the class just involves passing a Seq object to the SeqRecord:
>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)
Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently:
>>> simple_seq_r.id
'<unknown id>'
>>> simple_seq_r.id = 'AC12345'
>>> simple_seq_r.description = 'My little made up sequence I wish I could
write a paper about and submit to GenBank'
>>> print simple_seq_r.description
My little made up sequence I wish I could write a paper about and submit
to GenBank
>>> simple_seq_r.seq
Seq('GATC', Alphabet())
Adding annotations is almost as easy, and just involves dealing directly with the annotation dictionary:
>>> simple_seq_r.annotations['evidence'] = 'None. I just made it up.'
>>> print simple_seq_r.annotations
{'evidence': 'None. I just made it up.'}
That's just about all there is to it! Next, you may want to learn about SeqFeatures, which offer an additional structured way to represent information about a sequence.

9.1.2  Features and Annotations -- SeqFeatures

Sequence features are an essential part of describing a sequence. Once you get beyond the sequence itself, you need some way to organize and easily get at the more ``abstract'' information that is known about the sequence. While it is probably impossible to develop a general sequence feature class that will cover everything, the Biopython SeqFeature class attempts to encapsulate as much of the information about the sequence as possible. The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you'll probably have an easier time grasping the structure of the Biopython classes.

9.1.2.1  SeqFeatures themselves

The first level of dealing with Sequence features is the SeqFeature class itself. This class has a number of attributes, so first we'll list them and there general features, and then work through an example to show how this applies to a real life example, a GenBank feature table. The attributes of a SeqFeature are:
location
-- The location of the SeqFeature on the sequence that you are dealing with. The locations end-points may be fuzzy -- section 9.1.2.2 has a lot more description on how to deal with descriptions.

type
-- This is a textual description of the type of feature (for instance, this will be something like 'CDS' or 'gene').

ref
-- A reference to a different sequence. Some times features may be ``on'' a particular sequence, but may need to refer to a different sequence, and this provides the reference (normally an accession number). A good example of this is a genomic sequence that has most of a coding sequence, but one of the exons is on a different accession. In this case, the feature would need to refer to this different accession for this missing exon.

ref_db
-- This works along with ref to provide a cross sequence reference. If there is a reference, ref_db will be set as None if the reference is in the same database, and will be set to the name of the database otherwise.

strand
-- The strand on the sequence that the feature is located on. This may either be '1' for the top strand, '-1' for the bottom strand, or '0' for both strands (or if it doesn't matter). Keep in mind that this only really makes sense for double stranded DNA, and not for proteins or RNA.

qualifiers
-- This is a python dictionary of additional information about the feature. The key is some kind of terse one-word description of what the information contained in the value is about, and the value is the actual information. For example, a common key for a qualifier might be ``evidence'' and the value might be ``computational (non-experimental).'' This is just a way to let the person who is looking at the feature know that it has not be experimentally (i. e. in a wet lab) confirmed.

sub_features
-- A very important feature of a feature is that it can have additional sub_features underneath it. This allows nesting of features, and helps us to deal with things such as the GenBank/EMBL feature lines in a (we hope) intuitive way.
To show an example of SeqFeatures in action, let's take a look at the following feature from a GenBank feature table:
     mRNA            complement(join(<49223..49300,49780..>50208))
                     /gene="F28B23.12"
To look at the easiest attributes of the SeqFeature first, if you got a SeqFeature object for this it would have it type of 'mRNA', a strand of -1 (due to the 'complement'), and would have None for the ref and ref_db since there are no references to external databases. The qualifiers for this SeqFeature would be a python dictionarary that looked like {'gene' : 'F28B23.12'}.

Now let's look at the more tricky part, how the 'join' in the location line is handled. First, the location for the top level SeqFeature (the one we are dealing with right now) is set as going from '<49223' to '>50208' (see section 9.1.2.2 for the nitty gritty on how fuzzy locations like this are handled). So the location of the top level object is the entire span of the feature. So, how do you get at the information in the 'join?' Well, that's where the sub_features go in.

The sub_features attribute will have a list with two SeqFeature objects in it, and these contain the information in the join. Let's look at top_level_feature.sub_features[0]; the first sub_feature). This object is a SeqFeature object with a type of 'mRNA_join,' a strand of -1 (inherited from the parent SeqFeature) and a location going from '<49223' to '49300'.

So, the sub_features allow you to get at the internal information if you want it (i. e. if you were trying to get only the exons out of a genomic sequence), or just to deal with the broad picture (i. e. you just want to know that the coding sequence for a gene lies in a region). Hopefully this structuring makes it easy and intuitive to get at the sometimes complex information that can be contained in a SeqFeature.

9.1.2.2  Locations

In the section on SeqFeatures above, we skipped over one of the more difficult parts of Features, dealing with the locations. The reason this can be difficult is because of fuzziness of the positions in locations. Before we get into all of this, let's just define the vocabulary we'll use to talk about this. Basically there are two terms we'll use:
position
-- This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and 3^5 are all positions.

location
-- A location is two positions that defines a region of a sequence. For instance 5..20 (i. e. 5 to 20) is a location.
I just mention this because sometimes I get confused between the two.

The complication in dealing with locations comes in the positions themselves. In biology many times things aren't entirely certain (as much as us wet lab biologists try to make them certain!). For instance, you might do a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. This is very useful information, but the complication comes in how to represent this as a position. To help us deal with this, we have the concept of fuzzy positions. Basically there are five types of fuzzy positions, so we have five classes do deal with them:
ExactPosition
-- As its name suggests, this class represents a position which is specified as exact along the sequence. This is represented as just a a number, and you can get the position by looking at the position attribute of the object.

BeforePosition
-- This class represents a fuzzy position that occurs prior to some specified site. In GenBank/EMBL notation, this is represented as something like '<13', signifying that the real position is located somewhere less then 13. To get the specified upper boundary, look at the position attribute of the object.

AfterPosition
-- Contrary to BeforePosition, this class represents a position that occurs after some specified site. This is represented in GenBank as '>13', and like BeforePosition, you get the boundary number by looking at the position attribute of the object.

WithinPosition
-- This class models a position which occurs somewhere between two specified nucleotides. In GenBank/EMBL notation, this would be represented as '(1.5)', to represent that the position is somewhere within the range 1 to 5. To get the information in this class you have to look at two attributes. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary.

BetweenPosition
-- This class deals with a position that occurs between two coordinates. For instance, you might have a protein binding site that occurs between two nucleotides on a sequence. This is represented as '2^3', which indicates that the real position happens between position 2 and 3. Getting this information from the object is very similar to WithinPosition, the position attribute specifies the lower boundary (2, in this case) and the extension indicates the range to the higher boundary (1 in this case).
Now that we've got all of the types of fuzzy positions we can have taken care of, we are ready to actually specify a location on a sequence. This is handled by the FeatureLocation class. An object of this type basically just holds the potentially fuzzy start and end positions of a feature. You can create a FeatureLocation object by creating the positions and passing them in:
>>> from Bio import SeqFeature
>>> start_pos = SeqFeature.AfterPosition(5)
>>> end_pos = SeqFeature.BetweenPosition(8, 1)
>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
If you print out a FeatureLocation object, you can get a nice representation of the information:
>>> print my_location
[>5:(8^9)]
We can access the fuzzy start and end positions using the start and end attributes of the location:
>>> my_location.start
<Bio.SeqFeature.AfterPosition instance at 0x101d7164>
>>> print my_location.start
>5
>>> print my_location.end
(8^9)
If you don't want to deal with fuzzy positions and just want numbers, you just need to ask for the nofuzzy_start and nofuzzy_end attributes of the location:
>>> my_location.nofuzzy_start
5
>>> my_location.nofuzzy_end
8
Notice that this just gives you back the position attributes of the fuzzy locations.

Similary, to make it easy to create a position without worrying about fuzzy positions, you can just pass in numbers to the FeaturePosition constructors, and you'll get back out ExactPosition objects:
>>> exact_location = SeqFeature.FeatureLocation(5, 8)
>>> print exact_location
[5:8]
>>> exact_location.start
<Bio.SeqFeature.ExactPosition instance at 0x101dcab4>
That is all of the nitty gritty about dealing with fuzzy positions in Biopython. It has been designed so that dealing with fuzziness is not that much more complicated than dealing with exact positions, and hopefully you find that true!

9.1.2.3  References

Another common annotation related to a sequence is a reference to a journal or other published work dealing with the sequence. We have a fairly simple way of representing a Reference in Biopython -- we have a Bio.SeqFeature.Reference class that stores the relevant information about a reference as attributes of an object.

The attributes include things that you would expect to see in a reference like journal, title and authors. Additionally, it also can hold the medline_id and pubmed_id and a comment about the reference. These are all accessed simply as attributes of the object.

A reference also has a location object so that it can specify a particular location on the sequence that the reference refers to. For instance, you might have a journal that is dealing with a particular gene located on a BAC, and want to specify that it only refers to this position exactly. The location is a potentially fuzzy location, as described in section 9.1.2.2.

That's all there is too it. References are meant to be easy to deal with, and hopefully general enough to cover lots of usage cases.

9.2  Regression Testing Framework

Biopython has a regression testing framework originally written by Andrew Dalke and ported to PyUnit by Brad Chapman which helps us make sure the code is as bug-free as possible before going out.

9.2.1  Writing a Regression Test

Every module that goes into Biopython should have a test (and should also have documentation!). Let's say you've written a new module called Biospam -- here is what you should do to make a regression test:
  1. Write a script called test_Biospam.py

  2. If the script requires files to do the testing, these should go in the directory Tests/Biospam.

  3. Write out the test output and verify the output to be correct. There are two ways to do this:
    1. The long way:
      • Run the script and write its output to a file. On UNIX machines, you would do something like: python test_Biospam.py > test_Biospam which would write the output to the file test_Biospam.

      • Manually look at the file test_Biospam to make sure the output is correct. When you are sure it is all right and there are no bugs, you need to quickly edit the test_Biospam file so that the first line is: `test_Biospam' (no quotes).

      • copy the test_Biospam file to the directory Tests/output


    2. The quick way:
      • Run python run_tests.py -g test_Biospam.py. The regression testing framework is nifty enough that it'll put the output in the right place in just the way it likes it.

      • Go to the output (which should be in Tests/output/test_Biospam) and double check the output to make sure it is all correct.


  4. Now change to the Tests directory and run the regression tests with python run_tests.py. This will run all of the tests, and you should see your test run (and pass!).

  5. That's it! Now you've got a nice test for your module ready to check into CVS. Congratulations!

9.3  Parser Design

9.3.1  Design Overview

Many of the Biopython parsers are built around an event-oriented design that includes Scanner and Consumer objects.

Scanners take input from a data source and analyze it line by line, sending off an event whenever it recognizes some information in the data. For example, if the data includes information about an organism name, the scanner may generate an organism_name event whenever it encounters a line containing the name.

Consumers are objects that receive the events generated by Scanners. Following the previous example, the consumer receives the organism_name event, and the processes it in whatever manner necessary in the current application.

9.3.2  Events

There are two types of events: info events that tag the location of information within a data stream, and section events that mark sections within a stream. Info events are associated with specific lines within the data, while section events are not.

Section event names must be in the format start_EVENTNAME and end_EVENTNAME where EVENTNAME is the name of the event.

For example, a FASTA-formatted sequence scanner may generate the following events:
EVENT NAME      ORIGINAL INPUT
begin_sequence 
title           >gi|132871|sp|P19947|RL30_BACSU 50S RIBOSOMAL PROTEIN L30 (BL27
sequence        MAKLEITLKRSVIGRPEDQRVTVRTLGLKKTNQTVVHEDNAAIRGMINKVSHLVSVKEQ
end_sequence
begin_sequence
title           >gi|132679|sp|P19946|RL15_BACSU 50S RIBOSOMAL PROTEIN L15
sequence        MKLHELKPSEGSRKTRNRVGRGIGSGNGKTAGKGHKGQNARSGGGVRPGFEGGQMPLFQRLPK
sequence        RKEYAVVNLDKLNGFAEGTEVTPELLLETGVISKLNAGVKILGNGKLEKKLTVKANKFSASAK
sequence        GTAEVI
end_sequence
[...]
(I cut the lines shorter so they'd look nicer in my editor).

The FASTA scanner generated the following events: title, sequence, begin_sequence, and end_sequence. Note that the begin_sequence and end_sequence events are not associated with any line in the original input. They are used to delineate separate sequences within the file.

The events a scanner can send must be specifically defined for each data format.

9.3.3  `noevent' EVENT

A data file can contain lines that have no meaningful information, such as blank lines. By convention, a scanner should generate the "noevent" event for these lines.

9.3.4  Scanners

class Scanner:
    def feed(self, handle, consumer):
        # Implementation
Scanners should implement a method named 'feed' that takes a file handle and a consumer. The scanner should read data from the file handle and generate appropriate events for the consumer.

9.3.5  Consumers

class Consumer:
    # event handlers
Consumers contain methods that handle events. The name of the method is the event that it handles. Info events are passed the line of the data containing the information, and section events are passed nothing.

You are free to ignore events that are not interesting for your application. You should just not implement methods for those events.

All consumers should be derived from the base Consumer class.

An example:
class FASTAConsumer(Consumer):
    def title(self, line):
        # do something with the title
    def sequence(self, line):
        # do something with the sequence
    def begin_sequence(self):
        # a new sequence starts
    def end_sequence(self):
        # a sequence ends

9.3.6  BLAST

BLAST Scanners produce the following events:
header
    version
    reference
    query_info
    database_info

descriptions
    description_header
    round                         psi blast
    model_sequences               psi blast
    nonmodel_sequences            psi blast
    converged                     psi blast
    description
    no_hits

alignment
    multalign                     master-slave
    title                         pairwise
    length                        pairwise
  hsp
    score                         pairwise
    identities                    pairwise
    strand                        pairwise, blastn
    frame                         pairwise, blastx, tblastn, tblastx
    query                         pairwise
    align                         pairwise
    sbjct                         pairwise

database_report
    database
    posted_date
    num_letters_in_database
    num_sequences_in_database
    num_letters_searched          RESERVED.  Currently unused.  I've never
    num_sequences_searched        RESERVED.  seen it, but it's in blastool.c..
    ka_params
    gapped                        not blastp
    ka_params_gap                 gapped mode (not tblastx)

parameters
    matrix
    gap_penalties                 gapped mode (not tblastx)
    num_hits                     
    num_sequences                
    num_extends                  
    num_good_extends             
    num_seqs_better_e
    hsps_no_gap                   gapped (not tblastx) and not blastn
    hsps_prelim_gapped            gapped (not tblastx) and not blastn
    hsps_prelim_gap_attempted     gapped (not tblastx) and not blastn
    hsps_gapped                   gapped (not tblastx) and not blastn
    query_length
    database_length
    effective_hsp_length
    effective_query_length
    effective_database_length
    effective_search_space
    effective_search_space_used
    frameshift                    blastx or tblastn or tblastx
    threshold
    window_size
    dropoff_1st_pass
    gap_x_dropoff
    gap_x_dropoff_final           gapped (not tblastx) and not blastn
    gap_trigger
    blast_cutoff

9.3.7  Enzyme

The Enzyme.py module works with the enzyme.dat file included with the Enzyme distribution. The Enzyme Scanner produces the following events:
record
    identification
    description
    alternate_name
    catalytic_activity
    cofactor
    comment
    disease
    prosite_reference
    databank_reference
    terminator

9.3.8  KEGG

9.3.8.1  Bio.KEGG.Enzyme

The Bio.KEGG.Enzyme module works with the 'enzyme' file from the Ligand database, which can be obtained from the KEGG project. (http://www.genome.ad.jp/kegg).

The Bio.KEGG.Enzyme.Record contains all the information stored in a KEGG/Enzyme record. Its string representation also is a valid KEGG record, but it is NOT guaranteed to be equivalent to the record from which it was produced.

The Bio.KEGG.Enzyme.Scanner produces the following events:
entry
name
classname
sysname
reaction
substrate
product
inhibitor
cofactor
effector
comment
pathway_db
pathway_id
pathway_desc
organism
gene_id
disease_db
disease_id
disease_desc
motif_db
motif_id
motif
structure_db
structure_id
dblinks_db
dblinks_id
record_end

9.3.8.2  Bio.KEGG.Compound

The Bio.KEGG.Compound module works with the 'compound' file from the Ligand database, which can be obtained from the KEGG project. (http://www.genome.ad.jp/kegg).

The Bio.KEGG.Compound.Record contains all the information stored in a KEGG/Compound record. Its string representation also is a valid KEGG record, but it is NOT guaranteed to be equivalent to the record from which it was produced.

The Bio.KEGG.Enzyme.Scanner produces the following events:
entry
name
formula
pathway_db
pathway_id
pathway_desc
enzyme_id
enzyme_role
structure_db
structure_id
dblinks_db
dblinks_id
record_end

9.3.9  Fasta

The Fasta.py module works with FASTA-formatted sequence data. The Fasta Scanner produces the following events:
sequence
    title
    sequence

9.3.10  Medline

The Online Services Reference Manual documents the MEDLINE format at: http://www.nlm.nih.gov/pubs/osrm_nlm.html

The Medline scanner produces the following events:
record
    undefined
    abstract_author
    abstract
    address
    author
    call_number
    comments
    class_update_date
    country
    entry_date
    publication_date
    english_abstract
    entry_month
    gene_symbol
    identification
    issue_part_supplement
    issn
    journal_title_code
    language
    special_list
    last_revision_date
    mesh_heading
    mesh_tree_number
    major_revision_date
    no_author
    substance_name
    pagination
    personal_name_as_subject
    publication_type
    number_of_references
    cas_registry_number
    record_originator
    journal_subset
    subheadings
    secondary_source_id
    source
    title_abbreviation
    title
    transliterated_title
    unique_identifier
    volume_issue
    year
    pubmed_id
undefined is a special event that is called for every line with a qualifier not defined in the specification.

9.3.11  Prosite

The Prosite scanner produces the following events:
copyrights
    copyright
record
    identification
    accession
    date
    description
    pattern
    matrix
    rule
    numerical_results
    comment
    database_reference
    pdb_reference
    documentation
    terminator
The PRODOC scanner produces the following events:
record
    accession
    prosite_reference
    text
    reference

9.3.12  SWISS-PROT

The SProt.py module works with the sprotXX.dat file included with the SwissProt distribution. The SProt Scanner produces the following events:
record
    identification
    accession
    date
    description
    gene_name
    organism_species
    organelle
    organism_classification
    reference_number
    reference_position
    reference_comment
    reference_cross_reference
    reference_author
    reference_title
    reference_location
    comment
    database_cross_reference
    keyword
    feature_table
    sequence_header
    sequence_data
    terminator
The KeyWList.py modules works with the keywlist.txt file included with the SwissProt distribution. The KeyWList scanner produces the following events:
header
keywords
    keyword
footer
    copyright

9.3.13  NBRF

The NBRF module works with NBRF-formatted sequence data. Data is available at: http://www-nbrf.georgetown.edu/pirwww/pirhome.shtml.

The NBRF Scanner produces the following events:
    sequence_type
    sequence_name
    comment
    sequence

9.3.14  Ndb

The Ndb module works with Ndb-formatted sequence data. Data is available at: http://ndbserver.rutgers.edu/NDB/NDBATLAS/index.html.

The Ndb record contains the following items:
        Id
        Features
        Name
        Sequence
        Citation
        Space Group
        Cell Constants
        Crystallization Conditions
        Refinement
        Coordinates
Sequence is an instance of Crystal which is dictionary of Chain objects. Each chain is a sequence of PDB hetero items. Citation is a list of Reference objects. Crystal, Reference, Chain and Hetero are part of the biopython distribution.

9.3.15  MetaTool

The MetaTool parser works with MetaTool output files. MetaTool implements algorithms to decompose a biochemical pathway into a combination of simpler networks that are more accessible to analysis.

The MetaTool web page is http://pinguin.biologie.uni-jena.de/bioinformatik/networks/.

The MetaTool parser requires Numeric Python. Information is available at http://numpy.scipy.org/#older_array.

The Bio.MetaTool.Scanner produces the following events:
input_file_name
num_int_metabolites
num_reactions
metabolite_line
unbalanced_metabolite
num_rows
num_cols
irreversible_vector
branch_metabolite
non_branch_metabolite
stoichiometric_tag
kernel_tag
subsets_tag
reduced_system_tag
convex_basis_tag
conservation_relations_tag
elementary_modes_tag
reaction
enzyme
matrix_row
sum_is_constant_line
end_stochiometric
end_kernel
end_subsets
end_reduced_system
end_convex_basis
end_conservation_relations
end_elementary_modes

9.4  Substitution Matrices

9.4.1  SubsMat

This module provides a class and a few routines for generating substitution matrices, similar to BLOSUM or PAM matrices, but based on user-provided data.

Additionally, you may select a matrix from MatrixInfo.py, a collection of established substitution matrices.
class SeqMat(UserDict.UserDict)
  1. Attributes
    1. self.data: a dictionary in the form of {(i1,j1):n1, (i1,j2):n2,...,(ik,jk):nk} where i, j are alphabet letters, and n is a value.

    2. self.alphabet: a class as defined in Bio.Alphabet

    3. self.ab_list: a list of the alphabet's letters, sorted. Needed mainly for internal purposes

    4. self.sum_letters: a dictionary. {i1: s1, i2: s2,...,in:sn} where:
      1. i: an alphabet letter;
      2. s: sum of all values in a half-matrix for that letter;
      3. n: number of letters in alphabet.


  2. Methods
    1. __init__(self,data=None,alphabet=None,
               mat_type=NOTYPE,mat_name='',build_later=0):
      
      1. data: can be either a dictionary, or another SeqMat instance.
      2. alphabet: a Bio.Alphabet instance. If not provided, construct an alphabet from data.

      3. mat_type: type of matrix generated. One of the following:
        NOTYPE
        No type defined
        ACCREP
        Accepted Replacements Matrix
        OBSFREQ
        Observed Frequency Matrix
        EXPFREQ
        Expsected Frequency Matrix
        SUBS
        Substitution Matrix
        LO
        Log Odds Matrix


        mat_type is provided automatically by some of SubsMat's functions.

      4. mat_name: matrix name, such as "BLOSUM62" or "PAM250"

      5. build_later: default false. If true, user may supply only alphabet and empty dictionary, if intending to build the matrix later. this skips the sanity check of alphabet size vs. matrix size.


    2. entropy(self,obs_freq_mat)
      
      1. obs_freq_mat: an observed frequency matrix. Returns the matrix's entropy, based on the frequency in obs_freq_mat. The matrix instance should be LO or SUBS.


    3. letter_sum(self,letter)
      
      Returns the sum of all values in the matrix, for the provided letter

    4. all_letters_sum(self)
      
      Fills the dictionary attribute self.sum_letters with the sum of values for each letter in the matrix's alphabet.

    5. print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)
      
      prints the matrix to file handle f. format is the format field for the matrix values; bottomformat is the format field for the bottom row, containing matrix letters. Example output for a 3-letter alphabet matrix:
      A 23
      B 12 34
      C 7  22  27
        A   B   C
      
      The alphabet optional argument is a string of all characters in the alphabet. If supplied, the order of letters along the axes is taken from the string, rather than by alphabetical order.


  3. Usage

    The following section is layed out in the order by which most people wish to generate a log-odds matrix. Of course, interim matrices can be generated and investigated. Most people just want a log-odds matrix, that's all.
    1. Generating an Accepted Replacement Matrix

      Initially, you should generate an accepted replacement matrix (ARM) from your data. The values in ARM are the counted number of replacements according to your data. The data could be a set of pairs or multiple alignments. So for instance if Alanine was replaced by Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding ARM entries would be:
      ('A','C'): 10, ('C','A'): 12
      
      as order doesn't matter, user can already provide only one entry:
      ('A','C'): 22
      
      A SeqMat instance may be initialized with either a full (first method of counting: 10, 12) or half (the latter method, 22) matrices. A full protein alphabet matrix would be of the size 20x20 = 400. A half matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because same-letter entries don't change. (The matrix diagonal). Given an alphabet size of N:
      1. Full matrix size:N*N

      2. Half matrix size: N(N+1)/2

      The SeqMat constructor automatically generates a half-matrix, if a full matrix is passed. If a half matrix is passed, letters in the key should be provided in alphabetical order: ('A','C') and not ('C',A').

      At this point, if all you wish to do is generate a log-odds matrix, please go to the section titled Example of Use. The following text describes the nitty-gritty of internal functions, to be used by people who wish to investigate their nucleotide/amino-acid frequency data more thoroughly.

    2. Generating the observed frequency matrix (OFM)

      Use:
      OFM = SubsMat._build_obs_freq_mat(ARM)
      
      The OFM is generated from the ARM, only instead of replacement counts, it contains replacement frequencies.

    3. Generating an expected frequency matrix (EFM)

      Use:
      EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
      
      1. exp_freq_table: should be a FreqTable instance. See section 9.4.2 for detailed information on FreqTable. Briefly, the expected frequency table has the frequencies of appearance for each member of the alphabet. It is implemented as a dictionary with the alphabet letters as keys, and each letter's frequency as a value. Values sum to 1.

      The expected frequency table can (and generally should) be generated from the observed frequency matrix. So in most cases you will generate exp_freq_table using:
      >>> exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM)
      >>> EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
      
      But you can supply your own exp_freq_table, if you wish

    4. Generating a substitution frequency matrix (SFM)

      Use:
      SFM = SubsMat._build_subs_mat(OFM,EFM)
      
      Accepts an OFM, EFM. Provides the division product of the corresponding values.

    5. Generating a log-odds matrix (LOM)

      Use:
      LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,round_digit=1])
      
      1. Accepts an SFM.

      2. logbase: base of the logarithm used to generate the log-odds values.

      3. factor: factor used to multiply the log-odds values. Each entry is generated by log(LOM[key])*factor And rounded to the round_digit place after the decimal point, if required.


  4. Example of use

    As most people would want to generate a log-odds matrix, with minimum hassle, SubsMat provides one function which does it all:
    make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,
                          factor=10.0,round_digit=0):
    
    1. acc_rep_mat: user provided accepted replacements matrix
    2. exp_freq_table: expected frequencies table. Used if provided, if not, generated from the acc_rep_mat.
    3. logbase: base of logarithm for the log-odds matrix. Default base 10.
    4. round_digit: number after decimal digit to which result should be rounded. Default zero.

9.4.2  FreqTable

FreqTable.FreqTable(UserDict.UserDict)
  1. Attributes:
    1. alphabet: A Bio.Alphabet instance.
    2. data: frequency dictionary
    3. count: count dictionary (in case counts are provided).


  2. Functions:
    1. read_count(f): read a count file from stream f. Then convert to frequencies
    2. read_freq(f): read a frequency data file from stream f. Of course, we then don't have the counts, but it is usually the letter frquencies which are interesting.


  3. Example of use: The expected count of the residues in the database is sitting in a file, whitespace delimited, in the following format (example given for a 3-letter alphabet):
    A   35
    B   65
    C   100
    
    And will be read using the FreqTable.read_count(file_handle) function.

    An equivalent frequency file:
    A  0.175
    B  0.325
    C  0.5
    
    Conversely, the residue frequencies or counts can be passed as a dictionary. Example of a count dictionary (3-letter alphabet):
    {'A': 35, 'B': 65, 'C': 100}
    
    Which means that an expected data count would give a 0.5 frequency for 'C', a 0.325 probability of 'B' and a 0.175 probability of 'A' out of 200 total, sum of A, B and C)

    A frequency dictionary for the same data would be:
    {'A': 0.175, 'B': 0.325, 'C': 0.5}
    
    Summing up to 1.

    When passing a dictionary as an argument, you should indicate whether it is a count or a frequency dictionary. Therefore the FreqTable class constructor requires two arguments: the dictionary itself, and FreqTable.COUNT or FreqTable.FREQ indicating counts or frequencies, respectively.

    Read expected counts. readCount will already generate the frequencies Any one of the following may be done to geerate the frequency table (ftab):
    >>> from SubsMat import *
    >>> ftab = FreqTable.FreqTable(my_frequency_dictionary,FreqTable.FREQ)
    >>> ftab = FreqTable.FreqTable(my_count_dictionary,FreqTable.COUNT)
    >>> ftab = FreqTable.read_count(open('myCountFile'))
    >>> ftab = FreqTable.read_frequency(open('myFrequencyFile'))
    

Previous Up Next