This tutorial is an introduction to the EnsEMBL Compara API. Knowledge of the EnsEMBL Core API and of the concepts and conventions in the EnsEMBL Core API tutorial is assumed. Documentation about the Compara database schema is available in ensembl-compara/docs/ from the EnsEMBL CVS repository, and while is not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.
API installation and updating is the same as per the core API.
Starting from release 48 EnsEMBL has been running two public MySQL servers on
host=ensembldb.ensembl.org
. The server accessible on
port=3306
and port=4306
hosts all databases prior
to release 48, and the server on port=5306
hosts all newer
databases starting from release 48.
For using the auto-configuration feature, you will first need to supply the connection parameters to the Registry loader. For instance, if you want to connect to the the public EnsEMBL databases you can use the following command in your scripts:
[[INCLUDE::/info/docs/api/compara/tut_registry1.inc]]
This will initialize the Registry, from which you will be able to create object-specific adaptors later. Alternatively, you can use a shorter version based on a URL:
[[INCLUDE::/info/docs/api/compara/tut_registry2.inc]]
You will need to have a registry configuration file set up. By default, it takes the file defined by the ENSEMBL_REGISTRY environment variable or the file named .ensembl_init in your home directory if the former is not found. Additionally, you can use a specific file (see perldoc Bio::EnsEMBL::Registry or later in this document for some examples on how to use a different file). Please, refer to the EnsEMBL registry documentation for details about this option.
EnsEMBL Compara data, like core data, is stored in a MySQL relational database.
If you want to access a Compara database, you will need to connect to it.
This is done in exactly the same way as to connect to an EnsEMBL core database,
but using a Compara-specific DBAdaptor. One parameter you have to supply
in addition to the ones needed by the Registry is the -dbname
, which by convention contains the release number:
[[INCLUDE::/info/docs/api/compara/tut_registry3.inc]]
EnsEMBL Compara adaptors are used to fetch data from the database. Data are returned as EnsEMBL objects. For instance, the GenomeDBAdaptor returns Bio::EnsEMBL::Compara::GenomeDB objects.
Below is a non-exhaustive list of EnsEMBL Compara adaptors that are most often used:
Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.
You can get the adaptors from the Registry with the get_adaptor command. You need to specify three arguments: the species name, the type of database and the type of object. Therefore, in order to get the GenomeDBAdaptor for the Compara database, you will need the following command:
[[INCLUDE::/info/docs/api/compara/tut_genomedb1.inc]]
NB: As the EnsEMBL Compara DB is a multi-species database, the standard species name is 'Multi'. The type of the database is 'compara'.
Refer to the EnsEMBL core tutorial for a good description of the coding conventions normally used in EnsEMBL.
We can divide the fetching methods of the ObjectAdaptors into two categories: the fetch_by
and fetch_all_by
. The former return one single object while the latter return a reference to an array of objects.
[[INCLUDE::/info/docs/api/compara/tut_genomedb2.inc]]
[[INCLUDE::/info/docs/api/compara/tut_genomedb3.inc]]
The Compara database contains a number of different types of whole genome alignments. A listing about what are these different types can be found in the ensembl-compara/docs/schema_doc.html document in method_link section.
GenomicAlignBlocks are the preferred way to store and fetch genomic alignments. A GenomicAlignBlock contains several GenomicAlign objects. Every GenomicAlign object corresponds to a piece of genomic sequence aligned with the other GenomicAlign in the same GenomicAlignBlock. A GenomicAlign object is always in relation with other GenomicAlign objects and this relation is defined through the GenomicAlignBlock object. Therefore the usual way to fetch genomic alignments is by fetching GenomicAlignBlock objects. We have to start by getting the corresponding adaptor:
[[INCLUDE::/info/docs/api/compara/tut_align1.inc]]
In order to fetch the right alignments we need to specify a couple of data: the type of alignment and the piece of genomic sequence in which we are looking for alignments. The type of alignment is a more tricky now: you need to specify both the alignment method and the set of genomes. In order to simply this task, you could use the new Bio::EnsEMBL::Compara::MethodLinkSpeciesSet object. The best way to use them is by fetching them from the database:
[[INCLUDE::/info/docs/api/compara/tut_align2.inc]]
There are two ways to fetch GenomicAlignBlocks. One uses Bio::EnsEMBL::Slice objects while the second one is based on Bio::EnsEMBL::Compara::DnaFrag objects for specifying the piece of genomic sequence in which we are looking for alignments.
[[INCLUDE::/info/docs/api/compara/tut_align3.inc]]
Here is an example script with all of this:
[[INCLUDE::/info/docs/api/compara/tut_align4.inc]]
All the homologies and families refer to GeneMembers and SeqMembers. Homology objects store orthologous and paralogous relationships between members and Family objects are clusters of members.
A member represent either a gene (GeneMember) or a sequence-bearing locus, e.g. a protein or a transcript (SeqMember). Most of them are defined in the corresponding EnsEMBL core database. For instance, the sequence for the human gene ENSG00000004059 is stored in the human core database.
The fetch_by_source_stable_id method of the corresponding *MemberAdaptor takes two arguments. The first one is the source of the member and can be:
The second argument is the identifier for the member. Here is a simple example:
[[INCLUDE::/info/docs/api/compara/tut_member1.inc]]
The *Member objects have several attributes:
source_name
and stable_id
define this member.chr_name
, chr_start
, chr_end
, chr_strand
locate this member on the genome but
are only available for ENSEMBLGENE and ENSEMBLPEP.taxon_id
corresponds to the NCBI taxonomy identifier (see
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
for more details).taxon
returns a Bio::EnsEMBL::Compara::NCBITaxon object.
From this object you can get additional information about the species.[[INCLUDE::/info/docs/api/compara/tut_member2.inc]]
In our example the species is human, so the output will look like this:
common_name: human genus: Homo species: sapiens binomial: Homo sapiens classification: sapiens Homo Hominidae Catarrhini Haplorrhini Primates Euarchontoglires Eutheria Mammalia Euteleostomi Vertebrata Craniata Chordata Metazoa Eukaryota
A Homology object represents either an orthologous or paralogous relationships between two members.
Typically you want to get homologies for a given gene. The HomologyAdaptor has a fetching method called fetch_all_by_Member(). You will need the GeneMember object for your query gene, therefore you will fetch the GeneMember first like in this example:
[[INCLUDE::/info/docs/api/compara/tut_homology1.inc]]
Each homology relation has exactly 2 members, you should find there the initial member used as a query. The get_all_Members method returns an array of SeqMember objects. The SeqMember is actually an AlignedMember (for the underlying protein) and contains information about how this SeqMember has been aligned.
[[INCLUDE::/info/docs/api/compara/tut_homology2.inc]]
You can get the original alignment used to define an homology:
[[INCLUDE::/info/docs/api/compara/tut_homology3.inc]]
Families are clusters of proteins including all the EnsEMBL proteins plus all the metazoan SwissProt and SP-Trembl entries. The object and the adaptor are really similar to the previous ones.
[[INCLUDE::/info/docs/api/compara/tut_family1.inc]]
For additional information or help mail the ensembl dev mailing list. You will need to subscribe to this mailing list to use it.