This document describes the tables that make up the Ensembl Compara schema. Tables are listed grouped in different categories, and the purpose of each table is explained. Several examples are also given. They are intended to allow people to familiarise themselves with the schema.
This table contains all taxa used in this database, which mirror the data and tree structure from NCBI Taxonomy database (for more details see ensembl-compara/script/taxonomy/README-taxonomy which explain our import process)
Column
Type
Default value
Description
Index
taxon_id
int(10)
The NCBI Taxonomy ID
primary key
parent_id
int(10)
The parent taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id)
key: parent_id
rank
char(32)
''
E.g. kingdom, family, genus, etc.
key: rank
genbank_hidden_flag
tinyint(1)
0
Boolean value which defines whether this rank is used or not in the abbreviated lineage
left_index
int(10)
0
Sub-set left index. All sub-nodes have left_index and right_index values larger than this left_index
key: left_index
right_index
int(10)
0
Sub-set right index. All sub-nodes have left_index and right_index values smaller than this right_index
key: right_index
root_id
int(10)
1
The root taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id)
Example:
This examples shows how to get the lineage for Homo sapiens:
SELECT * FROM ncbi_taxa_node WHERE left_index <= 339687 AND right_index >= 339690 ORDER BY left_index;
Contains groups or sets of species which are used in the method_link_species_set table. Each species_set is a set of genome_db objects
Column
Type
Default value
Description
Index
species_set_id
int(10)
Internal (non-unique) ID for the table
unique: key
genome_db_id
int(10)
NULL
External reference to genome_db_id in the genome_db table
unique: key
Example:
This query shows the first 10 species_sets having human
SELECT species_set_id, GROUP_CONCAT(name) AS species FROM species_set JOIN genome_db USING(genome_db_id) GROUP BY species_set_id HAVING species LIKE '%homo_sapiens%' ORDER BY species_set_id LIMIT 10;
This table contains descriptive tags for the species_set_ids in the species_set table. It is used to store options on clades and group of species. It has been initially developed for the gene tree view.
Column
Type
Default value
Description
Index
species_set_id
int(10)
External reference to species_set_id in the species_set table
unique key: tag_species_set_id
tag
varchar(50)
Tag name
unique key: tag_species_set_id
value
mediumtext
Tag value
Example:
This query retrieves all the species_sets tagged as 'primates' and links to the genome_db table to retrieve the species names
SELECT species_set_id, name, tag, value FROM species_set JOIN species_set_tag USING(species_set_id) JOIN genome_db USING(genome_db_id) WHERE value = 'primates';
This table specifies which kind of link can exist between entities in compara (dna/dna alignment, synteny regions, homologous gene pairs, etc...) NOTE: We use method_link_ids between 1 and 100 for DNA-DNA alignments, between 101 and 200 for genomic syntenies, between 201 and 300 for protein homologies, between 301 and 400 for protein families and between 401 and 500 for protein and ncRNA trees. Each category corresponds to data stored in different tables.
Column
Type
Default value
Description
Index
method_link_id
int(10)
Internal unique ID
primary key
type
varchar(50)
''
The common name of the linking method between species
unique key: type
class
varchar(50)
''
Description of type of data associated with the \"type\" field and the main table to find these data
This table contains information about the comparisons stored in the database. A given method_link_species_set_id exist for each comparison made and relates a method_link_id in method_link with a set of species (species_set_id) in the species_set table.
Column
Type
Default value
Description
Index
method_link_species_set_id
int(10)
Internal unique ID
primary key
method_link_id
int(10)
External reference to method_link_id in the method_link table
unique key: method_link_id
species_set_id
int(10)
0
External reference to species_set_id in the species_set table
unique key: method_link_id
name
varchar(255)
''
Human-readable description for this method_link_species_set
source
varchar(255)
'ensembl'
Source of the data. Currently either "ensembl" or "ucsc" if data were imported from UCSC
url
varchar(255)
''
A URL where you can find the orignal data if they were imported
Example:
This query shows all the EPO alignments in this database:
SELECT * FROM method_link_species_set WHERE method_link_id = 13;
This table defines the genomic sequences used in the comparative genomics analyisis. It is used by the genomic_align_block table to define aligned sequences. It is also used by the dnafrag_region table to define syntenic regions. NOTE: Index has genome_db_id in the first place because unless fetching all dnafrags or fetching by dnafrag_id, genome_db_id appears always in the WHERE clause. Unique key is used to ensure that Bio::EnsEMBL::Compara::DBSQL::DnaFragAdaptor->fetch_by_GenomeDB_and_name will always fetch a single row. This can be used in the EnsEMBL Compara DB because we store top-level dnafrags only.
Column
Type
Default value
Description
Index
dnafrag_id
bigint
Internal unique ID
primary key
length
int(11)
0
The total length of the dnafrag
name
varchar(40)
''
Name of the DNA sequence (e.g., the name of the chromosome)
unique: name
genome_db_id
int(10)
External reference to genome_db_id in the genome_db table
unique: name
coord_system_name
varchar(40)
NULL
Refers to the coord system in which this dnafrag has been defined
is_reference
tinyint(1)
1
Boolean, whether dnafrag is reference (1) or non-reference (0) eg haplotype
Example:
This query shows the chromosome 14 of the Human genome (genome_db.genome_db_id = 90 refers to Human genome in this example) which is 107349540 nucleotides long.
SELECT dnafrag.* FROM dnafrag LEFT JOIN genome_db USING (genome_db_id) WHERE dnafrag.name = "14" AND genome_db.name = "homo_sapiens";
This table contains the genomic regions corresponding to every synteny relationship found. There are two genomic regions for every synteny relationship.
Column
Type
Default value
Description
Index
synteny_region_id
int(10)
0
External reference to synteny_region_id in the synteny_region table
key: synteny key: synteny_reversed
dnafrag_id
bigint
0
External reference to dnafrag_id in the dnafrag table
key: synteny key: synteny_reversed
dnafrag_start
int(10)
0
Position of the first nucleotide from this dnafrag which is in synteny
dnafrag_end
int(10)
0
Position of the last nucleotide from this dnafrag which is in synteny
dnafrag_strand
tinyint(4)
0
Strand of this region
Example 1:
Example of dnafrag_region query
SELECT * FROM dnafrag_region WHERE synteny_region_id = 34965;
When joining to dnafrag and genome_db tables we get more comprehensive information:
SELECT genome_db.name, dnafrag.name, dnafrag_start, dnafrag_end, dnafrag_strand FROM dnafrag_region LEFT JOIN dnafrag USING (dnafrag_id) LEFT JOIN genome_db USING (genome_db_id) WHERE synteny_region_id = 34965;
This table is the key table for the genomic alignments. The software used to align the genomic blocks is refered as an external key to the method_link table. Nevertheless, actual aligned sequences are defined in the genomic_align table. Tree alignments (EPO alignments) are best accessed through the genomic_align_tree table although the alignments are also indexed in this table. This allows the user to also access the tree alignments as normal multiple alignments. NOTE: All queries in the API uses the primary key as rows are always fetched using the genomic_align_block_id. The key 'method_link_species_set_id' is used by MART when fetching all the genomic_align_blocks corresponding to a given method_link_species_set_id
Used for pairwise comparison. Defines the percentage of identity between both sequences
length
int(10)
Total length of the alignment
group_id
bigint
NULL
Used to group alignments
level_id
tinyint(2)
0
Level of orhologous layer. 1 corresponds to the first layer of orthologous sequences found, 2 and over are addiotional layers. Use for building the syntenies (based on level_id = 1 only)
Example:
The following query refers to a primates EPO alignment:
SELECT * FROM genomic_align_block WHERE genomic_align_block_id = 5480000000010;
This table is used to index tree alignments, e.g. EPO alignments. These alignments include inferred ancestral sequences. The tree required to index these sequences is stored in this table. This table stores the structure of the tree. Each node links to an entry in the genomic_align_group table, which links to one or several entries in the genomic_align table. NOTE: Left_index and right_index are used to speed up fetching trees from the database. Any given node has its left_index larger than the left_index of its parent node and its right index smaller than the right_index of its parent node. In other words, all descendent nodes of a given node can be obtained by fetching all the node with a left_index (or right_index or both) between the left_index and the right_index of that node.
Column
Type
Default value
Description
Index
node_id
bigint(20)
Internal unique ID
primary key
parent_id
bigint(20)
0
Link to the parent node
key: parent_id
root_id
bigint(20)
0
Link to root node
key: root_id key: left_index
left_index
int(10)
0
Internal index. See above
key: left_index
right_index
int(10)
0
Internal index. See above
left_node_id
bigint(10)
0
Link to the node on the left side of this node
right_node_id
bigint(10)
0
Link to the node on the right side of this node
distance_to_parent
double
1
Phylogenetic distance between this node and its parent
Example 1:
The following query corresponds to the root of a tree, because parent_id = 0 and root_id = node_id
SELECT * FROM genomic_align_tree WHERE node_id = root_id LIMIT 1;
This table contains the coordinates and all the information needed to rebuild genomic alignments. Every entry corresponds to one of the aligned sequences. It also contains an external key to the method_link_species_set which refers to the software and set of species used for getting the corresponding alignment. The aligned sequence is defined by an external reference to the dnafrag table, the starting and ending position within this dnafrag, the strand and a cigar_line. The original aligned sequence is not stored but it can be retrieved using the cigar_line field and the original sequence. The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:
Original sequence: AACGCTT
The aligned sequence will be:
cigar line: 2MD3M2D2M
M
M
D
M
M
M
D
D
M
M
A
A
-
C
G
C
-
-
T
T
Column
Type
Default value
Description
Index
genomic_align_id
bigint
Unique internal ID
primary key
genomic_align_block_id
bigint
External reference to genomic_align_block_id in the genomic_align_block table
key: genomic_align_block_id
method_link_species_set_id
int(10)
0
External reference to method_link_species_set_id in the method_link_species_set table. This information is redundant because it also appears in the genomic_align_block table but it is used to speed up the queries
key: method_link_species_set_id key: dnafrag
dnafrag_id
bigint
0
External reference to dnafrag_id in the dnafrag table
key: dnafrag
dnafrag_start
int(10)
0
Starting position within the dnafrag defined by dnafrag_id
key: dnafrag
dnafrag_end
int(10)
0
Ending position within the dnafrag defined by dnafrag_id
key: dnafrag
dnafrag_strand
tinyint(4)
0
Strand in the dnafrag defined by dnafrag_id
cigar_line
mediumtext
Internal description of the aligned sequence
visible
tinyint(2)
1
Used in self alignments to ensure only one Bio::EnsEMBL::Compara::GenomicAlignBlock is visible when you have more than 1 block covering the same region
Here is a better way to get this by joining the dnafrag and genome_db tables:
SELECT genome_db.name, dnafrag.name, dnafrag_start, dnafrag_end, dnafrag_strand str, cigar_line FROM genomic_align LEFT JOIN dnafrag USING (dnafrag_id) LEFT JOIN genome_db USING (genome_db_id) WHERE genomic_align_block_id = 5480000000010;
This table contains conservation scores calculated from the whole-genome multiple alignments stored in the genomic_align_block table. Several scores are stored per row. expected_score and diff_score are binary columns and you need to use the Perl API to access these data.
Column
Type
Default value
Description
Index
genomic_align_block_id
bigint
External reference to genomic_align_block_id in the genomic_align_block table
key: genomic_align_block_id, window_size
window_size
smallint
The scores are stored at different resolution levels. This column defines the window size used to calculate the average score
key: genomic_align_block_id, window_size
position
int
Position of the first score (in alignment coordinates)
expected_score
blob
Expected score. The observed score can be determined using the diff_score and the expected_score
diff_score
blob
The difference between the expected and observed variation, i.e. the conservation score
There are 2 other elements in the same constrained_element:
SELECT constrained_element_id, genome_db.name, dnafrag.name FROM constrained_element JOIN dnafrag USING (dnafrag_id) JOIN genome_db USING (genome_db_id) WHERE constrained_element_id = 5290000000001;
This table stores cross-references for member sequences derived from the core databases. It is used by Bio::EnsEMBL::Compara::DBSQL::XrefMemberAdaptor and provides the data used in highlighting gene trees by GO and InterPro annotation"
Column
Type
Default value
Description
Index
member_id
int(10)
External reference to member_id in the member table. Indicates the member to which the xref applies.
primary key
dbprimary_acc
varchar(10)
Accession of xref (e.g. GO term, InterPro accession)
primary key
external_db_id
int(10)
External reference to external_db_id in the external_db table. Indicates to which external database the xref belongs.
: This table stores the raw local alignment results of peptide to peptide alignments returned by a BLAST run. The hits are actually stored in species-specific tables rather than in a single table. For example, human has the genome_db_id 90, and all the hits that have a human gene as a query are stored in peptide_align_feature_90
Column
Type
Default value
Description
Index
peptide_align_feature_id
bigint
Internal unique ID
primary key
qmember_id
int(10)
External reference to member_id in the member table for the query peptide
hmember_id
int(10)
External reference to member_id in the member table for the hit peptide
qgenome_db_id
int(10)
External reference to genome_db_id in the genome_db table for the query peptide (for query optimization)
hgenome_db_id
int(10)
External reference to genome_db_id in the genome_db table for the hit peptide (for query optimization)
qstart
int(10)
0
Starting position in the query peptide sequence
qend
int(10)
0
Ending position in the query peptide sequence
hstart
int(11)
0
Starting position in the hit peptide sequence
hend
int(11)
0
Ending position in the hit peptide sequence
score
double(16,4)
0.0000
Blast score for this HSP
evalue
double
Blast evalue for this HSP
align_length
int(10)
Alignment length of HSP
identical_matches
int(10)
Blast HSP match score
perc_ident
int(10)
Percent identical matches in the HSP length
positive_matches
int(10)
Blast HSP positive score
perc_pos
int(10)
Percent positive matches in the HSP length
hit_rank
int(10)
Rank in blast result
cigar_line
mediumtext
Cigar string coding the actual alignment
Example 1:
Example of peptide_align_feature entry:
SELECT * FROM peptide_align_feature_90 WHERE peptide_align_feature_id = 9000000001;
The following query corresponds to a particular hit found between a Homo sapiens protein and a Anolis carolinensis protein:
SELECT g1.name as qgenome, m1.stable_id as qstable_id, g2.name as hgenome, m2.stable_id as hstable_id, score, evalue FROM peptide_align_feature_90 LEFT JOIN member m1 ON (qmember_id = m1.member_id) LEFT JOIN member m2 ON (hmember_id = m2.member_id) LEFT JOIN genome_db g1 ON (qgenome_db_id = g1.genome_db_id) LEFT JOIN genome_db g2 ON (hgenome_db_id = g2.genome_db_id) WHERE peptide_align_feature_id = 9000000001;
This table contains the proteins corresponding to protein family relationship found. There are several family_member entries for each family entry
Column
Type
Default value
Description
Index
family_id
int(10)
External reference to family_id in the family table
primary key key: family_id
member_id
int(10)
External reference to the member_id in the member table
primary key key: member_id
cigar_line
mediumtext
Internal description of the multiple alignment (see the description in the homology_member table)
Example:
The following query refers to the four members of the protein family 54177. The proteins can be retieved using the member_ids. The multiple alignment can be restored using the cigar_lines.
SELECT * FROM family_member WHERE family_id = 29739;
This table holds the gene tree data structure, such as root, relation between parent and child, leaves, etc... In our data structure, all the trees of a given clusterset are arbitrarily connected to the same root. This eases to store and query in the same database the data from independant tree building analysis. Hence the "biological roots" of the trees are the children nodes of the main clusterset root. See the examples below.
Column
Type
Default value
Description
Index
node_id
int(10)
Internal unique ID
primary key
parent_id
int(10)
Link to the parent node
key: parent_id
root_id
int(10)
Link to the root node
key: root_id key: root_id_left_index
left_index
int(10)
0
Internal index. See above
key: root_id_left_index
right_index
int(10)
0
Internal index. See above
distance_to_parent
double
1.0
Phylogenetic distance between this node and its parent
member_id
int(10)
External reference to member_id in the member table to allow linkage from trees to peptides/transcripts.
key: member_id
Example:
The following query returns the root nodes of the independant protein trees stored in the database
SELECT gtn.node_id FROM gene_tree_node gtn LEFT JOIN gene_tree_root gtr ON (gtn.parent_id = gtr.root_id) WHERE gtr.tree_type = 'clusterset' AND gtr.member_type = 'protein' LIMIT 10;
Header table for gene_trees. The database is able to contain several sets of trees computed on the same genes. We call these analysis "clustersets" and they can be distinguished with the clusterset_id field. Traditionally, the compara databases have contained only one clusterset (clusterset_id=1), but currently (starting on release 66) we have at least 2 (one for protein trees and one for ncRNA trees). See the examples below.
Column
Type
Default value
Description
Index
root_id
INT(10)
Internal unique ID
primary key
member_type
ENUM('protein', 'ncrna')
The type of members used in the tree
tree_type
ENUM('clusterset', 'supertree', 'tree')
The type of the tree
key: tree_type
clusterset_id
VARCHAR(20)
'default'
Name for the set of clusters/trees
method_link_species_set_id
INT(10)
External reference to method_link_species_set_id in the method_link_species_set table
gene_align_id
INT(10)
External reference to gene_align_id in the gene_align table
ref_root_id
INT(10)
External reference to default (merged) root_id for this tree
key: ref_root_id
stable_id
VARCHAR(40)
Unique, stable ID for the tree (follows the pattern: label(5).release_introduced(4).unique_id(10))
unique: key
version
INT
Version of the stable ID (changes only when members move to/from existing trees)
Example 1:
The following query retrieves all the node_id of the current clustersets
SELECT * FROM gene_tree_root WHERE tree_type = 'clusterset';
This table contains all the genomic homologies. There are two homology_member entries for each homology entry for now, but both the schema and the API can handle more than just pairwise relationships. dN, dS, N, S and lnL are statistical values given by the codeml program of the Phylogenetic Analysis by Maximum Likelihood (PAML) package.
See species_names that participate in this parcitular homology entry
SELECT homology_id, description, GROUP_CONCAT(genome_db.name) AS species FROM homology LEFT JOIN method_link_species_set USING (method_link_species_set_id) LEFT JOIN species_set USING (species_set_id) LEFT JOIN genome_db USING(genome_db_id) WHERE homology_id = 100000001 GROUP BY homology_id;
This table contains the sequences corresponding to every genomic homology relationship found. There are two homology_member entries for each pairwise homology entry. As written in the homology table section, both schema and API can deal with more than pairwise relationships. The original alignment is not stored but it can be retrieved using the cigar_line field and the original sequences. The cigar line defines the sequence of matches or mismatches and deletions in the alignment.
First peptide sequence: SERCQVVVISIGPISVLSMILDFY
Second peptide sequence: SDRCQVLVISILSMIGLDFY
First corresponding cigar line: 20MD4M
Second corresponding cigar line: 11M5D9M
The alignment will be:
Example of alignment reconstruction
First peptide cigar line
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
D
M
M
M
M
First aligned peptide
S
E
R
C
Q
V
V
V
I
S
I
G
P
I
S
V
L
S
M
I
-
L
D
F
Y
Second aligned peptide
S
D
R
C
Q
V
L
V
I
S
I
-
-
-
-
-
L
S
M
I
G
L
D
F
Y
Second peptide cigar line
M
M
M
M
M
M
M
M
M
M
M
D
D
D
D
D
M
M
M
M
M
M
M
M
M
Column
Type
Default value
Description
Index
homology_id
int(10)
External reference to homology_id in the homology table
primary key key: homology_id
member_id
int(10)
External reference to member_id in the member table. Refers to the corresponding "ENSMBLGENE" entry
primary key key: member_id
peptide_member_id
int(10)
External reference to member_id in the member table. Refers to the corresponding "ENSEMBLPEP" entry
key: peptide_member_id
cigar_line
mediumtext
An internal description of the alignment. It contains mathces/mismatches (M) and delations (D) and refers to the corresponding peptide_member_id sequence
perc_cov
int(10)
Defines the percentage of the peptide which has been aligned
perc_id
int(10)
Defines the percentage of identity between both homologues
perc_pos
int(10)
Defines the percentage of positivity (similarity) between both homologues
Example:
The following query refers to the two homologue sequences defined by the homology.homology_id 100000001. Gene and peptide sequence of the second homologue can retrieved in the same way.
SELECT * FROM homology_member WHERE homology_id = 100000001;
This table contains one entry per stable_id mapping session (either for Families or for Protein Trees), which contains the type, the date of the mapping, and which releases were linked together. A single mapping_session is the event when mapping between two given releases for a particular class type ('family' or 'tree') is loaded. The whole event is thought to happen momentarily at 'when_mapped' (used for sorting in historical order).
Column
Type
Default value
Description
Index
mapping_session_id
INT
Internal unique ID
primary key
type
ENUM('family', 'tree')
Type of stable_ids that were mapped during this session
unique: key
when_mapped
TIMESTAMP
CURRENT_TIMESTAMP
Normally, we use the date of creation of the mapping file being loaded. This prevents the date from chaging even if we accidentally remove the entry and have to re-load it.
rel_from
INT
rel.number from which the stable_ids were mapped during this session. rel_from < rel_to
unique: key
rel_to
INT
rel.number to which the stable_ids were mapped during this session. rel_from < rel_to
This table keeps the history of stable_id changes from one release to another. The primary key 'object' describes a set of members migrating from stable_id_from to stable_id_to. Their volume (related to the 'shared_size' of the new class) is reflected by the fractional 'contribution' field. Since both stable_ids are listed in the primary key, they are not allowed to be NULLs. We shall treat empty strings as NULLs. If stable_id_from is empty, it means these members are newcomers into the new release. If stable_id_to is empty, it means these previously known members are disappearing in the new release. If both neither stable_id_from nor stable_id_to is empty, these members are truly migrating.
This table contains site-wise omega values found in the multiple alignments underlining the protein trees.
Column
Type
Default value
Description
Index
sitewise_id
int(10)
Internal unique ID
primary key
aln_position
int(10)
The position in the whole GeneTree alignment, even if it is all_gaps in the subtree
unique: aln_position_node_id_ds
node_id
int(10)
The root of the subtree for which the sitewise is calculated
unique: aln_position_node_id_ds key: node_id
tree_node_id
int(10)
The root of the tree. it will be equal to node_id if we are calculating sitewise for the whole tree
key: tree_node_id
omega
float(10,5)
The estimated omega value at the position
omega_lower
float(10,5)
The lower bound of the confidence interval
omega_upper
float(10,5)
The upper bound of the confidence interval
optimal
float(10,5)
optimal
ncod
int(10)
ncod
threshold_on_branch_ds
float(10,5)
The used threshold to break a tree into subtrees when the dS value of a given branch is too big. This is defined in the configuration file for the genetree pipeline