This is a re-implementation of an existing pipeline developed originally by core and the webteam. The new version uses eHive, so familiarity with this system is essential, and has been written to use as little memory as possible.
This is the way we retrieve the database connections to work with. The registry file should specify:
Here is an example of a file for v67 of Ensembl. Note the use of the Registry object within a registry file and the scoping of the package. If you omit the -db_version parameter and only use HEAD checkouts of Ensembl then this will automatically select the latest version of the API. Any change to version here must be reflected in the configuration file.
package Reg;
use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::Registry->no_version_check(1);
Bio::EnsEMBL::Registry->no_cache_warnings(1);
{
my $version = 67;
Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs(
{
-host => "mydb-1",
-port => 3306,
-db_version => $version,
-user => "user",
-NO_CACHE => 1,
},
{
-host => "mydb-2",
-port => 3306,
-db_version => $version,
-user => "user",
-NO_CACHE => 1,
},
);
}
1;
You give the registry to the init_pipeline.pl script via the -registry option
We recommend if you have a number of parameters which do not change between releases to create a configuration file which inherits from the root config file e.g.
package MyCnf;
use base qw/Bio::EnsEMBL::Pipeline::Flatfile::Flatfile_conf/;
sub default_options {
my ($self) = @_;
return {
%{ $self->SUPER::default_options() },
#Override of options
};
}
1;
If you do override the config then you should use the package name for your overridden config in the upcoming example commands.
Set to the base checkout of Ensembl. We should be able to add ensembl-hive/sql onto this path to find the SQL directory for hive e.g.
export ENSEMBL_CVS_ROOT_DIR=$HOME/work/ensembl-checkouts
Give the password to use to log into a database server e.g.
export ENSADMIN_PSW=wibble
Where Multiple Supported is supported we allow the user to specify the parameter more than once on the command line. For example species is one of these options e.g.
-species human -species cele -species yeast
Name | Type | Multiple Supported | Description | Default | Required |
---|---|---|---|---|---|
-registry | String | No | Location of the Ensembl registry to use with this pipeline | - | YES |
-base_path | String | No | Location of the dumps | - | YES |
-pipeline_db -host= | String | No | Specify a host for the hive database e.g. -pipeline_db -host=myserver.mysql | See hive generic config | YES |
-pipeline_db -dbname= | String | No | Specify a different database to use as the hive DB e.g. -pipeline_db -dbname=my_dumps_test | Uses pipeline name by default | NO |
-species | String | Yes | Specify one or more species to process. Pipeline will only consider these species | - | NO |
-types | String | Yes | Specify each type of dump you want to produce. Supported values are embl and genbank | All | NO |
-db_types | String | Yes | The database types to use. Supports the normal db adaptor groups e.g. core, otherfeatures etc. | core | NO |
-pipeline_name | String | No | Name to use for the pipeline | flatfile_dump_$release | NO |
-email | String | No | Email to send pipeline summaries to upon its successful completion | $USER@sanger.ac.uk | NO |
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \
-pipeline_db -host=my-db-host -base_path /path/to/dumps -registry reg.pm
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \
-pipeline_db -host=my-db-host -species anolis -species celegans -species human \
-base_path /path/to/dumps -registry reg.pm
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \
-pipeline_db -host=my-db-host -type embl \
-base_path /path/to/dumps -registry reg.pm
nohup
init_pipeline.pl
configuration from above-base_path
parameterinit_pipeline.pl
2>&1
is important as this clobbers STDERR into STDOUT> my_run.log
then sends the output to this file. Use tail -f
to track the pipelinebeekeeper.pl -url mysql://usr:pass@server:port/db -reg_conf reg.pm -loop -sleep 5 2>&1 > my_run.log &
Hive gives us the ability to run any Process outside of a database pipeline
run using standaloneJob.pl
. We will list some useful commands to run
standaloneJob.pl Bio::EnsEMBL::Pipeline::Flatfile::DumpFile \
-reg_conf reg.pm -debug 2 \
-release 67 -species homo_sapiens -type embl \
-base_path /path/to/dumps
Another pipeline is provided which can verify the files produced by this
pipeline. Nothing else other than a basic prodding of file contents is
attempted.
The code works with a SQLite database so you do not need a MySQL database
to schedule these jobs. You will have to schedule two pipelines; one
to work with embl and another to work with genbank.
The pipeline searches for all files matching the format *.dat.gz.
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FlatfileChecker_conf \
-base_path /path/to/embl/dumps -type embl
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FlatfileChecker_conf \
-base_path /path/to/genbank/dumps -type genbank
You can run this module without a pipeline if you need to check a single
file.
standaloneJob.pl Bio::EnsEMBL::Pipeline::Flatfile::CheckFlatfile \
-file /path/to/embl/dumps/homo_sapiens/Homo_sapiens.chromosome.1.dat.gz \
-type embl