h1. FASTA Pipeline
This is a re-implementation of an existing pipeline developed originally by core and the webteam. The new version uses eHive, so familiarity with this system is essential, and has been written to use as little memory as possible.
h2. The Registry File
This is the way we retrieve the database connections to work with. The registry file should specify:
Here is an example of a file for v67 of Ensembl. Note the use of the Registry object within a registry file and the scoping of the package. If you omit the -db_version parameter and only use HEAD checkouts of Ensembl then this will automatically select the latest version of the API. Any change to version here must be reflected in the configuration file.
bc.
package Reg;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::DBSQL::DBAdaptor;
Bio::EnsEMBL::Registry->no_version_check(1);
Bio::EnsEMBL::Registry->no_cache_warnings(1);
{
my $version = 67;
Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs(
{
-host => “mydb-1”,
-port => 3306,
-db_version => $version,
-user => “user”,
-NO_CACHE => 1,
},
{
-host => “mydb-2”,
-port => 3306,
-db_version => $version,
-user => “user”,
-NO_CACHE => 1,
},
);
Bio::EnsEMBL::DBSQL::DBAdaptor->new(
-HOST => ‘mydb-2’,
-PORT => 3306,
-USER => ‘user’,
-DBNAME => ‘ensembl_website’,
-SPECIES => ‘multi’,
-GROUP => ‘web’
);
Bio::EnsEMBL::DBSQL::DBAdaptor->new(
-HOST => ‘mydb-2’,
-PORT => 3306,
-USER => ‘user’,
-DBNAME => ‘ensembl_production’,
-SPECIES => ‘multi’,
-GROUP => ‘production’
);
}
1;
You give the registry to the init_pipeline.pl script via the -registry option
h2. Overriding Defaults Using a New Config File
We recommend if you have a number of parameters which do not change between releases to create a configuration file which inherits from the root config file e.g.
bc. package MyCnf; use base qw/Bio::EnsEMBL::Pipeline::FASTA::FASTA_conf/; sub default_options { my ($self) = @; return { %{ $self->SUPER::defaultoptions() }, #Override of options }; } 1;
If you do override the config then you should use the package name for your overridden config in the upcoming example commands.
h2. Environment
h3. PERL5LIB
h3. PATH
h3. ENSEMBL_CVS_ROOT_DIR
Set to the base checkout of Ensembl. We should be able to add ensembl-hive/sql onto this path to find the SQL directory for hive e.g.
bc. export ENSEMBL_CVS_ROOT_DIR=$HOME/work/ensembl-checkouts
h3. ENSADMIN_PSW
Give the password to use to log into a database server e.g.
bc. export ENSADMIN_PSW=wibble
h2. Command Line Arguments
Where Multiple Supported is supported we allow the user to specify the parameter more than once on the command line. For example species is one of these options e.g.
bc. -species human -species cele -species yeast
|. Name |. Type|. Multiple Supported|. Description|. Default|. Required| |@-registry@|String|No|Location of the Ensembl registry to use with this pipeline|-|YES| |@-base_path@|String|No|Location of the dumps|-|YES| |@-pipeline_db -host=@|String|No|Specify a host for the hive database e.g. @-pipeline_db -host=myserver.mysql@|See hive generic config|YES| |@-pipeline_db -dbname=@|String|No|Specify a different database to use as the hive DB e.g. @-pipeline_db -dbname=my_dumps_test@|Uses pipeline name by default|NO| |@-ftp_dir@|String|No|Location of the current FTP directory with the previous release’s files. We will use this to copy DNA files from one release to another. If not given we do not do any reuse|-|NO| |@-run_all_@|Boolean|No|Ignores any kind of reuse an forces the dump of all DNAs|-|NO| |@-species@|String|Yes|Specify one or more species to process. Pipeline will only consider these species. Use -force_species if you want to force a species run|-|NO| |@-force_species@|String|Yes|Specify one or more species to force through the pipeline. This is useful to force a dump when you know reuse will not do the “right thing”|-|NO| |@-dump_types@|String|Yes|Specify each type of dump you want to produce. Supported values are dna, cdna and ncrna|All|NO| |@-db_types@|String|Yes|The database types to use. Supports the normal db adaptor groups e.g. core, otherfeatures etc.|core|NO| |@-process_logic_names@|String|Yes|Provide a set of logic names whose models should be dumped|-|NO| |@-skip_logic_names@|String|Yes|Provide a set of logic names to skip when creating dumps. These are evaluated after @-process_logic_names@|core|NO| |@-release@|Integer|No|The release to dump|Software version|NO| |@-previous_release@|Integer|No|The previous release to use. Used to calculate reuse|Software version minus 1|NO| |@-blast_servers@|String|Yes|The servers to copy blast indexes to|-|NO| |@-blast_genomic_dir@|String|No|Location to copy the DNA blast indexes to|-|NO| |@-blast_genes_dir@|String|No|Location to copy the DNA gene (cdna, ncrna and protein) indexes to|-|NO| |@-scp_user@|String|No|User to perform the SCP as. Defaults to the current user|Current user|NO| |@-scp_identity@|String|No|The SSH identity file to use when performing SCPs. Normally used in conjunction with -scp_user|-|NO| |@-no_scp@|Boolean|No|Skip SCP altogether|0|NO| |@-pipeline_name@|String|No|Name to use for the pipeline|fasta_dump_$release|NO| |@-wublast_exe@|String|No|Location of the WUBlast indexing binary|xdformat|NO| |@-blat_exe@|String|No|Location of the Blat indexing binary|faToTwoBit|NO| |@-port_offset@|Integer|No|The offset of the ports to use when generating blat indexes. This figure is added onto the web database species ID|30000|NO| |@-email@|String|No|Email to send pipeline summaries to upon its successful completion|$USER@sanger.ac.uk|NO|
h2. Example Commands
h3. To load use normally:
bc. init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FASTA_conf \ -pipeline_db -host=my-db-host -base_path /path/to/dumps -registry reg.pm
h3. Run a subset of species (no forcing & supports registry aliases):
bc. init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FASTA_conf \ -pipeline_db -host=my-db-host -species anolis -species celegans -species human \ -base_path /path/to/dumps -registry reg.pm
h3. Specifying species to force (supports all registry aliases):
bc. init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FASTA_conf \ -pipeline_db -host=my-db-host -force_species anolis -force_species celegans -force_species human \ -base_path /path/to/dumps -registry reg.pm
h3. Running & forcing a species:
bc. init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FASTA_conf \ -pipeline_db -host=my-db-host -species celegans -force_species celegans \ -base_path /path/to/dumps -registry reg.pm
h3. Running everything:
bc. init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FASTA_conf \ -pipeline_db -host=my-db-host -run_all 1 \ -base_path /path/to/dumps -registry reg.pm
h3. Dumping just gene data (no DNA or ncRNA):
bc. init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FASTA_conf \ -pipeline_db -host=my-db-host -dump_type cdna \ -base_path /path/to/dumps -registry reg.pm
h3. Using a different SCP user & identity:
bc. init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::FASTA_conf \ -pipeline_db -host=my-db-host -scp_user anotherusr -scp_identity /users/anotherusr/.pri/identity \ -base_path /path/to/dumps -registry reg.pm
h2. Running the Pipeline
h2. But I Don’t Want a Pipeline
Hive gives us the ability to run any Process outside of a database pipeline run using @standaloneJob.pl@. We will list some useful commands to run
h3. Running DNA Dumping
bc. standaloneJob.pl Bio::EnsEMBL::Pipeline::FASTA::DumpFile \ -reg_conf reg.pm -debug 2 \ -release 67 -species homo_sapiens -sequence_type_list ‘[“dna”]’ \ -base_path /path/to/dumps
h3. Running Gene Dumping
bc. standaloneJob.pl Bio::EnsEMBL::Pipeline::FASTA::DumpFile \ -reg_conf reg.pm -debug 2 \ -release 67 -species homo_sapiens -sequence_type_list ‘[“cdna”]’ \ -base_path /path/to/dumps
h3. Running Gene Dumping
bc. standaloneJob.pl Bio::EnsEMBL::Pipeline::FASTA::DumpFile \ -reg_conf reg.pm -debug 2 \ -release 67 -species homo_sapiens -sequence_type_list ‘[“ncrna”]’ \ -base_path /path/to/dumps