==========================================================================================
 COMMANDLINE BLAST
==========================================================================================

First! We need some data. Let’s grab the mouse and zebrafish RefSeq protein data sets from NCBI, and put them in our home directory. If you’ve just logged in, you should be there already, but if you’re unsure, just run cd and hit enter. Now, we’ll use curl to download the files:


curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.1.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.2.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.1.protein.faa.gz

All four of the files are FASTA protein files (that’s what the .faa suggests) that are compressed with gzip (that’s what the .gz means).

Uncompress them:

gunzip *.faa.gz

and let’s look at the first few sequences in the file:

head mouse.1.protein.faa 

These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.

Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”


head -11 mouse.1.protein.faa > mm-first.fa


So now, for example, you can do cat mm-first.fa to see the contents of that file (or less mm-first.fa).
Now let’s BLAST these two sequences against the entire zebrafish protein data set. First, we need to tell BLAST that the zebrafish sequences are (a) a database, and (b) a protein database. That’s done by calling ‘makeblastdb’:


makeblastdb -in zebrafish.1.protein.faa -dbtype prot


Next, we call BLAST to do the search:


blastp -query mm-first.fa -db zebrafish.1.protein.faa


This should run pretty quickly, but you’re going to get a lot of output!! To save it to a file instead of watching it go past on the screen, ask BLAST to save the output to a file that we’ll name mm-first.x.zebrafish.txt:


blastp -query mm-first.fa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt


and then you can ‘page’ through this file at your leisure by typing:


less mm-first.x.zebrafish.txt
(Type spacebar to move down, and ‘q’ to get out of paging mode.)


Let’s do some more sequences (this one will take a little longer to run):


head -500 mouse.1.protein.faa > mm-second.fa
blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt

will compare the first 83 sequences. You can look at the output file with:

less mm-second.x.zebrafish.txt


Output in tabular format:


blastp -query mm-first.fa -db zebrafish.1.protein.faa -outfmt "7 qacc sacc evalue qstart qend sstart send" -out mm-first.x.zebrafish.TBL.txt

Now, try to run:
perl -e '$name_col=0;$score_col=1; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' mm-first.x.zebrafish.TBL.txt

Answer the following questions:

- What is doing the above command?
- It is correct or you see some logical problem (mtivate your answer)? In this case how can we fix it?