Running command-line BLAST
Ubuntu安装BLAST
2014-02-09 10:45:03| 分类: Linux/Ubuntu|举报|字号 订阅
very easy!
sudo apt-get install blast2
将自动安装以下程序:
bl2seq blast2 blastall blastall_old blastcl3 blastclust blastpgp
具体程序参数请在终端输入程序名查看
http://angus.readthedocs.io/en/2014/running-command-line-blast.html
Running command-line BLAST
The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.
Prepare for this tutorial by working through Start up an EC2 instance, but follow the instructions to start up Starting up a custom operating system instead; use AMI ami-7606d01e.
All of the commands below can and should be copy/pasted rather than re-typed.
Note: on Windows using TeraTerm, you can select the commands in the Web browser, then go to TeraTerm and click your right mouse button to paste. On Mac OS X using Terminal, you can select the commands in the Web browser, use Command-C to copy, and then go the terminal and use Command-V to paste.
Switching to root
Start by making sure you’re the superuser, root:
sudo bash
Updating the software on the machine
Copy and paste the following two commands
apt-get update
apt-get -y install screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config libncurses5-dev r-base-core \
r-cran-gplots python-matplotlib sysstat
(make sure to hit enter after the paste – sometimes the last line doesn’t paste completely.)
If you started up a custom operating system, then this should finish quickly; if instead you started up Ubuntu 14.04 blank, then this will take a minute or two.
Install BLAST
Here, we’re using curl to download the BLAST distribution from NCBI; then we’re using ‘tar’ to unpack it into the current directory; and then we’re copying the program files into the directory /usr/local/bin, where we can run them from anywhere.
cd /root curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.26/blast-2.2.26-x64-linux.tar.gz
tar xzf blast-2.2.26-x64-linux.tar.gz
cp blast-2.2.26/bin/* /usr/local/bin
cp -r blast-2.2.26/data /usr/local/blast-data
OK – now you can run BLAST from anywhere!
Again, this is basically what “installing software” means – it just means copying around files so that they can be run, and (in some cases) setting up resources so that the software knows where specific data files are.
Running BLAST
Try typing:
blastall
You’ll get a long laundry list of output, with all sorts of options and arguments. Let’s play with some of them.
First! We need some data. Let’s grab the mouse and zebrafish RefSeq protein data sets from NCBI, and put them in /mnt, which is the scratch disk space for Amazon machines
cd /mnt curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz
If you look at the files in the current directory, you should see both files, along with a directory called lost+found which is for system information:
ls -l
should show you:
drwx------ 2 root root 16384 2013-01-08 00:14 lost+found
-rw-r--r-- 1 root root 9454271 2013-06-11 02:29 mouse.protein.faa.gz
-rw-r--r-- 1 root root 8958096 2013-06-11 02:29 zebrafish.protein.faa.gz
Both of these files are FASTA protein files (that’s what the .faa suggests) that are compressed by gzip (that’s what the .gz suggests).
Uncompress them
gunzip *.faa.gz
and let’s look at the first few sequences:
head -11 mouse.protein.faa
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s just a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.
Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”
head -11 mouse.protein.faa > mm-first.fa
So now, for example, you can do ‘cat mm-first.fa’ to see the contents of that file (or ‘less mm-first.fa’).
Now let’s BLAST these two sequences against the entire zebrafish protein data set. First, we need to tell BLAST that the zebrafish sequences are (a) a database, and (b) a protein database. That’s done by calling ‘formatdb’
formatdb -i zebrafish.protein.faa -o T -p T
Next, we call BLAST to do the search
blastall -i mm-first.fa -d zebrafish.protein.faa -p blastp
This should run pretty quickly, but you’re going to get a LOT of output!! What’s going on? A few things –
- if you BLAST a sequence against a large database, odds are it will turn up a lot of spurious matches. By default, blastall uses an e-value cutoff of 10, which is very relaxed.
- blastall also reports the first 100 matches, which is usually more than you want.
- a lot of proteins also have trace similarity to other proteins!
For all of these reasons, generally you only want the first few BLAST matches, and/or the ones with a “good” e-value. We do that by adding ‘-b 2 -v 2’ (which says, report only two matches and alignments); and by adding ‘-e 1e-6’, which says, report only matches with an e-value of 1e-6 or better
blastall -i mm-first.fa -d zebrafish.protein.faa -p blastp -b 2 -v 2 -e 1e-6
Now you should get a lot less text! (And indeed you do...) Let’s put it an output file, ‘out.txt’
blastall -i mm-first.fa -d zebrafish.protein.faa -p blastp -b 2 -v 2 -o out.txt
The contents of the output file should look exactly like the output before you saved it into the file – check it out:
cat out.txt
Converting BLAST output into CSV
Suppose we wanted to do something with all this BLAST output. Generally, that’s the case - you want to retrieve all matches, or do a reciprocal BLAST, or something.
As with most programs that run on UNIX, the text output is in some specific format. If the program is popular enough, there will be one or more parsers written for that format – these are just utilities written to help you retrieve whatever information you are interested in from the output.
Let’s conclude this tutorial by converting the BLAST output in out.txt into a spreadsheet format, using a Python script. (We’re not doing this just to confuse you; this is really how we do things around here.)
First, we need to get the script. We’ll do that using the ‘git’ program
git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts
We’ll discuss ‘git’ more later; for now, just think of it as a way to get ahold of a particular set of files. In this case, we’ve placed the files in /root/ngs-scripts/, and you’re looking to run the script blast/blast-to-csv.py using Python
python /root/ngs-scripts/blast/blast-to-csv.py out.txt
This outputs a spread-sheet like list of names and e-values. To save this to a file, do:
python /root/ngs-scripts/blast/blast-to-csv.py out.txt > /root/Dropbox/out.csv
The end file, ‘out.csv’, should soon be in your Dropbox on your local computer. If you have Excel installed, try double clicking on it.
And that’s the kind of basic workflow we’ll be teaching you:
- Download program
- Download data
- Run program on data
- Look at results
...but in many cases more complicated :).
Note that there’s no limit on the number of sequences you BLAST, etc. It’s just sheer compute speed and disk space that you need to worry about, and if you look at the files, it turns out they’re not that big – so it’s mostly your time and energy.
This will also maybe help you understand why UNIX programs are so powerful – each program comes with several, or several dozen, little command line “flags” (parameters), that help control how it does its work; then the output is fed into another such program, etc. The possibilities are literally combinatorial.
We’re running a Python program ‘blast-to-csv.py’ above – if you’re interested in what the Python program does, take a look at the source code:
https://github.com/ngs-docs/ngs-scripts/blob/master/blast/blast-to-csv.py
Summing up
Command-line BLAST lets you do BLAST searches of any sequences you have, quickly and easily. It’s probably the single most useful skill a biologist can learn if they’re doing anything genomics-y ;).
Its main computational drawback is that it’s not fast enough to deal with some of the truly massive databases we now have, but that’s generally not a problem for individual users. That’s because they just run it and “walk away” until it’s done!
The main practical issues you will confront in making use of BLAST:
- getting your sequence(s) into the right place.
- formatting the database.
- configuring the BLAST parameters properly.
- doing what you want after BLAST!
Other questions to ponder:
- if we’re using a pre-configured operating system, why did we have to install BLAST?
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
Running command-line BLAST的更多相关文章
- 运行springboot项目报错 Error running 'ResourceApplication': Command line is too long. Shorten comma
方法1 IDEA 运行报错:Error running '***': Command line is too long 技术标签: IDEA Error running 'Test': Com ...
- Getting command line access to PHP and MySQL running MAMP on OSX
建立自己profile路径应该在/Users/yourname/,最后要运行. ./.profile使文件生效,和windows中给添加环境变量是一个道理,还可以看出linux和UNIX默认运行路径为 ...
- 【IntellJ IDEA】idea启动测试类报错Error running 'Test1.test': Command line is too long. Shorten command line for Test1.test or also for JUnit default configuration.
idea启动测试类报错 Error running 'Test1.test': Command line is too long. Shorten command line for Test1.tes ...
- Error running 'xxx': Command line is too long. Shorten command line for xxx
跑单元测试时,报错如下: Error running 'xxx': Command line is too long. Shorten command line for xxx 解决方案: 在项目所在 ...
- 【已解决】Error running 'xxx项目' Command line is too long(idea版)
[错误] Error running 'xxx项目': Command line is too long. Shorten command line for xxx or also for Sprin ...
- idea报错:Error running $classname: Command line is too long. Shorten command line for $classname.
Command line is too long 打印的变量太长了,超过了限制,这都会报错...我只想知道idea基于什么原理会报这个错... 解决 1.按照提示修改该类的配置,选择jar manif ...
- Linux Command Line Basics
Most of this note comes from the Beginning the Linux Command Line, Second Edition by Sander van Vugt ...
- cURL POST command line on WINDOWS RESTful service
26down votefavorite 7 My problem: Running windows 7 and using the executable command line tool to cu ...
- How to install IIS 7.5 on Windows 7 using the Command Line
原文 How to install IIS 7.5 on Windows 7 using the Command Line On Windows Vista, to install IIS 7.0 f ...
- [笔记]The Linux command line
Notes on The Linux Command Line (by W. E. Shotts Jr.) edited by Gopher 感觉博客园是不是搞了什么CSS在里头--在博客园显示效果挺 ...
随机推荐
- 文件处理工具 gif合成工具 文件后缀批量添加工具 文件夹搜索工具 重复文件查找工具 网页图片解析下载工具等
以下都是一些简单的免费分享的工具,技术支持群:592132877,提供定制化服务开发. Gif动图合成工具 主要功能是扫描指定的文件夹里的所有zip文件,然后提取Zip文件中的图片,并合成一张gif图 ...
- java程序编写需注意的问题
初学java,免不了很多注意事项 加分号 类名与文件名一致 javac fileName而非javac fileName.class ```java System.out.println(" ...
- notes on Art Pipeline
Do not add complex clothes/facial hair to a model for Mixamo to auto rig, it will cause confusion. A ...
- 使用HttpURLConnection请求multipart/form-data类型的form提交
写一个小程序,模拟Http POST请求来从网站中获取数据.使用Jsoup(http://jsoup.org/)来解析HTML. Jsoup封装了HttpConnection的功能,可以向服务器提交请 ...
- IOI2002 POJ1054 The Troublesome Frog 讨厌的青蛙 (离散化+剪枝)
Description In Korea, the naughtiness of the cheonggaeguri, a small frog, is legendary. This is a we ...
- hiho1613 墨水滴
对不起,太弱了.................想了一下午
- 用eclipse打包可执行的jar(含第三方jar包)
在eclipse中的解决方式如下: 在工程目录下(与src同层)建立lib目录,将第三方Jar包放到这个目录里(copy,paste即可)[如果直接引用本地的jar,一旦换电脑就呵呵了...] 右击工 ...
- bzoj 4852 炸弹攻击
bzoj 4852 炸弹攻击 二维平面上的最优解问题,模拟退火是一个较为优秀的近似算法. 此题确定圆心后,便可 \(O(m)\) 算出收益,且最优解附近显然也较优,是连续变化的,可以直接模拟退火. 小 ...
- BZOJ4373 算术天才⑨与等差数列 【线段树】*
BZOJ4373 算术天才⑨与等差数列 Description 算术天才⑨非常喜欢和等差数列玩耍. 有一天,他给了你一个长度为n的序列,其中第i个数为a[i]. 他想考考你,每次他会给出询问l,r,k ...
- 深入了解 WPF Dispatcher 的工作原理(PushFrame 部分)
在上一篇文章 深入了解 WPF Dispatcher 的工作原理(Invoke/InvokeAsync 部分) 中我们发现 Dispatcher.Invoke 方法内部是靠 Dispatcher.Pu ...