SPAdes
用后感:
拼个小基因组还好,对于很大的基因组,文库很多的,还是不要用了。服务器768G内存,都不够用。。。。
主页:
http://bioinf.spbau.ru/spades
说明书:
http://spades.bioinf.spbau.ru/release3.6.1/manual.html
Note, that SPAdes was initially designed for small genomes. It was tested on single-cell and standard bacterial and fungal data sets. SPAdes is not intended for larger genomes (e.g. mammalian size genomes) and metagenomic projects. For such purposes you can use it at your own risk.
SPAdes has also a separate modules for assembling highly polymorphic diploid genomes and for TruSeq barcode assembly. For more information see dipSPAdes manual and truSPAdes manual.
对reads的矫正默认是开启的,如果你用自己的矫正软件,可以关闭spades的矫正。
支持的输入数据:
illumina: 既可以是fasta又可以是fastq格式
paired end
mate pairs
single(unpaired)reads
IonTorrent:
fastq
bam
Sanger, Oxford Nanopore and PacBio reads can be provided in both formats since SPAdes does not run error correction for these types of data.
SPAdes should not be used if only PacBio, Oxford Nanopore, Sanger reads or additional contigs are available.
下载解压免安装 版本是3.6.1
wget http://spades.bioinf.spbau.ru/release3.6.1/SPAdes-3.6.1-Linux.tar.gz
tar -xzf SPAdes-3.6.1-Linux.tar.gz
cd SPAdes-3.6.1-Linux/bin/
服务位置:
~/spades/SPAdes-3.6.1-Linux/bin
suggest adding SPAdes installation directory to the PATH
variable.
测试是否可以用:
./spades.py --test
Version 3.6.1 of SPAdes supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously.
命令示范:
nohup ~/spades/SPAdes-3.6.1-Linux/bin/spades.py --careful--pe1-1 SHT-6K_rmpcr-1_paired.fq --pe1-2 SHT-6K_rmpcr-2_paired.fq \--pe2-1 NHD1114_L1_1_paired.fq --pe2-2 NHD1114_L1_2_paired.fq \--pe3-1 NHD1114_L2_1_paired.fq --pe3-2 NHD1114_L2_2_paired.fq \--pe4-1 NHD1115_L1_1_paired.fq.gz --pe4-2 NHD1115_L1_2_paired.fq.gz \--pe5-1 NHD1115_L2_1_paired.fq.gz --pe5-2 NHD1115_L2_2_paired.fq.gz \--mp1-1 SHT-3K-1_rmpcr-1_paired.fq --mp1-2 SHT-3K-1_rmpcr-2_paired.fq \--mp2-1 SHT-3K-2_rmpcr-1_paired.fq --mp2-2 SHT-3K-2_rmpcr-2_paired.fq \--mp3-1 5k_rmpcr-1_paired.fastq --mp3-2 5k_rmpcr-2_paired.fastq \--mp4-1 SHT-6K_rmpcr-1_paired.fq --mp4-2 SHT-6K_rmpcr-2_paired.fq \
--s1 SHT_500_1.fq --s2 SHT_500_2.fq \
--pacbio SHT_filtered_subreads20150723.fastq \
-t 10 -k 21,33,55,77,99,127 -o SPAdes_results &
用在命令中制定数据,最多只能指定5个pe和mp库,pe默认是fr的,mp默认是rf的。
如果有单端的数据:--s1 test1.fq --s2 test2.fq 不能用-s 必须是--s
如果有三代,nanopore, contigs数据:
Specifying data for hybrid assembly
--pacbio <file_name>
File with PacBio reads. More information on PacBio reads is provided in section 3.1.
--nanopore <file_name>
File with Oxford Nanopore reads.
--sanger <file_name>
File with Sanger reads
--trusted-contigs <file_name>
Reliable contigs of the same genome, which are likely to have no misassemblies and small rate of other errors (e.g. mismatches and indels). This option is not intended for contigs of the related species.
--untrusted-contigs <file_name>
Contigs of the same genome, quality of which is average or unknown. Contigs of poor quality can be used but may introduce errors in the assembly. This option is also not intended for contigs of the related species.
-o 指定的目录必须存在,你要先创建这个目录,然后再跑。
-t <int> (or --threads <int>)
Number of threads. The default value is 16.
-m <int> (or --memory <int>)
Set memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Actual amount of consumed RAM will be below this limit. Make sure this value is correct for the given machine. SPAdes uses the limit value to automatically determine the sizes of various buffers, etc.
--tmp-dir <dir_name>
Set directory for temporary files from read error correction. The default value is <output_dir>/corrected/tmp
--sc 如果你是单细胞基因组,小基因组,用这个参数
-k <int,int,...>
Comma-separated list of k-mer sizes to be used (all values must be odd, less than 128 and listed in ascending order). If --sc is set the default value are 21,33,55. For multicell data sets K values are automatically selected using maximum read length (see note for assembling long Illumina paired reads for details). To properly select K values for IonTorrent data read section 3.3.
--careful
Tries to reduce the number of mismatches and short indels. Also runs MismatchCorrector – a post processing tool, which uses BWA tool (comes with SPAdes). This option is recommended.
--continue
Continues SPAdes run from the specified output folder starting from the last available check-point. Check-points are made after:
error correction module is finished
iteration for each specified K value of assembly module is finished
mismatch correction is finished for contigs or scaffolds
For example, if specified K values are 21, 33 and 55 and SPAdes was stopped or crashed during assembly stage with K = 55, you can run SPAdes with the --continue option specifying the same output directory. SPAdes will continue the run starting from the assembly stage with K = 55. Error correction module and iterations for K equal to 21 and 33 will not be run again. Note that all options except -o <output_dir> are ignored if --continue is set.
关于kmer的问题:如果你是单细胞数据,就加上--sc参数,如果你是多细胞的数据,既不要加--sc 也不要加-k, 软件会根据你的read长度,去选择kmer,长度多长,kmer就多大,-k是指定你想要的kmer值,必须是奇数。
--restart-from <check_point>
Restart SPAdes run from the specified output folder starting from the specified check-point. Check-points are:
ec – start from error correction
as – restart assembly module from the first iteration
k<int> – restart from the iteration with specified k values, e.g. k55
mc – restart mismatch correction
In comparison to the --continue option, you can change some of the options when using --restart-from. You can change any option except: all basic options, all options for specifying input data (including --dataset), --only-error-correction option and --only-assembler option. For example, if you ran assembler with k values 21,33,55 without mismatch correction, you can add one more iteration with k=77 and run mismatch correction step by running SPAdes with following options:
--restart-from k55 -k 21,33,55,77 --mismatch-correction -o <previous_output_dir>.
Since all files will be overwritten, do not forget to copy your assembly from the previous run if you need it.
这两个参数挺好的, 不用重新跑~
如果你还有别的库,就要用yaml文件了。
By using a YAML file you can provide an unlimited number of paired-end, mate-pair and unpaired libraries. Basically, YAML data set file is a text file, in which input libraries are provided as a comma-separated list in square brackets. Each library is provided in braces as a comma-separated list of attributes. The following attributes are available:
- orientation ("fr", "rf", "ff")
- type ("paired-end", "mate-pairs", "hq-mate-pairs", "single", "pacbio", "nanopore", "sanger", "trusted-contigs", "untrusted-contigs")
- interlaced reads (comma-separated list of files with interlaced reads)
- left reads (comma-separated list of files with left reads)
- right reads (comma-separated list of files with right reads)
- single reads (comma-separated list of files with single reads)
To properly specify a library you should provide its type and at least one file with reads. Orientation is an optional attribute. Its default value is "fr" (forward-reverse) for paired-end libraries and "rf" (reverse-forward) for mate-pair libraries.
The value for each attribute is given after a colon. Comma-separated lists of files should be given in square brackets. For each file you should provide its full path in double quotes. Make sure that files with right reads are given in the same order as corresponding files with left reads.
这个yaml文件指定了两个pe库,一个mp库, 一个pacbio数据。
Notes:
- 如果用了--dataset来指定数据,就不能再用--pe1-1这种指定数据方式了。
- We recommend to nest all files with long reads of the same data type in a single library block.
Additional contigs
In case you have contigs of the same genome generated by other assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using --trusted-contigs
or --untrusted-contigs
. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited.
Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified.
SPAdes output
SPAdes stores all output files in <output_dir> , which is set by the user.
<output_dir>/corrected/ directory contains reads corrected by BayesHammer in *.fastq.gz files; if compression is disabled, reads are stored in uncompressed *.fastq files
<output_dir>/contigs.fasta contains resulting contigs
<output_dir>/scaffolds.fasta contains resulting scaffolds
<output_dir>/assembly_graph.fastg contains SPAdes assembly graph in FASTG format
To view FASTG files we recommend to use Bandage tool. Note that sequences stored in assembly_graph.fastg correspond to contigs before repeat resolution (edges of the assembly graph). Contigs after repeat resolution (scaffolding) are stored in contigs.fasta (scaffolds.fasta) and can be represented as paths in the assembly graph.
The full list of <output_dir> content is presented below:
contigs.fasta – resulting contigs
scaffolds.fasta – resulting scaffolds
before_rr.fasta – contigs before repeat resolution
assembly_graph.fastg – assembly graph
corrected/ – files from read error correction
configs/ – configuration files for read error correction
corrected.yaml – internal configuration file
Output files with corrected reads
params.txt – information about SPAdes parameters in this run
spades.log – SPAdes log
dataset.info – internal configuration file
input_dataset.yaml – internal YAML data set file
K<##>/ – directory containing files from the run with K=<##>
SPAdes will overwrite these files and directories if they exist in the specified <output_dir>.
freemao
FAFU
SPAdes的更多相关文章
- Impress.js上手 - 抛开PPT、制作Web 3D幻灯片放映
前言: 如果你已经厌倦了使用PPT设置路径.设置时间.设置动画方式来制作动画特效.那么Impress.js将是你一个非常好的选择. 用它制作的PPT将更加直观.效果也是嗷嗷美观的. 当然,如果用它来装 ...
- java转换 HTML字符实体,java特殊字符转义字符串
为什么要用转义字符串? HTML中<,>,&等有特殊含义(<,>,用于链接签,&用于转义),不能直接使用.这些符号是不显示在我们最终看到的网页里的,那如果我们希 ...
- HTML CSS 特殊字符表(转载)
转载地址:http://blog.csdn.net/bluestarf/article/details/40652011 转载原文地址:http://zhengmifan.com/news/noteb ...
- Html 特殊符号
HTML特殊符号对照表 特殊符号 命名实体 十进制编码 特殊符号 命名实体 十进制编码 Α Α Α Β Β Β Γ Γ Γ Δ Δ Δ Ε Ε Ε Ζ Ζ Ζ Η Η Η Θ Θ Θ Ι Ι Ι Κ ...
- HTML特殊符号汇总
较常用的飘黄处理了 ´ ´ © © > > µ µ ® ® & & ° ° ¡ ¡ » » ¦ ¦ ÷ ÷ ¿ ¿ ¬ ¬ § § • • ½ ½ « « ¶ ¶ ¨ ...
- xml html entity 列表
Name Character Unicode code point (decimal) Standard Description quot " U+0022 (34) XML 1.0 dou ...
- JS弹出模态窗口下拉列表特效
效果体验:http://hovertree.com/texiao/js/20/ 或者扫描二维码在手机体验: 点击选择城市后,在弹出的层中的输入框,输入英文字母 h,会有HoverTree和Hewenq ...
- 转载:《TypeScript 中文入门教程》 8、函数
版权 文章转载自:https://github.com/zhongsp 建议您直接跳转到上面的网址查看最新版本. 介绍 函数是JavaScript应用程序的基础. 它帮助你实现抽象层,模拟类,信息隐藏 ...
- POJ2794 Double Patience[离散概率 状压DP]
Double Patience Time Limit: 3000MS Memory Limit: 65536K Total Submissions: 694 Accepted: 368 Cas ...
随机推荐
- mke2fs/mks.etc3/fstab/mount指令
一.mke2fs指令mkfs.etc3 /dev/sdb1指令 主要新学习 cat /etc/filesystem //查看文件类型 mkfs. tab键有提示 //按照系统默认的值格式化 m ...
- 用PDB库调试Python程序
Python自带的pdb库,发现用pdb来调试程序还是很方便的,当然了,什么远程调试,多线程之类,pdb是搞不定的. 用pdb调试有多种方式可选: 1. 命令行启动目标程序,加上-m参数,这样调用my ...
- 前端相关技术之ajax相关
AJAX技术点 async javascript and xml:异步的js和xml,用js异步去操作xml ajax用于数据交互,不能操作DOM –节省用户操作,时间,提高用户体验,减少数据请求 – ...
- 2440 lcd10分钟休眠修改
在我们的系统中,LCD的虚拟控制台和控制台TTY不是同一个设备,也就是说,如果在程序里单纯的printf是不行的!这样只能修改你正在使用的TTY的blankinterval,而你用的却是文本方式的设备 ...
- JavaScript自定义右键菜单
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...
- IT公司100题-8-智力题
问题1: 有两个房间,一间房里有三盏灯,另一间房有控制着三盏灯的三个开关, 这两个房间是分割开的,从一间里不能看到另一间的情况. 现在要求受训者分别进这两房间一次,然后判断出这三盏灯分别是由哪个开关控 ...
- wp8.1 Study10:APP数据存储
一.理论 1.App的各种数据在WP哪里的? 下图很好介绍了这个问题.有InstalltionFolder, knownFolder, SD Card... 2.一个App的数据存储概览 主要分两大部 ...
- Win7 Print Spooler服務自动关闭
对于Win7系统而言,该问题通常是安装了错误的打印驱动引起的,Win7系统为了保护其它进程不受干扰,自动关闭了打印服务. 解决方法就是: a> 把不用的打印机删掉. b> 确保你安装了正确 ...
- POJ 3384
题目大意: 给定一个多边形,给定一个圆的半径,要求在多边形中放置两个同样半径的圆,可相互覆盖,但不能超出多边形的范围,希望两个圆的面积覆盖和最大 输出任意一组满足的圆的圆心点 如果两个圆不相互覆盖,那 ...
- html5 placeholder
placeholder是html5<input>标签的一个属性,placeholder 属性提供可描述输入字段预期值的提示信息(hint).该提示会在输入字段为空时显示,并会在字段获得焦点 ...