Canu Tutorial(canu指导手册)
Canu assembles reads from PacBio RS II or Oxford Nanopore MinION instruments into uniquely-assemblable contigs, unitigs. Canu owes lots of it design and code to celera-assembler.
Canu can be run using hardware of nearly any shape or size(可以在任何硬件上运行), anywhere from laptops to computational grids with thousands of nodes. Obviouisly, larger assemblies will take a long time to compute on laptops, and smaller assemblies can’t take advantage of hundreds of nodes, so what is being assembled plays some part in determining what hardware can be effectively used.
Most algorithms in canu have been multi-threaded (多线程)(to use all the cores on a single node), parallelized(并行) (to use all the nodes in a grid), or both (all the cores on all the nodes).
Canu, the command
The canu command is the ‘executive’ program that runs all modules of the assembler(canu程序会调用所有模块). It oversees each of the three top-level tasks (correction, trimming, unitig construction), each of which consists of many steps. Canu ensures that input files for each step exist, that each step successfully finished, and that the output for each step exists. It does minor bits of processing, such as reformatting files, but generally just executes other programs.
canu \
-d <working-directory> \
-p <file-prefix> \
[-s specifications] \
[-correct | -trim | -assemble] \
errorRate=<fraction-error> \
genomeSize=<genome-size>\
[parameters] \
[-pacbio-raw <read-file>]
[-pacbio-corrected <read-file>]
[-nanopore-raw <read-file>]
[-nanopore-corrected <read-file>]
Two options are mandatory(必须有的两个参数), -d, to set the working directory, and -p, to set the file name prefix. All work is performed and output appears in the working directory. The directory need not exist before starting. Most files in this directory have file names beginning with the file name prefix, however, running two canu commands in the same directory will probably lead to confusion.
The -s option will import a list of parameters from the supplied specification (利用提供文件里的参数)(‘spec’) file. These parameters will be applied before any from the command line are used, providing a method for setting commonly used parameters, but overriding them for specific assemblies.
By default, all three top-level tasks are performed(默认会自动运行三个任务). It is possible to run exactly one task by using the -correct, -trim or -assemble options(也可以指定任务单独运行). These options can be useful if you want to correct reads once and try many different assemblies. We do exactly that in the Canu Quick Start.
Parameters are key=value pairs that configure the assembler. They set run time parameters (e.g., memory, threads, grid), algorithmic parameters (e.g., error rates, trimming aggressiveness), and enable or disable entire processing steps (e.g., don’t correct errors, don’t search for subreads). They are described later. One parameter is required: the genomeSize (in bases, with common SI prefixes allowed, for example, 4.7m or 2.8g; see Global Options). Parameters are listed in the Canu Parameter Reference, but the common ones are described in this document(常见的参数设置在本文档中,高级参数参见后面的专门文档).
Reads are supplied to canu by options that describe how the reads were generated, and what level of quality they are, for example, -pacbio-raw indicates the reads were generated on a PacBio RS II instrument, and have had no processing done to them. Each file of reads supplied this way becomes a ‘library’ of reads. The reads should have been (physically) generated all at the same time using the same steps, but perhaps sequenced in multiple batches. In canu, each library has a set of options setting various algorithmic parameters, for example, how aggressively to trim. To explicitly set library parameters, a text ‘gkp’ file describing the library and the input files must be created. Don’t worry too much about this yet, it’s an advanced feature, fully described in Section gkp-files(高级参数,设定输入库).
The read-files contain sequence data in either FASTA or FASTQ format (or both! A quirk of the implementation allows files that contain both FASTA and FASTQ format reads). The files can be uncompressed, gzip, bzip2 or xz compressed. We’ve found that “gzip -1” provides good compression that is fast to both compress and decompress. For ‘archival’ purposes, we use “xz -9”.(输入文件的格式)
Canu, the pipeline
The canu pipeline, that is, what it actually computes, comprises of computing overlaps and processing the overlaps to some result. Each of the three tasks (read correction, read trimming and unitig construction) follow the same pattern(别看三个任务完成的工作不同,但它们的工作模式是相同的):
- Load reads into the read database, gkpStore(载入reads到read数据库里).
- Compute k-mer counts in preparation for the overlap computation(计算k-mer,为计算overlap做准备).
- Compute overlaps.
- Load overlaps into the overlap database, ovlStore(将overlaps载入到overlap数据库里).
- Do something interesting with the reads and overlaps(后续特异的工作).
- The read correction task will replace the original noisy read sequences with consensus sequences computed from overlapping reads.
- The read trimming task will use overlapping reads to decide what regions of each read are high-quality sequence, and what regions should be trimmed(用overlap决定哪些位置需要被剪掉). After trimming, the single largest high-quality chunk(块) of sequence is retained.
- The unitig construction task finds sets of overlaps that are consistent, and uses those to place reads into a multialignment layout(形成olc中的l). The layout is then used to generate a consensus sequence(形成olc中的c) for the unitig(建图找最优路径问题).
Module Tags模块标签
Because each of the three tasks share common algorithms(三个任务共用了相同的算法) (all compute overlaps, two compute consensus sequences, etc), parameters are differentiated by a short prefix ‘tag’ string(使用tag来标记参数). This lets canu have one generic parameter that can be set to different values for each stage in each task.
For example, “corOvlMemory” will set memory usage for overlaps being generated for read correction; “obtOvlMemory” for overlaps generated for Overlap Based Trimming; “utgOvlMemory” for overlaps generated for unitig construction.
The tags are:
We’ll get to the details eventually.
Execution Configuration执行配置
There are two modes that canu runs in(canu的运行模式有两种):
locally, using just one machine, or grid-enabled, using multiple hosts managed by a grid engine. At present, only Sun Grid Engine / Open Grid Engine / Univa Grid Engine / SGE / whatever it’s called now, PBS, and Slurm support is available. LSF support is also included but has limited testing. Section Grid Engine Configuration has a few hints on how to set up a new grid engine.
To enable execution of the parallel steps on the grid, set useGrid=1
. The canu pipeline will immediately submit itself to the grid, and run entirely under grid control. This is the default if a grid-engine is detected on your system. If you prefer to run locally, set useGrid=0
.
In both cases, local or grid, Canu will auto-detect available resources and scale the jobs to run, based on the resources and genome size you’re assembling. Thus, most users should be able to run the command without modifying the defaults. Some advanced options are outlined below. Each stage has the same five configuration options, and tags are used to specialize the option to a specific stage. The options are:
useGrid<tag>=boolean
Run this stage on the grid, usually in parallel. gridOptions<tag>=string
Supply this string to the grid submit command. <tag>Memory=integer
Use this many gigabytes of memory, per process. <tag>Threads
Use this many compute threads per process. <tag>Concurrency
If not on the grid, run this many jobs at the same time.
Global grid options, applied to every job submitted to the grid, can be set with ‘gridOptions’. This can be used to add accounting information or access credentials.
A name can be associated with this compute using ‘gridOptionsJobName’. Canu will work just fine with no name set, but if multiple canu assemblies are running at the same time, they will tend to wait for each others jobs to finish. For example, if two assemblies are running, at some point both will have overlap jobs running. Each assembly will be waiting for all jobs named ‘ovl_asm’ to finish. Had the assemblies specified job names, gridOptionsJobName=apple and gridOptionsJobName=orange, then one would be waiting for jobs named ‘ovl_asm_apple’, and the other would be waiting for jobs named ‘ovl_asm_orange’.
Error Rates
Canu expects all error rates to be reported as fraction error(分数错误,就是0.01,不是1%), not as percent error. We’re not sure exactly why this is so. Previously, it used a mix of fraction error and percent error (or both!), and was a little confusing. Here’s a handy table you can print out that converts between fraction error and percent error. Not all values are shown (it’d be quite a large table) but we have every confidence you can figure out the missing values:
Fraction Error Percent Error
0.01 1%
0.02 2%
0.03 3%
. .
. .
0.12 12%
. .
. .
Eventually, we want to have Canu take a single error rate, the error rate of a single input read, and derive all other rates from there. This is the parameter errorRate
. Currently, the defaults are 0.025 for PacBio sequences and 0.05 for Oxford Nanpore sequences. Typically, you should not need to modify this setting. However, the error rate does affect runtime and lowering it can significantly speed up your assembly. Thus, for low coverage datasets (<=30X) we recommend increasing the error rate slightly (by 1%, so errorRate=0.035 or PacBio)(参数设置建议:测序深度不够时,可以将errorrate设置得高一点) and for high-coverage (>=60X) datasets, we recommend decreasing it (by 1%, so errorRate=0.015 for PacBio).
The following error rates are defined:
- errorRate
- The expected error rate in a corrected single read(校正的单个read). This will set the remaining error rates implicitly(隐性的).
Recall there are three sets of overlaps generated: one for read correction, one for read trimming, and one for unitig construction. The read correction overlaps are typically generated by the ‘mhap’ overlapper, which doesn’t need an error rate as it doesn’t generate alignment based overlaps. The other two sets of overlaps are generated by overlapInCore, which is alignment based, and so requires a maximum error rate. Thus, out of the six possible tags, only two are currently useful for overlap generation(只有两个errorrate是有用的):
- obtOvlErrorRate
- Do not compute overlaps used for trimming above this error rate. Applies to the standard overlapper, and realigning mhap overlaps.
- utgOvlErrorRate
- Do not compute overlaps used for unitig construction above this error rate. Applies to the standard overlapper, and realigning mhap overlaps.
(ADVANCED) It is possible to convert the mhap or minimap overlaps to alignment based overlaps using obtReAlign=true
or ovlReAlign=true
. If so, the overlaps will be computed using either obtOvlErrorRate
or utgOvlErrorRate
, depending on which overlaps are being generated.
Be sure to not confuse obtOvlErrorRate
with obtErrorRate
:
- obtErrorRate
- Filter overlaps during OBT’s read trimming and read splitting.
In celera-assembler, consensus generation required an overlap. In canu, this is no longer used, but the cnsErrorRate
option still exists.
Minimum Lengths最小长度
Two minimum sizes are known:
- minReadLength
- Discard reads shorter than this when loading into the assembler, and when trimming reads.
- minOverlapLength
- Do not save overlaps shorter than this.
Overlap configuration
The largest compute of the assembler is also the most complicated to configure(最需要计算资源,最难设置得部分). As shown in the ‘module tags’ section, there are up to eight (!) different overlapper configurations. For each overlapper (‘ovl’ or ‘mhap’) there is a global configuration, and three specializations that apply to each stage in the pipeline (correction, trimming or assembly).
Like with ‘grid configuration’, overlap configuration uses a ‘tag’ prefix applied to each option. The tags in this instance are ‘cor’, ‘obt’ and ‘utg’.
For example:
- To change the k-mer size for all instances of the ovl overlapper, ‘merSize=23’ would be used.
- To change the k-mer size for just the ovl overlapper used during correction, ‘corMerSize=16’ would be used.
- To change the mhap k-mer size for all instances, ‘mhapMerSize=18’ would be used.
- To change the mhap k-mer size just during correction, ‘corMhapMerSize=15’ would be used.
- To use minimap for overlap computation just during correction, ‘corOverlapper=minimap’ would be used.
Ovl Overlapper Configuration
- <tag>Overlapper
- select the overlap algorithm to use, ‘ovl’ or ‘mhap’.
Ovl Overlapper Parameters
<tag>ovlHashBlockLength
how many bases to reads to include in the hash table; directly controls process size <tag>ovlRefBlockSize
how many reads to compute overlaps for in one process; directly controls process time <tag>ovlRefBlockLength
same, but use ‘bases in reads’ instead of ‘number of reads’ <tag>ovlHashBits
size of the hash table (SHOULD BE REMOVED AND COMPUTED, MAYBE TWO PASS) <tag>ovlHashLoad
how much to fill the hash table before computing overlaps (SHOULD BE REMOVED) <tag>ovlMerSize
size of kmer seed; smaller - more sensitive, but slower
The overlapper will not use frequent kmers to seed overlaps. These are computed by the ‘meryl’ program, and can be selected in one of three ways.
Terminology. A k-mer is a contiguous sequence of k bases. The read ‘ACTTA’ has two 4-mers: ACTT and CTTA. To account for reverse-complement sequence, a ‘canonical kmer’ is the lexicographically smaller of the forward and reverse-complemented kmer sequence. Kmer ACTT, with reverse complement AAGT, has a canonical kmer AAGT. Kmer CTTA, reverse-complement TAAG, has canonical kmer CTTA.
A ‘distinct’ kmer is the kmer sequence with no count associated with it. A ‘total’ kmer (for lack of a better term) is the kmer with its count. The sequence CGTTTTTTTCGTCG has (forward) 12 ‘total’ 4-mers and 7 ‘distinct’ kmers.
TCGTTTTTTTCGTCG count
TCGT 2 distinct-1
CGTT 1 distinct-2
GTTT 1 distinct-3
TTTT 4 distinct-4
TTTT 4 copy of distinct-4
TTTT 4 copy of distinct-4
TTTT 4 copy of distinct-4
TTTC 4 copy of distinct-4
TTCG 1 distinct-5
TCGT 2 copy of distinct-1
CGTC 1 distinct-6
GTCG 1 distinct-7
<tag>MerThreshold
any kmer with count higher than N is not used <tag>MerDistinct
pick a threshold so as to seed overlaps using this fraction of all distinct kmers in the input. In the example above, fraction 0.8572 of the k-mers (6/7) will be at or below threshold 2. <tag>MerTotal
pick a threshold so as to seed overlaps using this fraction of all kmers in the input. In the example above, fraction 0.6364 of the k-mers (7/11) will be at or below threshold 2. <tag>FrequentMers
don’t compute frequent kmers, use those listed in this fasta file
Mhap Overlapper Parameters
<tag>MhapBlockSize
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into. <tag>MhapMerSize
Use k-mers of this size for detecting overlaps. <tag>ReAlign
After computing overlaps with mhap, compute a sequence alignment for each overlap. <tag>MhapSensitivity
Either ‘normal’, ‘high’, or ‘fast’.
Mhap also will down-weight frequent kmers (using tf-idf), but it’s selection of frequent is not exposed.
Minimap Overlapper Parameters
<tag>MMapBlockSize
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into. <tag>MMapMerSize
Use k-mers of this size for detecting overlaps
Minimap also will ignore high-frequency minimzers, but it’s selection of frequent is not exposed.
Canu Tutorial(canu指导手册)的更多相关文章
- Java 注解指导手册 – 终极向导
原文链接 原文作者:Dani Buiza 译者:Toien Liu 校对:深海 编者的话:注解是java的一个主要特性且每个java开发者都应该知道如何使用它. 我们已经在Java Code Gee ...
- 【转载】Cmd Markdown 公式指导手册
目录 Cmd Markdown 公式指导手册 一.公式使用参考 1.如何插入公式 2.如何输入上下标 3.如何输入括号和分隔符 4.如何输入分数 5.如何输入开方 6.如何输入省略号 7.如何输入矢量 ...
- markdown公式指导手册
#Cmd Markdown 公式指导手册 标签: Tutorial 转载于https://www.zybuluo.com/codeep/note/163962#1%E5%A6%82%E4%BD%95% ...
- 指导手册 07 安装配置HIVE
指导手册 07 安装配置HIVE 安装环境及所需安装包: 1.操作系统:centos6.8 2.四台虚拟机:master :10.0.2.4, slave1:10.0.2.5,slave2:10. ...
- 指导手册06:HBase安装部署
指导手册06:HBase安装部署 配置环境 1.参考文件: https://www.cnblogs.com/lzxlfly/p/7221890.html https://www.cnblogs.com ...
- 指导手册05:MapReduce编程入门
指导手册05:MapReduce编程入门 Part 1:使用Eclipse创建MapReduce工程 操作系统: Centos 6.8, hadoop 2.6.4 情景描述: 因为Hadoop本身 ...
- 指导手册04:运行MapReduce
指导手册04:运行MapReduce Part 1:运行单个MapReduce任务 情景描述: 本次任务要求对HDFS目录中的数据文件/user/root/email_log.txt进行计算处理, ...
- 指导手册03:Hadoop基础操作
指导手册03:Hadoop基础操作 Part 1:查看Hadoop集群的基本信息1.查询存储系统信息(1)在WEB浏览器的地址栏输入http://master:50070/ 请查看自己的Hadoop集 ...
- 指导手册02:伪分布式安装Hadoop(ubuntuLinux)
指导手册02:伪分布式安装Hadoop(ubuntuLinux) Part 1:安装及配置虚拟机 1.安装Linux. 1.安装Ubuntu1604 64位系统 2.设置语言,能输入中文 3.创建 ...
随机推荐
- 2016年11月23日 星期三 --出埃及记 Exodus 20:14
2016年11月23日 星期三 --出埃及记 Exodus 20:14 "You shall not commit adultery.不可奸淫.
- webkit和xcode
一.webkit下载地址:https://svn.webkit.org/repository/webkit/ 它的总大小为2.75G 二.xcode下载地址:http://adcdownload.ap ...
- okhttp3教程(1)如何引入库
官网: https://github.com/square/okhttp https://github.com/square/okio 1,使用okhttp3需要两个库 在build.gradle c ...
- 关于python的__name__理解
Python中,每个模块有个__name__属性,当模块是在自己文件下执行的,那么它的__name__属性是__main__,而当它被引入到别的模块中,那么在别的模块下(import模块名 可以引入一 ...
- JavaScript(1)
JavaScript(1) 第一次见到JavaScript的时候,看到了熟悉的"Java",原来都是骗人的,李鬼不是李逵呀=.= 然而发现这个东西还是特别实用的,和Java比起来它 ...
- CUBRID学习笔记 30 复制表结构 cubrid教程
语法 CREATE {TABLE | CLASS} <new_table_name> LIKE <old_table_name> 如下 CREATE TABLE a_tbl( ...
- 常用的STL查找算法
常用的STL查找算法 <effective STL>中有句忠告,尽量用算法替代手写循环:查找少不了循环遍历,在这里总结下常用的STL查找算法: 查找有三种,即点线面: 点就是查找目标为单个 ...
- git :设置 object-c 的忽略文件
使用 git 命令行来进行版本控制的时候, 需要设置忽略文件. 这里能找到所有语言的忽略文件的内容:https://github.com/github/gitignore OBJECT的忽略文件内容: ...
- JavaWEB 常用开发模式MVC+三层结构
MVC开发模式: M: Model -- JavaBean C: Controler -- Servlet V: View --- JSP 不会在word里面画画,所以就直接截了 老 ...
- iOS - OC NSData 数据
前言 @interface NSData : NSObject <NSCopying, NSMutableCopying, NSSecureCoding> @interface NSMut ...