Canu Quick Start(快速使用Canu)
Canu Quick Start
- Canu Quick Start
- PBcR (老版的canu)
- CA
Canu specializes in(专门从事) assembling PacBio or Oxford Nanopre sequences. Canu will correct the reads, then trim suspicious regions(修剪可疑区域) (such as remaining SMRTbell adapter), then assemble the corrected and cleaned reads into unitigs(非重复序列区).(canu是专门用来组装三代reads的,三步走:校正、修剪、组装。)
Brief Introduction简单介绍
Canu has been designed to auto-detect your resources(硬件资源) and scale itself to fit. Two parameters let you restrict the resources used(限制资源的使用).
maxMemory=XX
maxThreads=XX
Memory is specified in gigabytes(千兆G). On a single machine, it will restrict Canu to at most this limit, on the grid(集群), no single job will try to use more than the specified resources.
The input sequences can be FASTA or FASTQ(有质量信息的二代reads) format, uncompressed, or compressed with gz, bz2 or xz(使用格式).
Running on the grid
Canu is designed to run on grid environments(集群环境) (LSF/PBS/Torque/Slrum/SGE are supported). Currently, Canu will submit itself to the default queue with default time options(所以在运行canu时,如果没有手工设置,就必须要指定grid为false,否则无法提交到集群运行). You can overwrite this behavior by providing any specific parameters you want to be used for submission as an option. Users should also specify a job name to use on the grid:
gridOptionsJobName=myassembly
"gridOptions=--partition quick --time 2:00"
Assembling PacBio data 组装
Pacific Biosciences released P6-C4 chemistry reads. You can download them directly (7 GB) or from the original page. You must have the Pac Bio SMRTpipe software installed to extract the reads as FASTQ(安装原厂软件将原始数据提出成FASTQ).
We made a 25X subset FASTQ available here(测试数据)
or use the following curl command:
curl -L -o p6.25x.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq
Correct, Trim and Assemble(校正、修建、组装)
By default, canu will correct the reads, then trim the reads, then assemble the reads to unitigs(非重复序列区).
默认是一条龙,全部自动做完。
canu \
-p ecoli -d ecoli-auto \
genomeSize=4.8m \
-pacbio-raw p6.25x.fastq
#PBS -N R498_CANU
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l mem=30gb
#PBS -q low cd $PBS_O_WORKDIR
date
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/public/software/gcc-4.9.2/lib64 /public/software/canu-master/Linux-amd64/bin/canu \
-p ecoli -d ecoli-auto \
genomeSize=4.8m \
-pacbio-raw ecoli_p6_25x.filtered.fastq
This will use the prefix(前缀) ‘ecoli’ to name files, compute the correction task(校正任务) in directory ‘ecoli-auto/correction’, the trimming task(修剪任务) in directory ‘ecoli-auto/trimming’, and the unitig construction(contig构建) stage in ‘ecoli-auto’ itself. Output files are described in the next section.
Find the Output
The canu progress chatter records statistics(会记录一些统计量) such as an input read histogram(输入read的直方图), corrected read histogram(校正后的read直方图), and overlap types(overlap类型). Outputs from the assembly tasks are in:
- ecoli*/ecoli.correctedReads.fasta.gz (校正后的reads,可以直接less查看)
- The sequences after correction, trimmed and split based on consensus evidence. Typically >99% for PacBio and >98% for Nanopore but it can vary based on your input sequencing quality.
- ecoli*/ecoli.trimmedReads.fasta.gz (修剪后的reads)
- The sequences after correction and final trimming. The corrected sequences above are overlapped again to identify any missed hairpin adapters or bad sequence that could not be detected in the raw sequences.
- ecoli*/ecoli.layout(layout阶段的文件)
- The layout provides information on where each read ended up in the final assembly, including contig and positions. It also includes the consensus sequence for each contig.
- ecoli*/ecoli.gfa (图文件)
- The GFA is the assembly graph generated by Canu. Currently this includes the contigs, associated bubbles, and any overlaps which were not used by the assembly.
The fasta output is split into three types:
- 1.ecoli*/asm.contigs.fasta (最终的contig文件,里面有些标签头)
-
Everything which could be assembled and is part of the primary assembly, including both unique and repetitive elements. Each contig has several flags included on the fasta def line:
>tig######## len=<integer> reads=<integer> covStat=<float> gappedBases=<yes|no> class=<contig|bubble|unassm> suggestRepeat=<yes|no> suggestCircular=<yes|no>
-
>tig00000000 len=110432 reads=231 covStat=181.52 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
TGAAAACACCAGTCGGTGGCAGACAAGGCGTCGGTCGGTGGAAGTGTAGACGCCCAACAACGGCAGCATAATAGGTCAGCCGTGCAGGCGGAGACACCAG - 下面是对上面标签的细致化解释:
- len
- Length of the sequence, in bp.
- reads
- Number of reads used to form the contig. (组装这条contig用了多少个reads)
- covStat (log(不同的contig/two-copy)???)
- The log of the ratio of the contig being unique(唯一的) versus being two-copy(), based on the read arrival rate(). Positive values indicate more likely to be unique, while negative values indicate more likely to be repetitive. See Footnote 24 in Myers et al., A Whole-Genome Assembly of Drosophila.
- gappedBases
- If yes, the sequence includes all gaps in the multialignment.
- class
- Type of sequence. Unassembled sequences are primarily low-coverage sequences spanned by a single read.
- suggestRepeat
- If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.
- suggestCircular
- If yes, sequence is likely circular. Not implemented.
-
- 2.ecoli*/asm.bubbles.fasta (bubble信息)
- alternate paths in the graph which could not be merged into the primary assembly.
- 3.ecoli*/asm.unassembled.fasta (没有进入组装,也不是bubble)
- reads which could not be incorporated into the primary or bubble assemblies.
Correct, Trim and Assemble, Manually 手动运行
Sometimes, however, it makes sense to do the three top-level tasks by hand. This would allow trying multiple unitig construction parameters on the same set of corrected and trimmed reads.
First, correct the raw reads:
canu -correct \
-p ecoli -d ecoli \
genomeSize=4.8m \
-pacbio-raw p6.25x.fastq
Then, trim the output of the correction:
canu -trim \
-p ecoli -d ecoli \
genomeSize=4.8m \
-pacbio-corrected ecoli/correction/ecoli.correctedReads.fasta.gz
And finally, assemble the output of trimming, twice(为啥要两次?):
canu -assemble \
-p ecoli -d ecoli-erate-0.013 \
genomeSize=4.8m \
errorRate=0.013 \
-pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gz canu -assemble \
-p ecoli -d ecoli-erate-0.025 \
genomeSize=4.8m \
errorRate=0.025 \
-pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gz
The directory layout for correction and trimming is exactly the same as when we ran all tasks in the same command. Each unitig construction task needs its own private work space, and in there the ‘correction’ and ‘trimming’ directories are empty. The error rate always specifies the error in the corrected reads which is typically <1% for PacBio data and <2% for Nanopore data (<1% on newest chemistries).
Assembling Oxford Nanopore data
A set of E. coli runs were released by the Loman lab. You can download one directly or any of them from the original page.(下载测试数据)
or use the following curl command:
curl -L -o oxford.fasta http://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta
Canu assembles any of the four available datasets into a single contig but we picked one dataset to use in this tutorial. Then, assemble the data as before:
canu \
-p ecoli -d ecoli-oxford \
genomeSize=4.8m \
-nanopore-raw oxford.fasta
The assembled identity is >99% before polishing.
Assembling With Multiple Technologies/Files多类型数据组装
Canu takes an arbitrary number of input files/formats. We made a mixed dataset of about 10X of a PacBio P6 and 10X of an Oxford Nanopore run available here(测试数据)
or use the following curl command:
curl -L -o mix.tar.gz http://gembox.cbcb.umd.edu/mhap/raw/ecoliP6Oxford.tar.gz
tar xvzf mix.tar.gz
Now you can assemble all the data:
canu \
-p ecoli -d ecoli-mix \
genomeSize=4.8m \
-pacbio-raw pacbio*fastq.gz \
-nanopore-raw oxford.fasta.gz
Assembling Low Coverage Datasets组装低覆盖度的数据
When you have 30X or less coverage, it helps to adjust the Canu assembly parameters. Typically, assembly 20X of single-molecule data outperforms hybrid methods with higher coverage. You can download a 20X subset of S. cerevisae(下载测试数据)
or use the following curl command:
curl -L -o yeast.20x.fastq.gz http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.20x.fastq.gz
and run the assembler adding sensitive parameters (errorRate=0.035):
canu \
-p asm -d yeast \
genomeSize=12.1m \
errorRate=0.035 \
-pacbio-raw yeast.20x.fastq.gz
After the run completes, we can check the assembly statistics:
tgStoreDump -sizes -s 12100000 -T yeast/unitigging/asm.tigStore 2 -G yeast/unitigging/asm.gkpStore
lenSuggestRepeat sum 160297 (genomeSize 12100000)
lenSuggestRepeat num 12
lenSuggestRepeat ave 13358
lenUnassembled ng10 13491 bp lg10 77 sum 1214310 bp
lenUnassembled ng20 11230 bp lg20 176 sum 2424556 bp
lenUnassembled ng30 9960 bp lg30 290 sum 3632411 bp
lenUnassembled ng40 8986 bp lg40 418 sum 4841978 bp
lenUnassembled ng50 8018 bp lg50 561 sum 6054460 bp
lenUnassembled ng60 7040 bp lg60 723 sum 7266816 bp
lenUnassembled ng70 6169 bp lg70 906 sum 8474192 bp
lenUnassembled ng80 5479 bp lg80 1114 sum 9684981 bp
lenUnassembled ng90 4787 bp lg90 1348 sum 10890099 bp
lenUnassembled ng100 4043 bp lg100 1624 sum 12103239 bp
lenUnassembled ng110 3323 bp lg110 1952 sum 13310167 bp
lenUnassembled ng120 2499 bp lg120 2370 sum 14520362 bp
lenUnassembled ng130 1435 bp lg130 2997 sum 15731198 bp
lenUnassembled sum 16139888 (genomeSize 12100000)
lenUnassembled num 3332
lenUnassembled ave 4843
lenContig ng10 770772 bp lg10 2 sum 1566457 bp
lenContig ng20 710140 bp lg20 4 sum 3000257 bp
lenContig ng30 669248 bp lg30 5 sum 3669505 bp
lenContig ng40 604859 bp lg40 7 sum 4884914 bp
lenContig ng50 552911 bp lg50 10 sum 6571204 bp
lenContig ng60 390415 bp lg60 12 sum 7407061 bp
lenContig ng70 236725 bp lg70 16 sum 8521520 bp
lenContig ng80 142854 bp lg80 23 sum 9768299 bp
lenContig ng90 94308 bp lg90 33 sum 10927790 bp
lenContig sum 12059140 (genomeSize 12100000)
lenContig num 56
lenContig ave 215341
Consensus Accuracy
While Canu corrects sequences and has 99% identity or greater with PacBio or Nanopore sequences, for the best accuracy we recommend polishing with a sequence-specific tool. We recommend Quiver for PacBio and Nanopolish for Oxford Nanpore data.(专业校正)
If you have Illumina sequences available, Pilon can also be used to polish either PacBio or Oxford Nanopore assemblies.
Futher Reading
See the FAQ page for commonly-asked questions and the release. notes page for information on what’s changed and known issues.
Canu Quick Start(快速使用Canu)的更多相关文章
- iOS QLPreviewController(Quick Look)快速浏览jpg,PDF,world等
#import <QuickLook/QuickLook.h> @interface ViewController ()<QLPreviewControllerDataSource, ...
- Canu Parameter Reference(canu参数介绍)
链接:Canu Parameter Reference To get the most up-to-date options, run canu -options The default values ...
- Canu Tutorial(canu指导手册)
链接:Canu Tutorial Canu assembles reads from PacBio RS II or Oxford Nanopore MinION instruments into u ...
- 根据权限显示隐藏SharePoint 2010快速启动栏的链接
转:http://www.360sps.com/Item/ShowAndHiddenLink.aspx 在SharePoint 2010的快速启动栏中可以根据权限来显示或隐藏列表.库.网站的链接,如果 ...
- WebStorm使用快速入门
WebStorm建立在开源IntelliJ平台之上,JetBrains已经开发和完善了超过15年.其提供了统一的UI,可与许多流行的版本控制系统配合使用,确保在git,GitHub,SVN,Mercu ...
- pycharm中查看快速帮助和python官方帮助文档
把光标放在要查询的对象上,打开视图菜单,quick definition查看对象的定义,quick documentation 快速文档,这个是jet brains自己对python的解释文档,第三个 ...
- Rocket - debug - Example: Quick Access
https://mp.weixin.qq.com/s/SxmX-CY2tqvEqZuAg-EXiQ 介绍riscv-debug的使用实例:配置Quick Access功能. 1. Quick Acce ...
- tcpack--3快速确认模式- ack状态发送&清除
ACK发送状态的转换图 ACK的发送状态清除 当成功发送ACK时,会删除延迟确认定时器,同时清零ACK的发送状态标志icsk->icsk_ack.pending ACK发送事件主要做了:更新快速 ...
- 在Visual Studio Code中配置GO开发环境
一.GO语言安装 详情查看:GO语言下载.安装.配置 二.GoLang插件介绍 对于Visual Studio Code开发工具,有一款优秀的GoLang插件,它的主页为:https://github ...
随机推荐
- IIS与ApplicationPool重启检测自动化解决方案
IIS与ApplicationPool重启检测自动化解决方案 Friday, November 28, 2014 DA Hotfix Automatic IIS & Application P ...
- IIC,RS485,RS232各种协议手册更新中
RS485使用手册与指南.pdf RS232协议标准详解.pdf IIC通信协议.pdf 链接:http://pan.baidu.com/s/1ccBtmA 密码:mwj6 IIC,RS485,R ...
- Java 常用排序算法/程序员必须掌握的 8大排序算法
Java 常用排序算法/程序员必须掌握的 8大排序算法 分类: 1)插入排序(直接插入排序.希尔排序) 2)交换排序(冒泡排序.快速排序) 3)选择排序(直接选择排序.堆排序) 4)归并排序 5)分配 ...
- HDU 1004 Let the Balloon Rise map
Let the Balloon Rise Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Oth ...
- jQuery对html进行Encode和Decode
最近需要在前台对编辑器生成的html代码做处理,需要进行编码,考虑到js没有直接对html编码的支持,看了下jQuery的实现,发现真是超级简单呀,顺便记录一下,需要的时候可以参考一下. functi ...
- MySQL占用内存过大的问题解决
MySQL竟然变化这么大了,记忆里还是40MB左右的软件. 想找回记忆里大小的软件(老版本的软件),可以去这个地址看看:http://mirrors.soho.com 现在去官网下载都300多MB了… ...
- spring+hibernate 实体类注解问题
<bean id="sessionFactory" class="org.springframework.orm.hibernate3.annotation.Ann ...
- jquery动态样式操作
获取与设置样式 获取class和设置class都可以使用attr()方法来完成.例如使用attr()方法来获取p元素的class,JQuery代码如下: 1 var p_class = $(" ...
- [SAP ABAP开发技术总结]动态修改选择屏幕
声明:原创作品,转载时请注明文章来自SAP师太技术博客( 博/客/园www.cnblogs.com):www.cnblogs.com/jiangzhengjun,并以超链接形式标明文章原始出处,否则将 ...
- Xcode error: conflicting types for 'XXXX'
问题描述:在main方法中调用了一个写在main方法后面的方法,比如: void main(){ A(); } void A(){} Xcode编译后就报错:conflicting types for ...