软件下载与说明:http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=12

原始数据的深度要达到100以上。

至少要两个库,大库和小库,

小库的一对reads要有重叠部分。并且小库的插入片段大小分布差异要在20%以内。

大库插入片段要接近3000,并且长度分布可以有较大的差异。

ALLPATHS‐LG requires a minimum of 2 paired‐end libraries – one short and one long. The short library average separation size must be slightly less than twice the read size, such that the reads from a pair will likely overlap – for example, for 100 base reads the insert size should be 180 bases. The distribution of sizes should be as small as possible, with a standard deviation of less than 20%. The long library insert size should be approximately 3000 bases long and can have a larger size distribution. Additional optional longer insert libraries can be used to help disambiguate larger repeat structures and may be generated at lower coverage.

A fragment library is a library with a short insert separation, less than twice the read length, so that the reads may overlap (e.g., 100bp Illumina reads taken from 180bp inserts.) A jumping library has a longer separation, typically in the 3kbp‐10kbp range, and may include sheared or EcoP15I libraries or other jumping‐library construction; ALLPATHS can handle read chimerism in jumping libraries. Note that fragment reads should be long enough to ensure the overlap.

A fragment library is a library with a short insert separation, less than twice the read length, so that the reads may overlap (e.g., 100bp Illumina reads taken from 180bp inserts.) A jumping library has a longer separation, typically in the 3kbp‐10kbp range, and may include sheared or EcoP15I libraries or other jumping‐library construction; ALLPATHS can handle read chimerism in jumping libraries. Note that fragment reads should be long enough to ensure the overlap.

现在也可以加入pacbio数据,但是只针对真菌基因组。

如果你有reference genome就提供

allpaths需要的输入文件:

1, DATA directory里的base,quality score, and pairing information files.like that:

<REF>/<DATA>/frag_reads_orig.fastb
<REF>/<DATA>/frag_reads_orig.qualb
<REF>/<DATA>/frag_reads_orig.pairs
<REF>/<DATA>/jump_reads_orig.fastb
<REF>/<DATA>/jump_reads_orig.qualb
<REF>/<DATA>/jump_reads_orig.pairs

2, a ploidy file must also be present.the file polidy file is a single-line file containing a number.The specfic file name is :

<REF>/<DATA>/ploidy

如何产生这些输入文件呢:

用自带的perl脚本: PrepareALLPaths.pl。用这个脚本需要提供两个配置文件:

in_groups.csv和 in_libs.csv。

csv的意思是: comma-separated-values

首先来看in_groups.csv文件:

group_name: a UNIQUE nickname for this specific data set.
library_name: the library to which the data set belongs.
file_name: the absolute path to the data file.

再看in_libs.csv文件:这个文件是描述你的library的。

library_name: matches the same field in in_groups.csv.
project_name: a string naming the project.
organism_name: the organism.
type: fragment, jumping, EcoP15, etc. This field is only informative.
paired: 0: Unpaired reads; 1: paired reads.
frag_size: average number of bases in the fragments (only defined for FRAGMENT libraries).
frag_stddev: estimated standard deviation of the fragments sizes (only defined for FRAGMENT libraries).
insert_size: average number of bases in the inserts (only defined for JUMPING libraries; if larger than 20 kb, the library is considered to be a LONG JUMPING library).
insert_stddev: estimated standard deviation of the inserts sizes (only defined for JUMPING libraries).
read_orientation: inward or outward. Outward oriented reads will be reversed.
genomic_start: index of the FIRST genomic base in the reads. If non‐zero, all the bases before genomic_start will be trimmed out.
genomic_end: index of the LAST genomic base in the reads. If non‐zero, all the bases after genomic_end will be trimmed out.

这两个文件准备好以后,就可以run 这个perl脚本了

mkdir sht_genome

cd sht_genome

mkdir mydata

PrepareALLPathsInputs.pl \

DATA_DIR='full path to REFERENCE DIR'/mydata \

PICARD_TOOLS_DIR='path to picard tools' \

IN_GROUPS_CSV=in_groups.csv(在当前目录可以不指定) \

IN_LIBS_CSV=in_libs.csv(在当前目录可以不指定) \

INCLUDE_NON_PF_READS=1(默认) \

PHRED_64 = 0(默认) \

PLOIDY = 2 \

DATA_DIR: is the location of the ALLPATHS DATA directory where the converted reads will be placed.

PICARD_TOOLS_DIR: is the path to the Picard tools needed for data conversion, if your data is in BAM format.

IN_GROUPS_CSV: 编辑的in_groups.csv的文件位置,如果在当前目录,可以不写。

IN_LIBS_CSV: 编辑的in_libs.csv的文件位置,如果在当前目录,可以不写。

INCLUDE_NON_PF_READS:  if the read passed the signal purity filter.  "0" means the read did not pass filter and "1" means that it did.

PHRED_64: 0表示碱基质量是按照phred_33,1表示碱基质量按照phred_64

PLOIDY: 产生polidy文件,是单倍体就是1,二倍体是2.

这个脚本执行完后,ALLPATHS所需的输入文件就都准备好了~

来说说目录结构:

我所在的目录: /share/bioinfo/miaochenyong/ZLS_HEGU_genome_assembly/try_assemble_allpaths

我就准备在这个目录下进行allpath拼接。

REFERENCE/DATA/RUN/ASSEMBLIES/SUBDIR

REFERENCE就是你的物种的名字,同时如果有参考基因组的话,转换格式Fasta2Fastb后,放进去,参考序列的命名:genome.fasta and genome.fastb.。提前创建好这个目录。Fasta2Fastb IN=genome.fasta

DATA:The DATA directory contains the original read data used in a particular assembly attempt. (This data is stored in internal ALLPATHS formats: fastb, qualb, pairs.) It also contains intermediate files derived from the original data that are independent of the particular assembly attempt – typically files used in evaluation.提前创建好这个目录

Each DATA directory may contain many RUN directories, each representing a particular attempt to assemble the original data using a different set of parameters.

RUN:The RUN directory contains all the non‐localized assembly files, that is, those intermediate files generated from the original read data in preparation for the final assembly stage (LocalizeReadsLG and beyond). It may also contain intermediate files used in evaluation that are dependent on the assembly parameters chosen.

在执行runallpathlg时,用RUN=来指定。

ASSEMBLIES:The ASSEMBLIES directory contains the actual assembly (or assemblies). There is no argument for naming this directory. It is actually named ASSEMBLIES.这个目录的名字是没法指定的,就叫ASSEMBLIES。

SUBDIR:The SUBDIR directory is where the localized assembly is generated, along with some assembly intermediate and evaluation files.在执行runallpathlg时,用SUBDIR=来指定。

下一步,运行ALLPATHS,

Once the read data has been imported you may run the ALLPATHS pipeline as often as desired, each time with different assembly parameters. Each time you run the ALLPATHS pipeline it will determine which modules need to run (or re‐run) depending on the parameters you have chosen. Unless you want to overwrite your previous assembly, specify a new RUN directory each time.

RunAllPathsLG
PRE=<user pre>
DATA_SUBDIR=mydata
RUN=myrun
REFERENCE_NAME=staph
TARGETS=standard(the value of the TARGETS parameter determines the operations performed by the pipeline:TARGETS=full_eval:Runs a version of the pipeline that includes additional evaluation modules.TARGETS=standard:Runs a streamlined version of the pipeline that skips many of the evaluation modules.

This will create (if it doesn’t already exist) the following pipeline directory structure:
<user pre>/staph/mydata/myrun

Where staph is the REFERENCE directory, mydata is the DATA directory containing the imported data, and myrun is the RUN directory.

实际使用时的命令:

nohup RunAllPathsLG \

PRE=/share/bioinfo/miaochenyong/ZLS_HEGU_genome_assembly/try_assemble_allpaths \

REFERENCE_NAME=sht_genome \

DATA_SUBDIR=mydata \

RUN=myrun \

EVALUATION=STANDARD \

TARGETS=full_eval \

SUBDIR=MCY_ASB \

THREADS=40 \

OVERWRITE=True  &

freemao

FAFU

allpaths 使用的更多相关文章

  1. 创建pathing jar

    pathing jar是一个特殊的jar: 该jar文件只包含manifest.mf文件 该manifest文件只包含Class-Path,列出了所有需要真正加到classpath中的jar,或者di ...

  2. Entity Framework技巧系列之七 - Tip 26 – 28

    提示26. 怎样避免使用不完整(Stub)实体进行数据库查询 什么是不完整(Stub)实体? 不完整实体是一个部分填充实体,用于替代真实的对象. 例如: 1 Category c = new Cate ...

  3. threejs 组成的3d管道,寻最短路径问题

    threejs 里面的3d管道的每个节点ID是唯一的,且对应x,y,z坐标.那么当需要从A点到B点的时候,可能出现有多条路径可走,此时便需要求出最短行走路径,因此用到一个寻路径算法.我们将问题简化如下 ...

  4. Cpp 二叉树

    #include<vector> #include<iostream> using namespace std; //二叉树的一个节点结构 struct BinaryTreeN ...

  5. 基于JGraphT实现的路径探寻

    基于JGraphT实现的路径探寻 业务中提出基于内存,探寻的两点间的有向以及无向路径,多点间的最小子图等需求,以下记录使用JGraphT的实现过程. GraphT是免费的Java类库,提供数学图论对象 ...

  6. react-router分析 - 一、history

    react-router基于history库,它是一个管理js应用session会话历史的js库.它将不同环境(浏览器,node等)的变量统一成了一个简易的API来管理历史堆栈.导航.确认跳转.以及s ...

  7. 剑指 Offer 34. 二叉树中和为某一值的路径 + 记录所有路径

    剑指 Offer 34. 二叉树中和为某一值的路径 Offer_34 题目详情 题解分析 本题是二叉树相关的题目,但是又和路径记录相关. 在记录路径时,可以使用一个栈来存储一条符合的路径,在回溯时将进 ...

随机推荐

  1. AFNetworking框架使用

    本文是由 iOS Tutorial 小组成员 Scott Sherwood撰写,他是一个基于位置动态加载(Dynamically Loaded)的软件公司(专业的混合定位)的共同创办人. 网络 — 你 ...

  2. "QQ尾巴病毒"核心技术的实现原理分析

    声明:本文旨在探讨技术,请读者不要使用文章中的方法进行任何破坏. 2003这一年里,QQ尾巴病毒可以算是风光了一阵子.它利用IE的邮件头漏洞在QQ上疯狂传播.中毒者在给别人发信息时,病毒会自动在信息文 ...

  3. bzoj 2324: [ZJOI2011]营救皮卡丘

    #include<cstdio> #include<iostream> #include<cstring> #include<cmath> #inclu ...

  4. 可以用WMI来获取磁盘及分区编号

    {$APPTYPE CONSOLE} uses SysUtils, ActiveX, ComObj, Variants; function ListDrives : string; var FSWbe ...

  5. C++string的操作

    #include <iostream> using namespace std; int main() { //initilization string str("abc.ddd ...

  6. [转] lib和dll 区别,生成及使用方法

    lib 和 dll 的区别.生成以及使用详解 [目录] lib dll介绍 生成动态库 调用动态库 生成静态库 调用静态库 首先介绍一下静态库(静态链接库).动态库(动态链接库)的概念,首先两者都是代 ...

  7. 用腻了bootstrap的可以试试semantic-ui

    semancti-ui介绍 semantic-ui是html/css框架的新贵,是继bootstrap和foundation之后的又一css神器.semantic-ui一出现在github上就受到火热 ...

  8. 找第k大的数

    (找第k大的数) 给定一个长度为1,000,000的无序正整数序列,以及另一个数n(1<=n<=1000000),接下来以类似快速排序的方法找到序列中第n大的数(关于第n大的数:例如序列{ ...

  9. 《使用this作为返回值的相关问题》

    //使用this作为返回值的相关问题: /* 如果在某个方法中把this作为返回值,则可以多次连续的调用同一个方法,从而使得代码 更加简洁,但是,这种把this作为返回值的方法可能造成实际意义的模糊, ...

  10. EditText提示文字

    如图,在没有输入文字前,会显示提醒文字. 引用的是文本文件(strings.xml)中的对应文字 activity_main.xml对应的红色代码: <?xml version="1. ...