




ALLPATHS‐LG requires a minimum of 2 paired‐end libraries – one short and one long. The short library average separation size must be slightly less than twice the read size, such that the reads from a pair will likely overlap – for example, for 100 base reads the insert size should be 180 bases. The distribution of sizes should be as small as possible, with a standard deviation of less than 20%. The long library insert size should be approximately 3000 bases long and can have a larger size distribution. Additional optional longer insert libraries can be used to help disambiguate larger repeat structures and may be generated at lower coverage.

A fragment library is a library with a short insert separation, less than twice the read length, so that the reads may overlap (e.g., 100bp Illumina reads taken from 180bp inserts.) A jumping library has a longer separation, typically in the 3kbp‐10kbp range, and may include sheared or EcoP15I libraries or other jumping‐library construction; ALLPATHS can handle read chimerism in jumping libraries. Note that fragment reads should be long enough to ensure the overlap.

如果你有reference genome就提供


1, DATA directory里的base,quality score, and pairing information files.like that:


2, a ploidy file must also be present.the file polidy file is a single-line file containing a number.The specfic file name is :



用自带的perl脚本: PrepareALLPaths.pl。用这个脚本需要提供两个配置文件:

in_groups.csv和 in_libs.csv。

csv的意思是: comma-separated-values


group_name: a UNIQUE nickname for this specific data set.
library_name: the library to which the data set belongs.
file_name: the absolute path to the data file.


library_name: matches the same field in in_groups.csv.
project_name: a string naming the project.
organism_name: the organism.
type: fragment, jumping, EcoP15, etc. This field is only informative.
paired: 0: Unpaired reads; 1: paired reads.
frag_size: average number of bases in the fragments (only defined for FRAGMENT libraries).
frag_stddev: estimated standard deviation of the fragments sizes (only defined for FRAGMENT libraries).
insert_size: average number of bases in the inserts (only defined for JUMPING libraries; if larger than 20 kb, the library is considered to be a LONG JUMPING library).
insert_stddev: estimated standard deviation of the inserts sizes (only defined for JUMPING libraries).
read_orientation: inward or outward. Outward oriented reads will be reversed.
genomic_start: index of the FIRST genomic base in the reads. If non‐zero, all the bases before genomic_start will be trimmed out.
genomic_end: index of the LAST genomic base in the reads. If non‐zero, all the bases after genomic_end will be trimmed out.

这两个文件准备好以后,就可以run 这个perl脚本了

mkdir sht_genome

cd sht_genome

mkdir mydata

PrepareALLPathsInputs.pl \

DATA_DIR='full path to REFERENCE DIR'/mydata \

PICARD_TOOLS_DIR='path to picard tools' \

IN_GROUPS_CSV=in_groups.csv(在当前目录可以不指定) \

IN_LIBS_CSV=in_libs.csv(在当前目录可以不指定) \


PHRED_64 = 0(默认) \

PLOIDY = 2 \

DATA_DIR: is the location of the ALLPATHS DATA directory where the converted reads will be placed.

PICARD_TOOLS_DIR: is the path to the Picard tools needed for data conversion, if your data is in BAM format.

IN_GROUPS_CSV: 编辑的in_groups.csv的文件位置,如果在当前目录,可以不写。

IN_LIBS_CSV: 编辑的in_libs.csv的文件位置,如果在当前目录,可以不写。

INCLUDE_NON_PF_READS:  if the read passed the signal purity filter.  "0" means the read did not pass filter and "1" means that it did.

PHRED_64: 0表示碱基质量是按照phred_33,1表示碱基质量按照phred_64

PLOIDY: 产生polidy文件,是单倍体就是1,二倍体是2.



我所在的目录: /share/bioinfo/miaochenyong/ZLS_HEGU_genome_assembly/try_assemble_allpaths



REFERENCE就是你的物种的名字,同时如果有参考基因组的话,转换格式Fasta2Fastb后,放进去,参考序列的命名:genome.fasta and genome.fastb.。提前创建好这个目录。Fasta2Fastb IN=genome.fasta

DATA:The DATA directory contains the original read data used in a particular assembly attempt. (This data is stored in internal ALLPATHS formats: fastb, qualb, pairs.) It also contains intermediate files derived from the original data that are independent of the particular assembly attempt – typically files used in evaluation.提前创建好这个目录

Each DATA directory may contain many RUN directories, each representing a particular attempt to assemble the original data using a different set of parameters.

RUN:The RUN directory contains all the non‐localized assembly files, that is, those intermediate files generated from the original read data in preparation for the final assembly stage (LocalizeReadsLG and beyond). It may also contain intermediate files used in evaluation that are dependent on the assembly parameters chosen.


ASSEMBLIES:The ASSEMBLIES directory contains the actual assembly (or assemblies). There is no argument for naming this directory. It is actually named ASSEMBLIES.这个目录的名字是没法指定的,就叫ASSEMBLIES。

SUBDIR:The SUBDIR directory is where the localized assembly is generated, along with some assembly intermediate and evaluation files.在执行runallpathlg时,用SUBDIR=来指定。


Once the read data has been imported you may run the ALLPATHS pipeline as often as desired, each time with different assembly parameters. Each time you run the ALLPATHS pipeline it will determine which modules need to run (or re‐run) depending on the parameters you have chosen. Unless you want to overwrite your previous assembly, specify a new RUN directory each time.

PRE=<user pre>
TARGETS=standard(the value of the TARGETS parameter determines the operations performed by the pipeline:TARGETS=full_eval:Runs a version of the pipeline that includes additional evaluation modules.TARGETS=standard:Runs a streamlined version of the pipeline that skips many of the evaluation modules.

This will create (if it doesn’t already exist) the following pipeline directory structure:
<user pre>/staph/mydata/myrun

Where staph is the REFERENCE directory, mydata is the DATA directory containing the imported data, and myrun is the RUN directory.


nohup RunAllPathsLG \

PRE=/share/bioinfo/miaochenyong/ZLS_HEGU_genome_assembly/try_assemble_allpaths \

REFERENCE_NAME=sht_genome \

DATA_SUBDIR=mydata \

RUN=myrun \


TARGETS=full_eval \






