现在BWA大家基本上只用其mem算法了,无论是二代还是三代比对到参考基因组上,BWA应用得最多的就是在重测序方面。

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM - arXiv:1303.3997v2

摘要

BWA-MEM is a new alignment algorithm for aligning sequence reads or assembly contigs against a large reference
genome such as human.  MEM是bwa中最新的算法,基本取代了前两种,目的是将测序reads或组装的contig比对到Reference上。

It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment.   MEM自动选择局部或全局比对,支持paired-end和嵌合体比对。

The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.   MEM算法很健壮,支持多种错误类型,可以应用到一系列长度的reads比对上,简单来说就是可以比对短reads,又可以比对长的contig,还可以比对长的PB reads,它们的错误率都是不同的。(但是最好是比对到Reference上,而且不能三代比三代,不能二代比三代,也不能二代比二代)

For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date.

引言

Most short-reads mappers for next-generation sequencing (NGS) data were developed when sequence reads were about 36bp in length. For 36bp reads, it is reasonable to require end-to-end alignment (i.e. every read base to be aligned to the reference) and to only report hits within certain hamming or edit distance.  2013年,大多数的NGS的比对工具都是为了36bp开发的,36bp基本就都是全局比对了,因此只输出特定编辑距离的hits(现在不是了,最少也有50bp吧,长一点的都300bp了)

However, with emerging technologies and improved chemistry, NGS reads are not short any more, which poses new challenges to read alignment. 和现在的PacBio一样,reads的长度一变,比对的工具也要大革新

For 100bp or longer reads, it becomes more important to allow long gaps under the affine-gap penalty and to report multiple nonoverlapping local hits potentially caused by structural variations or misassemblies in the reference genome. 重点来了:对于长reads,在affine-gap penalty(防射空位罚分,即区分了gap open和gap extend) 允许长 gaps 非常重要,而且需要输出SV和Ref错误组装引起的多个不重叠的局部hits。参见:防射空位罚分

之前的许多算法不支持长reads比对,有些成熟的算法支持sanger序列比对,但他们很慢,而且不适合分析大规模的NGS数据,所以急需开发针对NGS的长reads比对。

目前,已有一些long-read alignment algorithms:BWA-SW、Bowtie2、Cushaw2、GEM。但都有不足:

  1. BWA-SW is slower than bowtie2 for 100bp reads at a comparable accuracy and less accurate then Cushaw2 at a comparable speed.
  2. Bowtie2 and Cushaw2 are slower for 600bp reads (see Results).
  3. GEM is both fast and accurate for up to approximately 1000bp reads, it mandates end-to-end alignment and does not perform affine-gap alignment, which limits its uses for long-read alignment.

Even for typical sequence reads ranged from 100bp to 1000bp in length, no mappers work well across the full spectrum.  这就是开发BWA-MEM的目的,全面适应100bp to 1000bp 的reads的比对。

方法

Aligning a single query sequence

1.Seeding and re-seeding

BWA-MEM follows the canonical seed-and-extend paradigm. It initially seeds an alignment with supermaximal exact matches (SMEMs) using an algorithm we found previously (Li, 2012, Algorithm 5), which essentially finds at each query position the longest exact match covering the position.  BWA依旧沿用了之前的 seed-and-extend 策略,它使用之前的算法,用 supermaximal exact matches (SMEMs)来seeds一个比对,找到每个查询位点的 longest exact match 来覆盖该位点。

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly (SMEMs)

However, occasionally the true alignment may not contain any SMEMs. To reduce mismappings caused by missing seeds, we introduce re-seeding. 然而,真实的比对不一定包含任何SMEMs,为了减少丢失seeds引起的错误比对,我们引入了re-seeding过程。

Suppose we have a SMEM of length l with k occurrences in the reference genome. If l is too large (over 28bp by default), we re-seed with the longest exact matches that cover the middle base of the SMEM and occur at least k + 1 times in the genome. Such seeds can be found by requiring a minimum occurrence in the original SMEM algorithm.  假设我们有一个长l、出现k次的SMEM ,如果l太长,我们会re-seed longest exact matches,使其包含SMEM中间的碱基且最少出现k+1次。这样,seeds就会需要一个最小的出现次数在原来的SMEM算法中。(完全不知所云)

2.Chaining and chain filtering

We call a group of seeds that are colinear and close to each other as a chain. We greedily chain the seeds while seeding and then filter out short chains that are largely contained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain). Chain filtering aims to reduce unsuccessful seed extension at a later step. Each chain may not always correspond to a final hit. Chains detected here do not need to be accurate.  我们称一群共线性且彼此接近的seeds为一个chain,我们贪婪的将seeds链在一起,在seeding的时候,然后过滤掉短的chains。 过滤chain的目的在于减少下一步中的不成功的seed extension。每一个chain不一定总是对应一个最终的hit。这个时候的chains的检测不一定要非常准确。

3.Seed extension

We rank a seed by the length of the chain it belongs to and then by the seed length. For each seed in the ranked list,
from the best to the worst, we drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic programming (DP) if it potentially leads to a new alignment.

BWA-MEM’s seed extension differs from the standard seed extension in two aspects.

Firstly, suppose at a certain extension step we come to reference position x with the best extension score achieved at query position y. BWAMEM will stop extension if the difference between the best extension score and the score at (x, y) is larger than Z+|x−y|×pgapExt, where pgapExt is the gap extension penalty and Z is an arbitrary cutoff. This heuristic avoids extension through a poorly aligned region with good flanking alignment. It is similar to the X-dropoff heuristic in BLAST (Altschul et al., 1997), but does not penalize long gaps in one of the reference or the query sequences.

Secondly, while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence. If the difference between the best score reaching the end and the best local alignment score
is below a threshold, the local alignment will be rejected even if it has a higher score. BWA-MEMuses this strategy to automatically choose between local and end-to-end alignments. It may align through true variations towards
the end of a read and thus reduces reference bias, while avoids introducing excessive mismatches and gaps which may happen when we force a chimeric read into an end-to-end alignment.

Paired-end mapping

1.Rescuing missing hits

2.Pairing

结果和讨论

BWA MEM算法的更多相关文章

  1. Secondary ,Supplementary alignment 和bwa mem的-M -Y参数

    1.supplementary alignment supplementary alignment是指一条read的一部分和参考区域1比对成功,另一部分和参考区域2比对成功,参考区域1和参考区域2没有 ...

  2. 比对工具之 BWA 使用方法

    BWA算法简介: BWA-bactrack BWA-SW BWA-MEM BWA安装: # installing BWA .tar.bz2 -C /opt/biosoft/ cd /opt/bioso ...

  3. BWA/BWT 比对软件

    名称    bwa –   Burrows-Wheeler  Alignment Tool 内容摘要描述命令行与选项SAM 比对格式短序列比对注意事项  比对精确性  估计插入大小分布  内存需求  ...

  4. bwa比对软件的使用以及其结果文件(sam)格式说明

    一.bwa比对软件的使用 1.对参考基因组构建索引 bwa index -a bwtsw hg19.fa   #  -a 参数:is[默认] or bwtsw,即bwa构建索引的两种算法,两种算法都是 ...

  5. Michael Schatz - 序列比对课程

    Michael Schatz - Cold Spring Harbor Laboratory 最近在研究 BWA mem 序列比对算法,直接去看论文,看不懂,论文就3页,太精简了,好多背景知识都不了解 ...

  6. 全面解读第四代基因测序技术Oxford Nanopore--转载

    纳米孔测序技术(又称第四代测序技术)是最近几年兴起的新一代测序技术.目前测序长度可以达到150kb.这项技术开始于90年代,经历了三个主要的技术革新:一.单分子DNA从纳米孔通过:二.纳米孔上的酶对于 ...

  7. samtools 工具

    软件地址: http://www.htslib.org/ 功能三大版块 : Samtools Reading/writing/editing/indexing/viewing SAM/BAM/CRAM ...

  8. GATK--数据预处理,质控,检测变异

    版权声明:本文源自 解螺旋的矿工, 由 XP 整理发表,共 13781 字. 转载请注明:从零开始完整学习全基因组测序(WGS)数据分析:第4节 构建WGS主流程 | Public Library o ...

  9. NGS中的一些软件功能介绍

    1.bowtie 短序列比对工具,blast也是短序列比对工具,速度快,结果易理解. 输入可以是fastq或者fasta文件. 生成比对结果文件sam格式的吧. 2.bwa 转自:https://ww ...

随机推荐

  1. Http TCP/IP 协议的关系

    转自:http://www.cnblogs.com/ymy124/archive/2012/03/18/2404958.html 项目要求Web服务是高安全级别,在选择.net remoting,we ...

  2. mac 10.11 cocopods注意的地方

    最近安装cocoapods,遇到些新问题,安装过程纠结了一天,先是ruby版本的问题,解决掉了,后来又是ruby下载cocoapods慢的问题,尝试了好几遍都下载不成功.最后也是不断尝试和查询,算是安 ...

  3. newtonsoft.json 序列化,反序列化

    public class Book { public string BookID { get; set; } public DateTime PublishDate { get; set; } pub ...

  4. 终端更新ubuntu系统

    1.sudo apt-get update 2. sudo apt-get dist-upgrade

  5. 刷新各ifream当前页,下拉项改变触发事件js,给选中项加背景色js

    <script type="text/javascript" language="javascript"> //刷新框架各页面 function r ...

  6. MiniUI中DataGrid数据的载入

    本文将介绍一下,在ASP.NET MVC环境下,如何用Jquery MiniUI中的Datagrid控件载入数据. 1.效果展示: 2.具体步骤: 1>  既然是在MVC里,那我们的界面自然选择 ...

  7. 电脑安装Android4.0虚拟机的做法

    在开始教程之前,先给大家展示一下成功运行Android 4.0虚拟机的界面,经过笔者测试,体验很流畅,喜欢DIY和对开发感兴趣的朋友们可以猛击下一页,继续浏览教程. 准备工作 在体验前我们首先要下载J ...

  8. HTML和CSS的复习总结

    HTML(Hypertext Markup Language)超文本标记语言:其核心就是各种标记!<html> HTML页面中的所有内容,都在该标签之内:它主要含<head>和 ...

  9. 161229、SpringMVC的各种参数绑定方式

    1. 基本数据类型(以int为例,其他类似): Controller代码: @RequestMapping("saysth.do") public void test(int co ...

  10. angularJS和jQuery的区别

    问题: 假如我熟悉利用jQuery去开发客户端应用,那么我怎么上手angularjs,能否描述下所需要的模式转变,下面这些问题能够帮助你给出一个回答: 1.在设计客户端web应用的时候有什么区别,最大 ...