bcftools

beftools非常复杂，大概有20个命令，每个命令下面还有N多个参数

annotate .. edit VCF files, add or remove annotations
call .. SNP/indel calling (former "view")
cnv .. Copy Number Variation caller
concat .. concatenate VCF/BCF files from the same set of samples
consensus .. create consensus sequence by applying VCF variants
convert .. convert VCF/BCF to other formats and back
csq .. haplotype aware consequence caller
filter .. filter VCF/BCF files using fixed thresholds
gtcheck .. check sample concordance, detect sample swaps and contamination
index .. index VCF/BCF
isec .. intersections of VCF/BCF files
merge .. merge VCF/BCF files files from non-overlapping sample sets
mpileup .. multi-way pileup producing genotype likelihoods
norm .. normalize indels
plugin .. run user-defined plugin
polysomy .. detect contaminations and whole-chromosome aberrations
query .. transform VCF/BCF into user-defined formats
reheader .. modify VCF/BCF header, change sample names
roh .. identify runs of homo/auto-zygosity
sort .. sort VCF/BCF files
stats .. produce VCF/BCF stats (former vcfcheck)
view .. subset, filter and convert VCF and BCF files

下面讲一下过滤参数

1、bcftools filter（以-i参数为例）

-i, --include EXPRESSION ：
include only sites for which EXPRESSION is true. For valid expressions see EXPRESSIONS.（根据正则保留）
其中包括：
1.1、numerical constants, string constants, file names (this is currently supported only to filter by the ID column)

1, 1.0, 1e-4
"String"
@file_name

1.2、算术运算

+,*,-,/

1.3、comparison operators

== (same as =), >, >=, <=, <, !=

1.4、regex操作符“~”和它的否定“!~”。表达式区分大小写，除非添加“/i”。

INFO/HAYSTACK ~ "needle"
INFO/HAYSTACK ~ "NEEDless/i"

1.5、圆括号

(, )

1.6、逻辑运算符。参见下面的示例和有关“&&”与“&”以及“||”与“|”之间区别的过滤教程。

&&,  &, ||,  |

1.7、信息标签，格式标签，列名

INFO/DP or DP
FORMAT/DV, FMT/DV, or DV
FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]

1.8、1 (or 0) to test the presence (or absence) of a flag

FlagA=1 && FlagB=0

1.9、"." to test missing values

DP=".", DP!=".", ALT="."

2.0、missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".") using these expressions

GT~"\.", GT!~"\."

2.1、sample genotype: reference (haploid or diploid), alternate (hom or het, haploid or diploid), missing genotype, homozygous, heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het, alt-alt het, haploid ref, haploid alt (case-insensitive)

GT="ref"
GT="alt"
GT="mis"
GT="hom"
GT="het"
GT="hap"
GT="RR"
GT="AA"
GT="RA" or GT="AR"
GT="Aa" or GT="aA"
GT="R"
GT="A"

2.2、TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~" to require at least one allele of the given type or the equal sign "=" to require that all alleles are of the given type. Compare

TYPE="snp"
TYPE~"snp"
TYPE!="snp"
TYPE!~"snp"

2.3、array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and an element of the vector, as shown in the examples below

INFO/AF[0] > 0.3             .. first AF value bigger than 0.3
FORMAT/AD[0:0] > 30          .. first AD value of the first sample bigger than 30
FORMAT/AD[0:1]               .. first sample, second AD value
FORMAT/AD[1:0]               .. second sample, first AD value
DP4[*] == 0                  .. any DP4 value
FORMAT/DP[0]   > 30          .. DP of the first sample bigger than 30
FORMAT/DP[1-3] > 10          .. samples 2-4
FORMAT/DP[1-]  < 7           .. all samples but the first
FORMAT/DP[0,2-4] > 20        .. samples 1, 3-5
FORMAT/AD[0:1]               .. first sample, second AD field
FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
FORMAT/AD[*:1] or AD[:1]        .. any sample, second AD field
(DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
CSQ[*] ~ "missense_variant.*deleterious"

2.4、with many samples it can be more practical to provide a file with sample names, one sample name per line

GT[@samples.txt]="het" & binom(AD)<0.01

2.5、function on FORMAT tags (over samples) and INFO tags (over vector fields): maximum; minimum; arithmetic mean (AVG is synonymous with MEAN); median; standard deviation; sum; string length; absolute value; number of elements (matching columns for FORMAT tags or number of fields for INFO tags).

MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT

2.6、two-tailed binomial test. Note that for N=0 the test evaluates to a missing value and when FORMAT/GT is used to determine the vector indices, it evaluates to 1 for homozygous genotypes.

binom(FMT/AD)                .. GT can be used to determine the correct index
binom(AD[0],AD[1])           .. or the fields can be given explicitly
phred(binom())               .. the same as binom but phred-scaled

2.7、variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes; number of samples with missing genotype; fraction of samples with missing genotype; indel length (deletions negative, insertions positive)

N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN

2.8、the number (N_PASS) or fraction (F_PASS) of samples which pass the expression

N_PASS(GQ>90 & GT!="mis") > 90
F_PASS(GQ>90 & GT!="mis") > 0.9

2.9、custom perl filtering. Note that this command is not compiled in by default, see the section Optional Compilation with Perl in the INSTALL file for help and misc/demo-flt.pl for a working example. The demo defined the perl subroutine "severity" which can be invoked from the command line as follows:

perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3

注意事项：

字符串比较和正则表达式不区分大小写；变量和函数名不区分大小写，但是flag区分大小写。例如，"qual"可以代替"qual"， "strlen()"可以代替"strlen()"，但不是“dp”而是“DP”。当查询多个值时，将测试所有元素并对结果使用OR逻辑。例如，查询“TAG=1,2,3,4”时，计算如下:

-i 'TAG[*]=1'   .. true, the record will be printed
-i 'TAG[*]!=1'  .. true
-e 'TAG[*]=1'   .. false, the record will be discarded
-e 'TAG[*]!=1'  .. false
-i 'TAG[0]=1'   .. true
-i 'TAG[0]!=1'  .. false
-e 'TAG[0]=1'   .. false
-e 'TAG[0]!=1'  .. true

举例：

MIN(DV)>5
MIN(DV/DP)>0.3
MIN(DP)>10 & MIN(DV)>3
FMT/DP>10  & FMT/GQ>10 .. both conditions must be satisfied within one sample
FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
QUAL>10 |  FMT/GQ>10   .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
QUAL>10 || FMT/GQ>10   .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
COUNT(GT="hom")=0      .. no homozygous genotypes at the site
AVG(GQ)>50             .. average (arithmetic mean) of genotype qualities bigger than 50
ID=@file       .. selects lines with ID present in the file
ID!=@~/file    .. skip lines with ID present in the ~/file
MAF[0]<0.05    .. select rare variants at 5% cutoff
POS>=100   .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.

shell 扩展：

注意表达式必须经常引用，因为在shell中有些字符有特殊的含义。一个用单引号括起来的表达式的例子，它导致整个表达式按照预期传递给程序:

bcftools view -i '%ID!="." & MAF[0]<0.01'

------------------------过滤Filtering-------------

1、按照固定列(fixed columns)过滤

固定列，例如“QUAL, FILTER, INFO”可以直接过滤。例如：

bcftools query -e'FILTER="."' -f'%CHROM %POS %FILTER\n' file.bcf   #过滤掉FILTER字段中为.的行

bcftools query -i'QUAL>20 && DP>10' -f'%CHROM %POS %QUAL %DP\n' file.bcf | head -2  #只保留质量值大于20，且覆盖深度高于10的位点

2、FORMAT columns

在过滤FORMAT字段的时候，OR 逻辑用于所有samples。When filtering FORMAT tags, the OR logic is applied with multiple samples，而不是单个sample.例如，如果我们想删除任何样本中带有未知基因型的位点，表达式-i 'GT!="会不起作用,必须用相反的逻辑 `-e 'GT ="."'` :

bcftools query -i 'GT!="."' #不行
bcftools query -e 'GT ="."' #相反逻辑才可行

3、FORMAT列与布尔值(`&&` vs `&` and `||` vs `|`)

我们希望一个sample或多个samples具有足够大的覆盖率(DP>10)和基因型质量(GQ>20)的snp位点：

bcftools query -i'FMT/DP>10 & FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf       ##-i 'FMT/DP>10和FMT/GQ>20'在同一个sample中选择满足条件的位点:

另一方面，如果我们需要在同一sample中两个条件都满足但不一定相同样品，我们使用&&操作符而不是&：

bcftools query -i'FMT/DP>10 && FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf

|操作符可以只选择匹配的样本：

bcftools query -f'[%POS %SAMPLE %DP\n]\n' -i 'FMT/DP=19 | FMT/DP="."' test/view.filter.vcf

whole samples record when || is used(就是有一个样本符合该位点，那么该位点所有的样本记录都会被显示出来):

bcftools query -f '[%POS %SAMPLE %DP\n]\n' -i 'FMT/DP=19 || FMT/DP="."' test/view.filter.vcf

过滤：

bcftools filter  -i 'SVLEN<100000 | SVLEN< -50  & DV>10' -Oz --threads 8 -o B1952.filter.clean.vcf.gz  B1952.ngmlr_sniffle.vcf   #例如过滤-50<绝对SVLEN<10000

重要网址：

http://samtools.github.io/bcftools/howtos/filtering.html

bcftools的更多相关文章

bcftools合并vcf文件
见命令: bcftools merge A.vcf.gz B.vcf.gz C.vcf.gz -Oz -o ABC.vcf.gz 参考链接:http://vcftools.sourceforge.ne ...
bcftools或vcftools提取指定区段的vcf文件（extract specified position ）
下载安装bcftools 见如下命令: bcftools filter 1000Genomes.vcf.gz --regions 9:4700000-4800000 > 4700000-4800 ...
使用bcftools提取指定样本的vcf文件（extract specified samples in vcf format）
1.下载安装bcftools. 2.准备样本ID文件,这里命名为samplelistname.txt,一个样本一行,如下所示: sample1 sample2 sample3 3.输入命令: bcft ...
bcftools将vcf生成bgzip和index格式
利用bcftools软件将vcf格式生成gz格式和index格式,需要用到“-Oz”和“index”命令,具体如下: /bcftools-1.8/bin/bcftools view ExAC.vcf ...
【Bcftools】合并不同sample的vcf文件，通过bcftools
通过GATK calling出来的SNP如果使用UnifiedGenotype获得的SNP文件是分sample的,但是如果使用vcftools或者ANGSD则需要Vcf文件是multi-sample的 ...
【BCFTOOLS】按样本拆分VCF文件
在对vcf的操作有这样三个软件: Vcftools:主要用于群体分析,文本处理的功能不是很强大,虽然这个软件也可以拆分样本,但是这种拆分不涉及文件的处理,只是保留在分析流程里. GATK .x:这个软 ...
samtools+bcftools 进行SNP calling
两个软件的作用:1.samtools mpileup 主要是用于收集BAM文件中的信息,这个位点上有多少条read匹配,匹配read的碱基是什么,并将这些信息存储在BCF文件中.2.bcftools ...
bcftools 提取vcf（snp/indel）文件子集
做群体变异检测后,通常会有提取子集的操作,之前没有发现bcftools有这个功能,都是自己写脚本操作,数据量一上来,速度真的是让人无语凝噎.这里记录下提取子vcf文件的用法,软件版本:bcftools ...
linux 安装SAMtools，bcftools，htslib，sratoolkit，bedtools,GATK，TrimGalore，qualimap，vcftools，bwa
--------------------安装Samtools---------------------------------------------------------------------- ...

随机推荐

java异常处理机制详解
java异常处理机制详解程序很难做到完美,不免有各种各样的异常.比如程序本身有bug,比如程序打印时打印机没有纸了,比如内存不足.为了解决这些异常,我们需要知道异常发生的原因.对于一些常见的异常,我 ...
ubuntu18.04 中个性化配置vim方法
1:新建配置文件在终端里输入:vi ~/.vimrc (vimrc是vim的配置文件,每次打开vim时会自动加载这个文件里的配置) 2:配置的代码如下:直接就可以复制到里面然后保存就行 set ai ...
1+x证书Web前端开发HTML+CSS专项练习测试题（八）
1+x证书Web前端开发HTML+CSS专项练习测试题(八) 官方QQ群 1+x 证书 Web 前端开发 HTML+CSS 专项练习测试题(八) http://blog.zh66.club/index ...
2019 年 GitHub 上最热门的 Java 开源项目
1.JavaGuide https://github.com/Snailclimb/JavaGuide Star 22668 [Java 学习 + 面试指南] 一份涵盖大部分 Java 程序员所需要掌 ...
IT兄弟连 Java语法教程流程控制语句分支结构语句4
4 嵌套if-else条件语句嵌套if语句是作为另一个if或else语句的目标的if语句.嵌套if语句在程序设计中非常普遍.Java中,关于嵌套if语句需要记住的是,else语句总是和同一代码块中 ...
策略路由PBR（不含track）
策略路由:是一种依据用户制定的策略进行路由选择的机制.(公义)在特定数据进入路由表前,对其进行操控的方式.(本人定义) 根据作用对象的不同,策略路由可分为本地策略路由和接口策略路由: · 本地策略路由 ...
一个Web前端工程师或程序员的发展方向，未来困境及穷途末路
如果你刚好是一个Web前端工程师,或者你将要从事web前端工作.你应该和我有同样的感慨,web前端技术到了自己的天花板,前端工作我能做多少年?3年或5年?自己的职业规划应该怎么样?收入为什么没有增长? ...
重磅来袭！Reactive 架构专场四城巡回演讲
Reactive 究竟是什么?Reactive 对架构设计的影响和冲击,以及给开发方式带来的改变有哪些?为什么阿里巴巴.Pivotal.Facebook 纷纷在生产环境中实践 Reactive? 本次 ...
vue项目中引入iconfont
背景对于前端而言,图标的发展可谓日新月异.从img标签,到雪碧图,再到字体图标,svg,甚至svg也有了类似于雪碧图的方案svg-sprite-loader.雪碧图没有什么好讲的了,只是简单地利用了 ...
Web前端基础(6):CSS(三)
1. 定位定位有三种:相对定位.绝对定位.固定定位 1.1 相对定位现象和使用: 1.如果对当前元素仅仅设置了相对定位,那么与标准流的盒子什么区别. 2.设置相对定位之后,我们才可以使用四个方向的 ...