bcftools
beftools非常复杂,大概有20个命令,每个命令下面还有N多个参数
annotate .. edit VCF files, add or remove annotations call .. SNP/indel calling (former "view") cnv .. Copy Number Variation caller concat .. concatenate VCF/BCF files from the same set of samples consensus .. create consensus sequence by applying VCF variants convert .. convert VCF/BCF to other formats and back csq .. haplotype aware consequence caller filter .. filter VCF/BCF files using fixed thresholds gtcheck .. check sample concordance, detect sample swaps and contamination index .. index VCF/BCF isec .. intersections of VCF/BCF files merge .. merge VCF/BCF files files from non-overlapping sample sets mpileup .. multi-way pileup producing genotype likelihoods norm .. normalize indels plugin .. run user-defined plugin polysomy .. detect contaminations and whole-chromosome aberrations query .. transform VCF/BCF into user-defined formats reheader .. modify VCF/BCF header, change sample names roh .. identify runs of homo/auto-zygosity sort .. sort VCF/BCF files stats .. produce VCF/BCF stats (former vcfcheck) view .. subset, filter and convert VCF and BCF files
下面讲一下过滤参数
1、bcftools filter(以-i参数为例)
- -i, --include EXPRESSION :
- include only sites for which EXPRESSION is true. For valid expressions see EXPRESSIONS.(根据正则保留)
- 其中包括:
- 1.1、numerical constants, string constants, file names (this is currently supported only to filter by the ID column)
1, 1.0, 1e-4 "String" @file_name
1.2、算术运算
+,*,-,/
1.3、comparison operators
== (same as =), >, >=, <=, <, !=
1.4、regex操作符“~”和它的否定“!~”。表达式区分大小写,除非添加“/i”。
INFO/HAYSTACK ~ "needle" INFO/HAYSTACK ~ "NEEDless/i"
1.5、圆括号
(, )
1.6、逻辑运算符。参见下面的示例和有关“&&”与“&”以及“||”与“|”之间区别的过滤教程。
&&, &, ||, |
1.7、信息标签,格式标签,列名
INFO/DP or DP FORMAT/DV, FMT/DV, or DV FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]
1.8、1 (or 0) to test the presence (or absence) of a flag
FlagA=1 && FlagB=0
1.9、"." to test missing values
DP=".", DP!=".", ALT="."
2.0、missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".") using these expressions
GT~"\.", GT!~"\."
2.1、sample genotype: reference (haploid or diploid), alternate (hom or het, haploid or diploid), missing genotype, homozygous, heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het, alt-alt het, haploid ref, haploid alt (case-insensitive)
GT="ref" GT="alt" GT="mis" GT="hom" GT="het" GT="hap" GT="RR" GT="AA" GT="RA" or GT="AR" GT="Aa" or GT="aA" GT="R" GT="A"
2.2、TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~" to require at least one allele of the given type or the equal sign "=" to require that all alleles are of the given type. Compare
TYPE="snp" TYPE~"snp" TYPE!="snp" TYPE!~"snp"
2.3、array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and an element of the vector, as shown in the examples below
INFO/AF[0] > 0.3 .. first AF value bigger than 0.3 FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30 FORMAT/AD[0:1] .. first sample, second AD value FORMAT/AD[1:0] .. second sample, first AD value DP4[*] == 0 .. any DP4 value FORMAT/DP[0] > 30 .. DP of the first sample bigger than 30 FORMAT/DP[1-3] > 10 .. samples 2-4 FORMAT/DP[1-] < 7 .. all samples but the first FORMAT/DP[0,2-4] > 20 .. samples 1, 3-5 FORMAT/AD[0:1] .. first sample, second AD field FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field FORMAT/AD[*:1] or AD[:1] .. any sample, second AD field (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3 CSQ[*] ~ "missense_variant.*deleterious"
2.4、with many samples it can be more practical to provide a file with sample names, one sample name per line
GT[@samples.txt]="het" & binom(AD)<0.01
2.5、function on FORMAT tags (over samples) and INFO tags (over vector fields): maximum; minimum; arithmetic mean (AVG is synonymous with MEAN); median; standard deviation; sum; string length; absolute value; number of elements (matching columns for FORMAT tags or number of fields for INFO tags).
MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT
2.6、two-tailed binomial test. Note that for N=0 the test evaluates to a missing value and when FORMAT/GT is used to determine the vector indices, it evaluates to 1 for homozygous genotypes.
binom(FMT/AD) .. GT can be used to determine the correct index binom(AD[0],AD[1]) .. or the fields can be given explicitly phred(binom()) .. the same as binom but phred-scaled
2.7、variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes; number of samples with missing genotype; fraction of samples with missing genotype; indel length (deletions negative, insertions positive)
N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
2.8、the number (N_PASS) or fraction (F_PASS) of samples which pass the expression
N_PASS(GQ>90 & GT!="mis") > 90 F_PASS(GQ>90 & GT!="mis") > 0.9
2.9、custom perl filtering. Note that this command is not compiled in by default, see the section Optional Compilation with Perl in the INSTALL file for help and misc/demo-flt.pl for a working example. The demo defined the perl subroutine "severity" which can be invoked from the command line as follows:
perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
注意事项:
字符串比较和正则表达式不区分大小写;变量和函数名不区分大小写,但是flag区分大小写。例如,"qual"可以代替"qual", "strlen()"可以代替"strlen()",但不是“dp”而是“DP”。当查询多个值时,将测试所有元素并对结果使用OR逻辑。例如,查询“TAG=1,2,3,4”时,计算如下:
-i 'TAG[*]=1' .. true, the record will be printed -i 'TAG[*]!=1' .. true -e 'TAG[*]=1' .. false, the record will be discarded -e 'TAG[*]!=1' .. false -i 'TAG[0]=1' .. true -i 'TAG[0]!=1' .. false -e 'TAG[0]=1' .. false -e 'TAG[0]!=1' .. true
举例:
MIN(DV)>5 MIN(DV/DP)>0.3 MIN(DP)>10 & MIN(DV)>3 FMT/DP>10 & FMT/GQ>10 .. both conditions must be satisfied within one sample FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples QUAL>10 | FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10 QUAL>10 || FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2) COUNT(GT="hom")=0 .. no homozygous genotypes at the site AVG(GQ)>50 .. average (arithmetic mean) of genotype qualities bigger than 50 ID=@file .. selects lines with ID present in the file ID!=@~/file .. skip lines with ID present in the ~/file MAF[0]<0.05 .. select rare variants at 5% cutoff POS>=100 .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
shell 扩展:
注意表达式必须经常引用,因为在shell中有些字符有特殊的含义。一个用单引号括起来的表达式的例子,它导致整个表达式按照预期传递给程序:
bcftools view -i '%ID!="." & MAF[0]<0.01'
------------------------过滤Filtering-------------
1、按照固定列(fixed columns)过滤
固定列,例如“QUAL, FILTER, INFO”可以直接过滤。例如:
bcftools query -e'FILTER="."' -f'%CHROM %POS %FILTER\n' file.bcf #过滤掉FILTER字段中为.的行
bcftools query -i'QUAL>20 && DP>10' -f'%CHROM %POS %QUAL %DP\n' file.bcf | head -2 #只保留质量值大于20,且覆盖深度高于10的位点
2、FORMAT columns
在过滤FORMAT字段的时候,OR 逻辑用于所有samples。When filtering FORMAT tags, the OR logic is applied with multiple samples,而不是单个sample.例如,如果我们想删除任何样本中带有未知基因型的位点,表达式-i 'GT!="会不起作用,必须用相反的逻辑 -e 'GT ="."'
:
bcftools query -i 'GT!="."' #不行 bcftools query -e 'GT ="."' #相反逻辑才可行
3、FORMAT列与 布尔值(&&
vs &
and ||
vs |
)
我们希望一个sample或多个samples具有足够大的覆盖率(DP>10)和基因型质量(GQ>20)的snp位点:
bcftools query -i'FMT/DP>10 & FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf ##-i 'FMT/DP>10和FMT/GQ>20'在同一个sample中选择满足条件的位点:
另一方面,如果我们需要在同一sample中两个条件都满足但不一定相同样品,我们使用&&操作符而不是&:
bcftools query -i'FMT/DP>10 && FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf
|操作符可以只选择匹配的样本:
bcftools query -f'[%POS %SAMPLE %DP\n]\n' -i 'FMT/DP=19 | FMT/DP="."' test/view.filter.vcf
whole samples record when ||
is used(就是有一个样本符合该位点,那么该位点所有的样本记录都会被显示出来):
bcftools query -f '[%POS %SAMPLE %DP\n]\n' -i 'FMT/DP=19 || FMT/DP="."' test/view.filter.vcf
过滤:
bcftools filter -i 'SVLEN<100000 | SVLEN< -50 & DV>10' -Oz --threads 8 -o B1952.filter.clean.vcf.gz B1952.ngmlr_sniffle.vcf #例如过滤-50<绝对SVLEN<10000
重要网址:
http://samtools.github.io/bcftools/howtos/filtering.html
bcftools的更多相关文章
- bcftools合并vcf文件
见命令: bcftools merge A.vcf.gz B.vcf.gz C.vcf.gz -Oz -o ABC.vcf.gz 参考链接:http://vcftools.sourceforge.ne ...
- bcftools或vcftools提取指定区段的vcf文件(extract specified position )
下载安装bcftools 见如下命令: bcftools filter 1000Genomes.vcf.gz --regions 9:4700000-4800000 > 4700000-4800 ...
- 使用bcftools提取指定样本的vcf文件(extract specified samples in vcf format)
1.下载安装bcftools. 2.准备样本ID文件,这里命名为samplelistname.txt,一个样本一行,如下所示: sample1 sample2 sample3 3.输入命令: bcft ...
- bcftools将vcf生成bgzip和index格式
利用bcftools软件将vcf格式生成gz格式和index格式,需要用到“-Oz”和“index”命令,具体如下: /bcftools-1.8/bin/bcftools view ExAC.vcf ...
- 【Bcftools】合并不同sample的vcf文件,通过bcftools
通过GATK calling出来的SNP如果使用UnifiedGenotype获得的SNP文件是分sample的,但是如果使用vcftools或者ANGSD则需要Vcf文件是multi-sample的 ...
- 【BCFTOOLS】按样本拆分VCF文件
在对vcf的操作有这样三个软件: Vcftools:主要用于群体分析,文本处理的功能不是很强大,虽然这个软件也可以拆分样本,但是这种拆分不涉及文件的处理,只是保留在分析流程里. GATK .x:这个软 ...
- samtools+bcftools 进行SNP calling
两个软件的作用:1.samtools mpileup 主要是用于收集BAM文件中的信息,这个位点上有多少条read匹配,匹配read的碱基是什么,并将这些信息存储在BCF文件中.2.bcftools ...
- bcftools 提取vcf(snp/indel)文件子集
做群体变异检测后,通常会有提取子集的操作,之前没有发现bcftools有这个功能,都是自己写脚本操作,数据量一上来,速度真的是让人无语凝噎.这里记录下提取子vcf文件的用法,软件版本:bcftools ...
- linux 安装SAMtools,bcftools,htslib,sratoolkit,bedtools,GATK,TrimGalore,qualimap,vcftools,bwa
--------------------安装Samtools---------------------------------------------------------------------- ...
随机推荐
- java异常处理机制详解
java异常处理机制详解 程序很难做到完美,不免有各种各样的异常.比如程序本身有bug,比如程序打印时打印机没有纸了,比如内存不足.为了解决这些异常,我们需要知道异常发生的原因.对于一些常见的异常,我 ...
- ubuntu18.04 中个性化配置vim方法
1:新建配置文件 在终端里输入:vi ~/.vimrc (vimrc是vim的配置文件,每次打开vim时会自动加载这个文件里的配置) 2:配置的代码如下:直接就可以复制到里面然后保存就行 set ai ...
- 1+x证书Web前端开发HTML+CSS专项练习测试题(八)
1+x证书Web前端开发HTML+CSS专项练习测试题(八) 官方QQ群 1+x 证书 Web 前端开发 HTML+CSS 专项练习测试题(八) http://blog.zh66.club/index ...
- 2019 年 GitHub 上最热门的 Java 开源项目
1.JavaGuide https://github.com/Snailclimb/JavaGuide Star 22668 [Java 学习 + 面试指南] 一份涵盖大部分 Java 程序员所需要掌 ...
- IT兄弟连 Java语法教程 流程控制语句 分支结构语句4
4 嵌套if-else条件语句 嵌套if语句是作为另一个if或else语句的目标的if语句.嵌套if语句在程序设计中非常普遍.Java中,关于嵌套if语句需要记住的是,else语句总是和同一代码块中 ...
- 策略路由PBR(不含track)
策略路由:是一种依据用户制定的策略进行路由选择的机制.(公义)在特定数据进入路由表前,对其进行操控的方式.(本人定义) 根据作用对象的不同,策略路由可分为本地策略路由和接口策略路由: · 本地策略路由 ...
- 一个Web前端工程师或程序员的发展方向,未来困境及穷途末路
如果你刚好是一个Web前端工程师,或者你将要从事web前端工作.你应该和我有同样的感慨,web前端技术到了自己的天花板,前端工作我能做多少年?3年或5年?自己的职业规划应该怎么样?收入为什么没有增长? ...
- 重磅来袭!Reactive 架构专场四城巡回演讲
Reactive 究竟是什么?Reactive 对架构设计的影响和冲击,以及给开发方式带来的改变有哪些?为什么阿里巴巴.Pivotal.Facebook 纷纷在生产环境中实践 Reactive? 本次 ...
- vue项目中引入iconfont
背景 对于前端而言,图标的发展可谓日新月异.从img标签,到雪碧图,再到字体图标,svg,甚至svg也有了类似于雪碧图的方案svg-sprite-loader.雪碧图没有什么好讲的了,只是简单地利用了 ...
- Web前端基础(6):CSS(三)
1. 定位 定位有三种:相对定位.绝对定位.固定定位 1.1 相对定位 现象和使用: 1.如果对当前元素仅仅设置了相对定位,那么与标准流的盒子什么区别. 2.设置相对定位之后,我们才可以使用四个方向的 ...