vcf格式示例

##fileformat=VCFv4.1

##FILTER=<ID=LowQual,Description=”Low quality”>

##FORMAT=<ID=AD,Number=.,Type=Integer,Description=”Allelic depths for the ref and alt alleles in the order listed”>

##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Approximate read depth (reads with MQ=255 or with bad mates are filtered)”>

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype Quality”>

##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”>

##FORMAT=<ID=PL,Number=G,Type=Integer,Description=”Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification”>

##INFO=<ID=AC,Number=A,Type=Integer,Description=”Allele count in genotypes, for each ALT allele, in the same order as listed”>

##INFO=<ID=AF,Number=A,Type=Float,Description=”Allele Frequency, for each ALT allele, in the same order as listed”>

##INFO=<ID=AN,Number=1,Type=Integer,Description=”Total number of alleles in called genotypes”>

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description=”Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities”>

##INFO=<ID=DP,Number=1,Type=Integer,Description=”Approximate read depth; some reads may have been filtered”>

##INFO=<ID=DS,Number=0,Type=Flag,Description=”Were any of the samples downsampled?”>

##INFO=<ID=Dels,Number=1,Type=Float,Description=”Fraction of Reads Containing Spanning Deletions”>

##INFO=<ID=FS,Number=1,Type=Float,Description=”Phred-scaled p-value using Fisher’s exact test to detect strand bias”>

##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description=”Consistency of the site with at most two segregating haplotypes”>

##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description=”Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation”>

##INFO=<ID=MLEAC,Number=A,Type=Integer,Description=”Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed”>

##INFO=<ID=MLEAF,Number=A,Type=Float,Description=”Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed”>

##INFO=<ID=MQ,Number=1,Type=Float,Description=”RMS Mapping Quality”>

##INFO=<ID=MQ0,Number=1,Type=Integer,Description=”Total Mapping Quality Zero Reads”>

##INFO=<ID=MQRankSum,Number=1,Type=Float,Description=”Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities”>

##INFO=<ID=QD,Number=1,Type=Float,Description=”Variant Confidence/Quality by Depth”>

##INFO=<ID=RPA,Number=.,Type=Integer,Description=”Number of times tandem repeat unit is repeated, for each allele (including reference)”>

##INFO=<ID=RU,Number=1,Type=String,Description=”Tandem repeat unit (bases)”>

##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description=”Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias”>

##INFO=<ID=SOR,Number=1,Type=Float,Description=”Symmetric Odds Ratio of 2x2 contingency table to detect strand bias”>

##INFO=<ID=STR,Number=0,Type=Flag,Description=”Variant is a short tandem repeat”>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0

0:48:1:51,51 1

0:48:8:51,51 1/1:43:5:.,.

vcf字段详解

CHROM： 表示变异位点是在哪个contig 里call出来的，如果是人类全基因组的话那就是chr1…chr22，chrX,Y,M。

POS： 变异位点相对于参考基因组所在的位置，如果是indel，就是第一个碱基所在的位置。

ID： 如果call出来的SNP存在于dbSNP数据库里，就会显示相应的dbSNP里的rs编号。

REF和REF： 在这个变异位点处，参考基因组中所对应的碱基和研究对象基因组中所对应的碱基。

QUAL： 可以理解为所call出来的变异位点的质量值。Q=-10lgP，Q表示质量值；P表示这个位点发生错误的概率。因此，如果想把错误率从控制在90%以上，P的阈值就是1/10，那lg（1/10）=-1，Q=（-10）*（-1）=10。同理，当Q=20时，错误率就控制在了0.01。

FILTER： 理想情况下，QUAL这个值应该是用所有的错误模型算出来的，这个值就可以代表正确的变异位点了，但是事实是做不到的。因此，还需要对原始变异位点做进一步的过滤。无论你用什么方法对变异位点进行过滤，过滤完了之后，在FILTER一栏都会留下过滤记录，如果是通过了过滤标准，那么这些通过标准的好的变异位点的FILTER一栏就会注释一个PASS，如果没有通过过滤，就会在FILTER这一栏提示除了PASS的其他信息。如果这一栏是一个“.”的话，就说明没有进行过任何过滤。

FORMAT标签详解

GT: 表示这个样本的基因型，对于一个二倍体生物，GT值表示的是这个样本在这个位点所携带的两个等位基因。0表示跟REF一样；1表示表示跟ALT一样；2表示第二个ALT。当只有一个ALT 等位基因的时候，0/0表示纯和且跟REF一致；0/1表示杂合，两个allele一个是ALT一个是REF；1/1表示纯和且都为ALT； The most common format subfield is GT (genotype) data. If the GT subfield is present, it must be the first subfield. In the sample data, genotype alleles are numeric: the REF allele is 0, the first ALT allele is 1, and so on. The allele separator is ‘/’ for unphased genotypes and ‘

’ for phased genotypes.

0 - reference call

1 - alternative call 1

2 - alternative call 2

AD: 对应两个以逗号隔开的值，这两个值分别表示覆盖到REF和ALT碱基的reads数，相当于支持REF和支持ALT的测序深度。

DP: 覆盖到这个位点的总的reads数量，相当于这个位点的深度（并不是多有的reads数量，而是大概一定质量值要求的reads数）。

PL: 对应3个以逗号隔开的值，这三个值分别表示该位点基因型是0/0，0/1，1/1的没经过先验的标准化Phred-scaled似然值（L）。如果转换成支持该基因型概率（P）的话，由于L=-10lgP，那么P=10^（-L/10），因此，当L值为0时，P=10^0=1。因此，这个值越小，支持概率就越大，也就是说是这个基因型的可能性越大。

GQ: 表示最可能的基因型的质量值。表示的意义同QUAL。

#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878

chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255

chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0

chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26

chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

我们就可以解释上面的例子：

chr1：873762 是一个新发现的T/G变异，并且有很高的可信度（qual=5231.78）。

chr1：877664 是一个已知的变异为A/G 的SNP位点，名字rs3828047，并且具有很高的可信度（qual=3931.66）。

chr1：899282 是一个已知的变异为C/T的SNP位点，名字rs28548431，但可信度较低（qual=71.77）。在这个位点，GT=0/1，也就是说这个位点的基因型是C/T；GQ=25.92，质量值并不算太高，可能是因为cover到这个位点的reads数太少，DP=4，也就是说只有4条reads支持这个地方的变异；AD=1,3，也就是说支持REF的read有一条，支持ALT的有3条；在PL里，这个位点基因型的不确定性就表现的更突出了，0/1的PL值为0，虽然支持0/1的概率很高；但是1/1的PL值只有26，也就是说还有10^(-2.6)=0.25%的可能性是1/1；但几乎不可能是0/0，因为支持0/0的概率只有10^(-10.3)=5*10^-11。

chr1：974165 是一个已知的变异为T/C的SNP位点，名字rs9442391，但是这个位点的质量值很低，被标成了“LowQual”，在后续分析中可以被过滤掉。

VCF格式文件操作工具PyVCF

1.安装

$ pip install pyvcf

2.使用

a.作为python模块使用

>>> import vcf

>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))

>>> for record in vcf_reader:

...     print record

Record(CHROM=20, POS=14370, REF=G, ALT=[A])

Record(CHROM=20, POS=17330, REF=T, ALT=[A])

Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])

Record(CHROM=20, POS=1230237, REF=T, ALT=[None])

Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])

b.Shell命令行使用

pyvcf自带两个两个命令脚本vcf_filter.py和vcf_melt,前一个是过滤脚本，后面一个使vcf格式转换为tab分割字段的文件的脚本

$ vcf_melt < vcf/test/gatk.vcf

SAMPLE      AD      DP      GQ      GT      PL      FILTER  CHROM   POS     REF     ALT     ID      info.AC info.AF info.AN info.BaseQRankSum       info.DB info.DP info.DS info.Dels       info.FS info.HRun       info.HaplotypeScore     info.InbreedingCoeff    info.MQ info.MQ0        info.MQRankSum  info.QD info.ReadPosRankSum

BLANK       6,0     6       18.04   0/0     0,18,211        .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59

NA12878     138,107 250     99.0    0/1     1961,0,3049     .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59

NA12891     169,77  250     99.0    0/1     1038,0,3533     .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59

NA12892     249,0   250     99.0    0/0     0,600,5732      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59

NA19238     248,1   250     99.0    0/0     0,627,6191      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59

NA19239     250,0   250     99.0    0/0     0,615,5899      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59

NA19240     250,0   250     99.0    0/0     0,579,5674      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59

BLANK       13,4    17      62.64   0/1     63,0,296        .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

NA12878     118,127 246     99.0    0/1     2396,0,1719     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

NA12891     241,0   244     99.0    0/0     0,459,4476      .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

NA12892     161,85  246     99.0    0/1     1489,0,2353     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

NA19238     110,132 242     99.0    0/1     2561,0,1488     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

NA19239     106,135 242     99.0    0/1     2613,0,1389     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

NA19240     116,126 243     99.0    0/1     2489,0,1537     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

参考文章

http://blog.sina.com.cn/s/blog_12d5e3d3c0101qv1u.html

PyVCF

VCF文件处理工具PyVCF的更多相关文章

基因组与Python --PyVCF 好用的vcf文件处理器
vcf文件的全称是variant call file,即突变识别文件,它是基因组工作流程中产生的一种文件,保存的是基因组上的突变信息.通过对vcf文件进行分析,可以得到个体的变异信息.嗯,总之,这是很 ...
文件类型工具类：FileTypeUtil
个人学习,仅供参考! package com.example.administrator.filemanager.utils;import java.io.File;/** * 文件类型工具类 * * ...
bcftools或vcftools提取指定区段的vcf文件（extract specified position ）
下载安装bcftools 见如下命令: bcftools filter 1000Genomes.vcf.gz --regions 9:4700000-4800000 > 4700000-4800 ...
利用SHAPEIT将vcf文件进行基因型（genotype）定相（phasing）：查看两个突变是否来源于同一条链（染色体或父本或母本），two mutations carried by the same read
首先,下载SHAPEIT. 按照里面的步骤安装完后,将vcf文件进行基因型定相,分四步走. 第一步,将vcf文件转化为plink二进制文件(.bed, .bim, .fam). 这一步需要用到GATK ...
VCF文件导入导出
参考资料通讯录导入导出vcf格式文件方法可参考: https://qiaodahai.com/android-iphone-mobile-phones-contacts-import-and-exp ...
putty提供的两个文件传输工具PSCP、PSFTP详细介绍
用 SSH 来传输文件 PuTTY 提供了两个文件传输工具 PSCP (PuTTY Secure Copy client) PSFTP (PuTTY SFTP client) PSCP 通过 SSH ...
[软件推荐]快速文件复制工具(Limit Copy) V4.0 绿色版
快速文件复制工具(Limit Copy)绿色版是一款智能变频超快复制绿色软件. 快速文件复制工具(Limit Copy)功能比较完善,除了文件复制还可以智能变频,直接把要复制的文件拖入窗口即可,无需手 ...
日志文件清理工具V1.1
上次做完日志文件清理工具V1.0 的版本后,确实给自己的工作带来不少的方便.虽然只是一个小工具,代码也比较简单,但有用就是好东西.上次开发比较匆忙,有些细节没来得及完善,今天吃完晚饭,边看亚冠比赛边把 ...
软件打包为exe NSIS单文件封包工具V2.3
NSIS单文件封包工具V2.3 这是一款基于NSIS模块的封包制作工具,lzma算法最大压缩率,支持制作单文件,以及NSIS自定义解压封包. 支持注册dll,exe,reg,bat文件默认提取设置程 ...

随机推荐

【leetcode刷题笔记】Longest Substring Without Repeating Characters
Given a string, find the length of the longest substring without repeating characters. For example, ...
webpack打包笔记
optimist是一个node库,将webpack.config.js与shell参数整合成options对象 options对象包含之后构建的重要信息,类似于webpack.config.js we ...
sdut oj 操作系统实验--SSTF磁盘调度算法【操作系统算法】
操作系统实验--SSTF磁盘调度算法 Time Limit: 1000ms Memory limit: 65536K 有疑问?点这里^_^ 题目描述磁盘调度在多道程序设计的计算机系统中,各个进 ...
在webBrowser中取Cookie的方法
在很多情况下我们会使用间进程的webBrowser去实现一些网页的请求和抓去,这个时候有部分网页是取不到Cookie的,那怎么办呢?下面我提供一个方法,应该99%的都能取到, //取当前webBrow ...
TortoiseGit做push时提示Disconnected: No supported authentication methods available (server sent: publickey)错误
通过Git从远程服务器上获得到自己的项目,但是通过TortoiseGit做push时提示Disconnected: No supported authentication methods availa ...
Spark- Checkpoint原理剖析
Checkpoint,是Spark 提供的一个比较高级的功能.有的时候,比如说,我们的 Spark 应用程序,特别的复杂,然后从初始的RDD开始,到最后拯个应用程序完成,有非常多的步骤,比如超过20个 ...
python3 字符串属性(三）
maketrans 和 translate的用法(配合使用) 下面是python的英文用法解释 maketrans(x, y=None, z=None, /) Return a translation ...
Delpih - Format
Format是一个很常用,却又似乎很烦的方法,本人试图对这个方法的帮助进行一些翻译,让它有一个完整的概貌,以供大家查询之用: 首先看它的声明:function Format(const Format: ...
PS 滤镜——漩涡 vortex
%%% Vortex %%% 漩涡效果 clc; clear all; close all; addpath('E:\PhotoShop Algortihm\Image Processing\PS A ...
cocos2d-x CSV文件读取（Excel生成csv文件）
实现类 CCSVParse.h #ifndef __C_CSV_PARSE__ #define __C_CSV_PARSE__ #include "cocos2d.h" #incl ...

VCF文件处理工具PyVCF