The C++ executable module examples

This page provides usage examples for the executable module. Extended documentation for all of the options can be found on the manual page.

  • Running the program
  • Getting basic file statistics
  • Applying a filter
  • Writing to a new VCF file
  • Writing out to screen
  • Converting a VCF file to BCF
  • Comparing two VCF files
  • Getting allele frequency
  • Getting sequencing depth information
  • Getting linkage disequilibrium statistics
  • Getting Fst population statistics
  • Converting VCF files to PLINK format

Run the program

By default the executable can be found in the bin/ subdirectory. To run the program, type:

./vcftools

The program will return information regarding the version number.

Get basic file statistics

The executable can be run with only an input VCF file without any other options, and will return basic information regarding the contents of the file. To specify an input file you must use the one of the input options ( --vcf, --gzvcf, or --bcf ) depending on the type of file. For example, for a VCF file called input_data.vcf the following command could be run:

./vcftools --vcf input_data.vcf

It will return information about the file such as the number of variants and the number of individuals in the file.

Beginning with vcftools v0.1.12, the program can also take input in from standard input (stdin). To do this, use any of the normal file type input options followed by the dash - character.

zcat input_data.vcf.gz | ./vcftools --vcf -

Applying a filter

You can use VCFtools to filter out variants or individuals based on the values within the file. For example, to filter the sites within a file based upon their location in genome, use the options --chr, --from-bp, and --to-bp to specify the region.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000

After running this line, the program will return the amount of sites in the file that are included in the chromosomal region chr1:1000000-2000000. This option can be modified to work with any desired region.

Writing to a new VCF file

VCFtools can perform analyses on the variants that pass through the filters or simply write those variants out to a new file. This function is helpful for creating subsets of VCF files or just removing unwanted variants from VCF files. To write out the variants that pass through filters use the --recode option. In addition, use --recode-INFO-all to include all data from the INFO fields in the output. By default INFO fields are not written because many filters will alter the variants in a file, rendering the INFO values incorrect.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --recode-INFO-all

In this example, VCFtools will create a new VCF file containing only variants within the specified chromosomal region while keeping all INFO fields included in the original file.

Any files written out by VCFtools will be in the current working directory and have the prefix ./out.SUFFIX by default. To change the path, specify the new path using the option --out followed by the desired path. The program will add a suffix to that path based on the chosen output function.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --out subset

Writing out to screen

Beginning with VCFtools v0.1.12, the program can also write out to screen instead of having the program write to a specified path. Using the options --stdout or -c will redirect all output to standard out. The output can then be piped into other programs or written out to a specified file name.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --stdout | more

The above example will output the resulting file to screen one line at a time for quick inspection of the results.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode -c > /home/usr/data/subset.vcf

The above example will redirect the output and write it to the specified file name.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode -c | gzip -c > /home/usr/data/subset.vcf.gz

The above example will redirect the output into gzip (assuming it is installed) for compression, and then gzip will write the file to the specified destination.

Converting a VCF file to BCF

Beginning with VCFftools v0.1.11, the program has the ability to read and write BCF files. This means that the program can also convert files between the two formats. This is accomplished in a similar way as the above example, instead using the --recode-bcf option. All output BCF files are automatically compressed using BGZF.

./vcftools --vcf input_data.vcf --recode-bcf --recode-INFO-all --out converted_output

Comparing two files

Using VCFtools, two VCF files can be compared to determine which sites and individuals are shared between them. The first file is declared using the input file options just like any other output function. The second file must be specified using --diff, --gzdiff, or --diff-bcf. There are also advanced options to determine additional discordance between the two files.

./vcftools --vcf input_data.vcf --diff other_data.vcf --out compare

Getting allele frequency

To determine the frequency of each allele over all individuals in a VCF file, the --freq argument is used.

./vcftools --vcf input_data.vcf --freq --out output

The output file will be written to output.frq.

Getting sequencing depth information

Another useful output function summarizes sequencing depth for each individual or for each site. Just like the allele frequency example above, this output function follows the same basic model.

./vcftools --vcf input_data.vcf --depth -c > depth_summary.txt

With VCFtools, you can use many combinations of filters and an output function. For example, to write out site-wise sequence depths only at sites that have no missing data, include the --max-missing argument.

./vcftools --vcf input_data.vcf --site-depth --max-missing 1.0 --out site_depth_summary

Getting linkage disequilibrium statistics

Linkage disequilibrium between sites can be determined as well. This is accomplished using the --hap-r2, --geno-r2, or --geno-chisq arguments. Since the program must do pairwise site comparisons, this analysis can be time consuming, so it is recommended to filter the sites first or use one of the other options (--ld-window, --ld-window-bp or --min-r2) to reduce the number of comparisons. In this example, the VCFtools will only compare sites within 50,000 base pairs of one another.

./vcftools --vcf input_data.vcf --hap-r2 --ld-window-bp 50000 --out ld_window_50000

Getting Fst population statistics

VCFtools can also calculate Fst statistics between individuals of different populations. It is an estimate calculated in accordance to Weir and Cockerham’s 1984 paper. The user must supply text files that contain lists of individuals (one per line) that are members of each population. The function will work with multiple populations if multiple --weir-fst-pop arguments are used. The following example shows how to calculate a per-site Fst calculation with two populations. Other arguments can be used in conjunction with this function, such as --fst-window-size and --fst-window-step.

./vcftools --vcf input_data.vcf --weir-fst-pop population_1.txt --weir-fst-pop population_2.txt --out pop1_vs_pop2

Converting VCF files to PLINK format

VCFtools can convert VCF files into formats convenient for use in other programs. One such example is the ability to convert into PLINK format. The following function will output the variants in .ped and .map files.

./vcftools --vcf input_data.vcf --plink --chr 1 --out output_in_plink


NAME

SYNOPSIS

DESCRIPTION

EXAMPLES

BASIC OPTIONS

SITE FILTERING OPTIONS

INDIVIDUAL FILTERING OPTIONS

GENOTYPE FILTERING OPTIONS

OUTPUT OPTIONS

COMPARISON OPTIONS

AUTHOR

详细参数

NAME

vcftools
v0.1.12b − Utilities for the
variant call format (VCF) and binary variant call format (BCF)

SYNOPSIS

vcftools [
--vcf FILE | --gzvcf FILE | --bcf FILE] [
--out OUTPUT PREFIX ] [ FILTERING OPTIONS ] [ OUTPUT OPTIONS
]

DESCRIPTION

vcftools is a suite of
functions for use on genetic variation data in the form of VCF and
BCF files. The tools provided will be used mainly to summarize
data, run calculations on data, filter out data, and convert data
into other useful file formats.

EXAMPLES

Output allele
frequency for all sites in the input vcf file from chromosome 1

vcftools --gzvcf
input_file.vcf.gz --freq --chr 1 --out chr1_analysis

Output a new vcf file
from the input vcf file that removes any indel sites

vcftools --vcf input_file.vcf
--remove-indels --recode --recode-INFO-all --out SNPs_only

Output files comparing
and summarizing the individuals and sites in two vcf files

vcftools --gzvcf
input_file1.vcf.gz --gzdiff input_file2.vcf.gz --out in1_v_in2

Output a new vcf file
to standard out without any sites that have a filter tag, then
compress it with gzip

vcftools --gzvcf
input_file.vcf.gz --remove-filtered-all --recode --stdout | gzip -c
> output_PASS_only.vcf.gz

Output a
Hardy-Weinberg p-value for every site in the bcf file that does not
have any missing genotypes

vcftools --bcf input_file.bcf
--hardy --max-missing 1.0 --out output_noMissing

Output nucleotide
diversity at a list of positions

zcat input_file.vcf.gz |
vcftools --vcf - --site-pi --positions SNP_list.txt --out
nucleotide_diversity

BASIC OPTIONS

These options are used
to specify the input and output files.

INPUT FILE
OPTIONS

--vcf

This option defines the VCF file to be
processed. VCFtools expects files in VCF format v4.0, v4.1 or v4.2.
The latter two are supported with some small limitations. If the
user provides a dash character ’-’ as a file name, the program
expects a VCF file to be piped in through standard in.

--gzvcf

This option can be used in place of the
--vcf option to read compressed (gzipped) VCF files directly.

--bcf

This option can be used in place of the
--vcf option to read BCF2 files directly. You do not need to
specify if this file is compressed with BGZF encoding. If the user
provides a dash character ’-’ as a file name, the program expects a
BCF2 file to be piped in through standard in.

OUTPUT FILE
OPTIONS

--out

This option defines the output filename
prefix for all files generated by vcftools. For example, if is set
to output_filename, then all output files will be of the form
output_filename.*** . If this option is omitted, all output files
will have the prefix "out." in the current working directory.

--stdout
-c

These options direct the vcftools
output to standard out so it can be piped into another program or
written directly to a filename of choice. However, a select few
output functions cannot be written to standard out.

--temp

This option can be used to redirect any
temporary files that vcftools creates into a specified
directory.

SITE FILTERING OPTIONS

These options are used
to include or exclude certain sites from any analysis being
performed by the program.

POSITION
FILTERING

--chr
--not-chr

Includes or excludes sites with
indentifiers matching . These options may be used multiple times to
include or exclude more than one chromosome.

--from-bp

--to-bp

These options specify a lower bound and
upper bound for a range of sites to be processed. Sites with
positions less than or greater than these values will be excluded.
These options can only be used in conjunction with a single usage
of --chr. Using one of these does not require use of the other.

--positions

--exclude-positions

Include or exclude a set of sites on
the basis of a list of positions in a file. Each line of the input
file should contain a (tab-separated) chromosome and position. The
file can have comment lines that start with a "#", they will be
ignored.

--positions-overlap
--exclude-positions-overlap

Include or exclude a set of sites on
the basis of the reference allele overlapping with a list of
positions in a file. Each line of the input file should contain a
(tab-separated) chromosome and position. The file can have comment
lines that start with a "#", they will be ignored.

--bed
--exclude-bed

Include or exclude a set of sites on
the basis of a BED file. Only the first three columns (chrom,
chromStart and chromEnd) are required. The BED file is expected to
have a header line.

--thin

Thin sites so that no two sites are
within the specified distance from one another.

--mask

--invert-mask

--mask-min

These options are used to specify a
FASTA-like mask file to filter with. The mask file contains a
sequence of integer digits (between 0 and 9) for each position on a
chromosome that specify if a site at that position should be
filtered or not.
An example mask file would look like:

>1
0000011111222...
>2
2222211111000...

In this example, sites in the VCF file
located within the first 5 bases of the start of chromosome 1 would
be kept, whereas sites at position 6 onwards would be filtered out.
And sites after the 11th position on chromosome 2 would be filtered
out as well.
The "--invert-mask" option takes the same format mask file as the
"--mask" option, however it inverts the mask file before filtering
with it.
And the "--mask-min" option specifies a threshold mask value
between 0 and 9 to filter positions by. The default threshold is 0,
meaning only sites with that value or lower will be kept.

SITE ID
FILTERING

--snp

Include SNP(s) with matching ID (e.g. a
dbSNP rsID). This command can be used multiple times in order to
include more than one SNP.

--snps

--exclude

Include or exclude a list of SNPs given
in a file. The file should contain a list of SNP IDs (e.g. dbSNP
rsIDs), with one ID per line. No header line is expected.

VARIANT TYPE
FILTERING

--keep-only-indels
--remove-indels

Include or exclude sites that contain
an indel. For these options "indel" means any variant that alters
the length of the REF allele.

FILTER FLAG
FILTERING

--remove-filtered-all

Removes all sites with a FILTER flag
other than PASS.

--keep-filtered

--remove-filtered

Includes or excludes all sites marked
with a specific FILTER flag. These options may be used more than
once to specify multiple FILTER flags.

INFO FIELD
FILTERING

--keep-INFO
--remove-INFO

Includes or excludes all sites with a
specific INFO flag. These options only filter on the presence of
the flag and not its value. These options can be used multiple
times to specify multiple INFO flags.

ALLELE
FILTERING

--maf
--max-maf

Include only sites with a Minor Allele
Frequency greater than or equal to the "--maf" value and less than
or equal to the "--max-maf" value. One of these options may be used
without the other. Allele frequency is defined as the number of
times an allele appears over all individuals at that site, divided
by the total number of non-missing alleles at that site.

--non-ref-af

--max-non-ref-af

Include only sites with all
Non-Reference (ALT) Allele Frequencies greater than or equal to the
"--non-ref-af" value and less than or equal to the
"--max-non-ref-af" value. One of these options may be used without
the other. Allele frequency is defined as the number of times an
allele appears over all individuals at that site, divided by the
total number of non-missing alleles at that site.

--mac
--max-mac

Include only sites with Minor Allele
Count greater than or equal to the "--mac" value and less than or
equal to the "--max-mac" value. One of these options may be used
without the other. Allele count is simply the number of times that
allele appears over all individuals at that site.

--non-ref-ac

--max-non-ref-ac

Include only sites with all
Non-Reference (ALT) Allele Counts greater than or equal to the
"--non-ref-ac" value and less than or equal to the
"--max-non-ref-ac" value. One of these options may be used without
the other. Allele count is simply the number of times that allele
appears over all individuals at that site.

--min-alleles

--max-alleles

Include only sites with a number of
alleles greater than or equal to the "--min-alleles" value and less
than or equal to the "--max-alleles" value. One of these options
may be used without the other.
For example, to include only bi-allelic sites, one could use:

vcftools --vcf file1.vcf
--min-alleles 2 --max-alleles 2

GENOTYPE VALUE
FILTERING

--min-meanDP
--max-meanDP

Includes only sites with mean depth
values (over all included individuals) greater than or equal to the
"--min-meanDP" value and less than or equal to the "--max-meanDP"
value. One of these options may be used without the other. These
options require that the "DP" FORMAT tag is included for each
site.

--hwe

Assesses sites for Hardy-Weinberg
Equilibrium using an exact test, as defined by Wigginton, Cutler
and Abecasis (2005). Sites with a p-value below the threshold
defined by this option are taken to be out of HWE, and therefore
excluded.

--max-missing

Exclude sites on the basis of the
proportion of missing data (defined to be between 0 and 1, where 0
allows sites that are completely missing and 1 indicates no missing
data allowed).

--max-missing-count

Exclude sites with more than this
number of missing genotypes over all individuals.

--phased

Excludes all sites that contain
unphased genotypes.

MISCELLANEOUS
FILTERING

--minQ

Includes only sites with Quality value
above this threshold.

INDIVIDUAL FILTERING OPTIONS

These options are used
to include or exclude certain individuals from any analysis being
performed by the program.

--indv
--remove-indv

Specify an individual to be kept or
removed from the analysis. This option can be used multiple times
to specify multiple individuals. If both options are specified,
then the "--indv" option is executed before the "--remove-indv
option".

--keep

--remove

Provide a file containing a list of
individuals to either include or exclude in subsequent analysis.
Each individual ID (as defined in the VCF headerline) should be
included on a separate line. If both options are used, then the
"--keep" option is execute before the "--remove" option. No header
line is expected.

--max-indv

Randomly thins individuals so that only
the specified number are retained.

GENOTYPE FILTERING OPTIONS

These options are used
to exclude genotypes from any analysis being performed by the
program. If excluded, these values will be treated as missing.

--remove-filtered-geno-all

Excludes all genotypes with a FILTER
flag not equal to "." (a missing value) or PASS.

--remove-filtered-geno

Excludes genotypes with a specific
FILTER flag.

--minGQ

Exclude all genotypes with a quality
below the threshold specified. This option requires that the "GQ"
FORMAT tag is specified for all sites.

--minDP

--maxDP

Includes only genotypes greater than or
equal to the "--minDP" value and less than or equal to the
"--maxDP" value. This option requires that the "DP" FORMAT tag is
specified for all sites.

OUTPUT OPTIONS

These options specify
which analyses or conversions to perform on the data that passed
through all specified filters.

OUTPUT ALLELE
STATISTICS

--freq
--freq2

Outputs the allele frequency for each
site in a file with the suffix ".frq". The second option is used to
suppress output of any information about the alleles.

--counts
--counts2

Outputs the raw allele counts for each
site in a file with the suffix ".frq.count". The second option is
used to suppress output of any information about the alleles.

--derived

For use with the previous four
frequency and count options only. Re-orders the output file columns
so that the ancestral allele appears first. This option relies on
the ancestral allele being specified in the VCF file using the AA
tag in the INFO field.

OUTPUT DEPTH
STATISTICS

--depth

Generates a file containing the mean
depth per individual. This file has the suffix ".idepth".

--site-depth

Generates a file containing the depth
per site summed across all individuals. This output file has the
suffix ".ldepth".

--site-mean-depth

Generates a file containing the mean
depth per site averaged across all individuals. This output file
has the suffix ".ldepth.mean".

--geno-depth

Generates a (possibly very large) file
containing the depth for each genotype in the VCF file. Missing
entries are given the value -1. The file has the suffix
".gdepth".

OUTPUT LD
STATISTICS

--hap-r2

Outputs a file reporting the r2, D, and
D’ statistics using phased haplotypes. These are the traditional
measures of LD often reported in the population genetics
literature. The output file has the suffix ".hap.ld". This option
assumes that the VCF input file has phased haplotypes.

--geno-r2

Calculates the squared correlation
coefficient between genotypes encoded as 0, 1 and 2 to represent
the number of non-reference alleles in each individual. This is the
same as the LD measure reported by PLINK. The D and D’ statistics
are only available for phased genotypes. The output file has the
suffix ".geno.ld".

--geno-chisq

If your data contains sites with more
than two alleles, then this option can be used to test for genotype
independence via the chi-squared statistic. The output file has the
suffix ".geno.chisq".

--ld-window

This optional parameter defines the
maximum number of SNPs between the SNPs being tested for LD in the
"--hap-r2", "--geno-r2", and "--geno-chisq" functions.

--ld-window-bp

This optional parameter defines the
maximum number of physical bases between the SNPs being tested for
LD in the "--hap-r2", "--geno-r2", and "--geno-chisq"
functions.

--ld-window-min

This optional parameter defines the
minimum number of SNPs between the SNPs being tested for LD in the
"--hap-r2", "--geno-r2", and "--geno-chisq" functions.

--ld-window-bp-min

This optional parameter defines the
minimum number of physical bases between the SNPs being tested for
LD in the "--hap-r2", "--geno-r2", and "--geno-chisq"
functions.

--min-r2

This optional parameter sets a minimum
value for r2, below which the LD statistic is not reported by the
"--hap-r2", "--geno-r2", and "--geno-chisq" functions.

--interchrom-hap-r2

Outputs a file reporting the r2
statistics using phased haplotypes only with sites on different
chromosomes. The output file has the suffix ".interchrom.hap.ld".
This option assumes that the VCF input file has phased
haplotypes.

--interchrom-geno-r2

Calculates the squared correlation
coefficient between genotypes encoded as 0, 1 and 2 to represent
the number of non-reference alleles in each individual but only for
sites on differing chromosomes. The output file has the suffix
".interchrom.geno.ld".

--hap-r2-positions

Outputs a file reporting the r2
statistics using phased haplotypes only at the sites contained in
the provided BED file. The output file has the suffix
".list.hap.ld". This option assumes that the VCF input file has
phased haplotypes.

--geno-r2-positions

Calculates the squared correlation
coefficient between genotypes only at the sites contained in the
provided BED file. The output file has the suffix
".list.geno.ld".

OUTPUT
TRANSITION/TRANSVERSION STATISTICS

--TsTv

Calculates the Transition /
Transversion ratio in bins of size defined by this option. Only
uses bi-allelic SNPs. The resulting output file has the suffix
".TsTv".

--TsTv-summary

Calculates a simple summary of all
Transitions and Transversions. The output file has the suffix
".TsTv.summary".

--TsTv-by-count

Calculates the Transition /
Transversion ratio as a function of alternative allele count. Only
uses bi-allelic SNPs. The resulting output file has the suffix
".TsTv.count".

--TsTv-by-qual

Calculates the Transition /
Transversion ratio as a function of SNP quality threshold. Only
uses bi-allelic SNPs. The resulting output file has the suffix
".TsTv.qual".

--FILTER-summary

Generates a summary of the number of
SNPs and Ts/Tv ratio for each FILTER category. The output file has
the suffix ".FILTER.summary".

OUTPUT NUCLEOTIDE
DIVERGENCE STATISTICS

--site-pi

Measures nucleotide divergency on a
per-site basis. The output file has the suffix ".sites.pi".

--window-pi

--window-pi-step

Measures the nucleotide diversity in
windows, with the number provided as the window size. The output
file has the suffix ".windowed.pi". The latter is an optional
argument used to specify the step size in between windows.

OUTPUT FST
STATISTICS

--weir-fst-pop

This option is used to calculate an Fst
estimate from Weir and Cockerham’s 1984 paper. This is the
preferred calculation of Fst. The provided file must contain a list
of individuals (one individual per line) from the VCF file that
correspond to one population. This option can be used multiple
times to calculate Fst for more than two populations. By default,
calculations are done on a per-site basis. The output file has the
suffix ".weir.fst".

--fst-window-size
--fst-window-step

These options can be used with
"--weir-fst-pop" to do the Fst calculations on a windowed basis
instead of a per-site basis. These arguments specify the desired
window size and the desired step size between windows.

OUTPUT OTHER
STATISTICS

--het

Calculates a measure of heterozygosity
on a per-individual basis. Specfically, the inbreeding coefficient,
F, is estimated for each individual using a method of moments. The
resulting file has the suffix ".het".

--hardy

Reports a p-value for each site from a
Hardy-Weinberg Equilibrium test (as defined by Wigginton, Cutler
and Abecasis (2005)). The resulting file (with suffix ".hwe") also
contains the Observed numbers of Homozygotes and Heterozygotes and
the corresponding Expected numbers under HWE.

--TajimaD

Outputs Tajima’s D statistic in bins
with size of the specified number. The output file has the suffix
".Tajima.D".

--indv-freq-burden

This option calculates the number of
variants within each individual of a specific frequency. The
resulting file has the suffix ".ifreqburden".

--LROH

This option will identify and output
Long Runs of Homozygosity. The output file has the suffix
".LROH".

--relatedness

This option is used to calculate and
output a relatedness statistic based on the method of Yang et al,
Nature Genetics 2010 (doi:10.1038/ng.608). Specifically, calculate
the unadjusted Ajk statistic. Expectation of Ajk is zero for
individuals within a populations, and one for an individual with
themselves. The output file has the suffix ".relatedness".

--relatedness2

This option is used to calculate and
output a relatedness statistic based on the method of Manichaikul
et al., BIOINFORMATICS 2010 (doi:10.1093/bioinformatics/btq559).
The output file has the suffix ".relatedness2".

--site-quality

Generates a file containing the
per-site SNP quality, as found in the QUAL column of the VCF file.
This file has the suffix ".lqual".

--missing-indv

Generates a file reporting the
missingness on a per-individual basis. The file has the suffix
".imiss".

--missing-site

Generates a file reporting the
missingness on a per-site basis. The file has the suffix
".lmiss".

--SNPdensity

Calculates the number and density of
SNPs in bins of size defined by this option. The resulting output
file has the suffix ".snpden".

--kept-sites

Creates a file listing all sites that
have been kept after filtering. The file has the suffix
".kept.sites".

--removed-sites

Creates a file listing all sites that
have been removed after filtering. The file has the suffix
".removed.sites".

--singletons

This option will generate a file
detailing the location of singletons, and the individual they occur
in. The file reports both true singletons, and private doubletons
(i.e. SNPs where the minor allele only occurs in a single
individual and that individual is homozygotic for that allele). The
output file has the suffix ".singletons".

--hist-indel-len

This option will generate a histogram
file of the length of all indels (including SNPs). It shows both
the count and the percentage of all indels for indel lengths that
occur at least once in the input file. SNPs are considered indels
with length zero. The output file has the suffix ".indel.hist".

--extract-FORMAT-info

Extract information from the genotype
fields in the VCF file relating to a specfied FORMAT identifier.
The resulting output file has the suffix "..FORMAT". For example,
the following command would extract the all of the GT (i.e.
Genotype) entries:

vcftools --vcf file1.vcf
--extract-FORMAT-info GT

--get-INFO

This option is used to extract
information from the INFO field in the VCF file. The argument
specifies the INFO tag to be extracted, and the option can be used
multiple times in order to extract multiple INFO entries. The
resulting file, with suffix ".INFO", contains the required INFO
information in a tab-separated table. For example, to extract the
NS and DB flags, one would use the command:

vcftools --vcf file1.vcf
--get-INFO NS --get-INFO DB

OUTPUT VCF
FORMAT

--recode
--recode-bcf

These options are used to generate a
new file in either VCF or BCF from the input VCF or BCF file after
applying the filtering options specified by the user. The output
file has the suffix ".recode.vcf" or ".recode.bcf". By default, the
INFO fields are removed from the output file, as the INFO values
may be invalidated by the recoding (e.g. the total depth may need
to be recalculated if individuals are removed). This behavior may
be overriden by the following options. By default, BCF files are
written out as BGZF compressed files.

--recode-INFO

--recode-INFO-all

These options can be used with the
above recode options to define an INFO key name to keep in the
output file. This option can be used multiple times to keep more of
the INFO fields. The second option is used to keep all INFO values
in the original file.

--contigs

This option can be used in conjuction
with the --recode-bcf when the input file does not have any contig
declarations. This option expects a file name with one contig
header per line. These lines are included in the output file.

OUTPUT OTHER
FORMATS

--012

This option outputs the genotypes as a
large matrix. Three files are produced. The first, with suffix
".012", contains the genotypes of each individual on a separate
line. Genotypes are represented as 0, 1 and 2, where the number
represent that number of non-reference alleles. Missing genotypes
are represented by -1. The second file, with suffix ".012.indv"
details the individuals included in the main file. The third file,
with suffix ".012.pos" details the site locations included in the
main file.

--IMPUTE

This option outputs phased haplotypes
in IMPUTE reference-panel format. As IMPUTE requires phased data,
using this option also implies --phased. Unphased individuals and
genotypes are therefore excluded. Only bi-allelic sites are
included in the output. Using this option generates three files.
The IMPUTE haplotype file has the suffix ".impute.hap", and the
IMPUTE legend file has the suffix ".impute.hap.legend". The third
file, with suffix ".impute.hap.indv", details the individuals
included in the haplotype file, although this file is not needed by
IMPUTE.

--ldhat
--ldhat-geno

These options output data in LDhat
format. This option requires the "--chr" filter option to also be
used. The first option outputs phased data only, and therefore also
implies "--phased" be used, leading to unphased individuals and
genotypes being excluded. The second option treats all of the data
as unphased, and therefore outputs LDhat files in genotype/unphased
format. Two output files are generated with the suffixes
".ldhat.sites" and ".ldhat.locs", which correspond to the LDhat
"sites" and "locs" input files respectively.

--BEAGLE-GL
--BEAGLE-PL

These options output genotype
likelihood information for input into the BEAGLE program. The VCF
file is required to contain FORMAT fields with "GL" or "PL" tags,
which can generally be output by SNP callers such as the GATK. Use
of this option requires a chromosome to be specified via the
"--chr" option. The resulting output file has the suffix
".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype likelihoods for
biallelic sites. This file is suitable for input into BEAGLE via
the "like=" argument.

--plink
--plink-tped

These options output the genotype data
in PLINK PED format. With the first option, two files are
generated, with suffixes ".ped" and ".map". Note that only
bi-allelic loci will be output. Further details of these files can
be found in the PLINK documentation.
Note: The first option can be very slow on large datasets. Using
the --chr option to divide up the dataset is advised, or
alternatively use the --plink-tped option which outputs the files
in the PLINK transposed format with suffixes ".tped" and
".tfam".

COMPARISON OPTIONS

These options are used
to compare the original variant file to another variant file and
output the results. All diff functions cannot be written to
standard out.

DIFF VCF
FILE

--diff
--gzdiff

--diff-bcf

These options compare the original
input file to this specified VCF, gzipped VCF, or BCF file. This
option outputs two files describing the sites and individuals
common / unique to each file. These files have the suffixes
".diff.sites_in_files" and ".diff.indv_in_files"
respectively.
See examples section for usage help.

DIFF
OPTIONS

--diff-site-discordance

This option can be used in conjuction
with any of the above "--diff" options to calculate discordance on
a site by site basis. The resulting output file has the suffix
".diff.sites".

--diff-indv-discordance

This option can be used in conjuction
with any of the above "--diff" options to calculate discordance on
a per-individual basis. The resulting output file has the suffix
".diff.indv".

--diff-indv-map

This option can be used in conjuction
with any of the above "--diff" options to specify a mapping of
individual IDs in the second file to those in the first file.

--diff-discordance-matrix

This option can be used in conjuction
with any of the above "--diff" options to calculate a discordance
matrix. This option only works with bi-allelic loci with matching
alleles that are present in both files. The resulting output file
has the suffix ".diff.discordance.matrix".

--diff-switch-error

Used in conjuction with the --diff
option to calculate phasing errors (specifically "switch errors").
This option generates two output files describing switch errors
found between sites, and the average switch error per individual.
These two files have the suffixes ".diff.switch" and
".diff.indv.switch" respectively.

AUTHOR

Adam Auton
(adam.auton@einstein.yu.edu)
Anthony Marcketta (anthony.marcketta@einstein.yu.edu)

The Perl modules and scripts

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

The tools

  • fill-aa
  • fill-an-ac
  • fill-fs
  • fill-ref-md5
  • fill-rsIDs
  • vcf-annotate
  • vcf-compare
  • vcf-concat
  • vcf-consensus
  • vcf-contrast
  • vcf-convert
  • vcf-filter
  • vcf-fix-newlines
  • vcf-fix-ploidy
  • vcf-indel-stats
  • vcf-isec
  • vcf-merge
  • vcf-phased-join
  • vcf-query
  • vcf-shuffle-cols
  • vcf-sort
  • vcf-stats
  • vcf-subset
  • vcf-to-tab
  • vcf-tstv
  • vcf-validator
  • Vcf.pm

fill-an-ac

Fill or recalculate AN and AC INFO fields.

fill-fs

Annotates the VCF file with flanking sequence (INFO/FS tag) masking known variants with N's. Useful for designing primers.

 

fill-ref-md5

Fill missing reference info and sequence MD5s into VCF header.

fill-rsIDs

Fill missing rsIDs. This script has been discontinued, please use vcf-annotate instead.

vcf-annotate

The script adds or removes filters and custom annotations to VCF files. To add custom annotations to VCF files, create TAB delimited file with annotations such as

Compress the file (using bgzip annotations), index (using tabix -s 1 -b 2 -e 3 annotations.gz) and run

The script is also routinely used to apply filters. There are a number of predefined filters and custom filters can be easily added, see vcf-annotate -h for examples. Some of the predefined filters take advantage of tags added by bcftools, the descriptions of the most frequently asked ones follow:

Strand Bias .. Tests if variant bases tend to come from one strand. Fisher's exact test for 2x2 contingency table where the row variable is being the reference allele or not and the column variable is strand. Two-tail P-value is used.
End Distance Bias .. Tests if variant bases tend to occur at a fixed distance from the end of reads, which is usually an indication of misalignment. (T-test)
Base Quality Bias .. Tests if variant bases tend to occur with a quality bias (T-test). This filter is by default effectively disabled as it is set to 0.

Note: A fast htslib C version of this tool is now available (see bcftools annotate).


vcf-compare

Compares positions in two or more VCF files and outputs the numbers of positions contained in one but not the other files; two but not the other files, etc, which comes handy when generating Venn diagrams. The script also computes numbers such as nonreference discordance rates (including multiallelic sites), compares actual sequence (useful when comparing indels), etc.

Note: A fast htslib C version of this tool is now available (see bcftools stats).

vcf-concat 合并多个VCF文件,比如chr1.vcf, chr2.vcf合并为一个文件。如果每个vcf文件是一个染色体的,则可以用这个合并。

Concatenates VCF files (for example split by chromosome). Note
that the input and output VCFs will have the same number of
columns, the script does not merge VCFs by position (see also
vcf-merge).

In the basic mode it does not do anything fancy except for a
sanity check that all files have the same columns. When run with
the -s option, it will perform a partial merge sort, looking at
limited number of open files simultaneously.

e.g.

./vcf-concat chr1.vcf chr2.vcf > merged.vcf

vcf-consensus

Apply VCF variants to a fasta file to create consensus sequence.

vcf-convert

Convert between VCF versions, currently from VCFv3.3 to VCFv4.0.

vcf-contrast

A tool for finding differences between groups of samples, useful in trio analysises, cancer genomes etc.

In the example below variants with average mapping quality of 30 (-f MinMQ=30) and minimum depth of 10 (-d 10) are considered. Only novel alleles are reported (-n). Then vcf-query is used to extract the INFO/NOVEL* annotations into a table. Finally the sites are sorted by confidence of the site being different in the child (-k5,5nr).

vcf-filter

Please take a look at vcf-annotate and bcftools view which does what you are looking for. Apologies for the non-intuitive naming.
Note: A fast HTSlib C
version of a filtering tool is now available (see bcftools filter and bcftools view).

vcf-fix-newlines

Fixes diploid vs haploid genotypes on sex chromosomes, including
the pseudoautosomal regions.

 

vcf-fix-ploidy

Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.

 

vcf-indel-stats

Calculate in-frame ratio.

Note: A fast htslib C version of this tool is now available (see bcftools stats).

 

vcf-isec

Creates intersections and complements of two or more VCF files. Given multiple VCF files, it can output the list of positions which are shared by at least N files, at most N files, exactly N files, etc. The first example below outputs positions shared by at least two files and the second outputs positions present in the files A but absent from files B and C.

Note: A fast htslib C version of this tool is now available (see bcftools isec).

 

vcf-merge 合并多个VCF文件,如果每一个vcf文件是一个样本的,则可以用这个合并。

Merges two or more VCF files into one so that, for example, if two source files had one column each, on output will be printed a file with two columns. See also vcf-concat for concatenating VCFs split by chromosome.

Note that this script is not intended for concatenating VCF files. For this, use vcf-concat instead.
Note: A fast htslib C
version of this tool is now available (see bcftools merge).

 

vcf-phased-join

Concatenates multiple overlapping VCFs preserving phasing.

 

vcf-query

Powerful tool for converting VCF files into format defined by the user. Supports retrieval of subsets of positions, columns and fields.

Note: A fast htslib C version of this tool is now available (see bcftools query).

 

vcf-shuffle-cols

Reorder columns

vcf-sort

Sort a VCF file.

 

vcf-stats

Outputs some basic statistics: the number of SNPs, indels, etc.

Note: A fast htslib C version of this tool is now available (see bcftools stats).

 

vcf-subset

Remove some columns from the VCF file.

Note: A fast HTSlib C version of this tool is now available (see bcftools view).

e.g.

./vcf-subset -c sample1,sample2   file1.vcf > file2.vcf

 

vcf-tstv

A lightweight script for quick calculation of Ts/Tv ratio.

Note: A fast htslib C version of this tool is now available (see bcftools stats).

 

vcf-to-tab

A simple script which converts the VCF file into a tab-delimited text file listing the actual variants instead of ALT indexes.

vcf-validator

Vcf.pm

For examples how to use the Perl API, it is best to look at some of the simpler scripts, for example vcf-to-tab. The detailed documentation can be obtained by running

REF:

http://samtools.github.io/hts-specs/VCFv4.2.pdf

VCFtools的更多相关文章

  1. vcftools报错:Writing PLINK PED and MAP files ... Error: Could not open temporary file.解决方案

    一般来说有两种解决方案. 第一种:添加“--plink-tped”参数: 用vcftools的“--plink”参数生成plink格式文件时,小样本量测试可以正常生成plink格式,用大样本量时产生W ...

  2. bcftools或vcftools提取指定区段的vcf文件(extract specified position )

    下载安装bcftools 见如下命令: bcftools filter 1000Genomes.vcf.gz --regions 9:4700000-4800000 > 4700000-4800 ...

  3. 使用vcftools或者gcta计算群体间固定指数(Fixation index,FST)

    下列所用到的数据均为千人基因组数据库 1.通过vcftools计算FST 命令行如下: ./vcftools --vcf input_data.vcf --weir-fst-pop populatio ...

  4. linux 安装SAMtools,bcftools,htslib,sratoolkit,bedtools,GATK,TrimGalore,qualimap,vcftools,bwa

    --------------------安装Samtools---------------------------------------------------------------------- ...

  5. 收集vcftools所有用法

    VCFtools用来处理VCF文档. 筛选特定突变 比较文件 总结突变 转化文件格式 验证并合并文件 取突变交集和差集 Get basic file statistics input可以为VCF或BC ...

  6. vcf-tools 笔记

    vcf-query: 通过 vcf-query 提取DP (reads depth). ~/zengs/Tools/vcftools/perl/vcf-query -f '%CHROM\t%POS\t ...

  7. vcftools安装与使用

    官网地址:https://vcftools.github.io/examples.html vcftools的软件下载:https://vcftools.github.io/examples.html ...

  8. 利用vcftools比较两个vcf文件

    因为最近有一项工作是比较填充准确性的,中间有用到vcftools比较两个vcf文件. 使用命令也很简单: 1 vcftools --vcf file1.snp.vcf --diff file2.snp ...

  9. VCF (Variant Call Format)格式详解

    文章来源:http://www.cnblogs.com/emanlee/p/4562064.html VCF文件示例(VCFv4.2) ##fileformat=VCFv4.2 ##fileDate= ...

随机推荐

  1. centos7下挂载U盘和移动硬盘

    挂载U盘 1.使用fdisk -l命令查看磁盘情况 [root@localhost ~]# fdisk -l 磁盘 /dev/sda:1000.2 GB, 1000204886016 字节,19535 ...

  2. 图像处理之Canny边缘检測

    图像处理之Canny 边缘检測 一:历史 Canny边缘检測算法是1986年有John F. Canny开发出来一种基于图像梯度计算的边缘 检測算法,同一时候Canny本人对计算图像边缘提取学科的发展 ...

  3. servlet;jsp;cookies;session

  4. java拾遗5----Java操作Mongo入门

    Java操作Mongo入门 参考: http://api.mongodb.com/java/3.2/ http://www.runoob.com/mongodb/mongodb-java.html h ...

  5. Nulls

    Nullshttps://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements005.htm

  6. 谷歌浏览器input中的text 和 button 水平对齐的问题

    方法一  text 的vertical-align :top; 方法二  button的vertical-align: middle;

  7. 我的Android进阶之旅------>Android使用百度地图时,关于android.permission.BAIDU_LOCATION_SERVICE的声明警告。

    [重要提醒] 定位SDKv3.1版本之后,以下权限已不需要,请取消声明,否则将由于Android 5.0多帐户系统加强权限管理而导致应用安装失败. <uses-permission androi ...

  8. Linux中的grep和cut

    提取行: grep --color  着色 -v         不包含 提取列: cut -f      列号 提取第几列 -d     分隔符 以什么为分隔符,默认是制表键 局限性:如果分隔符不那 ...

  9. Hexo+yilia博客首页不显示全文,显示more,截断文章。

    个人主页:https://www.yuehan.online hexo new “xxx” 在md文档中 插入<!--more-->即可. 现在博客:www.wangyurui.top

  10. C#数组存入引用类型

    using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace Cont ...