General workflow
VCF.Filter generates variant hiltlists from next-generation sequencing data. Filters are applied to textual and numerical custom annotations provided in VCF (variant call format) files.
VCF format primer
Although VCF files are text files that can be opened and manipulated with a text editor, the complexity of the VCF file format is often under estimated.
Consider the example from the
VCFv4.2 format specification document shown below.
There are two main parts:
1. A set of header lines starting with '##' characters.
2. A set of tab separated columns holding the variant data starting from the row beginning with '#CHROM'.
Variant annotations are stored in the INFO field as key=value pairs separated by semicolons. The value can be textual, numerical, or an array holding values applying to different alleles at a given position.
The exact layout of an annotation is specified in a VCF header line. VCF header lines have four attributes: ID, Number, Type, Description. ID holds the annotation name and Description summarises its purpose.
The Number attribute specifies whether the value of the key=value pair is a flag (0), a single value (1), or an array of values (A, R). A-arrays hold values only for the alternative alleles. R-arrays hold values for all alleles including the reference allele.
The Type attribute instead spells out the data type of the value (String for text, Integer for natural number, Float for real number, Flag for boolean presence/absense).
The following table shows the header line attributes for the annotations in the VCF example shown above.
ID |
Number |
Type |
Description |
NS |
1 |
Integer |
Number of Samples With Data |
DP |
1 |
Integer |
Total Depth |
AF |
A |
Float |
Allele Frequency |
AA |
1 |
String |
Ancestral Allele |
DB |
0 |
Flag |
dbSNP membership, build 129 |
What can VCF.Filter do for me?
VCF.Filter is a standalone Java application for viewing and filtering the contents of VCF files aimed at an audience that doesn't feel comfortable using command line tools or web-based tools with their proprietary data.
VCF.Filter builds fully customizable filter chains for fields listed as VCFHeader lines,
intersects variants with runs of homozygosity provided as bed files,
calculates variant recurrence values in your cohort and filters on cohort recurrence,
analyzes pedigrees for the presence of variants following a known type of disease inheritance pattern,
prints your variants on a Hilbert curve,
and allows you to find a variant reported in the literature in your cohort of VCF files.
Download and unzip to a directory of your choice.
Double-click the jar file or the start-up-script that is suitable for your operating system.
Software requires Java 1.8 to run.
Download VCFFilter and sample data
Launch VCF.Filter via Java web start if you always want to be sure to work with the latest version and if you do not need example data.
Use the launch button that reserves a suitable amount of memory on your computer.
VCF file size
VCF file sizes can range from a few kilobases to tens of gigabases depending mainly on the number of variants, the number of samples, and the INFO field annotations provided.
The
Biomedical Sequencing Facility has sequenced human genomes of several individuals, whose VCF files are publicly available at the
Genom Austria website.
The size of these VCF files is just below 400 MB in compressed (vcf.gz) format.
The typical VCF file size for different resequencing protocols at the Biomedical Sequencing Facility is shown below.
Filtering Genom Austria VCF files
You are welcome to download the Genome Austria VCF files to see how VCF.Filter processes large VCF files.
Download the
PGA1_variants.vcf.gz and the corresponding index file
PGA1_variants.vcf.gz.tbi to the same directory.
Load the PGA1_variants.vcf.gz file as an
example VCF file in File -> Preferences.
To see immediate output, set the output limit in File -> Preferences to
1000.
You can now load the Genom Austria VCF files into VCF.Filter and filter them according to your interests. Load the PGA.fsc filter scenario file that ships with VCF.Filter to identify homozygous variants with a coverage above 50 reads falling into an annotated gene.
Filtering whole genome sequence trios
The National Institute of Standards and Technology (NIST)-hosted Genome in a Bottle Consortium (GIAB) is developing reference materials from well-characterized genomic DNA including whole genome sequences of family trios (mother father, child). Use the following table to download the curated compressed VCF files (.gz) along with their index files (.tbi) for individuals composing a trio of Ashkenazi ancestry, and save them to the same directory. The file names are extremely long. Feel free to shorten them but make sure that VCF file and index file have the same file name.
Load one of the vcf.gz files as an
example VCF file in File -> Preferences. To see immediate output, set the output limit in File -> Preferences to
1000.
Important: Load the
GIAB.fsc filter scenario file that ships with VCF.Filter to restrict the number of filtered variants to a managable amount that doesn't make your computer run out of memory.
You can now load the VCF files into VCF.Filter on the Familily analysis tab, define family relations and filter criteria, and identify variants corresponding to different patterns of inheritance.
VCF.Filter source code
VCF.Filter source code is available as a NetBeans project on GitHub and is released under GNU GPLv3 open source license.
Source code
What is the purpose of the example VCF file?
The example VCF file is used to extract annotations the user may want to filter for as well as to defined the outputfields. The example VCF file must have a valid VCF header and must be indexed. After changing the example VCF file, a restart of the software is required.
What VCF format is supported by VCF.Filter?
VCF.Filter can handle VCF files in format 4.2 and lower.
VCF.Filter takes a long time to respond and eventually freezes. Why?
VCF.Filter can be used to filter VCF files of any size. However, VCF.Filter cannot display any number of variants because the capacity to display variants is limited by the graphics capabilities of your computer. When VCF.Filter freezes, lower the output limit in the VCF.Filter preferences (File -> Preferences -> Output limit).
Can I filter whole genome VCF files with millions of variants?
Yes, but make sure that the output limit isn't set too high so that the graphics capabilities of your computer are overwhelmed.
What are pass lists?
Pass lists are lists containing regions of the genome in .bed or PLINK .hom format that the user wants to focus on. These can be chromosomes, genes that are part of a certain pathway, etc. Using pass lists can speed up your analysis quite significantly.
What are non-pass lists?
Non-pass lists are lists containing regions of the genome in .bed or PLINK .hom format that the user wants to ignore. These can be chromosomes, genes that are part of a certain pathway, etc.
What is a recurrence file?
A recurrence file lists the frequencies of variants in your cohort of VCF files. Filtering on variant recurrence is useful to eliminate ethnicity related variants and/or variants that are unreasonably frequent for technical reasons, for example.
How does VCF.Filter analyze family pedigrees?
VCF.Filter considers affected and unaffected individuals of a family pedigree separately. VCF.Filter then identifies recessive variants, dominant variants, X-linked variants, de-novo variants, and compound heterozygous variants. VCF.Filter does not consider partial penetrance. Therefore, unaffected individuals that are carriers of the pathogenic genotype may shield causative variants.
Wrong gene symbol field defined in Preferences -> Annotations. Compound heterozygous variant search isn't possible and won't be performed.
VCF.Filter needs a valid annotation for gene symbols in order to perform filtering for compound heterozygous variants. Since the gene symbol field needs to be defined by the user, VCF.Filter initializes it to the CHROM field, which is clearly wrong. To set the gene symbol annotation field, go to File -> Preferences -> Annotations and choose the correct field from the pull-down list. If there is no gene symbol annotation, family analysis can nevertheless be performed but the list of compound heterozygous variants will be empty.
The Run button is red but I can run the analysis anyway. Can I trust it?
This happens during family analysis when the gene symbol annotation field is not defined. You can trust the analysis but you should keep in mind that compound heterozygous variants will not be reported. It also happens during variant recurrence calculation when the gene symbol field is not defined. The output can be trusted but may not be very useful in this case.
How can I calculate variant recurrence in my cohort?
Load the VCF files of your cohort and tick the 'Calculate recurrence' field in 'Output and controls' area. If the Run button is red you need to define the gene symbol annotation field. Click Run and wait until VCF.Filter has processed all your cohort. Save the output to a tab-separated text file without the waiting message with a .tsv file ending. You can now load your recurrence.tsv file and use it for filtering.
On my Mac the application wouldn't get loaded because the it is from an unidentified developer?
This is a Mac security setting. After launching the software go to System Preferences -> Security and Privacy -> Open Anyway.
On my Mac Copy and Paste don't work.
You have to use the Ctrl-C and the Ctrl-V key combinations as if you were using a Windows computer.