A quick review of VCF file format....
The format includes several straight-forward columns like CHROM, POS, ID; where that one-word heading describes the one type of value the appears in that column.
The format also include 3 rather complicated columns, that each essentially have a file format onto themselves.
__________________
INFO - in the file header several in INFO fields are described. You could think of each entry in the INFO column as its own little object and the INFO headers are the class description for that object.
For example:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
Can be read as
The object shall an member called NS, which will have a length of 1, and be of type integer, and is the Number of Samples With Data
So the header:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
Tells you how to interpret the entry:
NS=3;DP=14;AF=0.5;DB;H2
or the entry:
NS=2;DP=10;AF=0.333,0.667;AA=T;DB
Notice that AF has two values in the second case. Number is not always a number. Sometimes a member in the INFO field has a value for each allele, so instead of a number the value of "number" is "A". There some other special cases denoted by letters.
__________________
FILTER - in the header, several filters are listed with an ID and a description. The ID is the string that appear in the FILTER field for a row (a SNP) that fails to pass that filter. The description is the human readable description for SNPs that fail to pass that filter. The FILTER field value for any SNP that passes all filters is "PASS".
__________________
FORMAT - is like its own little file format describing how to parse the fields for each sample. Each sample is column. For a given sample (column) for a given row (SNP) there are several values separated by ":". The FORMAT headers describe the ID, number of values, value type, and description for each possible member (just like INFO).
Often the FORMAT field is the same for many or all rows, so it is tempting to think there should just be one "format" used for the whole file. But the various attributes are not always relevant, or not always present for all entries. I don't remember if there is even any guaranty that the values that are present will appear in the same order. So each individual row gets its own mini format summary in the FORMAT filed, and the FORMAT headers elaborate on what those attributes mean, and the sample columns are where actual values are.
I think this is the file I will need to change, possibly the only file.
core/genometry/src/main/java/com/affymetrix/genometry/symloader/VCF.java
This post is helpful:
https://bioinformatics.stackexchange.com/questions/344/whats-the-difference-between-vcf-spec-versions-4-1-and-4-2
For reference, the PDF Samtools specification for VCF 4.1 (currently supported in IGB), 4.2 (not yet supported), and 4.3 (not yet supported)
https://samtools.github.io/hts-specs/VCFv4.1.pdf
https://samtools.github.io/hts-specs/VCFv4.2.pdf
https://samtools.github.io/hts-specs/VCFv4.3.pdf