Details
-
Type:
Improvement
-
Status: Closed (View Workflow)
-
Priority:
Major
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: 9.0.2 Minor Release
-
Labels:
-
Story Points:2
-
Sprint:Fall 2018 1, Fall 2018 Sprint 2
Description
Support newer VCF formats in IGB.
(User request)
Attachments
- Diffs4.1To4.2
- 7 kB
- Diffs4.2To4.3
- 126 kB
Issue Links
- relates to
-
HELP-303 VCFv4 compatibility
-
- Closed
-
Activity
To highlight differences between versions, download this repo:
https://github.com/samtools/hts-specs
That includes these files:
VCFv4.1.tex
VCFv4.2.tex
VCFv4.3.tex
Then do
$ diff VCFv4.1.tex VCFv4.2.tex > Diffs4.1To4.2
In Sublime (and probably other programs) the differences are color-coded with pink for the 4.1 file and green for the 4.2 file with the commit id in gray:
27c27
< ##fileformat=VCFv4.1
—
> ##fileformat=VCFv4.2
A quick review of VCF file format....
The format includes several straight-forward columns like CHROM, POS, ID; where that one-word heading describes the one type of value the appears in that column.
The format also include 3 rather complicated columns, that each essentially have a file format onto themselves.
__________________
INFO - in the file header several in INFO fields are described. You could think of each entry in the INFO column as its own little object and the INFO headers are the class description for that object.
For example:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
Can be read as
The object shall an member called NS, which will have a length of 1, and be of type integer, and is the Number of Samples With Data
So the header:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
Tells you how to interpret the entry:
NS=3;DP=14;AF=0.5;DB;H2
or the entry:
NS=2;DP=10;AF=0.333,0.667;AA=T;DB
Notice that AF has two values in the second case. Number is not always a number. Sometimes a member in the INFO field has a value for each allele, so instead of a number the value of "number" is "A". There some other special cases denoted by letters.
__________________
FILTER - in the header, several filters are listed with an ID and a description. The ID is the string that appear in the FILTER field for a row (a SNP) that fails to pass that filter. The description is the human readable description for SNPs that fail to pass that filter. The FILTER field value for any SNP that passes all filters is "PASS".
__________________
FORMAT - is like its own little file format describing how to parse the fields for each sample. Each sample is column. For a given sample (column) for a given row (SNP) there are several values separated by ":". The FORMAT headers describe the ID, number of values, value type, and description for each possible member (just like INFO).
Often the FORMAT field is the same for many or all rows, so it is tempting to think there should just be one "format" used for the whole file. But the various attributes are not always relevant, or not always present for all entries. I don't remember if there is even any guaranty that the values that are present will appear in the same order. So each individual row gets its own mini format summary in the FORMAT filed, and the FORMAT headers elaborate on what those attributes mean, and the sample columns are where actual values are.
Changes to the v4.2 compared to v4.1:
- Information field format: adding source and version as recommended fields.
- INFO field can have one value for each possible allele (code R).
- For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields.
- Alternate base (ALT) can include *: missing due to a upstream deletion.
- Quality scores, a sentence removed: High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired.
- Examples changed a bit.
The htsjdk library contains vcf code:
https://github.com/samtools/htsjdk/tree/master/src/main/java/htsjdk/variant/vcf
Investigate this code and how we can use it. We recently upgraded to using this library, but didn't investigate its vcf-related packages as yet. Probably it can help.
from: https://samtools.github.io/hts-specs/VCFv4.3.pdf
List of changes
7.1 Changes to VCFv4.3
• More strict language: ”should” replaced with ”must” where appropriate
• Tables with Type and Number definitions for INFO and FORMAT reserved keys
7.2 Changes between VCFv4.2 and VCFv4.3
• VCF compliant implementations must support both LF and CR+LF newline conventions
• INFO and FORMAT tag names must match the regular expression ^[A-Za-z ][0-9A-Za-z .]*$
• Spaces are allowed in INFO field values
• Characters with special meaning (such as ’;’ in INFO, ’:’ in FORMAT, and ’%’ in both) can be encoded using
the percent encoding (see Section 1.2)
• The character encoding of VCF files is UTF-8.
35
• The SAMPLE field can contain optional DOI URL for the source data file
• Introduced ##META header lines for defining phenotype metadata
• New reserved tag ”CNP” analogous to ”GP” was added. Both CNP and GP use 0 to 1 encoding, which is a
change from previous phred-scaled GP.
• In order for VCF and BCF to have the same expressive power, we state explicitly that Integers and Floats are
32-bit numbers. Integers are signed.
• We state explicitly that zero length strings are not allowed, this includes the CHROM and ID column, INFO
IDs, FILTER IDs and FORMAT IDs. Meta-information lines can be in any order, with the exception of
##fileformat which must come first.
• All header lines of the form ##key=<ID=xxx,...> must have an ID value that is unique for a given value of
”key”. All header lines whose value starts with ”<” must have an ID field. Therefore, also ##PEDIGREE
newly requires a unique ID.
• We state explicitly that duplicate IDs, FILTER, INFO or FORMAT keys are not valid.
• A section about gVCF was added, introduced the <*> symbolic allele.
• A section about tag naming conventions was added.
• New reserved AD, ADF, and ADR INFO and FORMAT fields added.
• Removed unused and ill-defined GLE FORMAT tag.
• Chromosome names cannot use reserved symbolic alleles and contain characters used by breakpoints (Section
1.4.7).
• IUPAC ambiguity codes should be converted to a concrete base.
• Symbolic ALTs for IUPAC codes.
7.3 Changes between BCFv2.1 and BCFv2.2
• BCF header lines can include optional IDX field
• We introduce end-of-vector byte and reserve 8 values for future use
• Clarified that except the end-of-vector byte, no other negative values are allowed in the GT array
• String vectors in BCF do not need to start with comma, as the number of values is indicated already in the
definition of the tag in the header.
• The implicit filter PASS was described inconsistently throughout BCFv2.1: It is encoded as the first entry in
the dictionary, not the last.
Looking at the list of changes between version 4.2 and 4.3...
Most of these changes are relevant when writing VCF files.
The information that IGB uses to display the entry in the context of the genome is unchanged. The file format is more explicit, and a bit stricter, but there is no need to make our parser reflect these rules. Our parser should allow IGB to show the data that appears in the file, its not our parsers job to enforce specific format rules.
Aside from allowing format=4.3, I don't see any further changes that should be needed to read VCFv4.3 compared to VCFv4.2.
The changes for 4.2 (compared to 4.1)
- INFO field can have one value for each possible allele (code R).
This allows one more possible special value to the number value for an INFO field. IGB does not currently read/use this field. No change needed.
- For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT meta information, extra fields can be included after the default fields.
As part of a general improvement to the parser, we should include a map for each entry that would be able to these open form values. These values would be part of the selection info for an entry, but they will not affect how the entry is displayed. This can be handled as part of issue IGBF-543.
- Alternate base (ALT) can include *: missing due to a upstream deletion.
I am still figuring out if/how the code might need to change to accommodate this. None of my example v4.2 files use this option.
As far as allowing '*' as a symbol in ALT,
None of my test files use that option. I manually edited a file to have <*> as the alt allele, and it looks like it the string "<*>" was passed along instead of a letter, or instead of "<X>" without any issue.
So, as far as I can tell, we don't need to make any changes to accommodate this option.
This notation is explained in section 5.5 of the format specification, but I am still unclear about how this type should be represented. I don't know if the END tag is mandatory when the <*> is used. I think using the END tag when it is present should be part of our general improvement to the parser.
The goal of this issue is to make IGB accept the VCF 4.2 (and 4.3) format. That much has been accomplished, see this branch:
https://bitbucket.org/IvoryBlak/integrated-genome-browser/branch/IGBF-1380_Support_VCF4.2
v4.2 introduces new feature options that we should incorporate into the info for each feature. That improvement, and other improvements/corrections to the existing parser will be handled later as part of issue IGBF-543.
I am moving this to Needs first level review.
Issue IGBF-930 has a good test file.
You can look for more in jira using
attachments is not EMPTY AND text ~ VCF
under Issues > Search for issues
I also have couple files I have acquired from researchers and cannot include here (they are big, and not public), but I can share them directly with tester.
The more test files the better. This format allows for a lot of options, and the files I have do not represent all options.
To test, drag/drop VCF files into IGB.
On master, VCF v4.2 files will not open, you'll get an error message.
On my branch, they should open and display data.
Ivory Blakley : Tested on windows with basic vcf file from issue IGBF-930 file. Works as expected in your branch but I have a doubt.
Once we drag drop the vcf file on IGB, It gives message to the user saying "Zoom in to display data" but the data has not yet loaded. We need to click "Load data" button then the data starts appearing so that Zoom message is misleading!
Apart from that it works as per the test case described above. Detailed testing will be done by Nowlan/Ahn I suppose?
For Code review:
Completed reviewing the code. I have left few comments and need for clarification points. Please look into those.
Thanks!
Moving to To-do column and assigning to Ivory Blakley.
I think it will be best to talk through the code comments in person. Sneha and I can talk through this on Tuesday afternoon.
Notes from in-person review:
remove the code that was commented out in core/genometry/src/main/java/com/affymetrix/genometry/symloader/VCF.java
line 328
core/genometry/src/main/java/com/affymetrix/genometry/symloader/VCF.java
line 520
Otherwise, comments were addressed in discussion. No need for further testing. Remove the commented code and create pull request.
I made those changes, rebased onto upstream master, and created a pull request.
ann.loraine commented on pull request #623: IGBF-1380 Support VCF4.2
How is this more efficient? Does it reduce:
- processing time (less work per line of data)
- memory usage (how much?)
It seems to me that the main value of this change is it reduces complexity of the code, making it easier to understand and maintain.
_________________
You are right. "more efficient" was not the right phrase. "simpler" is better.
I reduced the comments that I added for the INFO and FORMAT sub-class definitions in VCF.java.
Merged into master
My testing verifies that all VCF versions are now compatible with IGB and load as expected. There was no exception noted in the console during my testing and all other file formats are loading as expected, so this story will now be closed.
I think this is the file I will need to change, possibly the only file.
core/genometry/src/main/java/com/affymetrix/genometry/symloader/VCF.java
This post is helpful:
https://bioinformatics.stackexchange.com/questions/344/whats-the-difference-between-vcf-spec-versions-4-1-and-4-2
For reference, the PDF Samtools specification for VCF 4.1 (currently supported in IGB), 4.2 (not yet supported), and 4.3 (not yet supported)
https://samtools.github.io/hts-specs/VCFv4.1.pdf
https://samtools.github.io/hts-specs/VCFv4.2.pdf
https://samtools.github.io/hts-specs/VCFv4.3.pdf