[IGBF-4082] Investigate: How is IGB parsing and displaying VCF files? - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Improve VCF in IGB
Sprint:
Spring 1, Spring 2

Description

Task: Write a high-level overview of how IGB is parsing and displaying a VCF file. This will include naming key classes in IGB that are important for parsing and viewing the VCF file, as well as the overall flow.

There are two types of VCF files we want to look at for this task: one with population-level information ("ALL.apol1.sample.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf" from the igvData.zip file) and one with individual-specific information ("NG1LRQNESI.hard-filtered.vcf.gz"). These files can be found on Google Drive in the VCF project folder: https://drive.google.com/drive/folders/1PTa3rcCp59BRUcd2mB8BfemK2d3Kn3n5?usp=drive_link

Attachments

Issue Links

is blocked by

IGBF-4106 Integrate htsjdk library for parsing VCF in IGB

Closed

relates to

IGBF-4083 Investigate: How is IGV parsing and displaying VCF files?

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

saideepthi jagarapu (Inactive) added a comment - 13/Feb/25 11:30 AM - edited

IGB VCF parsing flow

Key classes used:
Classes used for parsing and visualisation are all custom, there are no inbuilt classes used

1. VCF

Responsible for reading, parsing, storing variants from VCF file
Uses manual string splitting and regex based parsing to extract metadata, variant information

2. SeqSymmetry

Stores standard variant metadata

3. BAMSym

Represents insertions, deletions, and other structural changes.

4. GraphIntervalSym

Stores numeric data from the VCF file.
Used for graphing info/format fields in IGB.

No in-built libraries used for parsing, it is handled in manual way.

VCF (parsing and processing)

1. Loading the file:

The VCF file is opened as an InputStream for processing.

2. Reading and extracting metadata:

inputstream is wrapped in bufferreader for better text reading, to process line by line

3. Parsing data metadata and header

so if starts with "##", it extracts metadata INFO, FILTER, FORMAT and stores metadata in maps (infoMap, filterMap, formatMap).
The header line (#CHROM POS ... FORMAT SAMPLE1 SAMPLE2 ...) is manually extracted and sample names are stored in array
uses lot of regex patterns to extract metadata

4. Parsing variant information

Convert POS from 1-based (vcf) to 0-based IGB
Extracts information like info fields, sample-specific genotype
Crate SeqSymmetry object for each variant.(location, REF, ALT, quality, genotype, etc.).
Extracts genotypes manually from FORMAT fields
Convert to BAMSym obj format for Structural Variants(insertion/deletion) which uses CIGAR format (e.g., M for match, I for insertion, D for deletion) and ensures correct alignment visualization.

5. Data visualisation in IGB

Variants are displayed as annotations on genomic tracks.
Numeric INFO fields (like depth, allele frequency) are plotted as graphs.

Show

saideepthi jagarapu (Inactive) added a comment - 13/Feb/25 11:30 AM - edited IGB VCF parsing flow Key classes used: Classes used for parsing and visualisation are all custom, there are no inbuilt classes used 1. VCF Responsible for reading, parsing, storing variants from VCF file Uses manual string splitting and regex based parsing to extract metadata, variant information 2. SeqSymmetry Stores standard variant metadata 3. BAMSym Represents insertions, deletions, and other structural changes. 4. GraphIntervalSym Stores numeric data from the VCF file. Used for graphing info/format fields in IGB. No in-built libraries used for parsing, it is handled in manual way. VCF (parsing and processing) 1. Loading the file : The VCF file is opened as an InputStream for processing. 2. Reading and extracting metadata : inputstream is wrapped in bufferreader for better text reading, to process line by line 3. Parsing data metadata and header so if starts with "##", it extracts metadata INFO, FILTER, FORMAT and stores metadata in maps (infoMap, filterMap, formatMap). The header line (#CHROM POS ... FORMAT SAMPLE1 SAMPLE2 ...) is manually extracted and sample names are stored in array uses lot of regex patterns to extract metadata 4. Parsing variant information Convert POS from 1-based (vcf) to 0-based IGB Extracts information like info fields, sample-specific genotype Crate SeqSymmetry object for each variant.(location, REF, ALT, quality, genotype, etc.). Extracts genotypes manually from FORMAT fields Convert to BAMSym obj format for Structural Variants(insertion/deletion) which uses CIGAR format (e.g., M for match, I for insertion, D for deletion) and ensures correct alignment visualization. 5. Data visualisation in IGB Variants are displayed as annotations on genomic tracks. Numeric INFO fields (like depth, allele frequency) are plotted as graphs.

Hide

Permalink

saideepthi jagarapu (Inactive) added a comment - 13/Feb/25 11:41 AM

Issues with current VCF parsing logic in IGB:

1. Parsing approach

IGB manually goes through file line by line and uses lot of regex expressions for splitting and extracting metadata
multiple loop iterations in extracting genotype data, fields

2. Data handling

Store metadata and genotype information in seperate maps and lists instead of single object like VariantContext to store everything
needs too many iterations to access different parts of data

3. Error handling

weak error handling, it may fail when there is missing or misformatted data
lot of manual checks

Show

saideepthi jagarapu (Inactive) added a comment - 13/Feb/25 11:41 AM Issues with current VCF parsing logic in IGB: 1. Parsing approach IGB manually goes through file line by line and uses lot of regex expressions for splitting and extracting metadata multiple loop iterations in extracting genotype data, fields 2. Data handling Store metadata and genotype information in seperate maps and lists instead of single object like VariantContext to store everything needs too many iterations to access different parts of data 3. Error handling weak error handling, it may fail when there is missing or misformatted data lot of manual checks

Hide

Permalink

saideepthi jagarapu (Inactive) added a comment - 13/Feb/25 3:16 PM

Loading population-level VCF:

IGB loads entire file, complex logic for per-sample tracks and allele frequency graphs and do not handle region-based loading unlike IGV

Individual-specific information

IGB loads entire file and have simple logic to handle single sample data

Show

saideepthi jagarapu (Inactive) added a comment - 13/Feb/25 3:16 PM Loading population-level VCF: IGB loads entire file, complex logic for per-sample tracks and allele frequency graphs and do not handle region-based loading unlike IGV Individual-specific information IGB loads entire file and have simple logic to handle single sample data

Investigate: How is IGB parsing and displaying VCF files?

Details

Description

Attachments

Issue Links

Activity

People

Dates