Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4083

Investigate: How is IGV parsing and displaying VCF files?

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Task: Write a high-level overview of how IGV is parsing and displaying a VCF file. This will include naming key classes in IGV that are important for parsing and viewing the VCF file, as well as the overall flow.

      There are two types of VCF files we want to look at for this task: one with population-level information ("ALL.apol1.sample.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf" from the igvData.zip file) and one with individual-specific information ("NG1LRQNESI.hard-filtered.vcf.gz"). These files can be found on Google Drive in the VCF project folder: https://drive.google.com/drive/folders/1PTa3rcCp59BRUcd2mB8BfemK2d3Kn3n5?usp=drive_link

        Attachments

          Issue Links

            Activity

            pkulzer Paige Kulzer (Inactive) created issue -
            pkulzer Paige Kulzer (Inactive) made changes -
            Field Original Value New Value
            Epic Link IGBF-3836 [ 23135 ]
            pkulzer Paige Kulzer (Inactive) made changes -
            Link This issue relates to IGBF-4082 [ IGBF-4082 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 1 [ 210 ] Spring 1, Spring 2 [ 210, 211 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited

            VCF Parsing - IGV is using htsjdk library for parsing and structure the VCF data.

            Key Classes used as part of VCF parsing:

            1. Inbuilt classes:

            • VCFFileReader - reads the vcf file
            • VCFCodec - inbuilt class in htsjdk used for reading, parsing vcf files
              • extracts all information like INFO, FORMAT, FILTER fields from file
              • basically converts a vcf file into particular type of objects (VariantContext)
            • VariantContext
              • used for representing single variant record and has details like CHROM, POS, REF, ALT, genotypes
              • stores parsed data in structured way
            • Genotype - represents genotype information

            2. Custom classes - created on top of built-in to handle logic according to requirements for better visualisation.

            • VCFVariant - custom wrapper around existing inbuilt VariantContext
              • more of formatting data
              • calculates allele freq, metadata or required data that has to be shown in UI
            • VCFGenotype - formats genotype data

            For displaying or rendering elements for visualisation

            • VariantTrack - render vcf variants
            • VariantRenderer - handle graphical representation of variants
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited VCF Parsing - IGV is using htsjdk library for parsing and structure the VCF data. Key Classes used as part of VCF parsing: 1. Inbuilt classes: VCFFileReader - reads the vcf file VCFCodec - inbuilt class in htsjdk used for reading, parsing vcf files extracts all information like INFO, FORMAT, FILTER fields from file basically converts a vcf file into particular type of objects (VariantContext) VariantContext used for representing single variant record and has details like CHROM, POS, REF, ALT, genotypes stores parsed data in structured way Genotype - represents genotype information 2. Custom classes - created on top of built-in to handle logic according to requirements for better visualisation. VCFVariant - custom wrapper around existing inbuilt VariantContext more of formatting data calculates allele freq, metadata or required data that has to be shown in UI VCFGenotype - formats genotype data For displaying or rendering elements for visualisation VariantTrack - render vcf variants VariantRenderer - handle graphical representation of variants
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited

            Overall flow:

            1. Loading vcf file:

            • IGV detects .vcf, .vcf.gz,.bcf files and it automatically selects appropriate parser based on extension using AbstractFeatureReader class from htsjdk.
            • selects VCFCodec from htsjdk for parsing VCF

            2. Parsing vcf file :

            • VCFCodec reads the headers (## metalines from vcf) - INFO, FORMAT, FILTER metadata definitions.
            • stores headers and metadata in VariantContext object
            • each variant line is converted into VariantContext object (which has position, reference data) - reads each variant as a structured VariantContext object.
            • each sample has genotypes which are stored inn VCFGenoType
            • extracts genotypes using variantContext.getGenotype()

            3. Storing parsed data:

            • VariantContext wrapped inside VCFVariant.java to parse each row
            • each variant is stored in VCFVariant obj which has Allele frequency (AF), genotype calls (GT)
            • now VCFVariant objects are stored inside VariantTrack for track-level storage

            4. Rendering variants

            • VariantTrack calls VariantRenderer to draw variants
            • color coding (gray for 0/0, ...)
            • hovering over variant displays - AF, DP, GQ
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited Overall flow: 1. Loading vcf file : IGV detects .vcf, .vcf.gz,.bcf files and it automatically selects appropriate parser based on extension using AbstractFeatureReader class from htsjdk. selects VCFCodec from htsjdk for parsing VCF 2. Parsing vcf file : VCFCodec reads the headers (## metalines from vcf) - INFO, FORMAT, FILTER metadata definitions. stores headers and metadata in VariantContext object each variant line is converted into VariantContext object (which has position, reference data) - reads each variant as a structured VariantContext object. each sample has genotypes which are stored inn VCFGenoType extracts genotypes using variantContext.getGenotype() 3. Storing parsed data : VariantContext wrapped inside VCFVariant.java to parse each row each variant is stored in VCFVariant obj which has Allele frequency (AF), genotype calls (GT) now VCFVariant objects are stored inside VariantTrack for track-level storage 4. Rendering variants VariantTrack calls VariantRenderer to draw variants color coding (gray for 0/0, ...) hovering over variant displays - AF, DP, GQ
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited

            Loading population-level VCF:

            • Batch loads variants for efficient RAM usage
            • Indexed VCF Files (.vcf.gz, .bcf) Enable Region-Based Loading which is basically preloading variants in specific genome region(few thousand base pairs around current view) without loading entire file.
            • when user zooms or scrolls, IGV loads new regions and discards previously loaded data

            Individual-specific VCF

            • Loads entire genotype data since it focuses on one sample.
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited Loading population-level VCF: Batch loads variants for efficient RAM usage Indexed VCF Files (.vcf.gz, .bcf) Enable Region-Based Loading which is basically preloading variants in specific genome region(few thousand base pairs around current view) without loading entire file. when user zooms or scrolls, IGV loads new regions and discards previously loaded data Individual-specific VCF Loads entire genotype data since it focuses on one sample.
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Assignee saideepthi jagarapu [ sjagarap ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Assignee saideepthi jagarapu [ sjagarap ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Link This issue is blocked by IGBF-4106 [ IGBF-4106 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]

              People

              • Assignee:
                sjagarap saideepthi jagarapu (Inactive)
                Reporter:
                pkulzer Paige Kulzer (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: