Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4082

Investigate: How is IGB parsing and displaying VCF files?

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Task: Write a high-level overview of how IGB is parsing and displaying a VCF file. This will include naming key classes in IGB that are important for parsing and viewing the VCF file, as well as the overall flow.

      There are two types of VCF files we want to look at for this task: one with population-level information ("ALL.apol1.sample.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf" from the igvData.zip file) and one with individual-specific information ("NG1LRQNESI.hard-filtered.vcf.gz"). These files can be found on Google Drive in the VCF project folder: https://drive.google.com/drive/folders/1PTa3rcCp59BRUcd2mB8BfemK2d3Kn3n5?usp=drive_link

        Attachments

          Issue Links

            Activity

            pkulzer Paige Kulzer (Inactive) created issue -
            pkulzer Paige Kulzer (Inactive) made changes -
            Field Original Value New Value
            Epic Link IGBF-3836 [ 23135 ]
            pkulzer Paige Kulzer (Inactive) made changes -
            Link This issue relates to IGBF-4083 [ IGBF-4083 ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 1 [ 210 ] Spring 1, Spring 2 [ 210, 211 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited

            IGB VCF parsing flow

            Key classes used:
            Classes used for parsing and visualisation are all custom, there are no inbuilt classes used

            1. VCF

            • Responsible for reading, parsing, storing variants from VCF file
            • Uses manual string splitting and regex based parsing to extract metadata, variant information

            2. SeqSymmetry

            • Stores standard variant metadata

            3. BAMSym

            • Represents insertions, deletions, and other structural changes.

            4. GraphIntervalSym

            • Stores numeric data from the VCF file.
            • Used for graphing info/format fields in IGB.

            No in-built libraries used for parsing, it is handled in manual way.

            VCF (parsing and processing)

            1. Loading the file:

            • The VCF file is opened as an InputStream for processing.

            2. Reading and extracting metadata:

            • inputstream is wrapped in bufferreader for better text reading, to process line by line

            3. Parsing data metadata and header

            • so if starts with "##", it extracts metadata INFO, FILTER, FORMAT and stores metadata in maps (infoMap, filterMap, formatMap).
            • The header line (#CHROM POS ... FORMAT SAMPLE1 SAMPLE2 ...) is manually extracted and sample names are stored in array
            • uses lot of regex patterns to extract metadata

            4. Parsing variant information

            • Convert POS from 1-based (vcf) to 0-based IGB
            • Extracts information like info fields, sample-specific genotype
            • Crate SeqSymmetry object for each variant.(location, REF, ALT, quality, genotype, etc.).
            • Extracts genotypes manually from FORMAT fields
            • Convert to BAMSym obj format for Structural Variants(insertion/deletion) which uses CIGAR format (e.g., M for match, I for insertion, D for deletion) and ensures correct alignment visualization.

            5. Data visualisation in IGB

            • Variants are displayed as annotations on genomic tracks.
            • Numeric INFO fields (like depth, allele frequency) are plotted as graphs.
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited IGB VCF parsing flow Key classes used: Classes used for parsing and visualisation are all custom, there are no inbuilt classes used 1. VCF Responsible for reading, parsing, storing variants from VCF file Uses manual string splitting and regex based parsing to extract metadata, variant information 2. SeqSymmetry Stores standard variant metadata 3. BAMSym Represents insertions, deletions, and other structural changes. 4. GraphIntervalSym Stores numeric data from the VCF file. Used for graphing info/format fields in IGB. No in-built libraries used for parsing, it is handled in manual way. VCF (parsing and processing) 1. Loading the file : The VCF file is opened as an InputStream for processing. 2. Reading and extracting metadata : inputstream is wrapped in bufferreader for better text reading, to process line by line 3. Parsing data metadata and header so if starts with "##", it extracts metadata INFO, FILTER, FORMAT and stores metadata in maps (infoMap, filterMap, formatMap). The header line (#CHROM POS ... FORMAT SAMPLE1 SAMPLE2 ...) is manually extracted and sample names are stored in array uses lot of regex patterns to extract metadata 4. Parsing variant information Convert POS from 1-based (vcf) to 0-based IGB Extracts information like info fields, sample-specific genotype Crate SeqSymmetry object for each variant.(location, REF, ALT, quality, genotype, etc.). Extracts genotypes manually from FORMAT fields Convert to BAMSym obj format for Structural Variants(insertion/deletion) which uses CIGAR format (e.g., M for match, I for insertion, D for deletion) and ensures correct alignment visualization. 5. Data visualisation in IGB Variants are displayed as annotations on genomic tracks. Numeric INFO fields (like depth, allele frequency) are plotted as graphs.
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            Issues with current VCF parsing logic in IGB:

            1. Parsing approach

            • IGB manually goes through file line by line and uses lot of regex expressions for splitting and extracting metadata
            • multiple loop iterations in extracting genotype data, fields

            2. Data handling

            • Store metadata and genotype information in seperate maps and lists instead of single object like VariantContext to store everything
            • needs too many iterations to access different parts of data

            3. Error handling

            • weak error handling, it may fail when there is missing or misformatted data
            • lot of manual checks
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Issues with current VCF parsing logic in IGB: 1. Parsing approach IGB manually goes through file line by line and uses lot of regex expressions for splitting and extracting metadata multiple loop iterations in extracting genotype data, fields 2. Data handling Store metadata and genotype information in seperate maps and lists instead of single object like VariantContext to store everything needs too many iterations to access different parts of data 3. Error handling weak error handling, it may fail when there is missing or misformatted data lot of manual checks
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            Loading population-level VCF:

            • IGB loads entire file, complex logic for per-sample tracks and allele frequency graphs and do not handle region-based loading unlike IGV

            Individual-specific information

            • IGB loads entire file and have simple logic to handle single sample data
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Loading population-level VCF: IGB loads entire file, complex logic for per-sample tracks and allele frequency graphs and do not handle region-based loading unlike IGV Individual-specific information IGB loads entire file and have simple logic to handle single sample data
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Assignee saideepthi jagarapu [ sjagarap ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Link This issue is blocked by IGBF-4106 [ IGBF-4106 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Assignee saideepthi jagarapu [ sjagarap ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            sjagarap saideepthi jagarapu (Inactive) made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]

              People

              • Assignee:
                sjagarap saideepthi jagarapu (Inactive)
                Reporter:
                pkulzer Paige Kulzer (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: