Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4082

Investigate: How is IGB parsing and displaying VCF files?

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Task: Write a high-level overview of how IGB is parsing and displaying a VCF file. This will include naming key classes in IGB that are important for parsing and viewing the VCF file, as well as the overall flow.

      There are two types of VCF files we want to look at for this task: one with population-level information ("ALL.apol1.sample.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf" from the igvData.zip file) and one with individual-specific information ("NG1LRQNESI.hard-filtered.vcf.gz"). These files can be found on Google Drive in the VCF project folder: https://drive.google.com/drive/folders/1PTa3rcCp59BRUcd2mB8BfemK2d3Kn3n5?usp=drive_link

        Attachments

          Issue Links

            Activity

            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            Loading population-level VCF:

            • IGB loads entire file, complex logic for per-sample tracks and allele frequency graphs and do not handle region-based loading unlike IGV

            Individual-specific information

            • IGB loads entire file and have simple logic to handle single sample data
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Loading population-level VCF: IGB loads entire file, complex logic for per-sample tracks and allele frequency graphs and do not handle region-based loading unlike IGV Individual-specific information IGB loads entire file and have simple logic to handle single sample data
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            Issues with current VCF parsing logic in IGB:

            1. Parsing approach

            • IGB manually goes through file line by line and uses lot of regex expressions for splitting and extracting metadata
            • multiple loop iterations in extracting genotype data, fields

            2. Data handling

            • Store metadata and genotype information in seperate maps and lists instead of single object like VariantContext to store everything
            • needs too many iterations to access different parts of data

            3. Error handling

            • weak error handling, it may fail when there is missing or misformatted data
            • lot of manual checks
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Issues with current VCF parsing logic in IGB: 1. Parsing approach IGB manually goes through file line by line and uses lot of regex expressions for splitting and extracting metadata multiple loop iterations in extracting genotype data, fields 2. Data handling Store metadata and genotype information in seperate maps and lists instead of single object like VariantContext to store everything needs too many iterations to access different parts of data 3. Error handling weak error handling, it may fail when there is missing or misformatted data lot of manual checks
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited

            IGB VCF parsing flow

            Key classes used:
            Classes used for parsing and visualisation are all custom, there are no inbuilt classes used

            1. VCF

            • Responsible for reading, parsing, storing variants from VCF file
            • Uses manual string splitting and regex based parsing to extract metadata, variant information

            2. SeqSymmetry

            • Stores standard variant metadata

            3. BAMSym

            • Represents insertions, deletions, and other structural changes.

            4. GraphIntervalSym

            • Stores numeric data from the VCF file.
            • Used for graphing info/format fields in IGB.

            No in-built libraries used for parsing, it is handled in manual way.

            VCF (parsing and processing)

            1. Loading the file:

            • The VCF file is opened as an InputStream for processing.

            2. Reading and extracting metadata:

            • inputstream is wrapped in bufferreader for better text reading, to process line by line

            3. Parsing data metadata and header

            • so if starts with "##", it extracts metadata INFO, FILTER, FORMAT and stores metadata in maps (infoMap, filterMap, formatMap).
            • The header line (#CHROM POS ... FORMAT SAMPLE1 SAMPLE2 ...) is manually extracted and sample names are stored in array
            • uses lot of regex patterns to extract metadata

            4. Parsing variant information

            • Convert POS from 1-based (vcf) to 0-based IGB
            • Extracts information like info fields, sample-specific genotype
            • Crate SeqSymmetry object for each variant.(location, REF, ALT, quality, genotype, etc.).
            • Extracts genotypes manually from FORMAT fields
            • Convert to BAMSym obj format for Structural Variants(insertion/deletion) which uses CIGAR format (e.g., M for match, I for insertion, D for deletion) and ensures correct alignment visualization.

            5. Data visualisation in IGB

            • Variants are displayed as annotations on genomic tracks.
            • Numeric INFO fields (like depth, allele frequency) are plotted as graphs.
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited IGB VCF parsing flow Key classes used: Classes used for parsing and visualisation are all custom, there are no inbuilt classes used 1. VCF Responsible for reading, parsing, storing variants from VCF file Uses manual string splitting and regex based parsing to extract metadata, variant information 2. SeqSymmetry Stores standard variant metadata 3. BAMSym Represents insertions, deletions, and other structural changes. 4. GraphIntervalSym Stores numeric data from the VCF file. Used for graphing info/format fields in IGB. No in-built libraries used for parsing, it is handled in manual way. VCF (parsing and processing) 1. Loading the file : The VCF file is opened as an InputStream for processing. 2. Reading and extracting metadata : inputstream is wrapped in bufferreader for better text reading, to process line by line 3. Parsing data metadata and header so if starts with "##", it extracts metadata INFO, FILTER, FORMAT and stores metadata in maps (infoMap, filterMap, formatMap). The header line (#CHROM POS ... FORMAT SAMPLE1 SAMPLE2 ...) is manually extracted and sample names are stored in array uses lot of regex patterns to extract metadata 4. Parsing variant information Convert POS from 1-based (vcf) to 0-based IGB Extracts information like info fields, sample-specific genotype Crate SeqSymmetry object for each variant.(location, REF, ALT, quality, genotype, etc.). Extracts genotypes manually from FORMAT fields Convert to BAMSym obj format for Structural Variants(insertion/deletion) which uses CIGAR format (e.g., M for match, I for insertion, D for deletion) and ensures correct alignment visualization. 5. Data visualisation in IGB Variants are displayed as annotations on genomic tracks. Numeric INFO fields (like depth, allele frequency) are plotted as graphs.

              People

              • Assignee:
                sjagarap saideepthi jagarapu (Inactive)
                Reporter:
                pkulzer Paige Kulzer (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: