Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4316

Use index files of VCF for faster loading time

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      4
    • Sprint:
      Fall 1, Spring 3, Spring 5, Spring 6, Spring 8, Summer 2, Summer 3, Summer 4, Summer 6, Fall 2, Fall 3, Fall 4, Fall 5

      Description

      Identified two related issues in IGB 10.2.0 involving the loading of VCF files, both of which are regressions from version 10.1.0. These issues negatively affect performance and functionality when working with large or indexed VCF datasets.

      1. Performance and Memory Issue When Loading VCF Files

      When loading larger VCF files (e.g., 1KG.chr22.anno.infocol.vcf.gz), IGB 10.2.0 exhibits significantly increased memory usage and may crash or freeze when navigating to a gene.

      In IGB 10.1.0, loading this file and accessing gene-level data works as expected.

      In IGB 10.2.0, it appears the entire file is being loaded into memory when "Load Data" is clicked, rather than just the in-frame genomic region (as was the behavior in 10.1.0).

      This results in performance degradation and potential out-of-memory errors, especially with large datasets.

      2.* Failure to Load Tabix-Indexed VCF Files*

      When attempting to load a VCF file with an associated Tabix index (e.g., Genome in a Bottle VCF), IGB 10.2.0 throws the following error:

      ClassCastException: VCFSymLoaderTabix cannot be cast to QuickLoadSymLoader

      These same files load without issue in IGB 10.1.0.

      This appears to be a class loading or module registration issue introduced in the newer version, likely related to recent changes in VCF parsing logic.

        Attachments

          Issue Links

            Activity

            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            Branch changes: https://bitbucket.org/lorainelab-deepthi/integrated-genome-browser/branch/IGBF-4106

            Testing guide:
            Please refer to Ticket IGBF-4219 for specific files to test and issues to address while testing.

            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Branch changes: https://bitbucket.org/lorainelab-deepthi/integrated-genome-browser/branch/IGBF-4106 Testing guide: Please refer to Ticket IGBF-4219 for specific files to test and issues to address while testing.
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            Tested these changes on Mac by fetching Deepthi's branch and following the testing instructions I outlined in IGBF-4219.

            Issue #1 (Performance and Memory Issue When Loading VCF Files) - This appears to have been fixed! Loading this VCF file (1KG.chr22.anno.infocol.vcf.gz) is no longer crashing IGB 10.2.0. And I'm seeing a new warning in the Log to let me know that IGB didn't detect an index file to help load the file with, it says,

            INFO: Query method failed (likely no index file), falling back to iterator: Index not found for: file:/Users/paigekulzer/Documents/VCF%20PROJECT/1KG.chr22.anno.infocol.vcf.gz
            

            When I create a .tbi index for this VCF file, the time it takes for IGB to add this file as well as load data is noticeably reduced. Both actions are nearly instantaneous, whereas adding/loading this large VCF file without an index takes about 1 minute/20 seconds, respectively.

            Issue #2 (Failure to Load Tabix-Indexed VCF Files) - This appears to have been fixed! I can now load the Genome in a Bottle VCF file to IGB via URL as well as locally without any errors being thrown. I confirmed that I could Load Data and Load Sequence without issue, too.

            Issue #3 (Malformed VCF Header Error Handling) - I also tested the "Malformed VCF Header issue" that was outlined in IGBF-4219 by adding the Smoke Testing QL to IGB and attempting to add those VCF files. That issue of repeated error messages has been fixed with a minor caveat that I wanted to note here.

            The error now appears just once to warn the user that the VCF header is not formatted correctly which is behaving as expected. However, if I unload then load the file back into IGB, the error does not appear as a pop-up again. This seems a bit odd, but the error still appears in the Log which I think is good enough for debugging!

            As for the error message itself, it mentions that, "More information about what went wrong may be available in the Console." However, IGB now uses the terminology "Log" instead of "Console". I will make a ticket for changing all user-facing instances of Console to Log for consistency.

            Passing this ticket back to Nowlan Freese for a code review. saideepthi jagarapu, make sure to rebase your changes before submitting a PR!

            Show
            pkulzer Paige Kulzer added a comment - - edited Tested these changes on Mac by fetching Deepthi's branch and following the testing instructions I outlined in IGBF-4219 . Issue #1 (Performance and Memory Issue When Loading VCF Files) - This appears to have been fixed! Loading this VCF file (1KG.chr22.anno.infocol.vcf.gz) is no longer crashing IGB 10.2.0. And I'm seeing a new warning in the Log to let me know that IGB didn't detect an index file to help load the file with, it says, INFO: Query method failed (likely no index file), falling back to iterator: Index not found for : file:/Users/paigekulzer/Documents/VCF%20PROJECT/1KG.chr22.anno.infocol.vcf.gz When I create a .tbi index for this VCF file, the time it takes for IGB to add this file as well as load data is noticeably reduced. Both actions are nearly instantaneous, whereas adding/loading this large VCF file without an index takes about 1 minute/20 seconds, respectively. Issue #2 (Failure to Load Tabix-Indexed VCF Files) - This appears to have been fixed! I can now load the Genome in a Bottle VCF file to IGB via URL as well as locally without any errors being thrown. I confirmed that I could Load Data and Load Sequence without issue, too. Issue #3 (Malformed VCF Header Error Handling) - I also tested the "Malformed VCF Header issue" that was outlined in IGBF-4219 by adding the Smoke Testing QL to IGB and attempting to add those VCF files. That issue of repeated error messages has been fixed with a minor caveat that I wanted to note here. The error now appears just once to warn the user that the VCF header is not formatted correctly which is behaving as expected. However, if I unload then load the file back into IGB, the error does not appear as a pop-up again. This seems a bit odd, but the error still appears in the Log which I think is good enough for debugging! As for the error message itself, it mentions that, "More information about what went wrong may be available in the Console." However, IGB now uses the terminology "Log" instead of "Console". I will make a ticket for changing all user-facing instances of Console to Log for consistency. Passing this ticket back to Nowlan Freese for a code review. saideepthi jagarapu , make sure to rebase your changes before submitting a PR!
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited

            Thanks for testing this Paige Kulzer

            INFO: Query method failed (likely no index file), falling back to iterator: Index not found for: file:/Users/paigekulzer/Documents/VCF%20PROJECT/1KG.chr22.anno.infocol.vcf.gz
            This is just a print log i added if it doesnt have an index available, this is not any blocking error. So, I can remove that line and should be fine.

            Regarding Issue #3,
            For removing duplicated popups i have used a hashmap, which is why unload and load file back is not giving popup. I will look into this issue and think of a better solution.

            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited Thanks for testing this Paige Kulzer INFO: Query method failed (likely no index file), falling back to iterator: Index not found for: file:/Users/paigekulzer/Documents/VCF%20PROJECT/1KG.chr22.anno.infocol.vcf.gz This is just a print log i added if it doesnt have an index available, this is not any blocking error. So, I can remove that line and should be fine. Regarding Issue #3, For removing duplicated popups i have used a hashmap, which is why unload and load file back is not giving popup. I will look into this issue and think of a better solution.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Testing on Mac:
            I ran into an error when trying to load a file that was NAME.vcf.gz and an index that was NAME.vcf.gz.tbi. When I switched the index name to NAME.vcf.tbi it worked fine. I think we should also let NAME.vcf.gz.tbi work, however, as I think that is the default output when tabix indexing a gzipped file.

            Two code changes I was curious about:
            UnindexedSymLoader.java
            // strategyList.add(LoadStrategy.CHROMOSOME);

            VCF.java
            line 415 return null;

            Show
            nfreese Nowlan Freese added a comment - - edited Testing on Mac: I ran into an error when trying to load a file that was NAME.vcf.gz and an index that was NAME.vcf.gz.tbi. When I switched the index name to NAME.vcf.tbi it worked fine. I think we should also let NAME.vcf.gz.tbi work, however, as I think that is the default output when tabix indexing a gzipped file. Two code changes I was curious about: UnindexedSymLoader.java // strategyList.add(LoadStrategy.CHROMOSOME); VCF.java line 415 return null;
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            Code changes:

            UnindexedSymLoader.java
            // strategyList.add(LoadStrategy.CHROMOSOME);
            I just removed the commented line in PR

            VCF.java
            line 415 return null;
            As we have removed lineprocessor logic now, just returning null instead of creating lineprocessor

            @Override
            protected LineProcessor createLineProcessor(String featureName)

            { return this; // before return null; // after }

            attention Nowlan Freese

            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Code changes: UnindexedSymLoader.java // strategyList.add(LoadStrategy.CHROMOSOME); I just removed the commented line in PR VCF.java line 415 return null; As we have removed lineprocessor logic now, just returning null instead of creating lineprocessor @Override protected LineProcessor createLineProcessor(String featureName) { return this; // before return null; // after } attention Nowlan Freese
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            Testing on Mac:
            I ran into an error when trying to load a file that was NAME.vcf.gz and an index that was NAME.vcf.gz.tbi. When I switched the index name to NAME.vcf.tbi it worked fine. I think we should also let NAME.vcf.gz.tbi work, however, as I think that is the default output when tabix indexing a gzipped file.

            – Retested this bug with Nowlan Freese and found out this to be working fine. So, no changes done for this bug.

            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Testing on Mac: I ran into an error when trying to load a file that was NAME.vcf.gz and an index that was NAME.vcf.gz.tbi. When I switched the index name to NAME.vcf.tbi it worked fine. I think we should also let NAME.vcf.gz.tbi work, however, as I think that is the default output when tabix indexing a gzipped file. – Retested this bug with Nowlan Freese and found out this to be working fine. So, no changes done for this bug.
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment -

            All existing issues were resolved, and most of the bugs addressed by Nowlan Freese and Paige Kulzer as part of thorough testing were all fixed and
            now the all the new vcf new implementation looks good.

            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - All existing issues were resolved, and most of the bugs addressed by Nowlan Freese and Paige Kulzer as part of thorough testing were all fixed and now the all the new vcf new implementation looks good.
            Hide
            pkulzer Paige Kulzer added a comment -

            Retested the scenarios from my above comment - all is still working well. I also confirmed that there is no error popping up when adding a VCF file with a NAME.vcf.gz.tbi index file. All code changes have been squashed into one commit. Ready for PR!

            Show
            pkulzer Paige Kulzer added a comment - Retested the scenarios from my above comment - all is still working well. I also confirmed that there is no error popping up when adding a VCF file with a NAME.vcf.gz.tbi index file. All code changes have been squashed into one commit. Ready for PR!
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - Thanks for testing Paige Kulzer PR - https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/1082
            Hide
            nfreese Nowlan Freese added a comment - - edited

            saideepthi jagarapu - two requests:
            1) Can you please add back the following line to UnindexedSymLoader.java:

            // strategyList.add(LoadStrategy.CHROMOSOME);

            It's a minor thing but I would like to leave it in as it is out of scope for the VCF work.

            2) Can you change your commit message to something like:

            Update VCF parsing logic to use HTSJDK

            Show
            nfreese Nowlan Freese added a comment - - edited saideepthi jagarapu - two requests: 1) Can you please add back the following line to UnindexedSymLoader.java: // strategyList.add(LoadStrategy.CHROMOSOME); It's a minor thing but I would like to leave it in as it is out of scope for the VCF work. 2) Can you change your commit message to something like: Update VCF parsing logic to use HTSJDK
            Hide
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited
            Show
            sjagarap saideepthi jagarapu (Inactive) added a comment - - edited Sure Nowlan Freese ! Done with final requested changes https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/1082
            Hide
            ann.loraine Ann Loraine added a comment -

            Branch is merged into main.

            Show
            ann.loraine Ann Loraine added a comment - Branch is merged into main.
            Hide
            ann.loraine Ann Loraine added a comment -

            New installers are built and deployed to early access section of bioviz.org website.
            Ready for final testing.

            Show
            ann.loraine Ann Loraine added a comment - New installers are built and deployed to early access section of bioviz.org website. Ready for final testing.
            Hide
            pkulzer Paige Kulzer added a comment -

            Downloaded the early release version of IGB from bioviz.org and tested on my Mac. VCF files with space-separated headings are being handled with appropriate warning messages. VCF files with an index in the same directory are loading much faster than they do without an index. VCF files can be added via URL or locally and seem to be working consistently.

            Closing ticket!

            Show
            pkulzer Paige Kulzer added a comment - Downloaded the early release version of IGB from bioviz.org and tested on my Mac. VCF files with space-separated headings are being handled with appropriate warning messages. VCF files with an index in the same directory are loading much faster than they do without an index. VCF files can be added via URL or locally and seem to be working consistently. Closing ticket!

              People

              • Assignee:
                sjagarap saideepthi jagarapu (Inactive)
                Reporter:
                sjagarap saideepthi jagarapu (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: