Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4344

Investigate speed/memory issues when color/filter by used

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 10.2.0
    • Labels:
      None

      Description

      Situation: When loading single-cell bam files in IGB and then coloring and filtering the data, I have noticed that the memory usage is quite high and IGB will slow down.

      Task: Investigate why the memory and slow down issue is occurring.

      Important files: SamtagsFilter, SamtagsColor, FilterSymmetry

        Attachments

          Activity

          Hide
          karthik Karthik Raveendran added a comment - - edited

          Background:
          For both Color by and Filter for SAMtags, users will select/enter a SAMtag, such as CB,CR,UB,MI, and the corresponding values, eg. TGTACTCAGA, that they are interested in seeing or coloring. At any instance, users can only select on SAMtag but they can add multiple values of interest. When the users click OK, this information is then packaged and handed over to SAMtagsFilter or SAMtagsColor to make corresponding changes in the track.

          Each read in the track is compared with the user values one by one and then its filtered/not filtered or colored/not colored. For SAMtagsFilter and SAMtagsColor, this conditional step happens in particular function and these functions are called for each and every read alignment:

          • For SAMtagsFilter (extends SymmetryFilter), the function is filterSymmetry(BioSeq seq, SeqSymmetry sym) and,
          • For SAMtagsColor(extends ColorProvidor), the function is getColor(SeqSymmetry sym)
            Every time the function is called read alignment information is passed.

          To begin the investigation, I suggest starting from the algorithms in these functions. For single-cell data, the number of reads in a given track could be a lot and processing through each of them could take up some computing power. There was an earlier issue where the filterSymmetry function was called twice for each read which was fixed in an earlier ticket. Considering the volume of data that needs to be handled, it is possible even a small change in the efficiency of the algorithm could have a huge difference.

          Show
          karthik Karthik Raveendran added a comment - - edited Background: For both Color by and Filter for SAMtags, users will select/enter a SAMtag, such as CB,CR,UB,MI, and the corresponding values, eg. TGTACTCAGA, that they are interested in seeing or coloring. At any instance, users can only select on SAMtag but they can add multiple values of interest. When the users click OK, this information is then packaged and handed over to SAMtagsFilter or SAMtagsColor to make corresponding changes in the track. Each read in the track is compared with the user values one by one and then its filtered/not filtered or colored/not colored. For SAMtagsFilter and SAMtagsColor, this conditional step happens in particular function and these functions are called for each and every read alignment: For SAMtagsFilter (extends SymmetryFilter), the function is filterSymmetry(BioSeq seq, SeqSymmetry sym) and, For SAMtagsColor(extends ColorProvidor), the function is getColor(SeqSymmetry sym) Every time the function is called read alignment information is passed. To begin the investigation, I suggest starting from the algorithms in these functions. For single-cell data, the number of reads in a given track could be a lot and processing through each of them could take up some computing power. There was an earlier issue where the filterSymmetry function was called twice for each read which was fixed in an earlier ticket. Considering the volume of data that needs to be handled, it is possible even a small change in the efficiency of the algorithm could have a huge difference.
          Hide
          nfreese Nowlan Freese added a comment - - edited

          Files needed are located in Google Drive.

          Steps to reproduce the issue:

          1. Load the data
            1. Load the hg38 human genome
            2. Go to chr16:28,931,601-28,939,711
            3. Add the 10k_PBMC_chr16.bam file
              1. Download the file from Google Drive
            4. Click Load Data
          2. Install the Monster Alignment Filter
            1. Go to Plug-ins, Launch App Manager
            2. Select the Monster Alignment Filter
            3. Click Install
            4. Close the App Manager
            5. Right-click the track label, select Filter...
            6. Click Add > Reads without monster gaps, set Max intron size to 4000, click OK
          3. Color reads by cluster
            1. Right-click the track label, select Color by...
            2. Select SAMtags > Edit Tags and Color
            3. Click Import... and use the file Graph-Based_10K_PBMC.csv and click Open
            4. Select SAMtag: CB
            5. Click Save and Apply > OK
            6. Wait for the colors to change
          4. Duplicate the BAM file track
            1. Right-click the track label, select Track Operations > Copy
          5. Color the copied track by cluster
            1. Right-click the new Copied track label, select Color by...
            2. Select SAMtags > Edit Tags and Color
            3. Click Import... and use the file Graph-Based_10K_PBMC.csv and click Open
              1. Note: the color values may already be set, it is not necessary to import the file again
            4. Select SAMtag: CB
            5. Click Save and Apply > OK
            6. Wait for the colors to change
          6. Filter the original BAM file by Cluster
            1. Right-click the original BAM track label, select Filter...
            2. Click Add > SAMtags
            3. Set SAMtag to CB
            4. comparator =
            5. For the value copy/paste the contents of 10k_PBMC_cluster10.txt
            6. Click OK
            7. The reads should all be brownish/red
          7. Filter the Copied BAM file by Cluster
            1. Right-click the Copied track label, select Filter...
            2. Click Add > SAMtags
            3. Set SAMtag to CB
            4. comparator =
            5. For the value copy/paste the contents of 10k_PBMC_cluster11.txt
            6. Click OK
            7. The reads should all be green
          8. Load data for a different gene
            1. Go to chr16:28,950,223-28,967,238
            2. Click Load Data
            3. Wait for the data to load and colors to change
              1. You may need to click Load Data again to get the data to load for the Copied track
          Show
          nfreese Nowlan Freese added a comment - - edited Files needed are located in Google Drive . Steps to reproduce the issue: Load the data Load the hg38 human genome Go to chr16:28,931,601-28,939,711 Add the 10k_PBMC_chr16.bam file Download the file from Google Drive Click Load Data Install the Monster Alignment Filter Go to Plug-ins, Launch App Manager Select the Monster Alignment Filter Click Install Close the App Manager Right-click the track label, select Filter... Click Add > Reads without monster gaps, set Max intron size to 4000, click OK Color reads by cluster Right-click the track label, select Color by... Select SAMtags > Edit Tags and Color Click Import... and use the file Graph-Based_10K_PBMC.csv and click Open Select SAMtag: CB Click Save and Apply > OK Wait for the colors to change Duplicate the BAM file track Right-click the track label, select Track Operations > Copy Color the copied track by cluster Right-click the new Copied track label, select Color by... Select SAMtags > Edit Tags and Color Click Import... and use the file Graph-Based_10K_PBMC.csv and click Open Note: the color values may already be set, it is not necessary to import the file again Select SAMtag: CB Click Save and Apply > OK Wait for the colors to change Filter the original BAM file by Cluster Right-click the original BAM track label, select Filter... Click Add > SAMtags Set SAMtag to CB comparator = For the value copy/paste the contents of 10k_PBMC_cluster10.txt Click OK The reads should all be brownish/red Filter the Copied BAM file by Cluster Right-click the Copied track label, select Filter... Click Add > SAMtags Set SAMtag to CB comparator = For the value copy/paste the contents of 10k_PBMC_cluster11.txt Click OK The reads should all be green Load data for a different gene Go to chr16:28,950,223-28,967,238 Click Load Data Wait for the data to load and colors to change You may need to click Load Data again to get the data to load for the Copied track
          Hide
          uchinta Udaya Chinta (Inactive) added a comment - - edited

          Pulled the latest commits in IGBF-4295 and created a new branch IGBF-4344 from IGBF-4295.

          Changes :
          Created a single HashMap that stores tag values and colors when the color file is loaded, and this HashMap is rebuilt only when a tag value or color value changes. This improves performance because it is no longer rebuilt every time the getColor method is called.

          Commit Id: https://bitbucket.org/lorainelab_udaya/integrated-genome-browser/commits/572202179e1a05e50346f9c44c01f4eb33c5a3b4

          Cc : Karthik Raveendran, Dr.Nowlan Freese

          Show
          uchinta Udaya Chinta (Inactive) added a comment - - edited Pulled the latest commits in IGBF-4295 and created a new branch IGBF-4344 from IGBF-4295 . Changes : Created a single HashMap that stores tag values and colors when the color file is loaded, and this HashMap is rebuilt only when a tag value or color value changes. This improves performance because it is no longer rebuilt every time the getColor method is called. Commit Id: https://bitbucket.org/lorainelab_udaya/integrated-genome-browser/commits/572202179e1a05e50346f9c44c01f4eb33c5a3b4 Cc : Karthik Raveendran , Dr. Nowlan Freese
          Hide
          nfreese Nowlan Freese added a comment -

          Tested Udaya's branch on Mac.

          Following the testing instructions above, the speed of loading data is vastly improved. IGB no longer hangs when loading data. With sequence and data loaded for the view chr16:28,799,070-29,019,552 the memory usage is equivalent to the same data loaded in IGB 10.1.0 release (around 950Mb).

          For the PR, I believe Karthik Raveendran will include this commit on top of his most recent changes for IGBF-4295.

          Show
          nfreese Nowlan Freese added a comment - Tested Udaya's branch on Mac. Following the testing instructions above, the speed of loading data is vastly improved. IGB no longer hangs when loading data. With sequence and data loaded for the view chr16:28,799,070-29,019,552 the memory usage is equivalent to the same data loaded in IGB 10.1.0 release (around 950Mb). For the PR, I believe Karthik Raveendran will include this commit on top of his most recent changes for IGBF-4295 .
          Hide
          uchinta Udaya Chinta (Inactive) added a comment -

          Made changes to improve the performance of SAMtools Tag Color Mapping table.

          Commit Id : https://bitbucket.org/lorainelab_udaya/integrated-genome-browser/commits/6af2c02b4df51b853ca1af9209a8ee4110449976

          Cc: Dr.Nowlan Freese, Karthik Raveendran

          Show
          uchinta Udaya Chinta (Inactive) added a comment - Made changes to improve the performance of SAMtools Tag Color Mapping table. Commit Id : https://bitbucket.org/lorainelab_udaya/integrated-genome-browser/commits/6af2c02b4df51b853ca1af9209a8ee4110449976 Cc: Dr. Nowlan Freese , Karthik Raveendran
          Show
          uchinta Udaya Chinta (Inactive) added a comment - Fixed the issue caused when samtag value is selected. Commit Id : https://bitbucket.org/lorainelab_udaya/integrated-genome-browser/commits/f9fee8c4876d5552deb9eef095a94bbd70e7ef1e Cc : Dr. Nowlan Freese , Karthik Raveendran
          Hide
          nfreese Nowlan Freese added a comment -

          Tested Udaya's branch on Mac. Working as expected.

          Karthik Raveendran - please add Udaya's commits to your 4295 branch.

          Show
          nfreese Nowlan Freese added a comment - Tested Udaya's branch on Mac. Working as expected. Karthik Raveendran - please add Udaya's commits to your 4295 branch.
          Hide
          nfreese Nowlan Freese added a comment -

          Udaya's commits are included within Karthik's pull request: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/1088

          Show
          nfreese Nowlan Freese added a comment - Udaya's commits are included within Karthik's pull request: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/1088
          Hide
          ann.loraine Ann Loraine added a comment -

          PR is merged.

          Show
          ann.loraine Ann Loraine added a comment - PR is merged.
          Hide
          nfreese Nowlan Freese added a comment -

          Closing ticket.

          Show
          nfreese Nowlan Freese added a comment - Closing ticket.

            People

            • Assignee:
              uchinta Udaya Chinta (Inactive)
              Reporter:
              nfreese Nowlan Freese
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: