Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2354

Refactor Index Hacking Data Structure

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      3
    • Sprint:
      Spring 8 : 24 Apr to 8 May, Spring 8 : 11 May to 25 May, Spring 9 : 25 May to 8 Jun, Summer 1: 8 Jun - 19 Jun, Summer 2: 22 Jun - 3 Jul, Summer 3: 6 Jul - 17 Jul

      Description

      Situation: Currently, the index hacking data structure is a StringBuilder object (called <output>). As the file is looped through, a new row of data (chromosome, start, stop, value) are appended to the StringBuilder object.

      Problem: Each data point in the fourth column (value) needs to be divided by the average value across the entire file. To determine the average, a second loop through the entire file is needed before building the StringBuilder object. This doubles the amount of time to display the file to the user.

      Task: Refactor the code to avoid using a StringBuilder object to construct the data row by row. Instead, create an Array/ArrayList for each column of data (chromosome, start, stop, value). Loop through the file appending to each Array. Then determine the average of the value Array, and divide the entire value Array contents by the average value. Then use the four Arrays to create the Sym object.

      Note This is an example of how this task could be accomplished. Other data structures may be more efficient. The main objective is to only loop through the file a single time.

      For example:

      //How file is currently constructed row by row using StringBuilder
      chr1 0 16384 14567
      chr1 16384 32768 10576

      //How file should be constructed using Array/ArrayList for each column (chromosome, start, stop, value)
      [chr1, chr1, ...]
      [0, 16384, ...]
      [16384, 32768, ...]
      [14567, 10576, ...] length = 2, average = (14567 + 10576)/length

      //Divide the 4th Array values by the average
      [14567/average, 10576/average, ...]

        Attachments

          Issue Links

            Activity

            Hide
            pooja.nikhare Pooja Nikhare (Inactive) added a comment -

            Please review my branch: https://bitbucket.org/pnikhare/integrated-genome-browser/branch/IGBF-2354#diff

            It includes the changes of IGBF-2329.
            Overview:
            In loop storing the intermediate data (id, chromosome, start, stop, value) into an array of class object.
            Storing the id to mean map
            Iterate over the each class object, recalculate the chunk length using id to mean map, and form the String builder object.

            Show
            pooja.nikhare Pooja Nikhare (Inactive) added a comment - Please review my branch: https://bitbucket.org/pnikhare/integrated-genome-browser/branch/IGBF-2354#diff It includes the changes of IGBF-2329 . Overview: In loop storing the intermediate data (id, chromosome, start, stop, value) into an array of class object. Storing the id to mean map Iterate over the each class object, recalculate the chunk length using id to mean map, and form the String builder object.
            Hide
            ann.loraine Ann Loraine added a comment -

            Please proceed with PR at your convenience Pooja Nikhare.

            Show
            ann.loraine Ann Loraine added a comment - Please proceed with PR at your convenience Pooja Nikhare .
            Hide
            pooja.nikhare Pooja Nikhare (Inactive) added a comment -
            • Previous code
              While Loading the file: Output was stored in String builder - Converted into Input stream - Stored as (SeqId, File) where the file contains (Chrom, start, end, length)
              Loading the chr data for seqId: File was parsed to wiggle Format data and then created Syms.
            • Change Log for new code
              While loading the bai File: Output is stored in the Wiggle Format data
              Loading the chr data for seqId: Wiggle Format data is directly parsed for creating syms.

            Please review changes: https://bitbucket.org/pnikhare/integrated-genome-browser/branch/IGBF-2354#diff

            Show
            pooja.nikhare Pooja Nikhare (Inactive) added a comment - Previous code While Loading the file: Output was stored in String builder - Converted into Input stream - Stored as (SeqId, File) where the file contains (Chrom, start, end, length) Loading the chr data for seqId: File was parsed to wiggle Format data and then created Syms. Change Log for new code While loading the bai File: Output is stored in the Wiggle Format data Loading the chr data for seqId: Wiggle Format data is directly parsed for creating syms. Please review changes: https://bitbucket.org/pnikhare/integrated-genome-browser/branch/IGBF-2354#diff
            Hide
            ann.loraine Ann Loraine added a comment -

            I am a bit nervous about modifying Wiggle.java in the genometry library because this will require a lot of testing to ensure that the non-bai wiggle files will still load properly.

            Aslo, that file is already a bit large and complicated.

            Instead, could you create a new SymLoader class within the bai package that will use your new logic?

            This would be a kind of second draft of BaiSymLoader and should co-exist alongside it in the package.

            To use it, we will simply change the line in BaiFileHandler.java which decides which SymLoader to create when the user tries to open a bai file.

            In addition, we ought to change the type of "baiSymLoader" variable:

            This line:

            BaiSymLoader baiSymLoader = new BaiSymLoader(uri, indexUri, featureName, genomeVersion);

            could become:

            SymLoader symLoader = new BaiSymLoader2(uri, indexUri, featureName, genomeVersion);

            Note: We have a bit of a "code smell" here in that the class name is "BaiSymLoader2" - you should pick a better, more descriptive name

            Note the proposed type of symLoader variable above. It is declared "SymLoader", the type of the parent class.

            If I recall correctly, the IGB data loading infrastructure is actually expecting a more general type than "BaiSymLoader". I don't think we need to be so specific about the type of the returned SymLoader object. Which means: you should be able to substitute your new class by changing this one line of code as in the above.

            There's a lot here, so I wonder if we want to talk about it next week? It would be fine it you want to wait until then!

            Show
            ann.loraine Ann Loraine added a comment - I am a bit nervous about modifying Wiggle.java in the genometry library because this will require a lot of testing to ensure that the non-bai wiggle files will still load properly. Aslo, that file is already a bit large and complicated. Instead, could you create a new SymLoader class within the bai package that will use your new logic? This would be a kind of second draft of BaiSymLoader and should co-exist alongside it in the package. To use it, we will simply change the line in BaiFileHandler.java which decides which SymLoader to create when the user tries to open a bai file. In addition, we ought to change the type of "baiSymLoader" variable: This line: BaiSymLoader baiSymLoader = new BaiSymLoader(uri, indexUri, featureName, genomeVersion); could become: SymLoader symLoader = new BaiSymLoader2(uri, indexUri, featureName, genomeVersion); Note: We have a bit of a "code smell" here in that the class name is "BaiSymLoader2" - you should pick a better, more descriptive name Note the proposed type of symLoader variable above. It is declared "SymLoader", the type of the parent class. If I recall correctly, the IGB data loading infrastructure is actually expecting a more general type than "BaiSymLoader". I don't think we need to be so specific about the type of the returned SymLoader object. Which means: you should be able to substitute your new class by changing this one line of code as in the above. There's a lot here, so I wonder if we want to talk about it next week? It would be fine it you want to wait until then!
            Hide
            pooja.nikhare Pooja Nikhare (Inactive) added a comment -

            Okay, professor. I will make a separate file. I will ensure that there is no dependency with wiggle file. We can have a short code walkthrough also if possible. As there are many changes.

            Show
            pooja.nikhare Pooja Nikhare (Inactive) added a comment - Okay, professor. I will make a separate file. I will ensure that there is no dependency with wiggle file. We can have a short code walkthrough also if possible. As there are many changes.
            Hide
            ann.loraine Ann Loraine added a comment -

            Yes, thank you, a walk-through would be great.
            Let's do it the next time you are on the schedule.

            Show
            ann.loraine Ann Loraine added a comment - Yes, thank you, a walk-through would be great. Let's do it the next time you are on the schedule.
            Hide
            pooja.nikhare Pooja Nikhare (Inactive) added a comment -
            Show
            pooja.nikhare Pooja Nikhare (Inactive) added a comment - Changed the type of Sym Loader to generic type. Two implementations BaiSymLoader and BaiFileSymLoader class co-exist in the package. Please review the changes on branch : https://bitbucket.org/pnikhare/integrated-genome-browser/branch/IGBF-2354#diff
            Hide
            ann.loraine Ann Loraine added a comment -

            Pooja Nikhare - please submit PR. (I hope all is going well for you!!)

            Show
            ann.loraine Ann Loraine added a comment - Pooja Nikhare - please submit PR. (I hope all is going well for you!!)
            Hide
            pooja.nikhare Pooja Nikhare (Inactive) added a comment -

            Hello professor, thanks for asking. Everything is going well.

            Submitted a PR.
            https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/793/igbf-2354-refactor-index-hacking-data

            Show
            pooja.nikhare Pooja Nikhare (Inactive) added a comment - Hello professor, thanks for asking. Everything is going well. Submitted a PR. https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/793/igbf-2354-refactor-index-hacking-data

              People

              • Assignee:
                nfreese Nowlan Freese
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: