Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      3
    • Sprint:
      Summer 6: 17 Aug - 28 Aug, Summer 7: 31 Aug - 11 Sep, Fall 3: Oct 12 - Oct 23

      Description

      Situation: Significant changes were made to how the bai sym loader work (IGBF-2329) as well as experimental changes to how bin sizes are calculated and applied to chromosomes (IGBF-2221).

      Task: Need to review code changes and refactor code to simplify. For example, it looks like there are now two Java Class files that contain the same code (related to IGBF-2329). Need to determine why there are two classes and if one can be removed.

      Need to clean up unnecessary code related to comparing bin sizes to chromosomes (IGBF-2221). After significant testing we determined this could not be accurately implemented. Need to remove any code that refers to it.

      Need to double-check code used to determine average size of bins. The average size was originally determined by looping through the data twice. This was fixed in IGBF-2329, but need to make sure that the now unnecessary loop was removed.

        Attachments

          Issue Links

            Activity

            Hide
            pbadzuh Philip Badzuh (Inactive) added a comment -

            Some notes on investigating the organization of bai contents:

            • Chunks from different bins overlap, suggesting that read data applies to both bins (e.g. 1 and 3). This, in turn, suggests that reads span the bins, implying that 5th level bin chunks include data from higher level bins.
            • Based on the above, it seems as if the first chunk is the primary data pertaining to a bin and consecutive chunks are those from higher level bins, however, I don’t understand how a read can span distant bins e.g. 1 and 3. This would imply the read length is > 16,384 bp (the span of a single bin), which doesn’t make sense for illumina, the sequencing method used in the alignment I analyzed.
            Show
            pbadzuh Philip Badzuh (Inactive) added a comment - Some notes on investigating the organization of bai contents: Chunks from different bins overlap, suggesting that read data applies to both bins (e.g. 1 and 3). This, in turn, suggests that reads span the bins, implying that 5th level bin chunks include data from higher level bins. Based on the above, it seems as if the first chunk is the primary data pertaining to a bin and consecutive chunks are those from higher level bins, however, I don’t understand how a read can span distant bins e.g. 1 and 3. This would imply the read length is > 16,384 bp (the span of a single bin), which doesn’t make sense for illumina, the sequencing method used in the alignment I analyzed.
            Hide
            pbadzuh Philip Badzuh (Inactive) added a comment -

            Please see my changes in the commit below:
            https://bitbucket.org/pbadzuh/igb_pbdev/commits/cf9c32ec1c2465dac6c03d6b9e46244a1dea2647?at=IGBF-2482

            These:

            • Make normalization based on the genome level rather than chromosome level mean bin size.
            • Remove unused code/imports.
            • Add/modify comments for clarification.

            I was unable to find a way to circumvent the use of a dummy bam file in order to access index information using htsjdk, so the dependency as modified by Charan has been maintained.

            Show
            pbadzuh Philip Badzuh (Inactive) added a comment - Please see my changes in the commit below: https://bitbucket.org/pbadzuh/igb_pbdev/commits/cf9c32ec1c2465dac6c03d6b9e46244a1dea2647?at=IGBF-2482 These: Make normalization based on the genome level rather than chromosome level mean bin size. Remove unused code/imports. Add/modify comments for clarification. I was unable to find a way to circumvent the use of a dummy bam file in order to access index information using htsjdk, so the dependency as modified by Charan has been maintained.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            After testing with additional files I noticed that there are odd discrepancies in the output when compared to IndexCov and BioVizBase. When I compared the bedgraph output between IGB, IndexCov, and BioVizBase to the bam file itself, it was clear that BioVizBase produced the only output that perfectly matched the bam data.

            I checked the output generated by IGB when calculating the scores using just the offsets, the chunk sizes, or the combined chunk sizes. It did not appear that any of these approaches matched the BioVizBase output exactly. After examining the BioVizBase R code (estimateCoverage), it appears the process of calculating the bin sizes is more complex than we initially thought.

            The next step is to debug the BioVizBase R code to fully understand how each bin size is calculated: https://github.com/jorainer/biovizBase/blob/master/R/coverage.R

            Note that debugging the coverage.R code requires C code.

            Show
            nfreese Nowlan Freese added a comment - - edited After testing with additional files I noticed that there are odd discrepancies in the output when compared to IndexCov and BioVizBase. When I compared the bedgraph output between IGB, IndexCov, and BioVizBase to the bam file itself, it was clear that BioVizBase produced the only output that perfectly matched the bam data. I checked the output generated by IGB when calculating the scores using just the offsets, the chunk sizes, or the combined chunk sizes. It did not appear that any of these approaches matched the BioVizBase output exactly. After examining the BioVizBase R code (estimateCoverage), it appears the process of calculating the bin sizes is more complex than we initially thought. The next step is to debug the BioVizBase R code to fully understand how each bin size is calculated: https://github.com/jorainer/biovizBase/blob/master/R/coverage.R Note that debugging the coverage.R code requires C code.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Moving out of the current sprint. Also closing, as the majority of the index hacking work is done.
            Creating a new ticket for the next steps mentioned in the previous comment.

            Show
            ann.loraine Ann Loraine added a comment - - edited Moving out of the current sprint. Also closing, as the majority of the index hacking work is done. Creating a new ticket for the next steps mentioned in the previous comment.

              People

              • Assignee:
                nfreese Nowlan Freese
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: