[IGBF-2482] Refactor Index Hacking code - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Push the Boundaries
Sprint:
Summer 6: 17 Aug - 28 Aug, Summer 7: 31 Aug - 11 Sep, Fall 3: Oct 12 - Oct 23

Description

Situation: Significant changes were made to how the bai sym loader work (~~IGBF-2329~~) as well as experimental changes to how bin sizes are calculated and applied to chromosomes (~~IGBF-2221~~).

Task: Need to review code changes and refactor code to simplify. For example, it looks like there are now two Java Class files that contain the same code (related to ~~IGBF-2329~~). Need to determine why there are two classes and if one can be removed.

Need to clean up unnecessary code related to comparing bin sizes to chromosomes (~~IGBF-2221~~). After significant testing we determined this could not be accurately implemented. Need to remove any code that refers to it.

Need to double-check code used to determine average size of bins. The average size was originally determined by looping through the data twice. This was fixed in ~~IGBF-2329~~, but need to make sure that the now unnecessary loop was removed.

Attachments

Issue Links

relates to

IGBF-2329 Create GraphSyms from bai file directly

Closed

IGBF-2221 Compare bin sizes to chromosome sizes

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Philip Badzuh (Inactive) added a comment - 18/Aug/20 12:05 PM

Some notes on investigating the organization of bai contents:

Chunks from different bins overlap, suggesting that read data applies to both bins (e.g. 1 and 3). This, in turn, suggests that reads span the bins, implying that 5th level bin chunks include data from higher level bins.

Based on the above, it seems as if the first chunk is the primary data pertaining to a bin and consecutive chunks are those from higher level bins, however, I don’t understand how a read can span distant bins e.g. 1 and 3. This would imply the read length is > 16,384 bp (the span of a single bin), which doesn’t make sense for illumina, the sequencing method used in the alignment I analyzed.

Show

Philip Badzuh (Inactive) added a comment - 18/Aug/20 12:05 PM Some notes on investigating the organization of bai contents: Chunks from different bins overlap, suggesting that read data applies to both bins (e.g. 1 and 3). This, in turn, suggests that reads span the bins, implying that 5th level bin chunks include data from higher level bins. Based on the above, it seems as if the first chunk is the primary data pertaining to a bin and consecutive chunks are those from higher level bins, however, I don’t understand how a read can span distant bins e.g. 1 and 3. This would imply the read length is > 16,384 bp (the span of a single bin), which doesn’t make sense for illumina, the sequencing method used in the alignment I analyzed.

Hide

Permalink

Philip Badzuh (Inactive) added a comment - 20/Aug/20 6:02 PM

Please see my changes in the commit below:
https://bitbucket.org/pbadzuh/igb_pbdev/commits/cf9c32ec1c2465dac6c03d6b9e46244a1dea2647?at=IGBF-2482

These:

Make normalization based on the genome level rather than chromosome level mean bin size.
Remove unused code/imports.
Add/modify comments for clarification.

I was unable to find a way to circumvent the use of a dummy bam file in order to access index information using htsjdk, so the dependency as modified by Charan has been maintained.

Show

Philip Badzuh (Inactive) added a comment - 20/Aug/20 6:02 PM Please see my changes in the commit below: https://bitbucket.org/pbadzuh/igb_pbdev/commits/cf9c32ec1c2465dac6c03d6b9e46244a1dea2647?at=IGBF-2482 These: Make normalization based on the genome level rather than chromosome level mean bin size. Remove unused code/imports. Add/modify comments for clarification. I was unable to find a way to circumvent the use of a dummy bam file in order to access index information using htsjdk, so the dependency as modified by Charan has been maintained.

Hide

Permalink

Nowlan Freese added a comment - 23/Sep/20 10:54 AM - edited

After testing with additional files I noticed that there are odd discrepancies in the output when compared to IndexCov and BioVizBase. When I compared the bedgraph output between IGB, IndexCov, and BioVizBase to the bam file itself, it was clear that BioVizBase produced the only output that perfectly matched the bam data.

I checked the output generated by IGB when calculating the scores using just the offsets, the chunk sizes, or the combined chunk sizes. It did not appear that any of these approaches matched the BioVizBase output exactly. After examining the BioVizBase R code (estimateCoverage), it appears the process of calculating the bin sizes is more complex than we initially thought.

The next step is to debug the BioVizBase R code to fully understand how each bin size is calculated: https://github.com/jorainer/biovizBase/blob/master/R/coverage.R

Note that debugging the coverage.R code requires C code.

Show

Nowlan Freese added a comment - 23/Sep/20 10:54 AM - edited After testing with additional files I noticed that there are odd discrepancies in the output when compared to IndexCov and BioVizBase. When I compared the bedgraph output between IGB, IndexCov, and BioVizBase to the bam file itself, it was clear that BioVizBase produced the only output that perfectly matched the bam data. I checked the output generated by IGB when calculating the scores using just the offsets, the chunk sizes, or the combined chunk sizes. It did not appear that any of these approaches matched the BioVizBase output exactly. After examining the BioVizBase R code (estimateCoverage), it appears the process of calculating the bin sizes is more complex than we initially thought. The next step is to debug the BioVizBase R code to fully understand how each bin size is calculated: https://github.com/jorainer/biovizBase/blob/master/R/coverage.R Note that debugging the coverage.R code requires C code.

Hide

Permalink

Ann Loraine added a comment - 14/Oct/20 5:03 PM - edited

Moving out of the current sprint. Also closing, as the majority of the index hacking work is done.
Creating a new ticket for the next steps mentioned in the previous comment.

Show

Ann Loraine added a comment - 14/Oct/20 5:03 PM - edited Moving out of the current sprint. Also closing, as the majority of the index hacking work is done. Creating a new ticket for the next steps mentioned in the previous comment.

Refactor Index Hacking code

Details

Description

Attachments

Issue Links

Activity

People

Dates