[IGBF-2443] Make first draft slide deck or technical document showing problem domain - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Push the Boundaries
Sprint:
Summer 2: 22 Jun - 3 Jul, Summer 3: 6 Jul - 17 Jul, Summer 4: 14 Jul - 28 Jul, Summer 5: 3 Aug - 14 Aug

Attachments

Issue Links

blocks

IGBF-2354 Refactor Index Hacking Data Structure

Closed

Activity

Descending order - Click to sort in ascending order

Hide

Permalink

Nowlan Freese added a comment - 12/Aug/20 3:13 PM - edited

I created an additional figure showing how the chunks between various bins overlap. I was also able to determine the offset values from the chunks. It appears that each bins chunks overlap each other slightly. The offset value effectively takes the overlap into account so that the data are not be double-counted in the overlap section. It also makes clear that the first chunk is the most important chunk, and that the rest of the chunks are smaller and overlap with several other bins.

I added an additional description of how the actual byte values can be calculated from the virtual file offset.

Show

Nowlan Freese added a comment - 12/Aug/20 3:13 PM - edited I created an additional figure showing how the chunks between various bins overlap. I was also able to determine the offset values from the chunks. It appears that each bins chunks overlap each other slightly. The offset value effectively takes the overlap into account so that the data are not be double-counted in the overlap section. It also makes clear that the first chunk is the most important chunk, and that the rest of the chunks are smaller and overlap with several other bins. I added an additional description of how the actual byte values can be calculated from the virtual file offset.

Hide

Permalink

Philip Badzuh (Inactive) added a comment - 11/Aug/20 11:36 AM - edited

I think the document did a good job of introducing the idea of index hacking and provided a nice overview of all the associated concepts.

I am a little confused, however, on how the bin 'value' is currently determined. "Instead of calculating the bin values from the offsets, use the first chunk_beg and chunk_end to calculate the relative size of the bin." Is there a reason that only the first chunk is used?

To me, it would make the most sense to sum the differences between each chunk's data start and data end. Also, it seems that the start/end integers would need to first be converted into their two-part bit constituents (48 | 16) to then determine the chunk's actual length/size of compressed data rather than using the raw integer values.

I may just be confused, however. Please let me know what you think.

Show

Philip Badzuh (Inactive) added a comment - 11/Aug/20 11:36 AM - edited I think the document did a good job of introducing the idea of index hacking and provided a nice overview of all the associated concepts. I am a little confused, however, on how the bin 'value' is currently determined. "Instead of calculating the bin values from the offsets, use the first chunk_beg and chunk_end to calculate the relative size of the bin." Is there a reason that only the first chunk is used? To me, it would make the most sense to sum the differences between each chunk's data start and data end. Also, it seems that the start/end integers would need to first be converted into their two-part bit constituents (48 | 16) to then determine the chunk's actual length/size of compressed data rather than using the raw integer values. I may just be confused, however. Please let me know what you think.

Hide

Permalink

Nowlan Freese added a comment - 05/Aug/20 10:42 AM

Link to document: https://docs.google.com/document/d/1BymvLsAEJF3wmj7r6_wTRhSvoLfkpSmslAkQvBJDN_g/edit?usp=sharing

Show

Nowlan Freese added a comment - 05/Aug/20 10:42 AM Link to document: https://docs.google.com/document/d/1BymvLsAEJF3wmj7r6_wTRhSvoLfkpSmslAkQvBJDN_g/edit?usp=sharing

Hide

Permalink

Nowlan Freese added a comment - 30/Jul/20 3:54 PM - edited

I have finished the first draft of the index hacking technical documentation. The goal of the documentation is to introduce the overall idea as well as our current understanding of how bai index files work.

I have moved this issue to needs 1st level review. I think it would be good for Philip Badzuh to do the initial review as he has not worked on this issue directly, but I have described it to him before.

To test this issue, read through the document Google Drive > IGB Project Documentation and Plans > Index Hacking. Make note of things that are unclear or need further explanation.

Show

Nowlan Freese added a comment - 30/Jul/20 3:54 PM - edited I have finished the first draft of the index hacking technical documentation. The goal of the documentation is to introduce the overall idea as well as our current understanding of how bai index files work. I have moved this issue to needs 1st level review. I think it would be good for Philip Badzuh to do the initial review as he has not worked on this issue directly, but I have described it to him before. To test this issue, read through the document Google Drive > IGB Project Documentation and Plans > Index Hacking. Make note of things that are unclear or need further explanation.

Hide

Permalink

Ann Loraine added a comment - 13/Jul/20 10:53 AM

First paragraph:

Index Hacking
Summary
BAM files have an index (bai) that is used to speed up loading of bam files. The bai index contains a rough estimate of how much data is present in 16,384 base pair increments along the genome. Therefore the bai file can be used to create a rough coverage graph quickly. The goal of this document is to compile information regarding the composition of the bai index and how it can be used by IGB to create rough coverage graphs.

Show

Ann Loraine added a comment - 13/Jul/20 10:53 AM First paragraph: Index Hacking Summary BAM files have an index (bai) that is used to speed up loading of bam files. The bai index contains a rough estimate of how much data is present in 16,384 base pair increments along the genome. Therefore the bai file can be used to create a rough coverage graph quickly. The goal of this document is to compile information regarding the composition of the bai index and how it can be used by IGB to create rough coverage graphs.

Hide

Permalink

Nowlan Freese added a comment - 10/Jul/20 10:29 AM

The working slide deck and document are in Google Drive > IGB Project Documentation and Plans > Index Hacking

Currently creating a document using figures from the slides.

Show

Nowlan Freese added a comment - 10/Jul/20 10:29 AM The working slide deck and document are in Google Drive > IGB Project Documentation and Plans > Index Hacking Currently creating a document using figures from the slides.

People

Assignee:

Nowlan Freese

Reporter:

Ann Loraine

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

19/Jun/20 3:12 PM

Updated:

14/Aug/20 10:01 AM

Resolved:

14/Aug/20 10:01 AM