Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2443

Make first draft slide deck or technical document showing problem domain

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      2
    • Sprint:
      Summer 2: 22 Jun - 3 Jul, Summer 3: 6 Jul - 17 Jul, Summer 4: 14 Jul - 28 Jul, Summer 5: 3 Aug - 14 Aug

      Attachments

        Issue Links

          Activity

          Hide
          nfreese Nowlan Freese added a comment - - edited

          I created an additional figure showing how the chunks between various bins overlap. I was also able to determine the offset values from the chunks. It appears that each bins chunks overlap each other slightly. The offset value effectively takes the overlap into account so that the data are not be double-counted in the overlap section. It also makes clear that the first chunk is the most important chunk, and that the rest of the chunks are smaller and overlap with several other bins.

          I added an additional description of how the actual byte values can be calculated from the virtual file offset.

          Show
          nfreese Nowlan Freese added a comment - - edited I created an additional figure showing how the chunks between various bins overlap. I was also able to determine the offset values from the chunks. It appears that each bins chunks overlap each other slightly. The offset value effectively takes the overlap into account so that the data are not be double-counted in the overlap section. It also makes clear that the first chunk is the most important chunk, and that the rest of the chunks are smaller and overlap with several other bins. I added an additional description of how the actual byte values can be calculated from the virtual file offset.
          Hide
          pbadzuh Philip Badzuh (Inactive) added a comment - - edited

          I think the document did a good job of introducing the idea of index hacking and provided a nice overview of all the associated concepts.

          I am a little confused, however, on how the bin 'value' is currently determined. "Instead of calculating the bin values from the offsets, use the first chunk_beg and chunk_end to calculate the relative size of the bin." Is there a reason that only the first chunk is used?

          To me, it would make the most sense to sum the differences between each chunk's data start and data end. Also, it seems that the start/end integers would need to first be converted into their two-part bit constituents (48 | 16) to then determine the chunk's actual length/size of compressed data rather than using the raw integer values.

          I may just be confused, however. Please let me know what you think.

          Show
          pbadzuh Philip Badzuh (Inactive) added a comment - - edited I think the document did a good job of introducing the idea of index hacking and provided a nice overview of all the associated concepts. I am a little confused, however, on how the bin 'value' is currently determined. "Instead of calculating the bin values from the offsets, use the first chunk_beg and chunk_end to calculate the relative size of the bin." Is there a reason that only the first chunk is used? To me, it would make the most sense to sum the differences between each chunk's data start and data end. Also, it seems that the start/end integers would need to first be converted into their two-part bit constituents (48 | 16) to then determine the chunk's actual length/size of compressed data rather than using the raw integer values. I may just be confused, however. Please let me know what you think.
          Show
          nfreese Nowlan Freese added a comment - Link to document: https://docs.google.com/document/d/1BymvLsAEJF3wmj7r6_wTRhSvoLfkpSmslAkQvBJDN_g/edit?usp=sharing
          Hide
          nfreese Nowlan Freese added a comment - - edited

          I have finished the first draft of the index hacking technical documentation. The goal of the documentation is to introduce the overall idea as well as our current understanding of how bai index files work.

          I have moved this issue to needs 1st level review. I think it would be good for Philip Badzuh to do the initial review as he has not worked on this issue directly, but I have described it to him before.

          To test this issue, read through the document Google Drive > IGB Project Documentation and Plans > Index Hacking. Make note of things that are unclear or need further explanation.

          Show
          nfreese Nowlan Freese added a comment - - edited I have finished the first draft of the index hacking technical documentation. The goal of the documentation is to introduce the overall idea as well as our current understanding of how bai index files work. I have moved this issue to needs 1st level review. I think it would be good for Philip Badzuh to do the initial review as he has not worked on this issue directly, but I have described it to him before. To test this issue, read through the document Google Drive > IGB Project Documentation and Plans > Index Hacking. Make note of things that are unclear or need further explanation.
          Hide
          ann.loraine Ann Loraine added a comment -

          First paragraph:

          Index Hacking
          Summary
          BAM files have an index (bai) that is used to speed up loading of bam files. The bai index contains a rough estimate of how much data is present in 16,384 base pair increments along the genome. Therefore the bai file can be used to create a rough coverage graph quickly. The goal of this document is to compile information regarding the composition of the bai index and how it can be used by IGB to create rough coverage graphs.

          Show
          ann.loraine Ann Loraine added a comment - First paragraph: Index Hacking Summary BAM files have an index (bai) that is used to speed up loading of bam files. The bai index contains a rough estimate of how much data is present in 16,384 base pair increments along the genome. Therefore the bai file can be used to create a rough coverage graph quickly. The goal of this document is to compile information regarding the composition of the bai index and how it can be used by IGB to create rough coverage graphs.
          Hide
          nfreese Nowlan Freese added a comment -

          The working slide deck and document are in Google Drive > IGB Project Documentation and Plans > Index Hacking

          Currently creating a document using figures from the slides.

          Show
          nfreese Nowlan Freese added a comment - The working slide deck and document are in Google Drive > IGB Project Documentation and Plans > Index Hacking Currently creating a document using figures from the slides.

            People

            • Assignee:
              nfreese Nowlan Freese
              Reporter:
              ann.loraine Ann Loraine
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: