Details

    • Type: Story
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      12
    • Sprint:
      Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 4 : 30 Sep to 11 Oct, Fall 5 : 14 Oct to 25 Oct, Fall 6 : 28 Oct to 8 Nov

      Description

      In genomics we deal with very large files that associate numbers or features with genomic ranges.

      Often these files get large - 10 Gb or bigger - with many millions of features.

      Getting useful overviews of all these data is challenging. Current state of the art are so-called "coverage graphs" that plot the number of features per genomic base and display them as graphs in genome browser tracks. Some genome browsers (like IGB) can calculate these graphs "on the fly" after loading alignments (BAM files) into memory.

      Large genome-based files are sorted by sequence name and genomic position and then indexed to facilitate efficient loading of data from specific regions of the genome into genome browsers. For example BAM files have indexes called BAI files, which map locations in the genome to locations in the file. Tab-delimited file formats like BED and GFF can have indexes called TBI files. Bigwig and Bigbed files formats have indexes written at the top of the file.

      Back in 2011, Dr. Loraine had the idea to use the index files themselves as a summary of the larger file to give users an overview of the distribution of data in the larger file.

      At the time, we were collaborating with Michael Lawrence of Genentech. We did a little work on it, but eventually dropped it in favor of other things.

      Recently, a new paper was published from Aaron Quinlan's group that developed this idea further. Dr. Loraine saw a tweet about the article, recalled that she had thought of something similar many years before, and contacted Michael about it.

      He then pointed her to some code he and his Bioconductor collaborators had implemented that also explored the idea.

      I think it's time to re-visit this idea because in the interim, whole genome sequencing as a way to diagnose genetic problems has become much more practical. Let's jump back on it and see where it takes us!

      References:

      Goal:

      • Using one of the above tools, create BED or bigwig file from a BAI file and visualize in IGB

      Compare to ordinary coverage graph:

      • Performance - is it faster to load, less memory-intensive?
      • Do we notice any new patterns not previously apparent?

      Investigate:

      • Can we implement a new file parser for bai index files in IGB?

        Attachments

        1. BS-seq_Chr1.bam.bai
          88 kB
        2. BS-seq_Chr1-HEADERONLY.bam
          0.3 kB
        3. comparison.png
          comparison.png
          115 kB
        4. empty.bam
          0.1 kB
        5. empty.bam.bai
          0.0 kB
        6. estimateCoverage.bedgraph
          182 kB
        7. estimateCoverageScript.R
          1 kB
        8. galaxy.bai
          88 kB
        9. galaxy.bam
          0.3 kB
        10. galaxy.zip
          79 kB
        11. HG003.GRCh38.2x250.bam.bai
          8.96 MB
        12. indexcov.bedgraph
          50 kB
        13. indexcov.png
          indexcov.png
          15 kB
        14. jvarkit.bedgraph
          52 kB

          Issue Links

            Activity

            ann.loraine Ann Loraine created issue -
            ann.loraine Ann Loraine made changes -
            Field Original Value New Value
            Epic Link IGBF-1919 [ 18010 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2019 Sprint 12 [ 71 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked lower
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Assignee Sai Charan Reddy Vallapureddy [ svallapu ]
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            nfreese Nowlan Freese made changes -
            Attachment indexcov.png [ 14388 ]
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Attachment galaxy.zip [ 14390 ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBC-194 [ IGBC-194 ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBC-433 [ IGBC-433 ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBC-328 [ IGBC-328 ]
            nfreese Nowlan Freese made changes -
            Attachment estimateCoverageScript.R [ 14396 ]
            Attachment estimateCoverage.bedgraph [ 14397 ]
            Attachment indexcov.bedgraph [ 14398 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2019 Sprint 12 [ 71 ] Summer 2019 Sprint 12, Fall 2019 Sprint 1 [ 71, 72 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Description In genomics we deal with very large files that associate numbers or features with genomic ranges.

            Often these files get large - 10 Gb or bigger - with many millions of features.

            Getting useful overviews of all these data is challenging. Current state of the art are so-called "coverage graphs" that plot the number of features per genomic base and display them as graphs in genome browser tracks. Some genome browsers (like IGB) can calculate these graphs "on the fly" after loading the data.

            These large files are sorted by sequence name and genomic position and then indexed to facilitate efficient loading of data from specific regions of the genome into genome browsers. For example BAM files have indexes called BAI files, which map locations in the genome to locations in the file. Tab-delimited file formats like BED and GFF can have indexes called TBI files. Bigwig and Bigbed files formats have indexes written at the top of the file.

            Back in 2011, Dr. Loraine had the idea to use the index files themselves as a summary of the larger file to give users an overview of the distribution of data in the larger file.

            At the time, we were collaborating with Michael Lawrence of Genentech. We did a little work on it, but eventually dropped it in favor of other things.

            Recently, a new paper was published from Aaron Quinlan's group that developed this idea further. Dr. Loraine saw a tweet about the article, recalled that she had thought of something similar many years before, and contacted Michael about it.

            He then pointed her to some code he and his Bioconductor collaborators had implemented that also explored the idea.

            I think it's time to re-visit this idea because in the interim, whole genome sequencing as a way to diagnose genetic problems has become much more practical. Also, the paper from Dr. Quinlon is likely going to get a lot of attention, which means the idea will finally get a wide audience. Let's jump back on it and see where it takes us!

            References:

            * Emails, notes: https://www.dropbox.com/sh/2sdv6k7k4s2lso5/AAAO1zcMb-pRqOz5iEMh7rX4a
            * New paper: https://academic.oup.com/gigascience/article/6/11/gix090/4160383
            * R implementation https://github.com/jorainer/biovizBase/blob/master/R/coverage.R

            Goal:

            * Using one of the above tools, create BED or bigwig file from a BAI file and visualize in IGB

            Compare to ordinary coverage graph:

            * Performance - is it faster to load, less memory-intensive?
            * Do we notice any new patterns not previously apparent?

            Investigate:

            * Can we implement a new file parser for bai index files in IGB?

            In genomics we deal with very large files that associate numbers or features with genomic ranges.

            Often these files get large - 10 Gb or bigger - with many millions of features.

            Getting useful overviews of all these data is challenging. Current state of the art are so-called "coverage graphs" that plot the number of features per genomic base and display them as graphs in genome browser tracks. Some genome browsers (like IGB) can calculate these graphs "on the fly" after loading alignments (BAM files) into memory.

            Large genome-based files are sorted by sequence name and genomic position and then indexed to facilitate efficient loading of data from specific regions of the genome into genome browsers. For example BAM files have indexes called BAI files, which map locations in the genome to locations in the file. Tab-delimited file formats like BED and GFF can have indexes called TBI files. Bigwig and Bigbed files formats have indexes written at the top of the file.

            Back in 2011, Dr. Loraine had the idea to use the index files themselves as a summary of the larger file to give users an overview of the distribution of data in the larger file.

            At the time, we were collaborating with Michael Lawrence of Genentech. We did a little work on it, but eventually dropped it in favor of other things.

            Recently, a new paper was published from Aaron Quinlan's group that developed this idea further. Dr. Loraine saw a tweet about the article, recalled that she had thought of something similar many years before, and contacted Michael about it.

            He then pointed her to some code he and his Bioconductor collaborators had implemented that also explored the idea.

            I think it's time to re-visit this idea because in the interim, whole genome sequencing as a way to diagnose genetic problems has become much more practical. Let's jump back on it and see where it takes us!

            References:

            * Emails, notes: https://www.dropbox.com/sh/2sdv6k7k4s2lso5/AAAO1zcMb-pRqOz5iEMh7rX4a
            * New paper: https://academic.oup.com/gigascience/article/6/11/gix090/4160383
            * R implementation https://github.com/jorainer/biovizBase/blob/master/R/coverage.R

            Goal:

            * Using one of the above tools, create BED or bigwig file from a BAI file and visualize in IGB

            Compare to ordinary coverage graph:

            * Performance - is it faster to load, less memory-intensive?
            * Do we notice any new patterns not previously apparent?

            Investigate:

            * Can we implement a new file parser for bai index files in IGB?

            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Status In Progress [ 3 ] Open [ 1 ]
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Workflow Loraine Lab Workflow [ 18644 ] Fall 2019 Workflow Update [ 19076 ]
            nfreese Nowlan Freese made changes -
            Attachment comparison.png [ 14400 ]
            ann.loraine Ann Loraine made changes -
            Comment [ Commit message - "bored during holidays" :-)
            ]
            nfreese Nowlan Freese made changes -
            Attachment jvarkit.bedgraph [ 14401 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2019 Sprint 12, Fall 2019 Sprint 1 [ 71, 72 ] Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2 [ 71, 72, 73 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2 [ 71, 72, 73 ] Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 4 [ 71, 72, 73, 74 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Attachment galaxy.bai [ 14406 ]
            Attachment galaxy.bam [ 14407 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3 [ 71, 72, 73, 74 ] Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 2019 Sprint 4 [ 71, 72, 73, 74, 75 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 4 : 30 Sep to 11 Oct [ 71, 72, 73, 74, 75 ] Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 4 : 30 Sep to 11 Oct, Fall 5 : 14 Oct to 25 Oct [ 71, 72, 73, 74, 75, 76 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-2067 [ IGBF-2067 ]
            ann.loraine Ann Loraine made changes -
            Workflow Fall 2019 Workflow Update [ 19076 ] Revised Fall 2019 Workflow Update [ 21191 ]
            nfreese Nowlan Freese made changes -
            Attachment empty.bam [ 14468 ]
            Attachment empty.bam.bai [ 14469 ]
            nfreese Nowlan Freese made changes -
            Attachment HG003.GRCh38.2x250.bam.bai [ 14470 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 4 : 30 Sep to 11 Oct, Fall 5 : 14 Oct to 25 Oct [ 71, 72, 73, 74, 75, 76 ] Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 4 : 30 Sep to 11 Oct, Fall 5 : 14 Oct to 25 Oct, Fall 6 : 28 Oct to 8 Nov [ 71, 72, 73, 74, 75, 76, 77 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-2100 [ IGBF-2100 ]
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-2101 [ IGBF-2101 ]
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-2104 [ IGBF-2104 ]
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            svallapu Sai Charan Reddy Vallapureddy (Inactive) made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Assignee Sai Charan Reddy Vallapureddy [ svallapu ]
            nfreese Nowlan Freese made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            nfreese Nowlan Freese made changes -
            Assignee Nowlan Freese [ nfreese ]
            nfreese Nowlan Freese made changes -
            Story Points 4 12
            nfreese Nowlan Freese made changes -
            Assignee Nowlan Freese [ nfreese ] Sai Charan Reddy Vallapureddy [ svallapu ]
            nfreese Nowlan Freese made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
            nfreese Nowlan Freese made changes -
            Attachment BS-seq_Chr1-HEADERONLY.bam [ 14786 ]
            Attachment BS-seq_Chr1.bam.bai [ 14787 ]
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-990 [ IGBF-990 ]
            ann.loraine Ann Loraine made changes -
            Summary Investigate index hacking Visualizing the index

              People

              • Assignee:
                svallapu Sai Charan Reddy Vallapureddy (Inactive)
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: