Details

    • Type: Story
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      12
    • Sprint:
      Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 4 : 30 Sep to 11 Oct, Fall 5 : 14 Oct to 25 Oct, Fall 6 : 28 Oct to 8 Nov

      Description

      In genomics we deal with very large files that associate numbers or features with genomic ranges.

      Often these files get large - 10 Gb or bigger - with many millions of features.

      Getting useful overviews of all these data is challenging. Current state of the art are so-called "coverage graphs" that plot the number of features per genomic base and display them as graphs in genome browser tracks. Some genome browsers (like IGB) can calculate these graphs "on the fly" after loading alignments (BAM files) into memory.

      Large genome-based files are sorted by sequence name and genomic position and then indexed to facilitate efficient loading of data from specific regions of the genome into genome browsers. For example BAM files have indexes called BAI files, which map locations in the genome to locations in the file. Tab-delimited file formats like BED and GFF can have indexes called TBI files. Bigwig and Bigbed files formats have indexes written at the top of the file.

      Back in 2011, Dr. Loraine had the idea to use the index files themselves as a summary of the larger file to give users an overview of the distribution of data in the larger file.

      At the time, we were collaborating with Michael Lawrence of Genentech. We did a little work on it, but eventually dropped it in favor of other things.

      Recently, a new paper was published from Aaron Quinlan's group that developed this idea further. Dr. Loraine saw a tweet about the article, recalled that she had thought of something similar many years before, and contacted Michael about it.

      He then pointed her to some code he and his Bioconductor collaborators had implemented that also explored the idea.

      I think it's time to re-visit this idea because in the interim, whole genome sequencing as a way to diagnose genetic problems has become much more practical. Let's jump back on it and see where it takes us!

      References:

      Goal:

      • Using one of the above tools, create BED or bigwig file from a BAI file and visualize in IGB

      Compare to ordinary coverage graph:

      • Performance - is it faster to load, less memory-intensive?
      • Do we notice any new patterns not previously apparent?

      Investigate:

      • Can we implement a new file parser for bai index files in IGB?

        Attachments

        1. BS-seq_Chr1.bam.bai
          88 kB
        2. BS-seq_Chr1-HEADERONLY.bam
          0.3 kB
        3. comparison.png
          comparison.png
          115 kB
        4. empty.bam
          0.1 kB
        5. empty.bam.bai
          0.0 kB
        6. estimateCoverage.bedgraph
          182 kB
        7. estimateCoverageScript.R
          1 kB
        8. galaxy.bai
          88 kB
        9. galaxy.bam
          0.3 kB
        10. galaxy.zip
          79 kB
        11. HG003.GRCh38.2x250.bam.bai
          8.96 MB
        12. indexcov.bedgraph
          50 kB
        13. indexcov.png
          indexcov.png
          15 kB
        14. jvarkit.bedgraph
          52 kB

          Issue Links

            Activity

            Hide
            svallapu Sai Charan Reddy Vallapureddy added a comment - - edited

            Nowlan Freese Ann Loraine
            BAM file is no longer needed to load BAI file with the below changes. To remove the dependency changes are made in htsjdk project.

            Branches:
            https://bitbucket.org/svallapu/charan_igb/branch/IGBF-1920-BAI

            https://github.com/VallapuCharan/htsjdk-igb/commits/igb-2.16.3
            (Don't use master, use igb-2.16.2)

            Steps to build with new changes.
            1. Clone my htsjdk branch.
            2. Gradle clean and install.
            3. Replace old build(2.16.2) with the new one (2.16.3) in .m2 repository(htsjdk).
            4. Clone my IGB branch.
            5. Maven clean and install.
            6. Run and test it

            Test steps:
            1. Select A. Thaliana (flower)
            2. Drag and drop attached galaxy.bai file into IGB
            3. Click the Load Sequence.
            4. You should see a graph.

            Todo:
            1. Bedgraph file is created to get the graph. (Discussion required)

            Show
            svallapu Sai Charan Reddy Vallapureddy added a comment - - edited Nowlan Freese Ann Loraine BAM file is no longer needed to load BAI file with the below changes. To remove the dependency changes are made in htsjdk project. Branches: https://bitbucket.org/svallapu/charan_igb/branch/IGBF-1920-BAI https://github.com/VallapuCharan/htsjdk-igb/commits/igb-2.16.3 (Don't use master, use igb-2.16.2) Steps to build with new changes. 1. Clone my htsjdk branch. 2. Gradle clean and install. 3. Replace old build(2.16.2) with the new one (2.16.3) in .m2 repository(htsjdk). 4. Clone my IGB branch. 5. Maven clean and install. 6. Run and test it Test steps: 1. Select A. Thaliana (flower) 2. Drag and drop attached galaxy.bai file into IGB 3. Click the Load Sequence. 4. You should see a graph. Todo: 1. Bedgraph file is created to get the graph. (Discussion required)
            Hide
            svallapu Sai Charan Reddy Vallapureddy added a comment - - edited

            Ann Loraine Nowlan Freese

            Branches:
            IGB:
            IGB (Sync with the master. I squashed all my commits to single commit)
            https://bitbucket.org/svallapu/charan_igb/branch/IGBF-1920-BAI

            Htsjsk:
            htsjdk-igb (2.16.3) (It is now in sync with upstream igb-2.16.3. Pull request is merged.)
            https://github.com/VallapuCharan/htsjdk-igb/commits/igb-2.16.3

            Show
            svallapu Sai Charan Reddy Vallapureddy added a comment - - edited Ann Loraine Nowlan Freese Branches: IGB: IGB (Sync with the master. I squashed all my commits to single commit) https://bitbucket.org/svallapu/charan_igb/branch/IGBF-1920-BAI Htsjsk: htsjdk-igb (2.16.3) (It is now in sync with upstream igb-2.16.3. Pull request is merged.) https://github.com/VallapuCharan/htsjdk-igb/commits/igb-2.16.3
            Hide
            svallapu Sai Charan Reddy Vallapureddy added a comment -

            Ann Loraine
            Pull Request is submitted. (Created a new branch with new comments and new method names. All your comments are addressed.)

            Branch: https://bitbucket.org/svallapu/charan_igb/branch/IGBF-1920-BAI

            (Just FYI: your comments are in different branch: https://bitbucket.org/svallapu/charan_igb/commits/2f02b6e76790a66bc438f4649c8ea4a9aad061bd?at=IGBF-1920-Final)

            Show
            svallapu Sai Charan Reddy Vallapureddy added a comment - Ann Loraine Pull Request is submitted. (Created a new branch with new comments and new method names. All your comments are addressed.) Branch: https://bitbucket.org/svallapu/charan_igb/branch/IGBF-1920-BAI (Just FYI: your comments are in different branch: https://bitbucket.org/svallapu/charan_igb/commits/2f02b6e76790a66bc438f4649c8ea4a9aad061bd?at=IGBF-1920-Final )
            Hide
            nfreese Nowlan Freese added a comment -

            Tested using [jar | HG002.GRCh38.2x250.bam.bai] from loraine lab bitbucket.

            Able to load HG002.GRCh38.2x250.bam.bai. Produced a bedgraph file that was then automatically loaded into IGB.

            Closing issue

            Note: Additional issues have been created to fix the scaling of the data and loading directly from the bai (instead of bedgraph).

            Show
            nfreese Nowlan Freese added a comment - Tested using [jar | HG002.GRCh38.2x250.bam.bai] from loraine lab bitbucket. Able to load HG002.GRCh38.2x250.bam.bai. Produced a bedgraph file that was then automatically loaded into IGB. Closing issue Note: Additional issues have been created to fix the scaling of the data and loading directly from the bai (instead of bedgraph).
            Hide
            nfreese Nowlan Freese added a comment -

            Note: I was interested in why we were getting negative values from some of the offsets, so I looked at the raw offset data generated by jvarkit. The negative values are occurring when an offset is less than the offset before it. It's unclear why an offset would ever be negative (and not zero), however, looking at the bam file, it appears that negative and zero value offsets tend to occur in areas where there is no coverage. This would explain why any time we see negative values in our data, they are zeros in the same locations in indexcov or estimateCoverage.

            Show
            nfreese Nowlan Freese added a comment - Note: I was interested in why we were getting negative values from some of the offsets, so I looked at the raw offset data generated by jvarkit. The negative values are occurring when an offset is less than the offset before it. It's unclear why an offset would ever be negative (and not zero), however, looking at the bam file, it appears that negative and zero value offsets tend to occur in areas where there is no coverage. This would explain why any time we see negative values in our data, they are zeros in the same locations in indexcov or estimateCoverage.

              People

              • Assignee:
                svallapu Sai Charan Reddy Vallapureddy
                Reporter:
                aloraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: