Details

    • Story Points:
      5
    • Sprint:
      Fall 4 2023 Oct 16, Fall 6, Fall 7, Fall 8, Spring 1, Spring 2

      Description

      Task: Add CRAM file support to IGB.

      Task list:

      • Implement CRAM reference source
      • Implement CRAM file type handler
      • Implement CRAM methods(parse(), getSAMReader())
      • Load 2-bit data from code
      • Check the possibility of unit tests
      • Review the code
      • Test code with different input samples

      Working Branch

        Attachments

          Issue Links

            Activity

            Hide
            kgopu Kaushik Gopu added a comment - - edited

            Notes for myself

            • BioSeq object holds all the residues of the current chromosome
            • What is SamSequenceRecord?
            • GenometryModel : contains information of currently loaded sequence
            • GenometryModel.getInstance().getSelectedSeq().orElse(null): To retrieve the loaded
              chromosome sequence, use this specific method.
            Show
            kgopu Kaushik Gopu added a comment - - edited Notes for myself BioSeq object holds all the residues of the current chromosome What is SamSequenceRecord? GenometryModel : contains information of currently loaded sequence GenometryModel.getInstance().getSelectedSeq().orElse(null): To retrieve the loaded chromosome sequence, use this specific method.
            Hide
            kgopu Kaushik Gopu added a comment - - edited
            • Implemented CRAMReferenceSource to supply a reference source (2-bit data) when reading CRAM files.
            • The Implementation of CRAMReferenceSource in IGB is IGBReferenceSource.
            • This reference source is passed as a parameter to SamReaderFactory for mapping CRAM tracks to the 2-bit sequence (see below).
              // Some comments here
              final SamReaderFactory factory = SamReaderFactory.makeDefault().referenceSource(new IGBReferenceSource()).validationStringency(ValidationStringency.SILENT);
              
            • Below is the implementation of IGBReferenceSource that reads the 2-bit sequence information from IGB
            • package org.lorainelab.igb.bam;
              
              import com.affymetrix.genometry.BioSeq;
              import com.affymetrix.genometry.GenometryModel;
              import htsjdk.samtools.SAMSequenceRecord;
              import htsjdk.samtools.cram.ref.CRAMReferenceSource;
              
              import java.nio.charset.StandardCharsets;
              
              public class IGBReferenceSource implements CRAMReferenceSource {
                  @Override
                  public byte[] getReferenceBases(SAMSequenceRecord sequenceRecord, boolean tryNameVariants) {
                      return getReferenceBasesByRegion(sequenceRecord,0, sequenceRecord.getSequenceLength());
                  }
              
                  @Override
                  public byte[] getReferenceBasesByRegion(SAMSequenceRecord sequenceRecord, int zeroBasedStart, int requestedRegionLength) {
                      final BioSeq bioSeq = GenometryModel.getInstance().getSelectedSeq().orElse(null);
                      if(bioSeq!=null){
                          byte[] bytes = bioSeq.
                                  getResidues(zeroBasedStart,zeroBasedStart+requestedRegionLength).
                                  getBytes(StandardCharsets.US_ASCII);
                          return bytes;
                      }
                      return new byte[]{};
                  }
              }
              
              
            Show
            kgopu Kaushik Gopu added a comment - - edited Implemented CRAMReferenceSource to supply a reference source (2-bit data) when reading CRAM files. The Implementation of CRAMReferenceSource in IGB is IGBReferenceSource . This reference source is passed as a parameter to SamReaderFactory for mapping CRAM tracks to the 2-bit sequence (see below). // Some comments here final SamReaderFactory factory = SamReaderFactory.makeDefault().referenceSource( new IGBReferenceSource()).validationStringency(ValidationStringency.SILENT); Below is the implementation of IGBReferenceSource that reads the 2-bit sequence information from IGB package org.lorainelab.igb.bam; import com.affymetrix.genometry.BioSeq; import com.affymetrix.genometry.GenometryModel; import htsjdk.samtools.SAMSequenceRecord; import htsjdk.samtools.cram.ref.CRAMReferenceSource; import java.nio.charset.StandardCharsets; public class IGBReferenceSource implements CRAMReferenceSource { @Override public byte [] getReferenceBases(SAMSequenceRecord sequenceRecord, boolean tryNameVariants) { return getReferenceBasesByRegion(sequenceRecord,0, sequenceRecord.getSequenceLength()); } @Override public byte [] getReferenceBasesByRegion(SAMSequenceRecord sequenceRecord, int zeroBasedStart, int requestedRegionLength) { final BioSeq bioSeq = GenometryModel.getInstance().getSelectedSeq().orElse( null ); if (bioSeq!= null ){ byte [] bytes = bioSeq. getResidues(zeroBasedStart,zeroBasedStart+requestedRegionLength). getBytes(StandardCharsets.US_ASCII); return bytes; } return new byte []{}; } }
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            Sample Data For testing:

            https://drive.google.com/drive/folders/1bbw3eqvmjUvgYfWqv7Wmkl-o9SlQUeFM?usp=sharing

            Chr1:6,550-8,753
            Chr2:73,092-75,624
            Chr3:590,284-593,368
            Chr4:1,376,133-1,379,919
            Chr5:1,378,604-1,382,638

            Show
            kgopu Kaushik Gopu added a comment - - edited Sample Data For testing: https://drive.google.com/drive/folders/1bbw3eqvmjUvgYfWqv7Wmkl-o9SlQUeFM?usp=sharing Chr1:6,550-8,753 Chr2:73,092-75,624 Chr3:590,284-593,368 Chr4:1,376,133-1,379,919 Chr5:1,378,604-1,382,638
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            updates so far

            • The link to the working branch can be found in the description
            • Progress of the task can be found in the description
            • Recent commit include file handler implementation, cram reference source, and core cram logic
            Show
            kgopu Kaushik Gopu added a comment - - edited updates so far The link to the working branch can be found in the description Progress of the task can be found in the description Recent commit include file handler implementation, cram reference source, and core cram logic
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Nowlan's comments after discussing with Kaushik:

            Currently with Kaushik's code, when a user adds a cram file and then clicks load data, the sequence data are loaded first for the region in view. However, this creates a potential problem if the user has zoomed in on the data. If part of a read is outside of the bounds of the user's view, it will need the sequence for the region outside of the user's view. However, since we are only loading the sequence data within the user's view, the logic will break and we see an error. Instead, we need to load the sequence for any reads that overlap the user's view. For example, if a read starts outside of the user's view to the left then we need to also load the sequence data for that region.

            Show
            nfreese Nowlan Freese added a comment - - edited Nowlan's comments after discussing with Kaushik: Currently with Kaushik's code, when a user adds a cram file and then clicks load data, the sequence data are loaded first for the region in view. However, this creates a potential problem if the user has zoomed in on the data. If part of a read is outside of the bounds of the user's view, it will need the sequence for the region outside of the user's view. However, since we are only loading the sequence data within the user's view, the logic will break and we see an error. Instead, we need to load the sequence for any reads that overlap the user's view. For example, if a read starts outside of the user's view to the left then we need to also load the sequence data for that region.
            Hide
            nfreese Nowlan Freese added a comment -

            Nowlan's comments after discussing with Dr. Loraine:

            The easiest solution would be to load the sequence for the entire chromosome when a user loads data from a cram file. This would be much easier to implement than determining the range of the data that is being loaded. However, it would come at the cost of increased time to load the data initially and a larger memory footprint.

            If loading the sequence based on the range of the data being loaded by the user is overly difficult/complex, we can move forward with loading the sequence data for the entire chromosome when a user loads data from a cram file.

            Show
            nfreese Nowlan Freese added a comment - Nowlan's comments after discussing with Dr. Loraine: The easiest solution would be to load the sequence for the entire chromosome when a user loads data from a cram file. This would be much easier to implement than determining the range of the data that is being loaded. However, it would come at the cost of increased time to load the data initially and a larger memory footprint. If loading the sequence based on the range of the data being loaded by the user is overly difficult/complex, we can move forward with loading the sequence data for the entire chromosome when a user loads data from a cram file.
            Hide
            nfreese Nowlan Freese added a comment -

            See the code for how Mismatch Pileup Graph works on a bam file in IGB. You can test this using the instructions on the Track Operators Annotation Track testing page - find the section "single track - output a graph track from a bam file (part 2)".

            What is happening in IGB is that the Bam file is loaded first. If a bam read falls partially in the user's view then the bam read both in and out of the user's view is loaded. The mismatch pileup graph requires the sequencing data to work. So when the user clicks load data again to load the mismatch pileup graph, it requests the sequence data that overlaps the user's view and the the sequence data that overlaps the bam read that is outside of the user's view. This is effectively the same as what we need to happen with cram data.

            Show
            nfreese Nowlan Freese added a comment - See the code for how Mismatch Pileup Graph works on a bam file in IGB. You can test this using the instructions on the Track Operators Annotation Track testing page - find the section "single track - output a graph track from a bam file (part 2)". What is happening in IGB is that the Bam file is loaded first. If a bam read falls partially in the user's view then the bam read both in and out of the user's view is loaded. The mismatch pileup graph requires the sequencing data to work. So when the user clicks load data again to load the mismatch pileup graph, it requests the sequence data that overlaps the user's view and the the sequence data that overlaps the bam read that is outside of the user's view. This is effectively the same as what we need to happen with cram data.
            Hide
            kgopu Kaushik Gopu added a comment -
            • Updated branch with new code changes to load sequence based on range
            • Nowlan Freese, pull the branch (link available in the description), and give chr range as input, move the scroll bar towards the right, and hit load data
            • Only the sequence from the given range will be loaded
            • Test it with different cases and let me know if it's not working as intended.
            Show
            kgopu Kaushik Gopu added a comment - Updated branch with new code changes to load sequence based on range Nowlan Freese , pull the branch (link available in the description), and give chr range as input, move the scroll bar towards the right, and hit load data Only the sequence from the given range will be loaded Test it with different cases and let me know if it's not working as intended.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            I have created new bam/cram files that contain multiple regions of reads on the same chromosome for the Arabidopsis thaliana genome.
            Data are located here: https://drive.google.com/drive/folders/1Q4Wm8clzg6PC5GOs-ywJK2AclzpsSlFU?usp=sharing

            General location of the data
            Chr1:6689-8835
            Chr1:9400956-9404045
            Chr1:1:23635652-23637669

            NOTE: Creating a cram file requires the sequence file (fasta). When viewing the cram file in a genome browser, the same sequence file must be used by the genome browser that was used to create the cram file. For example, if I use the sequence file for the A_thaliana_Jun_2009 genome version in IGB to make the cram file, that cram file will not work in IGV. It's unclear to me at this time what the difference between IGB and IGV's versions of the genomes is (chromosome names, lengths, the sequence itself), but the error will show as an MD5 validation failure.

            I have included two different cram files, one that is valid in IGV and one that is valid in IGB. You can use the IGV cram file to view the MD5 error in IGB by attempting to load the file.

            Show
            nfreese Nowlan Freese added a comment - - edited I have created new bam/cram files that contain multiple regions of reads on the same chromosome for the Arabidopsis thaliana genome. Data are located here: https://drive.google.com/drive/folders/1Q4Wm8clzg6PC5GOs-ywJK2AclzpsSlFU?usp=sharing General location of the data Chr1:6689-8835 Chr1:9400956-9404045 Chr1:1:23635652-23637669 NOTE: Creating a cram file requires the sequence file (fasta). When viewing the cram file in a genome browser, the same sequence file must be used by the genome browser that was used to create the cram file. For example, if I use the sequence file for the A_thaliana_Jun_2009 genome version in IGB to make the cram file, that cram file will not work in IGV. It's unclear to me at this time what the difference between IGB and IGV's versions of the genomes is (chromosome names, lengths, the sequence itself), but the error will show as an MD5 validation failure. I have included two different cram files, one that is valid in IGV and one that is valid in IGB. You can use the IGV cram file to view the MD5 error in IGB by attempting to load the file.
            Hide
            kgopu Kaushik Gopu added a comment -

            I came across an explanation in the IGV help group for why the complete sequence is necessary for CRAM. This is because the htsjdk software needs to download the entire sequence before decoding the records. The performance would be much better when a reference file is loaded first, and then the CRAM file is loaded.

            Show
            kgopu Kaushik Gopu added a comment - I came across an explanation in the IGV help group for why the complete sequence is necessary for CRAM. This is because the htsjdk software needs to download the entire sequence before decoding the records. The performance would be much better when a reference file is loaded first, and then the CRAM file is loaded.
            Hide
            kgopu Kaushik Gopu added a comment -

            Notes on discussion with Nowlan Freese:

            • HTSJDK downloads the entire sequence before decoding the records. This sequence is loaded once and will be available for the entire session.
            • The HTSJDK team does not yet implement a loading sequence based on the chromosome range.
            • Tested loading strategy for CRAM in IGV. Since IGV uses the HTSJDK library, the complete sequence is loaded for CRAM.
            Show
            kgopu Kaushik Gopu added a comment - Notes on discussion with Nowlan Freese : HTSJDK downloads the entire sequence before decoding the records. This sequence is loaded once and will be available for the entire session. The HTSJDK team does not yet implement a loading sequence based on the chromosome range. Tested loading strategy for CRAM in IGV. Since IGV uses the HTSJDK library, the complete sequence is loaded for CRAM.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Testing on Mac:

            • Tested loading various cram files locally for the A thaliana genome (see Google Drive).
            • Tested loading a cram file via URL.
            • Tested loading an incompatible cram file and seeing the MD5 error (A_thaliana_Jun_2009_Chr1_3regions_IGV-works.cram).

            Everything appears to be working as expected. The sequence is being loaded for the entire length of data on the viewed chromosome, regardless of how zoomed in the user is (this is what we decided would work best for now). Caching the genome appears to be working correctly.

            Show
            nfreese Nowlan Freese added a comment - - edited Testing on Mac: Tested loading various cram files locally for the A thaliana genome (see Google Drive ). Tested loading a cram file via URL . Tested loading an incompatible cram file and seeing the MD5 error (A_thaliana_Jun_2009_Chr1_3regions_IGV-works.cram). Everything appears to be working as expected. The sequence is being loaded for the entire length of data on the viewed chromosome, regardless of how zoomed in the user is (this is what we decided would work best for now). Caching the genome appears to be working correctly.
            Hide
            kgopu Kaushik Gopu added a comment - - edited

            main-JDK8 - Pull Request has been submitted.

            Show
            kgopu Kaushik Gopu added a comment - - edited main-JDK8 - Pull Request has been submitted.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            PR is merged. Branch installers are built and ready for download and testing at:

            https://bitbucket.org/lorainelab/integrated-genome-browser/downloads/

            Note: Branch installer file names contain prefix "main-JDK8" as the merge target was the main-JDK8 branch.

            Show
            ann.loraine Ann Loraine added a comment - - edited PR is merged. Branch installers are built and ready for download and testing at: https://bitbucket.org/lorainelab/integrated-genome-browser/downloads/ Note: Branch installer file names contain prefix "main-JDK8" as the merge target was the main-JDK8 branch.
            Hide
            nfreese Nowlan Freese added a comment -

            Tested installer on Mac.
            Able to load cram files locally and via URL, and able to see the MD5 error when there is an issue with the cram.

            Closing ticket.

            Show
            nfreese Nowlan Freese added a comment - Tested installer on Mac. Able to load cram files locally and via URL, and able to see the MD5 error when there is an issue with the cram. Closing ticket.

              People

              • Assignee:
                kgopu Kaushik Gopu
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: