Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3330

Add mm39 (GRCm39) mouse genome to IGB

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      2
    • Sprint:
      Summer 2 2023 May 29, Summer 3 2023 June 12, Summer 4 2023 June 26

      Description

      Situation: A user has asked if we have the GRCm39 (mm39) mouse genome in IGB. We do not support it through Quickloads, but it is available through the UCSC DAS data source.

      Task: Add the GRCm39 (mm39) genome and annotation to IGB.

        Attachments

          Issue Links

            Activity

            nfreese Nowlan Freese created issue -
            nfreese Nowlan Freese made changes -
            Field Original Value New Value
            Epic Link IGBF-1765 [ 17855 ]
            nfreese Nowlan Freese made changes -
            Epic Link IGBF-1765 [ 17855 ] IGBF-1478 [ 17563 ]
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-753 [ IGBF-753 ]
            nfreese Nowlan Freese made changes -
            Epic Link IGBF-1478 [ 17563 ] IGBF-1395 [ 17470 ]
            nfreese Nowlan Freese made changes -
            Epic Link IGBF-1395 [ 17470 ] IGBF-1478 [ 17563 ]
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-2952 [ IGBF-2952 ]
            Hide
            ann.loraine Ann Loraine added a comment -

            For this, add a new genome "setup" to our subversion (like git) data repository.

            Show
            ann.loraine Ann Loraine added a comment - For this, add a new genome "setup" to our subversion (like git) data repository.
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            Mdavis4290 Molly Davis made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Mdavis4290 Molly Davis made changes -
            Status In Progress [ 3 ] To-Do [ 10305 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 1 2023 May 15 [ 170 ]
            ann.loraine Ann Loraine made changes -
            Assignee Molly Davis [ molly ]
            ann.loraine Ann Loraine made changes -
            Epic Link IGBF-1478 [ 17563 ] IGBF-1765 [ 17855 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2 2023 May 29 [ 171 ]
            nfreese Nowlan Freese made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            nfreese Nowlan Freese made changes -
            Assignee Nowlan Freese [ nfreese ]
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Downloaded the Jun. 2020 GRCm39/mm39 mm39.2bit file from https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/

            Had to use ftp to get the gene2accession.gz file, there is some issue with trying to gunzip -c an ftp file, so I just pulled the whole file. I was able to download the Mus_musculus.gene_info file with no issues, but it is small.

            ftp ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
            get gene2accession.gz

            NCBI:txid10090 for mouse:

            gunzip -c gene2accession.gz | grep '^10090\t' > ~/Desktop/jiraIssues/3330/10090.gene2accession.txt

            I pulled the refGene and ncbiRefSeq files as bed files. UCSC table browser.
            I then created bed14 files for the refGene and ncbiRefSeq files.

            ucscToBedDetail.py -a 10090.gene2accession.txt -g Mus_musculus.gene_info mm39_refGene.bed.gz M_musculus_Jun_2020_refGene.bed
            ucscToBedDetail.py -a 10090.gene2accession.txt -g Mus_musculus.gene_info mm39_ncbiRefSeq.bed.gz M_musculus_Jun_2020_ncbiRefSeq.bed

            I then sorted the two bed files and tabix indexed them. I was getting a lot of errors with tabix (version 1.17) where it was complaining about coordinates being zero and did I forget to add the -0 flag to indicate zero based coordinates. I don't remember this being an issue before, but since bed files are 0-based I added the flag.

            sort -k1,1 -k2,2n M_musculus_Jun_2020_refGene.bed | bgzip > M_musculus_Jun_2020_refGene.bed.gz
            tabix -0 -s 1 -b 2 -e 3 M_musculus_Jun_2020_refGene.bed.gz
            
            sort -k1,1 -k2,2n M_musculus_Jun_2020_ncbiRefSeq.bed | bgzip > M_musculus_Jun_2020_ncbiRefSeq.bed.gz
            tabix -0 -s 1 -b 2 -e 3 M_musculus_Jun_2020_ncbiRefSeq.bed.gz

            The all_mRNA, and all_est files I pulled as psl files by selecting the output format as all fields from selected table. Once downloaded, I used the following code:

            gunzip \-c M_musculus_Jun_2020_all_est.psl.gz | grep -v bin | cut -f2- > M_musculus_Jun_2020_all_est.psl
            sort -k14,14 -k16,16n M_musculus_Jun_2020_all_est.psl > sorted.psl
            mv sorted.psl M_musculus_Jun_2020_all_est.psl
            bgzip M_musculus_Jun_2020_all_est.psl
            tabix -s 14 -b 16 -0 M_musculus_Jun_2020_all_est.psl.gz
            
            gunzip \-c M_musculus_Jun_2020_all_mrna.psl.gz | grep -v bin | cut -f2- > M_musculus_Jun_2020_all_mrna.psl
            sort -k14,14 -k16,16n M_musculus_Jun_2020_all_mrna.psl > sorted.psl
            mv sorted.psl M_musculus_Jun_2020_all_mrna.psl
            bgzip M_musculus_Jun_2020_all_mrna.psl
            tabix -s 14 -b 16 -0 M_musculus_Jun_2020_all_mrna.psl.gz
            
            Show
            nfreese Nowlan Freese added a comment - - edited Downloaded the Jun. 2020 GRCm39/mm39 mm39.2bit file from https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/ Had to use ftp to get the gene2accession.gz file, there is some issue with trying to gunzip -c an ftp file, so I just pulled the whole file. I was able to download the Mus_musculus.gene_info file with no issues, but it is small. ftp ftp: //ftp.ncbi.nlm.nih.gov/gene/DATA/ get gene2accession.gz NCBI:txid10090 for mouse: gunzip -c gene2accession.gz | grep '^10090\t' > ~/Desktop/jiraIssues/3330/10090.gene2accession.txt I pulled the refGene and ncbiRefSeq files as bed files. UCSC table browser . I then created bed14 files for the refGene and ncbiRefSeq files. ucscToBedDetail.py -a 10090.gene2accession.txt -g Mus_musculus.gene_info mm39_refGene.bed.gz M_musculus_Jun_2020_refGene.bed ucscToBedDetail.py -a 10090.gene2accession.txt -g Mus_musculus.gene_info mm39_ncbiRefSeq.bed.gz M_musculus_Jun_2020_ncbiRefSeq.bed I then sorted the two bed files and tabix indexed them. I was getting a lot of errors with tabix (version 1.17) where it was complaining about coordinates being zero and did I forget to add the -0 flag to indicate zero based coordinates. I don't remember this being an issue before, but since bed files are 0-based I added the flag. sort -k1,1 -k2,2n M_musculus_Jun_2020_refGene.bed | bgzip > M_musculus_Jun_2020_refGene.bed.gz tabix -0 -s 1 -b 2 -e 3 M_musculus_Jun_2020_refGene.bed.gz sort -k1,1 -k2,2n M_musculus_Jun_2020_ncbiRefSeq.bed | bgzip > M_musculus_Jun_2020_ncbiRefSeq.bed.gz tabix -0 -s 1 -b 2 -e 3 M_musculus_Jun_2020_ncbiRefSeq.bed.gz The all_mRNA, and all_est files I pulled as psl files by selecting the output format as all fields from selected table. Once downloaded, I used the following code: gunzip \-c M_musculus_Jun_2020_all_est.psl.gz | grep -v bin | cut -f2- > M_musculus_Jun_2020_all_est.psl sort -k14,14 -k16,16n M_musculus_Jun_2020_all_est.psl > sorted.psl mv sorted.psl M_musculus_Jun_2020_all_est.psl bgzip M_musculus_Jun_2020_all_est.psl tabix -s 14 -b 16 -0 M_musculus_Jun_2020_all_est.psl.gz gunzip \-c M_musculus_Jun_2020_all_mrna.psl.gz | grep -v bin | cut -f2- > M_musculus_Jun_2020_all_mrna.psl sort -k14,14 -k16,16n M_musculus_Jun_2020_all_mrna.psl > sorted.psl mv sorted.psl M_musculus_Jun_2020_all_mrna.psl bgzip M_musculus_Jun_2020_all_mrna.psl tabix -s 14 -b 16 -0 M_musculus_Jun_2020_all_mrna.psl.gz
            Hide
            nfreese Nowlan Freese added a comment - - edited

            I have placed the DM6 quickload in CyVerse for testing.

            To test:
            In IGB,

            1. Add https://data.cyverse.org/dav-anon/iplant/home/nowlanf/Mouse_2020/quickload as a new Data Source in IGB (IntegratedGenomeBrowser > Settings > Data Sources > Add...).
            2. Select the Mus musculus Species and the M_musculus_Jun_2020 Genome Version.
            3. The RefGene track should appear and should automatically load data.
            4. Navigate to: chr1:36,405,677-36,408,789
            5. Click Load Sequence.
            6. Residues (ATCG) should load (may take a little while, CyVerse can be slow).
            7. Check that there are no errors in the log.
            8. Under Available Data in the Data Access tab, click the checkbox for mRNA, EST, and NCBI RefSeq
            9. Click Load Data
            10. Gene annotations for each track should load.
            11. Check that there are no errors in the log.
            12. Check that the UCSC (DAS) folder appears under Available Data in the Data Access tab.
            13. Click the checkbox for ncbiRefSeqCurated.
            14. Click Load Data.
            15. Check that there are no errors in the log.
            Show
            nfreese Nowlan Freese added a comment - - edited I have placed the DM6 quickload in CyVerse for testing. To test: In IGB, Add https://data.cyverse.org/dav-anon/iplant/home/nowlanf/Mouse_2020/quickload as a new Data Source in IGB (IntegratedGenomeBrowser > Settings > Data Sources > Add...). Select the Mus musculus Species and the M_musculus_Jun_2020 Genome Version. The RefGene track should appear and should automatically load data. Navigate to: chr1:36,405,677-36,408,789 Click Load Sequence. Residues (ATCG) should load (may take a little while, CyVerse can be slow). Check that there are no errors in the log. Under Available Data in the Data Access tab, click the checkbox for mRNA, EST, and NCBI RefSeq Click Load Data Gene annotations for each track should load. Check that there are no errors in the log. Check that the UCSC (DAS) folder appears under Available Data in the Data Access tab. Click the checkbox for ncbiRefSeqCurated. Click Load Data. Check that there are no errors in the log.
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-1617 [ IGBF-1617 ]
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Other genomes that should be updated based on UCSC:
            rat - IGBF-3362
            chicken - IGBF-3361

            Show
            nfreese Nowlan Freese added a comment - - edited Other genomes that should be updated based on UCSC: rat - IGBF-3362 chicken - IGBF-3361
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-3361 [ IGBF-3361 ]
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-3362 [ IGBF-3362 ]
            Hide
            nfreese Nowlan Freese added a comment - - edited

            I noticed that the psl files (EST and mRNA) have some kind of off by one error. The listed end position is +1 compared to where the annotations actually end (positive strand alignments). For example, if a mRNA on the positive strand says that it starts at 100 and ends at 150, in IGB the annotation will be drawn overlapping bases 100 - 149. If on the negative strand the IGB reported start is +1 to where the annotation is actually drawn. This indicates that the tEnd (https://genome.ucsc.edu/FAQ/FAQformat.html#format2) column is effectively +1 compared to what is actually drawn in IGB. Note that the annotations in IGB appear to be drawn in the correct location, i.e. intron exon boundaries line up correctly. The only issue is that the IGB reported start/end is off by one, depending on positive/negative strand.

            This issue is present in older psl files (M_musculus_Dec_2011) and on older versions of IGB (tested 9.0.2).

            As this issue is not new or specific to this ticket, I have created IGBF-3363 to address it.

            Show
            nfreese Nowlan Freese added a comment - - edited I noticed that the psl files (EST and mRNA) have some kind of off by one error. The listed end position is +1 compared to where the annotations actually end (positive strand alignments). For example, if a mRNA on the positive strand says that it starts at 100 and ends at 150, in IGB the annotation will be drawn overlapping bases 100 - 149. If on the negative strand the IGB reported start is +1 to where the annotation is actually drawn. This indicates that the tEnd ( https://genome.ucsc.edu/FAQ/FAQformat.html#format2 ) column is effectively +1 compared to what is actually drawn in IGB. Note that the annotations in IGB appear to be drawn in the correct location, i.e. intron exon boundaries line up correctly. The only issue is that the IGB reported start/end is off by one, depending on positive/negative strand. This issue is present in older psl files (M_musculus_Dec_2011) and on older versions of IGB (tested 9.0.2). As this issue is not new or specific to this ticket, I have created IGBF-3363 to address it.
            nfreese Nowlan Freese made changes -
            Link This issue relates to IGBF-3363 [ IGBF-3363 ]
            nfreese Nowlan Freese made changes -
            Assignee Nowlan Freese [ nfreese ]
            nfreese Nowlan Freese made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2 2023 May 29 [ 171 ] Summer 2 2023 May 29, Summer 3 2023 June 12 [ 171, 172 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 2 2023 May 29, Summer 3 2023 June 12 [ 171, 172 ] Summer 2 2023 May 29, Summer 3 2023 June 12, Summer 4 2023 June 26 [ 171, 172, 173 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Mdavis4290 Molly Davis made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            Mdavis4290 Molly Davis made changes -
            Attachment IGB_Mus-musculus.pdf [ 17908 ]
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Review: I was able to get through each step of the instructions and ran into no errors. The following is an image of my screen after finishing.

            IGB_Mus-musculus.pdf

            If the image is correct than the ticket is ready to move forward!

            Show
            Mdavis4290 Molly Davis added a comment - - edited Review : I was able to get through each step of the instructions and ran into no errors. The following is an image of my screen after finishing. IGB_Mus-musculus.pdf If the image is correct than the ticket is ready to move forward!
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ] Nowlan Freese [ nfreese ]
            Mdavis4290 Molly Davis made changes -
            Status First Level Review in Progress [ 10301 ] Needs 1st Level Review [ 10005 ]
            nfreese Nowlan Freese made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            nfreese Nowlan Freese made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Notes from pushing changes to subversion repository: https://svn.bioviz.org/viewvc/genomes/quickload/M_musculus_Jun_2020/

            Comment for subversion update
            IGBF-3330: Add Mus musculus Jun 2020 genome;Jun 29, 2023

            Need to update the following files:
            contents.txt -> M_musculus_Jun_2020 Mus musculus (Jun 2020) mouse (GRCm39/mm39)
            .htaccess -> AddDescription "Mus musculus (Jun 2020) mouse (GRCm39/mm39)" M_musculus_Jun_2020

            Basic steps to checking in files and adding new rat folder with files:

            1. Modify the contents.txt and .htaccess files
            2. Drag the mouse folder into the quickload folder
              then:
              svn status
              svn ci -m "IGBF-3330: Add Mus musculus Jun 2020 genome;Jun 29, 2023" quickload/.htaccess
              svn ci -m "IGBF-3330: Add Mus musculus Jun 2020 genome;Jun 29, 2023" quickload/contents.txt
              svn add quickload/M_musculus_Jun_2020
              svn ci -m "IGBF-3330: Add Mus musculus Jun 2020 genome;Jun 29, 2023"
              
            Show
            nfreese Nowlan Freese added a comment - - edited Notes from pushing changes to subversion repository: https://svn.bioviz.org/viewvc/genomes/quickload/M_musculus_Jun_2020/ Comment for subversion update IGBF-3330 : Add Mus musculus Jun 2020 genome;Jun 29, 2023 Need to update the following files: contents.txt -> M_musculus_Jun_2020 Mus musculus (Jun 2020) mouse (GRCm39/mm39) .htaccess -> AddDescription "Mus musculus (Jun 2020) mouse (GRCm39/mm39)" M_musculus_Jun_2020 Basic steps to checking in files and adding new rat folder with files: Modify the contents.txt and .htaccess files Drag the mouse folder into the quickload folder then: svn status svn ci -m "IGBF-3330: Add Mus musculus Jun 2020 genome;Jun 29, 2023" quickload/.htaccess svn ci -m "IGBF-3330: Add Mus musculus Jun 2020 genome;Jun 29, 2023" quickload/contents.txt svn add quickload/M_musculus_Jun_2020 svn ci -m "IGBF-3330: Add Mus musculus Jun 2020 genome;Jun 29, 2023"
            Hide
            nfreese Nowlan Freese added a comment -

            New mouse genome has been added to subversion repository.

            Next step, deploy to quickload site.

            Show
            nfreese Nowlan Freese added a comment - New mouse genome has been added to subversion repository. Next step, deploy to quickload site.
            nfreese Nowlan Freese made changes -
            Assignee Nowlan Freese [ nfreese ]
            nfreese Nowlan Freese made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            nfreese Nowlan Freese made changes -
            Assignee Ann Loraine [ aloraine ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Data are deployed. For details on where the data are deployed, and how to test, see
            linked ticket titled "Add rn7 rat genome to IGB".

            Show
            ann.loraine Ann Loraine added a comment - - edited Data are deployed. For details on where the data are deployed, and how to test, see linked ticket titled " Add rn7 rat genome to IGB ".
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ] Nowlan Freese [ nfreese ]
            Hide
            nfreese Nowlan Freese added a comment -

            Tested on Mac on IGB release (9.1.10).
            Able to load all data (RefGene, NCBI RefSeq, mRNA, EST, sequence file) on scidas and igbquickload.org

            Closing ticket.

            Show
            nfreese Nowlan Freese added a comment - Tested on Mac on IGB release (9.1.10). Able to load all data (RefGene, NCBI RefSeq, mRNA, EST, sequence file) on scidas and igbquickload.org Closing ticket.
            nfreese Nowlan Freese made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            nfreese Nowlan Freese made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]

              People

              • Assignee:
                nfreese Nowlan Freese
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: