Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2952

Add UCSC dm6 fruit fly genome and annotation to IGB

    Details

    • Story Points:
      1.5
    • Sprint:
      Fall 3 2021 Sep 13 - Sep 24, Fall 6 2021 Oct 25 - Nov 5, Fall 7 2021 Nov 8 - Nov 24, Fall 8 2021 Nov 29 - Dec 10

      Description

      Situation: A discrepancy was identified in the current released version of the dm6 genome in IGB and the genome available through UCSC. This may confuse users who attempt to load data using UCSC DAS.

      Task: Add the UCSC dm6 genome and annotation to IGB.

        Attachments

        1. screenshot-1.png
          screenshot-1.png
          151 kB
        2. screenshot-2.png
          screenshot-2.png
          144 kB
        3. screenshot-3.png
          screenshot-3.png
          131 kB
        4. screenshot-4.png
          screenshot-4.png
          106 kB

          Issue Links

            Activity

            Hide
            nfreese Nowlan Freese added a comment -

            Once this issue is complete, can push the dm6 blacklist to the SVN repo.

            Show
            nfreese Nowlan Freese added a comment - Once this issue is complete, can push the dm6 blacklist to the SVN repo.
            Hide
            nfreese Nowlan Freese added a comment -

            This ticket is still in progress.

            A couple of notes:
            The UCSC RefSeq (refgene) annotation includes the RefSeq transcript identifiers (NM_132191) instead of the flybase transcript identifiers (FBtr0071134). This may be an issue as many researchers may rely on the flybase transcript identifiers (FBtr0071134) as well as the flybase gene ids (FBgn0029974). As a sidenote, the flybase gene id is the same for each transcript of the gene, but each transcript has a unique transcript id (compared to Arabidopsis where each transcript shares the same base ID and the individual transcripts are .1, .2, etc).

            It would be nice to add the flybase transcript identifiers and the flybase gene identifiers to the 14th column of the UCSC RefSeq (refgene) bed file. However, I have been unable to find a tool that will take a list of refseq identifiers and output the flybase transcript and gene identifiers. Flybase has a tool to convert between various IDs as well as an id validator but it only outputs the gene identifiers, and many of them are missing.

            The NCBI annotation for DM6 includes the RefSeq ID and the flybase transcript and gene identifiers. So I may need to write something to pull out those values and map them to the UCSC RefSeq (refgene) annotation.

            Show
            nfreese Nowlan Freese added a comment - This ticket is still in progress. A couple of notes: The UCSC RefSeq (refgene) annotation includes the RefSeq transcript identifiers (NM_132191) instead of the flybase transcript identifiers (FBtr0071134). This may be an issue as many researchers may rely on the flybase transcript identifiers (FBtr0071134) as well as the flybase gene ids (FBgn0029974). As a sidenote, the flybase gene id is the same for each transcript of the gene, but each transcript has a unique transcript id (compared to Arabidopsis where each transcript shares the same base ID and the individual transcripts are .1, .2, etc). It would be nice to add the flybase transcript identifiers and the flybase gene identifiers to the 14th column of the UCSC RefSeq (refgene) bed file. However, I have been unable to find a tool that will take a list of refseq identifiers and output the flybase transcript and gene identifiers. Flybase has a tool to convert between various IDs as well as an id validator but it only outputs the gene identifiers, and many of them are missing. The NCBI annotation for DM6 includes the RefSeq ID and the flybase transcript and gene identifiers. So I may need to write something to pull out those values and map them to the UCSC RefSeq (refgene) annotation.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            I have placed the DM6 quickload in CyVerse for testing.

            To test:
            In IGB,
            Add https://data.cyverse.org/dav-anon/iplant/home/nfreese/2952_testing/quickload as a new Data Source.
            Select the Drosophila melanogaster Species and the D_melanogaster_Aug_2014 Genome Version.
            The RefGene track should appear and should automatically load data.
            Navigate to: chrX:6,206,883-6,206,926
            Click Load Sequence.
            Residues (ATCG) should load (may take a little while, CyVerse can be slow).
            Check that there are no errors in the log.
            Enable the UCSC DAS data provider as a Data Source.
            The UCSC (DAS) folder should appear under Available Data in the Data Access tab.
            Select the ensGene.
            Click Load Data.
            Check that there are no errors in the log.

            Show
            nfreese Nowlan Freese added a comment - - edited I have placed the DM6 quickload in CyVerse for testing. To test: In IGB, Add https://data.cyverse.org/dav-anon/iplant/home/nfreese/2952_testing/quickload as a new Data Source. Select the Drosophila melanogaster Species and the D_melanogaster_Aug_2014 Genome Version. The RefGene track should appear and should automatically load data. Navigate to: chrX:6,206,883-6,206,926 Click Load Sequence . Residues (ATCG) should load (may take a little while, CyVerse can be slow). Check that there are no errors in the log. Enable the UCSC DAS data provider as a Data Source. The UCSC (DAS) folder should appear under Available Data in the Data Access tab. Select the ensGene. Click Load Data. Check that there are no errors in the log.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Some notes on creating the D_melanogaster_Aug_2014_refGene.bed.gz file.

            The bed12 file is from UCSC Table Browser
            assembly: Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6)
            track: NCBI RefSeq
            table: UCSC RefSeq (refGene)

            I downloaded the updated gene2accession and gene_info files can be found here: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ and selected the taxid 7227.
            I then ran ucscToBedDetail.py to create the bed14 file.

            To include the FlyBase transcript and gene IDs, I downloaded the NCBI annotation for dm6 (GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz). This file contained the NM/NR NCBI RefSeq gene ids and their corresponding FlyBase gene and transcript IDs.

            I then used the file to grep lines where the feature was: mRNA, lnc_RNA, snRNA, miRNA, antisense_RNA, ncRNA, rRNA, snoRNA, RNase_MRP_RNA, RNase_P_RNA, SRP_RNA, primary_transcript - as these features contained the NR/NM RefSeq identifier and FlyBase gene and transcript IDs.

            For example:

            cat GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff | grep '.*\tSRP_RNA\t.*' >> output.txt

            I then removed any lines where there was no "Name=" field. This resulted in 34,450 lines, which is the same number of unique NM/NR in the UCSC RefSeq (refgene) file.

            Next, I extracted Just the NR/NM values and their corresponding FlyBase gene and transcript identifiers.
            The resulting file looks like this:

            NM_001103384	FBgn0025837	FBtr0112921

            The NM/NR code was then used as a key to sort the data and add the FlyBase gene and transcript identifiers to the 14th column of the D_melanogaster_Aug_2014_refGene.bed.gz file.

            Show
            nfreese Nowlan Freese added a comment - - edited Some notes on creating the D_melanogaster_Aug_2014_refGene.bed.gz file. The bed12 file is from UCSC Table Browser assembly: Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) track: NCBI RefSeq table: UCSC RefSeq (refGene) I downloaded the updated gene2accession and gene_info files can be found here: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ and selected the taxid 7227. I then ran ucscToBedDetail.py to create the bed14 file. To include the FlyBase transcript and gene IDs, I downloaded the NCBI annotation for dm6 (GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz). This file contained the NM/NR NCBI RefSeq gene ids and their corresponding FlyBase gene and transcript IDs. I then used the file to grep lines where the feature was: mRNA, lnc_RNA, snRNA, miRNA, antisense_RNA, ncRNA, rRNA, snoRNA, RNase_MRP_RNA, RNase_P_RNA, SRP_RNA, primary_transcript - as these features contained the NR/NM RefSeq identifier and FlyBase gene and transcript IDs. For example: cat GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff | grep '.*\tSRP_RNA\t.*' >> output.txt I then removed any lines where there was no "Name=" field. This resulted in 34,450 lines, which is the same number of unique NM/NR in the UCSC RefSeq (refgene) file. Next, I extracted Just the NR/NM values and their corresponding FlyBase gene and transcript identifiers. The resulting file looks like this: NM_001103384 FBgn0025837 FBtr0112921 The NM/NR code was then used as a key to sort the data and add the FlyBase gene and transcript identifiers to the 14th column of the D_melanogaster_Aug_2014_refGene.bed.gz file.
            Hide
            omarne Omkar Marne (Inactive) added a comment - - edited

            Test results:

            • I added the https://data.cyverse.org/dav-anon/iplant/home/nfreese/2952_testing/quickload as a new Data Source and named it as DM6 quickload.
            • RefGene track appeared after selecting Drosophila melanogaster Species and the D_melanogaster_Aug_2014 Genome Version.
            • Residues appear after loading sequence. There are no errors in the log.
            • After enabling UCSC DAS provider as a Data source and selecting ensgene, the residues appear and there are no errors in the log.

            Please check screenshot 1, 2,3.

            Show
            omarne Omkar Marne (Inactive) added a comment - - edited Test results: I added the https://data.cyverse.org/dav-anon/iplant/home/nfreese/2952_testing/quickload as a new Data Source and named it as DM6 quickload. RefGene track appeared after selecting Drosophila melanogaster Species and the D_melanogaster_Aug_2014 Genome Version. Residues appear after loading sequence. There are no errors in the log. After enabling UCSC DAS provider as a Data source and selecting ensgene, the residues appear and there are no errors in the log. Please check screenshot 1, 2,3.
            Hide
            nfreese Nowlan Freese added a comment - - edited

            I have committed the files to the subversion repository: https://svn.bioviz.org/viewvc/genomes/quickload/D_melanogaster_Aug_2014/

            Show
            nfreese Nowlan Freese added a comment - - edited I have committed the files to the subversion repository: https://svn.bioviz.org/viewvc/genomes/quickload/D_melanogaster_Aug_2014/
            Hide
            ann.loraine Ann Loraine added a comment -

            Loraine needs to deploy these to the various sites.

            Show
            ann.loraine Ann Loraine added a comment - Loraine needs to deploy these to the various sites.
            Hide
            ann.loraine Ann Loraine added a comment -

            Deployed to:

            and ready for testing.

            attn:

            Nowlan Freese
            Omkar Marne

            Show
            ann.loraine Ann Loraine added a comment - Deployed to: Quickload hosted at UNC Charlotte: http://igbquickload.org/quickload/ Quickload hosted at RENCI: http://lorainelab-quickload.scidas.org/quickload Quickload hosted on Amazon EC2: http://quickload.bioviz.org/quickload/ and ready for testing. attn: Nowlan Freese Omkar Marne
            Hide
            nfreese Nowlan Freese added a comment -

            To test:
            In IGB,

            1. Disable all Data Sources
            2. Add one of the above data sources
            3. Select the Drosophila melanogaster Species and the D_melanogaster_Aug_2014 Genome Version.
            4. The RefGene track should appear and should automatically load data.
            5. Navigate to: chrX:6,206,883-6,206,926
            6. Click Load Sequence.
            7. Residues (ATCG) should load.
            8. Check that there are no errors in the log.
            9. Select the ENCODE blacklist under Available Data.
            10. Navigate to: chr2R:737,475-885,425
            11. Click Load Data
            12. Data should load in view (should appear as a black bar titled Low Mappability).
            13. Check that there are no errors in the log.
            14. Enable the UCSC DAS data provider as a Data Source.
            15. The UCSC (DAS) folder should appear under Available Data in the Data Access tab.
            16. Select the ensGene.
            17. Click Load Data.
            18. Check that there are no errors in the log.

            Repeat the above steps for all three Quickload hosts.

            Show
            nfreese Nowlan Freese added a comment - To test: In IGB, Disable all Data Sources Add one of the above data sources Select the Drosophila melanogaster Species and the D_melanogaster_Aug_2014 Genome Version. The RefGene track should appear and should automatically load data. Navigate to: chrX:6,206,883-6,206,926 Click Load Sequence . Residues (ATCG) should load. Check that there are no errors in the log. Select the ENCODE blacklist under Available Data. Navigate to: chr2R:737,475-885,425 Click Load Data Data should load in view (should appear as a black bar titled Low Mappability). Check that there are no errors in the log. Enable the UCSC DAS data provider as a Data Source. The UCSC (DAS) folder should appear under Available Data in the Data Access tab. Select the ensGene. Click Load Data. Check that there are no errors in the log. Repeat the above steps for all three Quickload hosts.
            Hide
            omarne Omkar Marne (Inactive) added a comment - - edited

            I added Quickload hosted on Amazon EC2 http://quickload.bioviz.org/quickload/ data source and followed above steps. All the steps performed as expected.

            I couldn't load below data sources because they were already present in the data sources tab.

            Please refer screenshot 4.

            Show
            omarne Omkar Marne (Inactive) added a comment - - edited I added Quickload hosted on Amazon EC2 http://quickload.bioviz.org/quickload/ data source and followed above steps. All the steps performed as expected. I couldn't load below data sources because they were already present in the data sources tab. Quickload hosted at UNC Charlotte: http://igbquickload.org/quickload/ Quickload hosted at RENCI: http://lorainelab-quickload.scidas.org/quickload Please refer screenshot 4.
            Hide
            ann.loraine Ann Loraine added a comment -

            Omkar Marne - if you disable all data sources (step 1) you should be able to add all three, even though they are already in IGB. Please try again.

            Show
            ann.loraine Ann Loraine added a comment - Omkar Marne - if you disable all data sources (step 1) you should be able to add all three, even though they are already in IGB. Please try again.
            Hide
            omarne Omkar Marne (Inactive) added a comment -

            I disable all the data sources and added the above three. I am getting the expected output.

            Closing the ticket.

            Show
            omarne Omkar Marne (Inactive) added a comment - I disable all the data sources and added the above three. I am getting the expected output. Closing the ticket.

              People

              • Assignee:
                nfreese Nowlan Freese
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: