Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-1189

Add S. Cerevisiae June 2008 genome to IGB Quickload - User request

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
    • Story Points:
      0.5
    • Sprint:
      Fall 2018 1, Fall 2018 Sprint 2, Fall 2018 Sprint 3

      Description

      A user from HELP-279 has contacted us about the June 2008 S. Cerevisiae genome missing from IGB Quickload.

      To aid in the task of adding this genome, I have attached the:

      1) 2bit reference genome, obtained here:
      http://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/bigZips/

      2) Reference Annotations in BED format, obtained from the UCSC Table Browser using the "SGD Gene" track

        Attachments

          Issue Links

            Activity

            mason Mason Meyer (Inactive) created issue -
            mason Mason Meyer (Inactive) made changes -
            Field Original Value New Value
            Link This issue relates to HELP-279 [ HELP-279 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Rank Ranked lower
            ann.loraine Ann Loraine made changes -
            Rank Ranked lower
            ann.loraine Ann Loraine made changes -
            Sprint Early Fall 2017 [ 47 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]
            ann.loraine Ann Loraine made changes -
            Summary Add S. Cerevisiae June 2008 genome to IGB Quickload Add S. Cerevisiae June 2008 genome to IGB Quickload - User request
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Labels Beginner
            Hide
            ann.loraine Ann Loraine added a comment -

            Create and test Quickload containing just these data and notify Dr. Loraine when it is ready to deploy.
            Ensure appropriate synonyms are added to synonyms.txt and species.txt bundled with IGB.

            Show
            ann.loraine Ann Loraine added a comment - Create and test Quickload containing just these data and notify Dr. Loraine when it is ready to deploy. Ensure appropriate synonyms are added to synonyms.txt and species.txt bundled with IGB.
            ann.loraine Ann Loraine made changes -
            Story Points 0.25
            ann.loraine Ann Loraine made changes -
            Fix Version/s 9.0.2 Minor Release [ 10600 ]
            Story Points 0.25 0.5
            Labels Beginner Intermediate
            Assignee Ivory Blakley [ ieclabau ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 2018 1 [ 51 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked lower
            ieclabau Ivory Blakley (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            I used the files attached here.
            I changed the name of the .2bit file to match the folder. The name of the bed file can stay as-is.

            In order to create an index for the bed file I had to unzip it and sort it.
            sort -k 1,1 -k2,2n S_Cerevisiae_Jun2008_sgdgenes.bed > S_Cerevisiae_Jun2008_sgdgenes.sorted.bed

            I made the genome.txt file as suggested in the users guide:
            twoBitInfo S_cerevisiae_Jun_2008.2bit genome.txt

            Both the genome.txt file and the set of chromosomes in the bed file are as follows:
            $ gunzip -c S_Cerevisiae_Jun2008_sgdgenes.bed.gz | cut -f 1 | uniq | sort
            $ cat genome.txt | cut -f 1 | sort

            2micron
            chrI
            chrII
            chrIII
            chrIV
            chrIX
            chrM
            chrV
            chrVI
            chrVII
            chrVIII
            chrX
            chrXI
            chrXII
            chrXIII
            chrXIV
            chrXV
            chrXVI

            I don't know what the 2micron sequence is supposed to be.
            Since the annotations and the genome use the same chromosome names, I don't need to add a chromosomes.txt file.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - I used the files attached here. I changed the name of the .2bit file to match the folder. The name of the bed file can stay as-is. In order to create an index for the bed file I had to unzip it and sort it. sort -k 1,1 -k2,2n S_Cerevisiae_Jun2008_sgdgenes.bed > S_Cerevisiae_Jun2008_sgdgenes.sorted.bed I made the genome.txt file as suggested in the users guide: twoBitInfo S_cerevisiae_Jun_2008.2bit genome.txt Both the genome.txt file and the set of chromosomes in the bed file are as follows: $ gunzip -c S_Cerevisiae_Jun2008_sgdgenes.bed.gz | cut -f 1 | uniq | sort $ cat genome.txt | cut -f 1 | sort 2micron chrI chrII chrIII chrIV chrIX chrM chrV chrVI chrVII chrVIII chrX chrXI chrXII chrXIII chrXIV chrXV chrXVI I don't know what the 2micron sequence is supposed to be. Since the annotations and the genome use the same chromosome names, I don't need to add a chromosomes.txt file.
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            synonyms.txt

            To find other terms that might be used for this genome version, I looked at the UCSC ftp downloads site.

            In this document, I found that the following terms for the 2008 genome: sacCer2, SGD June 2008
            ftp://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/chromosomes/README.txt

            In this document, I found that the 2011 version (already supported in IGB) should have these synonyms: SacCer_Apr2011, sacCer3
            ftp://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/chromosomes/README.txt

            One the most likely times for needing a genome synonyms is when loading data from Galaxy.
            I went to useGalaxy.org, and selected NGS: RNA Analysis > TopHat
            I selected to Use a built-in genome, and entered "Cerevisiae" to search within the pulldown menu for genomes. Options are:
            Yeast (Saccharomyces cerevisiae): sacCer2
            Yeast (Saccharomyces cerevisiae): sacCer3

            So I think I have all the synonyms I need.
            _________________________

            species.txt

            I don't when we need this. Its in the instructions for creating a quickload site, so I made it. I supplied sacCer as an alternative genome prefix.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - synonyms.txt To find other terms that might be used for this genome version, I looked at the UCSC ftp downloads site. In this document, I found that the following terms for the 2008 genome: sacCer2, SGD June 2008 ftp://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/chromosomes/README.txt In this document, I found that the 2011 version (already supported in IGB) should have these synonyms: SacCer_Apr2011, sacCer3 ftp://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/chromosomes/README.txt One the most likely times for needing a genome synonyms is when loading data from Galaxy. I went to useGalaxy.org, and selected NGS: RNA Analysis > TopHat I selected to Use a built-in genome, and entered "Cerevisiae" to search within the pulldown menu for genomes. Options are: Yeast (Saccharomyces cerevisiae): sacCer2 Yeast (Saccharomyces cerevisiae): sacCer3 So I think I have all the synonyms I need. _________________________ species.txt I don't when we need this. Its in the instructions for creating a quickload site, so I made it. I supplied sacCer as an alternative genome prefix.
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment - - edited

            HEADER.md

            I stole the HEADER.md file from the S_cerevisiae_Apr_2011 directory in the SVN repo.
            Modified it to reflect the current genome version and data sources.

            _______________

            annots.xml

            I stole the annots.xml from from the S_cerevisiae_Apr_2011 directory in the igbquickload site.
            Modified it to use the bed file name for this genome.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - - edited HEADER.md I stole the HEADER.md file from the S_cerevisiae_Apr_2011 directory in the SVN repo. Modified it to reflect the current genome version and data sources. _______________ annots.xml I stole the annots.xml from from the S_cerevisiae_Apr_2011 directory in the igbquickload site. Modified it to use the bed file name for this genome.
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            I set up an AWS EC2 instance and followed the instructions on this page to set up Apache.
            https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Tutorials.WebServerDB.CreateWebServer.html
            Section: Install an Apache Web Server with PHP
            Followed these steps, except in step 3, I just used "sudo yum install -y httpd" and ignored the rest.

            You can now access the quicklaod site for the 2008 S.Cerevisiae genome assembly here:
            http://18.222.191.240/Quickload_IGBF-1189_S.Cerevisiae/

            I tested this as a data provider in IGB and it looks great.
            I have not tested the species.txt or synonyms.txt file.
            The only thing that's not working is the HEADER.md file. Apache seems to be completely ignoring it, doesn't even list it as a file in the directory.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - I set up an AWS EC2 instance and followed the instructions on this page to set up Apache. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Tutorials.WebServerDB.CreateWebServer.html Section: Install an Apache Web Server with PHP Followed these steps, except in step 3, I just used "sudo yum install -y httpd" and ignored the rest. You can now access the quicklaod site for the 2008 S.Cerevisiae genome assembly here: http://18.222.191.240/Quickload_IGBF-1189_S.Cerevisiae/ I tested this as a data provider in IGB and it looks great. I have not tested the species.txt or synonyms.txt file. The only thing that's not working is the HEADER.md file. Apache seems to be completely ignoring it, doesn't even list it as a file in the directory.
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment - - edited

            I opened the HEADER.md file in RStudio and made a HEADER.html file from it, uploaded that to the EC2 instance.
            Now when go to that location via my web browser I see the html page. But the overall style is very different from the main IGB quicklaod.

            I think this is enough for Ann to be able to test the files. There's no reason to spend time adjusting my web server settings, the purpose of this site is to test the quickload files. It might be nice to know what the right settings are (for general reference), but there's no reason to hold up this issue.
            Ann, for future reference, how could I set up my server to have the general style/appearance of the main IGBquickload site?
            _______________________

            I am assigning this to Ann to review.

            You can now access my quicklaod site for the 2008 S.Cerevisiae genome assembly here:
            http://18.222.191.240/Quickload_IGBF-1189_S.Cerevisiae/

            If no corrections are needed, then I think this is the point where I just hand it off to Ann completely.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - - edited I opened the HEADER.md file in RStudio and made a HEADER.html file from it, uploaded that to the EC2 instance. Now when go to that location via my web browser I see the html page. But the overall style is very different from the main IGB quicklaod. I think this is enough for Ann to be able to test the files. There's no reason to spend time adjusting my web server settings, the purpose of this site is to test the quickload files. It might be nice to know what the right settings are (for general reference), but there's no reason to hold up this issue. Ann, for future reference, how could I set up my server to have the general style/appearance of the main IGBquickload site? _______________________ I am assigning this to Ann to review. You can now access my quicklaod site for the 2008 S.Cerevisiae genome assembly here: http://18.222.191.240/Quickload_IGBF-1189_S.Cerevisiae/ If no corrections are needed, then I think this is the point where I just hand it off to Ann completely.
            ieclabau Ivory Blakley (Inactive) made changes -
            Assignee Ivory Blakley [ ieclabau ] Ann Loraine [ aloraine ]
            ieclabau Ivory Blakley (Inactive) made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            DO let me know when you are done using the test site. I will shut down the EC2 instance.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - DO let me know when you are done using the test site. I will shut down the EC2 instance.
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] Reviewing [ 10301 ]
            Hide
            ann.loraine Ann Loraine added a comment -

            Add URL attribute to annots.xml; use relative link as in human genome annots.xml

            Show
            ann.loraine Ann Loraine added a comment - Add URL attribute to annots.xml; use relative link as in human genome annots.xml
            Hide
            ann.loraine Ann Loraine added a comment -

            Gene models do not appear to have functional information. Please add description and gene name info into fields 13 and 14, similar to what we do for the human genome. The descriptive text for each gene should be available from one of the Entrez gene files. Search the IGB wiki for documentation describing how and where to get functional descriptions for a genome. The documentation is now a few years old and so some of the file names and locations may have changed. However, the basic idea should still hold, which is: we use the Entrez gene databases from NCBI to obtain succinct descriptions of genes and show them to users when they select gene models.

            Show
            ann.loraine Ann Loraine added a comment - Gene models do not appear to have functional information. Please add description and gene name info into fields 13 and 14, similar to what we do for the human genome. The descriptive text for each gene should be available from one of the Entrez gene files. Search the IGB wiki for documentation describing how and where to get functional descriptions for a genome. The documentation is now a few years old and so some of the file names and locations may have changed. However, the basic idea should still hold, which is: we use the Entrez gene databases from NCBI to obtain succinct descriptions of genes and show them to users when they select gene models.
            ann.loraine Ann Loraine made changes -
            Status Reviewing [ 10301 ] Open [ 1 ]
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ] Ivory Blakley [ ieclabau ]
            ieclabau Ivory Blakley (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Sprint Fall 2018 1 [ 51 ] Fall 2018 1, Fall 2018 Sprint 2 [ 51, 52 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            added url.

            I'm trying to find a file with the gene descriptions.

            This file:
            https://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff
            has gene symbols (which do not match our bed file), and aliases (which do seem to match), and descriptions. The locations given in this file are based on genome (version=R64-2-1), which is SacCer3. Since we have a bed file for SacCer2, and the descriptions should be the same if the name is the same, we should be ok to use this. It would be nice to find annotations that were designed for the SacCer2 assembly, or a file that has matching names as the primary name (not buried in a list of aliases).

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - added url. I'm trying to find a file with the gene descriptions. This file: https://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff has gene symbols (which do not match our bed file), and aliases (which do seem to match), and descriptions. The locations given in this file are based on genome (version=R64-2-1), which is SacCer3. Since we have a bed file for SacCer2, and the descriptions should be the same if the name is the same, we should be ok to use this. It would be nice to find annotations that were designed for the SacCer2 assembly, or a file that has matching names as the primary name (not buried in a list of aliases).
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment - - edited

            The gene descriptions were added following the instructions here:
            https://wiki.transvar.org/display/igbdevelopers/Updating+RefGene+UCSC+data+set+for+an+existing+genome+in+IGB+QuickLoad
            With some modification.

            I encountered a problem with the ucscToBedDetail.py script from the GenomeSource repo:
            https://bitbucket.org/lorainelab/genomesource
            That issue has been resolved.

            The authoritative body for this genome is SDG. UCSC provides a bed table of sdg genes. The identifiers used in this bed file are locus_ids, which is an available field in the gene_info file from ncbi. I modified the ucscToBedDetail.py script to filter by taxID and use the locus id rather than the gene id to get the information from the gene_info table. I copied the modified script (ucscToBedDetail_SacCer_locusID.py) to the test quickload site so Ann and include it as documentation if appropriate.

            Out of 6,717 models in the sdg genes bed file, 698 were not found in the gene_info list (so they have NA for the gene description).

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - - edited The gene descriptions were added following the instructions here: https://wiki.transvar.org/display/igbdevelopers/Updating+RefGene+UCSC+data+set+for+an+existing+genome+in+IGB+QuickLoad With some modification. I encountered a problem with the ucscToBedDetail.py script from the GenomeSource repo: https://bitbucket.org/lorainelab/genomesource That issue has been resolved. The authoritative body for this genome is SDG. UCSC provides a bed table of sdg genes. The identifiers used in this bed file are locus_ids, which is an available field in the gene_info file from ncbi. I modified the ucscToBedDetail.py script to filter by taxID and use the locus id rather than the gene id to get the information from the gene_info table. I copied the modified script (ucscToBedDetail_SacCer_locusID.py) to the test quickload site so Ann and include it as documentation if appropriate. Out of 6,717 models in the sdg genes bed file, 698 were not found in the gene_info list (so they have NA for the gene description).
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            This is ready for review.

            Remember, you can access the quicklaod site for the 2008 S.Cerevisiae genome assembly here:
            http://18.222.191.240/Quickload_IGBF-1189_S.Cerevisiae/

            I am assigning this to Ann for review.
            Please see my last comment.
            If this passes review then Ann can take over.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - This is ready for review. Remember, you can access the quicklaod site for the 2008 S.Cerevisiae genome assembly here: http://18.222.191.240/Quickload_IGBF-1189_S.Cerevisiae/ I am assigning this to Ann for review. Please see my last comment. If this passes review then Ann can take over.
            ieclabau Ivory Blakley (Inactive) made changes -
            Assignee Ivory Blakley [ ieclabau ] Ann Loraine [ aloraine ]
            ieclabau Ivory Blakley (Inactive) made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] Reviewing [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Sprint Fall 2018 1, Fall 2018 Sprint 2 [ 51, 52 ] Fall 2018 1, Fall 2018 Sprint 2, Fall 2018 Sprint 3 [ 51, 52, 53 ]
            ann.loraine Ann Loraine made changes -
            Fix Version/s 9.0.2 Minor Release [ 10600 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing [ 10301 ] Needs Testing [ 10002 ]
            Hide
            ann.loraine Ann Loraine added a comment - - edited
            Show
            ann.loraine Ann Loraine added a comment - - edited Added to new https://svn.bioviz.org/repos/genomes subversion repository. To browse, visit https://svn.bioviz.org/viewvc (visit on non-UNCC network due to issues with internal DNS, to be fixed soon) However, did not add S_cerevisiae_Jun_2008_refGene.bed.gz because it is not mentioned in the annots.xml file.
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBF-1414 [ IGBF-1414 ]
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Deployed as part of current (newly migrated) quickload site.

            To test:

            • Test using master branch installer for 9.0.2
            • Inspect Species menu of Current Genome tab; check for new yeast genome related menu options (IGB has a bug whereby newly added genomes are shown incorrectly using their UCSC-style or other synonyms)
            • Select Current Genome tab, select Saccharomyces cerevisiae species (this is budding yeast)
            • Check that 2008 genome is listed in genome version menu
            • Select 2008 genome
            • Make sure that reference gene model data set is loaded upon visiting the genome
            • Make sure that sequence data can be loaded
            • Make sure that the menu item in Current Genome lists the species and genome correctly, with correct tooltips

            Report any problems here and if there are problems, return to To-Do column.

            Show
            ann.loraine Ann Loraine added a comment - - edited Deployed as part of current (newly migrated) quickload site. To test: Test using master branch installer for 9.0.2 Inspect Species menu of Current Genome tab; check for new yeast genome related menu options (IGB has a bug whereby newly added genomes are shown incorrectly using their UCSC-style or other synonyms) Select Current Genome tab, select Saccharomyces cerevisiae species (this is budding yeast) Check that 2008 genome is listed in genome version menu Select 2008 genome Make sure that reference gene model data set is loaded upon visiting the genome Make sure that sequence data can be loaded Make sure that the menu item in Current Genome lists the species and genome correctly, with correct tooltips Report any problems here and if there are problems, return to To-Do column.
            ann.loraine Ann Loraine made changes -
            Status Needs Testing [ 10002 ] Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Assignee Sneha Ramesh Watharkar [ jdaly ]
            ann.loraine Ann Loraine made changes -
            Assignee Sneha Ramesh Watharkar [ jdaly ] Pranav Sanjay Tambvekar [ ptambvek ]
            ptambvek Pranav Sanjay Tambvekar (Inactive) made changes -
            Status Needs Testing [ 10002 ] Testing In Progress [ 10003 ]
            Hide
            ptambvek Pranav Sanjay Tambvekar (Inactive) added a comment -

            Tested as described, the application behaves as expected.

            Show
            ptambvek Pranav Sanjay Tambvekar (Inactive) added a comment - Tested as described, the application behaves as expected.
            ptambvek Pranav Sanjay Tambvekar (Inactive) made changes -
            Resolution Done [ 10000 ]
            Status Testing In Progress [ 10003 ] Closed [ 6 ]
            ptambvek Pranav Sanjay Tambvekar (Inactive) made changes -
            Assignee Pranav Sanjay Tambvekar [ ptambvek ] Mason Meyer [ mason ]
            ann.loraine Ann Loraine made changes -
            Assignee Mason Meyer [ mason ] Ann Loraine [ aloraine ]
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ] Ivory Blakley [ ieclabau ]
            ann.loraine Ann Loraine made changes -
            Workflow Loraine Lab Workflow [ 17855 ] Fall 2019 Workflow Update [ 19820 ]
            ann.loraine Ann Loraine made changes -
            Workflow Fall 2019 Workflow Update [ 19820 ] Revised Fall 2019 Workflow Update [ 21939 ]

              People

              • Assignee:
                ieclabau Ivory Blakley (Inactive)
                Reporter:
                mason Mason Meyer (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: