Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-1879

Ensure Galaxy default dbkey values are in synonyms.txt

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
    • Story Points:
      2
    • Sprint:
      Summer 2019 Sprint 12, Fall 2019 Sprint 1, Fall 2019 Sprint 2, Fall 2019 Sprint 3, Fall 4 : 30 Sep to 11 Oct, Fall 5 : 14 Oct to 25 Oct, Fall 6 : 28 Oct to 8 Nov, Fall 7 : 11 Nov to 22 Nov, Fall 8 : 25 Nov to 6 Dec, Spring 5 2021 May 17 - May 28

      Description

      This ticket was first worked on in 2014 and needs to be re-visited because Galaxy code base has progressed and many new genome versions are available.

      The Galaxy software relies heavily on the UCSC Genome Browser informatics system to support many different genome versions. There's a "cron job" that Galaxy admin's periodically run to update genome version data in Galaxy. We need to understand how that works to make sure that IGB's synonyms system stays up to date with Galaxy.

      The script(s) run during the cron job reside in:

      The script in this directory that "kicks off" the update is updateucsc.sh.sample.

      It also handles getting length files with chromosome names and sizes. Note that this script also manages import of non-ucsc genome names. Look at the script to see how that aspect is working.

      The above is actually a legacy system but has been used for a long time and seems to work quite well. In addition, there's a "data manager" system that allows an admin to use the Galaxy UI to define new genomes.

      Each genome version is identified by a "dbkey" that in the case of ucsc genomes, is identical to ucsc genome version names.

      An example dbkey is "apiMel2", which is IGB, is called "A_mellifera_Jan_2005"

      Dan Blankenberg (https://galaxyproject.org/people/dan/) notes that there may actually be a REST endpoint that can report the dbkeys for all genome versions supported in a given Galaxy instance. If not, he recommends we open a ticket in Galaxy github requesting this.

      Galaxy also keeps track of a user-friendly name that is displayed to users. This is equivalent to column 2 in IGB Quickload's contents.txt file – see http://igbquickload.org/quickload/contents.txt.

      For this task, check that the synonyms.txt file includes all genome versions supported by Galaxy. If not, update it.

        Attachments

          Issue Links

            Activity

            Hide
            ann.loraine Ann Loraine added a comment - - edited

            This looks right.

            However, I have a question about Arabidopsis_thaliana_TAIR10.

            The output from this endpoint - https://usegalaxy.org/api/genomes/Arabidopsis_thaliana_TAIR10 - reports chromosome names as "chr1", "chr2", etc with lower-case "c".

            The IGB Quickload site many years ago also used lower-case "c". But I changed us to using upper-case "C" when I realized that The Arabidopsis Information Resource and the entire Arabidopsis community was using upper-case "C". I'm not sure where Galaxy got Arabidopsis_thaliana_TAIR10 from, but they might have got it from me. If they got it from me, then they maybe used the lower-case "c" because that is what I was using. However, I also have a vague memory of someone at the last Galaxy Community Conference saying something about how Galaxy might be making all chromosome names lower-case by default.

            Could you ask them about this?

            I also checked a few other plant genomes that I think might be using upper-case letters and they were all lower-case in Galaxy also.

            Also, could you find the part of their code base (it's python) that supports this "genomes" endpoint.

            (By the way I'm really super-glad it is there!!!)

            Show
            ann.loraine Ann Loraine added a comment - - edited This looks right. However, I have a question about Arabidopsis_thaliana_TAIR10. The output from this endpoint - https://usegalaxy.org/api/genomes/Arabidopsis_thaliana_TAIR10 - reports chromosome names as "chr1", "chr2", etc with lower-case "c". The IGB Quickload site many years ago also used lower-case "c". But I changed us to using upper-case "C" when I realized that The Arabidopsis Information Resource and the entire Arabidopsis community was using upper-case "C". I'm not sure where Galaxy got Arabidopsis_thaliana_TAIR10 from, but they might have got it from me. If they got it from me, then they maybe used the lower-case "c" because that is what I was using. However, I also have a vague memory of someone at the last Galaxy Community Conference saying something about how Galaxy might be making all chromosome names lower-case by default. Could you ask them about this? I also checked a few other plant genomes that I think might be using upper-case letters and they were all lower-case in Galaxy also. Also, could you find the part of their code base (it's python) that supports this "genomes" endpoint. (By the way I'm really super-glad it is there!!!)
            Hide
            shamika Shamika Gajanan Kulkarni (Inactive) added a comment -

            Yes sure, I will ask them and let you know about the lower-case and I'll try to find the part of their code-base too. Will let you know soon. Thank you Prof. [~aloraine].

            Show
            shamika Shamika Gajanan Kulkarni (Inactive) added a comment - Yes sure, I will ask them and let you know about the lower-case and I'll try to find the part of their code-base too. Will let you know soon. Thank you Prof. [~aloraine] .
            Hide
            ann.loraine Ann Loraine added a comment -

            dbkey is "?" if a user is working with a custom genome in Galaxy.
            Galaxy has a manually defined static map that maps dbkeys.

            Show
            ann.loraine Ann Loraine added a comment - dbkey is "?" if a user is working with a custom genome in Galaxy. Galaxy has a manually defined static map that maps dbkeys.
            Hide
            nfreese Nowlan Freese added a comment -

            I have included some updates from Galaxy as part of IGBF-781.

            Some notes:
            The dbkeys for genomes that are also found in UCSC appear to have the same key names. For example, Takifugu rubripes is fr1 in both Galaxy and UCSC. However, as UCSC does not provide plant genomes through its API, Galaxy uses a somewhat different set of keys. For example, the Galaxy EU key for Solanum lycopersicum is Solanum_lycopersicum_Sol_Genomics_itag2.4.

            Since the Galaxy keys are the same as UCSC, I focused on the non UCSC genomes that are available in IGB. This list is mostly plant genomes, as many of the non-plant genomes in IGB are provided by UCSC. While the Galaxy main API does not include many plant genomes, the Galaxy EU API does include many plant genomes. This news article discusses the newly added plant genomes and where the data came from.

            I manually added as many of the Galaxy EU keys to synonyms.txt as I could identify.

            Show
            nfreese Nowlan Freese added a comment - I have included some updates from Galaxy as part of IGBF-781 . Some notes: The dbkeys for genomes that are also found in UCSC appear to have the same key names. For example, Takifugu rubripes is fr1 in both Galaxy and UCSC. However, as UCSC does not provide plant genomes through its API , Galaxy uses a somewhat different set of keys. For example, the Galaxy EU key for Solanum lycopersicum is Solanum_lycopersicum_Sol_Genomics_itag2.4. Since the Galaxy keys are the same as UCSC, I focused on the non UCSC genomes that are available in IGB. This list is mostly plant genomes, as many of the non-plant genomes in IGB are provided by UCSC. While the Galaxy main API does not include many plant genomes, the Galaxy EU API does include many plant genomes. This news article discusses the newly added plant genomes and where the data came from. I manually added as many of the Galaxy EU keys to synonyms.txt as I could identify.
            Hide
            nfreese Nowlan Freese added a comment -

            I have created IGBF-2862 regarding an edge case I found while testing loading Galaxy data in IGB.

            Show
            nfreese Nowlan Freese added a comment - I have created IGBF-2862 regarding an edge case I found while testing loading Galaxy data in IGB.

              People

              • Assignee:
                shamika Shamika Gajanan Kulkarni (Inactive)
                Reporter:
                dcnorris David Norris (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 4 hours
                  4h
                  Remaining:
                  Remaining Estimate - 4 hours
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified