Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4010

Create documentation for adding NCBI genomes to IGB Quickload

    Details

    • Type: Documentation
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Situation: We currently have documentation for adding new genomes to IGB Quickload using UCSC as a data source. However, after working hard on the IGB-UCSC integration, future requests for adding new genomes to IGB will have to be completed using other data sources such as NCBI. Since this is an entirely new data source and the process for downloading data and naming files will look different, we need to create some new documentation.

      Task:

      • Read over current documentation for adding new genomes to IGB Quickload (https://docs.google.com/document/d/1WQO_HWhpfUBntsNSaQ-jdrVoR6jJcQ7ewJ_HFvpYsYA/edit?usp=sharing).
      • Using our current documentation as a guide, create a new document in Google Drive, create section outlines, and transfer over as much relevant documentation as possible.
      • Include a section that discusses how to name files depending on the assembly type (e.g., "ncbiRefSeq" when dealing with a RefSeq assembly, "genBank" when dealing with a GenBank (GCA) assembly).
      • Include a section with an example annots.xml file that discusses how to format the description attribute depending on the assembly type (e.g., "NCBI GenBank [GenBank (GCA) assembly] [Assembly] ([Assembly date in MMM. DD, YYYY format])" when dealing with a GenBank (GCA) assembly), as well as the title attribute.
      • Include an example HEADER.md file that has been remade for genomes coming from NCBI rather than UCSC.
      • Include a section for creating species.txt and synonyms.txt.

        Attachments

          Issue Links

            Activity

            Hide
            pkulzer Paige Kulzer added a comment -

            These are great suggestions!

            I've added some more instructions to the Tabix-index gene model file section for manually checking the BED file for any erroneous text. I also removed mention of archiving files from that section. Finally, I specified that ALL modified files should be included in the zipped Quickload folder for a reviewer to test (i.e., species.txt, synonyms.txt, etc) in the Deploy to IGB Quickload section.

            Closing this ticket!

            Show
            pkulzer Paige Kulzer added a comment - These are great suggestions! I've added some more instructions to the Tabix-index gene model file section for manually checking the BED file for any erroneous text. I also removed mention of archiving files from that section. Finally, I specified that ALL modified files should be included in the zipped Quickload folder for a reviewer to test (i.e., species.txt, synonyms.txt, etc) in the Deploy to IGB Quickload section. Closing this ticket!
            Hide
            nfreese Nowlan Freese added a comment -

            Overall looks good. I had a couple of thoughts. Up to Paige Kulzer to decide if changes should be made. Otherwise I think the ticket can be closed.

            • Convert annotation features -> just making a note about how it doesn't seem like we need to do the steps to add the 13/14th columns to the bed file like we used to when we were pulling annotations from UCSC. I checked the GFF for human hg38 from NCBI it looks like the data for the 13th/14th columns is present in the file. This is great and saves us a good amount of time.
            • Tabix-index gene model file -> I don't archive the original BED12 file. I don't think there's a reason to save it?
            • Based on testing of IGBF-4018 I'm wondering if we should include some text on making sure to check the conversion of the bed file for things like %2C.
            Show
            nfreese Nowlan Freese added a comment - Overall looks good. I had a couple of thoughts. Up to Paige Kulzer to decide if changes should be made. Otherwise I think the ticket can be closed. Convert annotation features -> just making a note about how it doesn't seem like we need to do the steps to add the 13/14th columns to the bed file like we used to when we were pulling annotations from UCSC. I checked the GFF for human hg38 from NCBI it looks like the data for the 13th/14th columns is present in the file. This is great and saves us a good amount of time. Tabix-index gene model file -> I don't archive the original BED12 file. I don't think there's a reason to save it? Based on testing of IGBF-4018 I'm wondering if we should include some text on making sure to check the conversion of the bed file for things like %2C.
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            To test the new version of the documentation I made, I went ahead and followed this documentation to add the Dama dama genome to IGB Quickload as part of IGBF-4018.

            For review, please look over the documentation I've made and check the following:

            • The order of steps makes sense
            • The images are clear and easy to read, and add to the reader's comprehension of the task
            • The naming system for genomes with RefSeq vs GenBank annotations is clearly defined and consistent
            • Code is all formatted consistently
            • The new Dama dama quickload contains all of the necessary files and works as expected when added to IGB (i.e., was created successfully by following this documentation)
            Show
            pkulzer Paige Kulzer added a comment - - edited To test the new version of the documentation I made, I went ahead and followed this documentation to add the Dama dama genome to IGB Quickload as part of IGBF-4018 . For review, please look over the documentation I've made and check the following: The order of steps makes sense The images are clear and easy to read, and add to the reader's comprehension of the task The naming system for genomes with RefSeq vs GenBank annotations is clearly defined and consistent Code is all formatted consistently The new Dama dama quickload contains all of the necessary files and works as expected when added to IGB (i.e., was created successfully by following this documentation)
            Hide
            pkulzer Paige Kulzer added a comment - - edited
            Show
            pkulzer Paige Kulzer added a comment - - edited Here's a link to the new documentation: How we add new genomes to IGB Quickload using NCBI as a data source

              People

              • Assignee:
                pkulzer Paige Kulzer
                Reporter:
                pkulzer Paige Kulzer
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: