Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3889

Add Capitella teleta genome to IGB

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Task: Add the Capitella teleta genome and annotation to IGB.

      Capitella teleta (Capca1) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000328365.1/
      NCBI:txid283909

        Attachments

          Activity

          Hide
          nfreese Nowlan Freese added a comment -

          Tested following the instructions above. Everything looks good.

          Closing ticket.

          Show
          nfreese Nowlan Freese added a comment - Tested following the instructions above. Everything looks good. Closing ticket.
          Hide
          ann.loraine Ann Loraine added a comment -

          I have deployed the latest copy of quickload repository to:

          To test:

          • launch IGB and visit each new genome version (see above)
          • visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself
          • within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow
          • check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)
          Show
          ann.loraine Ann Loraine added a comment - I have deployed the latest copy of quickload repository to: RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary) UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup) To test: launch IGB and visit each new genome version (see above) visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment -

          The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so:

          svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here]
          

          Ready for final review!

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so: svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here] Ready for final review!
          Hide
          nfreese Nowlan Freese added a comment -

          Tested locally on Mac, looks good.

          Ready to push to SVN repo

          Show
          nfreese Nowlan Freese added a comment - Tested locally on Mac, looks good. Ready to push to SVN repo
          Hide
          nfreese Nowlan Freese added a comment - - edited

          Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451

          The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ][nine digits][.][version number]
          The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ][nine digits][.][version number]

          Show
          nfreese Nowlan Freese added a comment - - edited Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451 The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ] [nine digits] [.] [version number] The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ] [nine digits] [.] [version number]
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment -

          I've addressed the above change requests:

          • I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations.
          • I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome.
          • I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms.

          Let me know if I've updated all of these files correctly!

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - I've addressed the above change requests: I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations. I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome. I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms. Let me know if I've updated all of these files correctly!
          Hide
          nfreese Nowlan Freese added a comment - - edited

          1st level review:
          Downloaded zip file and unpacked and added to IGB as new quickload data source.
          Annotation loads for whole genome and I am able to load the sequence (no errors in log).

          Couple of small things:

          • The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene.
          • Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms).
          • Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.
          Show
          nfreese Nowlan Freese added a comment - - edited 1st level review: Downloaded zip file and unpacked and added to IGB as new quickload data source. Annotation loads for whole genome and I am able to load the sequence (no errors in log). Couple of small things: The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene. Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms). Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment - - edited

          Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive:
          Path: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip
          Link: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

          For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - - edited Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive: Path : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip Link : https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment - - edited

          Below is an outline of the steps I followed to create the Capitella teleta Quickload:

          1. Convert genome .fasta to .2bit

          rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ 
          ./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit
          

          2. Create genome.txt

          rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./  
          ./twoBitInfo C_teleta_Jan_2013.2bit genome.txt
          

          3. Get gene models from NCBI (.gff), then convert .gff to .bed

          git clone git@bitbucket.org:lorainelab/genomesource.git
          
          path+=('~/Documents/Repos/genomesource/')
          export PYTHONPATH="${PYTHONPATH}:~/Documents/Repos/genomesource/"
          
          ./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed
          

          4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't)

          brew install tnftp
          ftp ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
          
          get gene2accession.gz
          quit
          
          gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt
          

          5. Sort, gzip, and tabix the .bed file made in step 3

          sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz
          tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz
          

          6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order.

          7. Create annots.xml

          brew install svn
          svn checkout --username=guest --password=guest https://svn.bioviz.org/repos/genomes/quickload
          svn mkdir C_teleta_Jan_2013
          svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/
          

          7. Add the new genome to contents.txt and .htaccess

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - - edited Below is an outline of the steps I followed to create the Capitella teleta Quickload: 1. Convert genome .fasta to .2bit rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ ./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit 2. Create genome.txt rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./ ./twoBitInfo C_teleta_Jan_2013.2bit genome.txt 3. Get gene models from NCBI (.gff), then convert .gff to .bed git clone git@bitbucket.org:lorainelab/genomesource.git path+=('~/Documents/Repos/genomesource/') export PYTHONPATH= "${PYTHONPATH}:~/Documents/Repos/genomesource/" ./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed 4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't) brew install tnftp ftp ftp: //ftp.ncbi.nlm.nih.gov/gene/DATA/ get gene2accession.gz quit gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt 5. Sort, gzip, and tabix the .bed file made in step 3 sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz 6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order. 7. Create annots.xml brew install svn svn checkout --username=guest --password=guest https: //svn.bioviz.org/repos/genomes/quickload svn mkdir C_teleta_Jan_2013 svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/ 7. Add the new genome to contents.txt and .htaccess

            People

            • Assignee:
              pkulzer Paige Kulzer (Inactive)
              Reporter:
              pkulzer Paige Kulzer (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: