Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3889

Add Capitella teleta genome to IGB

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Task: Add the Capitella teleta genome and annotation to IGB.

      Capitella teleta (Capca1) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000328365.1/
      NCBI:txid283909

        Attachments

          Activity

          pkulzer Paige Kulzer (Inactive) created issue -
          pkulzer Paige Kulzer (Inactive) made changes -
          Field Original Value New Value
          Epic Link IGBF-3823 [ 23122 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Status To-Do [ 10305 ] In Progress [ 3 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Description Task: Add the Capitella teleta genome and annotation to IGB.

          Capitella teleta (Capca1) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000328365.1/
          Task: Add the Capitella teleta genome and annotation to IGB.

          Capitella teleta (Capca1) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000328365.1/
          NCBI:txid283909
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment - - edited

          Below is an outline of the steps I followed to create the Capitella teleta Quickload:

          1. Convert genome .fasta to .2bit

          rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ 
          ./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit
          

          2. Create genome.txt

          rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./  
          ./twoBitInfo C_teleta_Jan_2013.2bit genome.txt
          

          3. Get gene models from NCBI (.gff), then convert .gff to .bed

          git clone git@bitbucket.org:lorainelab/genomesource.git
          
          path+=('~/Documents/Repos/genomesource/')
          export PYTHONPATH="${PYTHONPATH}:~/Documents/Repos/genomesource/"
          
          ./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed
          

          4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't)

          brew install tnftp
          ftp ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
          
          get gene2accession.gz
          quit
          
          gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt
          

          5. Sort, gzip, and tabix the .bed file made in step 3

          sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz
          tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz
          

          6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order.

          7. Create annots.xml

          brew install svn
          svn checkout --username=guest --password=guest https://svn.bioviz.org/repos/genomes/quickload
          svn mkdir C_teleta_Jan_2013
          svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/
          

          7. Add the new genome to contents.txt and .htaccess

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - - edited Below is an outline of the steps I followed to create the Capitella teleta Quickload: 1. Convert genome .fasta to .2bit rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ ./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit 2. Create genome.txt rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./ ./twoBitInfo C_teleta_Jan_2013.2bit genome.txt 3. Get gene models from NCBI (.gff), then convert .gff to .bed git clone git@bitbucket.org:lorainelab/genomesource.git path+=('~/Documents/Repos/genomesource/') export PYTHONPATH= "${PYTHONPATH}:~/Documents/Repos/genomesource/" ./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed 4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't) brew install tnftp ftp ftp: //ftp.ncbi.nlm.nih.gov/gene/DATA/ get gene2accession.gz quit gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt 5. Sort, gzip, and tabix the .bed file made in step 3 sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz 6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order. 7. Create annots.xml brew install svn svn checkout --username=guest --password=guest https: //svn.bioviz.org/repos/genomes/quickload svn mkdir C_teleta_Jan_2013 svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/ 7. Add the new genome to contents.txt and .htaccess
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment - - edited

          Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive:
          Path: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip
          Link: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

          For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - - edited Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive: Path : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip Link : https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.
          pkulzer Paige Kulzer (Inactive) made changes -
          Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Assignee Paige Kulzer [ pkulzer ] Nowlan Freese [ nfreese ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Sprint Fall 1 [ 202 ]
          ann.loraine Ann Loraine made changes -
          Sprint Fall 1 [ 202 ] Fall 1, Fall 2 [ 202, 203 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          pkulzer Paige Kulzer (Inactive) made changes -
          Rank Ranked higher
          pkulzer Paige Kulzer (Inactive) made changes -
          Rank Ranked lower
          nfreese Nowlan Freese made changes -
          Sprint Fall 1, Fall 2 [ 202, 203 ] Fall 1, Fall 3 [ 202, 204 ]
          nfreese Nowlan Freese made changes -
          Sprint Fall 1, Fall 3 [ 202, 204 ] Fall 1, Fall 4 [ 202, 205 ]
          nfreese Nowlan Freese made changes -
          Sprint Fall 1, Fall 4 [ 202, 205 ] Fall 1, Fall 5 [ 202, 206 ]
          nfreese Nowlan Freese made changes -
          Sprint Fall 1, Fall 5 [ 202, 206 ] Fall 1, Fall 6 [ 202, 207 ]
          nfreese Nowlan Freese made changes -
          Sprint Fall 1, Fall 6 [ 202, 207 ] Fall 1, Fall 7 [ 202, 208 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Rank Ranked higher
          nfreese Nowlan Freese made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          nfreese Nowlan Freese made changes -
          Attachment species.txt [ 18577 ]
          Attachment synonyms.txt [ 18578 ]
          Hide
          nfreese Nowlan Freese added a comment - - edited

          1st level review:
          Downloaded zip file and unpacked and added to IGB as new quickload data source.
          Annotation loads for whole genome and I am able to load the sequence (no errors in log).

          Couple of small things:

          • The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene.
          • Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms).
          • Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.
          Show
          nfreese Nowlan Freese added a comment - - edited 1st level review: Downloaded zip file and unpacked and added to IGB as new quickload data source. Annotation loads for whole genome and I am able to load the sequence (no errors in log). Couple of small things: The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene. Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms). Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ] Paige Kulzer [ pkulzer ]
          nfreese Nowlan Freese made changes -
          Status First Level Review in Progress [ 10301 ] To-Do [ 10305 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Status To-Do [ 10305 ] In Progress [ 3 ]
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment -

          I've addressed the above change requests:

          • I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations.
          • I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome.
          • I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms.

          Let me know if I've updated all of these files correctly!

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - I've addressed the above change requests: I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations. I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome. I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms. Let me know if I've updated all of these files correctly!
          pkulzer Paige Kulzer (Inactive) made changes -
          Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Assignee Paige Kulzer [ pkulzer ] Nowlan Freese [ nfreese ]
          nfreese Nowlan Freese made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          Hide
          nfreese Nowlan Freese added a comment - - edited

          Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451

          The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ][nine digits][.][version number]
          The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ][nine digits][.][version number]

          Show
          nfreese Nowlan Freese added a comment - - edited Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451 The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ] [nine digits] [.] [version number] The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ] [nine digits] [.] [version number]
          nfreese Nowlan Freese made changes -
          Status First Level Review in Progress [ 10301 ] Needs 1st Level Review [ 10005 ]
          nfreese Nowlan Freese made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          Hide
          nfreese Nowlan Freese added a comment -

          Tested locally on Mac, looks good.

          Ready to push to SVN repo

          Show
          nfreese Nowlan Freese added a comment - Tested locally on Mac, looks good. Ready to push to SVN repo
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ] Paige Kulzer [ pkulzer ]
          nfreese Nowlan Freese made changes -
          Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
          Hide
          pkulzer Paige Kulzer (Inactive) added a comment -

          The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so:

          svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here]
          

          Ready for final review!

          Show
          pkulzer Paige Kulzer (Inactive) added a comment - The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so: svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here] Ready for final review!
          pkulzer Paige Kulzer (Inactive) made changes -
          Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
          pkulzer Paige Kulzer (Inactive) made changes -
          Assignee Paige Kulzer [ pkulzer ] Nowlan Freese [ nfreese ]
          Hide
          ann.loraine Ann Loraine added a comment -

          I have deployed the latest copy of quickload repository to:

          To test:

          • launch IGB and visit each new genome version (see above)
          • visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself
          • within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow
          • check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)
          Show
          ann.loraine Ann Loraine added a comment - I have deployed the latest copy of quickload repository to: RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary) UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup) To test: launch IGB and visit each new genome version (see above) visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)
          nfreese Nowlan Freese made changes -
          Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
          Hide
          nfreese Nowlan Freese added a comment -

          Tested following the instructions above. Everything looks good.

          Closing ticket.

          Show
          nfreese Nowlan Freese added a comment - Tested following the instructions above. Everything looks good. Closing ticket.
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ] Paige Kulzer [ pkulzer ]
          nfreese Nowlan Freese made changes -
          Resolution Done [ 10000 ]
          Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
          nfreese Nowlan Freese made changes -
          Assignee Paige Kulzer [ pkulzer ] Nowlan Freese [ nfreese ]
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ] Paige Kulzer [ pkulzer ]

            People

            • Assignee:
              pkulzer Paige Kulzer (Inactive)
              Reporter:
              pkulzer Paige Kulzer (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: