[IGBF-3889] Add Capitella teleta genome to IGB - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Add genomes requested during SDB
Sprint:
Fall 1, Fall 7

Description

Task: Add the Capitella teleta genome and annotation to IGB.

Capitella teleta (Capca1) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000328365.1/
NCBI:txid283909

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

synonyms.txt
0.0 kB
10/Dec/24 11:29 AM
species.txt
0.0 kB
10/Dec/24 11:29 AM

Activity

Descending order - Click to sort in ascending order

Hide

Permalink

Nowlan Freese added a comment - 20/Dec/24 9:39 AM

Tested following the instructions above. Everything looks good.

Closing ticket.

Show

Nowlan Freese added a comment - 20/Dec/24 9:39 AM Tested following the instructions above. Everything looks good. Closing ticket.

Hide

Permalink

Ann Loraine added a comment - 19/Dec/24 2:38 PM

I have deployed the latest copy of quickload repository to:

RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary)
UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup)

To test:

launch IGB and visit each new genome version (see above)
visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself
within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow
check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Show

Ann Loraine added a comment - 19/Dec/24 2:38 PM I have deployed the latest copy of quickload repository to: RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary) UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup) To test: launch IGB and visit each new genome version (see above) visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 16/Dec/24 9:18 AM

The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so:

svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here]

Ready for final review!

Show

Paige Kulzer (Inactive) added a comment - 16/Dec/24 9:18 AM The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so: svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here] Ready for final review!

Hide

Permalink

Nowlan Freese added a comment - 12/Dec/24 2:47 PM

Tested locally on Mac, looks good.

Ready to push to SVN repo

Show

Nowlan Freese added a comment - 12/Dec/24 2:47 PM Tested locally on Mac, looks good. Ready to push to SVN repo

Hide

Permalink

Nowlan Freese added a comment - 12/Dec/24 11:20 AM - edited

Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451

The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ][nine digits][.][version number]
The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ][nine digits][.][version number]

Show

Nowlan Freese added a comment - 12/Dec/24 11:20 AM - edited Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451 The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ] [nine digits] [.] [version number] The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ] [nine digits] [.] [version number]

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 11/Dec/24 3:50 PM

I've addressed the above change requests:

I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations.
I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome.
I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms.

Let me know if I've updated all of these files correctly!

Show

Paige Kulzer (Inactive) added a comment - 11/Dec/24 3:50 PM I've addressed the above change requests: I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations. I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome. I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms. Let me know if I've updated all of these files correctly!

Hide

Permalink

Nowlan Freese added a comment - 10/Dec/24 11:30 AM - edited

1st level review:
Downloaded zip file and unpacked and added to IGB as new quickload data source.
Annotation loads for whole genome and I am able to load the sequence (no errors in log).

Couple of small things:

The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene.
Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms).
Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.

Show

Nowlan Freese added a comment - 10/Dec/24 11:30 AM - edited 1st level review: Downloaded zip file and unpacked and added to IGB as new quickload data source. Annotation loads for whole genome and I am able to load the sequence (no errors in log). Couple of small things: The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene. Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms). Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:46 PM - edited

Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive:
Path: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip
Link: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.

Show

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:46 PM - edited Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive: Path : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip Link : https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:41 PM - edited

Below is an outline of the steps I followed to create the Capitella teleta Quickload:

1. Convert genome .fasta to .2bit

rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ 
./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit

2. Create genome.txt

rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./  
./twoBitInfo C_teleta_Jan_2013.2bit genome.txt

3. Get gene models from NCBI (.gff), then convert .gff to .bed

git clone git@bitbucket.org:lorainelab/genomesource.git

path+=('~/Documents/Repos/genomesource/')
export PYTHONPATH="${PYTHONPATH}:~/Documents/Repos/genomesource/"

./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed

4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't)

brew install tnftp
ftp ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/

get gene2accession.gz
quit

gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt

5. Sort, gzip, and tabix the .bed file made in step 3

sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz
tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz

6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order.

7. Create annots.xml

brew install svn
svn checkout --username=guest --password=guest https://svn.bioviz.org/repos/genomes/quickload
svn mkdir C_teleta_Jan_2013
svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/

7. Add the new genome to contents.txt and .htaccess

Show

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:41 PM - edited Below is an outline of the steps I followed to create the Capitella teleta Quickload: 1. Convert genome .fasta to .2bit rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ ./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit 2. Create genome.txt rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./ ./twoBitInfo C_teleta_Jan_2013.2bit genome.txt 3. Get gene models from NCBI (.gff), then convert .gff to .bed git clone git@bitbucket.org:lorainelab/genomesource.git path+=('~/Documents/Repos/genomesource/') export PYTHONPATH= "${PYTHONPATH}:~/Documents/Repos/genomesource/" ./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed 4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't) brew install tnftp ftp ftp: //ftp.ncbi.nlm.nih.gov/gene/DATA/ get gene2accession.gz quit gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt 5. Sort, gzip, and tabix the .bed file made in step 3 sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz 6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order. 7. Create annots.xml brew install svn svn checkout --username=guest --password=guest https: //svn.bioviz.org/repos/genomes/quickload svn mkdir C_teleta_Jan_2013 svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/ 7. Add the new genome to contents.txt and .htaccess

People

Assignee:

Paige Kulzer (Inactive)

Reporter:

Paige Kulzer (Inactive)

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

05/Sep/24 1:50 PM

Updated:

21/Dec/24 6:57 AM

Resolved:

20/Dec/24 9:39 AM