[IGBF-3889] Add Capitella teleta genome to IGB - JIRA UNCC

Paige Kulzer (Inactive) created issue - 05/Sep/24 1:50 PM

Paige Kulzer (Inactive) made changes - 05/Sep/24 1:50 PM

Field	Original Value	New Value
Epic Link		IGBF-3823 [ 23122 ]

Paige Kulzer (Inactive) made changes - 10/Sep/24 11:50 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Paige Kulzer (Inactive) made changes - 10/Sep/24 11:50 AM

Description

Task: Add the Capitella teleta genome and annotation to IGB.

Capitella teleta (Capca1) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000328365.1/

Task: Add the Capitella teleta genome and annotation to IGB.

Capitella teleta (Capca1) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000328365.1/
NCBI:txid283909

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:41 PM - edited

Below is an outline of the steps I followed to create the Capitella teleta Quickload:

1. Convert genome .fasta to .2bit

rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ 
./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit

2. Create genome.txt

rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./  
./twoBitInfo C_teleta_Jan_2013.2bit genome.txt

3. Get gene models from NCBI (.gff), then convert .gff to .bed

git clone git@bitbucket.org:lorainelab/genomesource.git

path+=('~/Documents/Repos/genomesource/')
export PYTHONPATH="${PYTHONPATH}:~/Documents/Repos/genomesource/"

./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed

4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't)

brew install tnftp
ftp ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/

get gene2accession.gz
quit

gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt

5. Sort, gzip, and tabix the .bed file made in step 3

sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz
tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz

6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order.

7. Create annots.xml

brew install svn
svn checkout --username=guest --password=guest https://svn.bioviz.org/repos/genomes/quickload
svn mkdir C_teleta_Jan_2013
svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/

7. Add the new genome to contents.txt and .htaccess

Show

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:41 PM - edited Below is an outline of the steps I followed to create the Capitella teleta Quickload: 1. Convert genome .fasta to .2bit rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/faToTwoBit ./ ./faToTwoBit GCA_000328365.1_Capca1_genomic.fna C_teleta_Jan_2013.2bit 2. Create genome.txt rsync -aP rsync: //hgdownload.soe.ucsc.edu/genome/admin/exe/macOSX.arm64/twoBitInfo ./ ./twoBitInfo C_teleta_Jan_2013.2bit genome.txt 3. Get gene models from NCBI (.gff), then convert .gff to .bed git clone git@bitbucket.org:lorainelab/genomesource.git path+=('~/Documents/Repos/genomesource/') export PYTHONPATH= "${PYTHONPATH}:~/Documents/Repos/genomesource/" ./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/C_teleta_Jan_2013_refGene.bed 4. Check if NCBI has any information for this genome using its txid (NCBI:txid283909) (Conclusion: it didn't) brew install tnftp ftp ftp: //ftp.ncbi.nlm.nih.gov/gene/DATA/ get gene2accession.gz quit gunzip -c gene2accession.gz | grep '^283909\t' > 283909.gene2accession.txt 5. Sort, gzip, and tabix the .bed file made in step 3 sort -k1,1 -k2,2n C_teleta_Jan_2013_refGene.bed | bgzip > C_teleta_Jan_2013_refGene.bed.gz tabix -0 -s 1 -b 2 -e 3 C_teleta_Jan_2013_refGene.bed.gz 6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order. 7. Create annots.xml brew install svn svn checkout --username=guest --password=guest https: //svn.bioviz.org/repos/genomes/quickload svn mkdir C_teleta_Jan_2013 svn cp A_gambiae_Oct_2006/annots.xml C_teleta_Jan_2013/ 7. Add the new genome to contents.txt and .htaccess

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:46 PM - edited

Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive:
Path: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip
Link: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.

Show

Paige Kulzer (Inactive) added a comment - 10/Sep/24 3:46 PM - edited Before I "check-in" these changes via svn, please find a zipped copy of the new Quickload at the following location on Google Drive: Path : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > C_teleta.zip Link : https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link For this initial review, please download this .zip file to your computer and check that the genome and its annotations can be added to IGB without error. Also, please look over the steps I've outlined above and let me know if I missed any steps. After this review, I will plan to commit my changes via svn.

Paige Kulzer (Inactive) made changes - 10/Sep/24 3:46 PM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Paige Kulzer (Inactive) made changes - 10/Sep/24 3:46 PM

Assignee

Paige Kulzer [ pkulzer ]

Nowlan Freese [ nfreese ]

Paige Kulzer (Inactive) made changes - 11/Sep/24 8:29 AM

Sprint

Fall 1 [ 202 ]

Ann Loraine made changes - 16/Sep/24 8:34 AM

Sprint

Fall 1 [ 202 ]

Fall 1, Fall 2 [ 202, 203 ]

Ann Loraine made changes - 16/Sep/24 8:34 AM

Rank

Ranked higher

Paige Kulzer (Inactive) made changes - 18/Sep/24 8:58 AM

Rank

Ranked higher

Paige Kulzer (Inactive) made changes - 20/Sep/24 9:07 AM

Rank

Ranked lower

Nowlan Freese made changes - 27/Sep/24 10:30 AM

Sprint

Fall 1, Fall 2 [ 202, 203 ]

Fall 1, Fall 3 [ 202, 204 ]

Nowlan Freese made changes - 14/Oct/24 4:48 PM

Sprint

Fall 1, Fall 3 [ 202, 204 ]

Fall 1, Fall 4 [ 202, 205 ]

Nowlan Freese made changes - 29/Oct/24 2:13 PM

Sprint

Fall 1, Fall 4 [ 202, 205 ]

Fall 1, Fall 5 [ 202, 206 ]

Nowlan Freese made changes - 14/Nov/24 10:14 AM

Sprint

Fall 1, Fall 5 [ 202, 206 ]

Fall 1, Fall 6 [ 202, 207 ]

Nowlan Freese made changes - 18/Nov/24 10:02 AM

Sprint

Fall 1, Fall 6 [ 202, 207 ]

Fall 1, Fall 7 [ 202, 208 ]

Paige Kulzer (Inactive) made changes - 10/Dec/24 9:49 AM

Rank

Ranked higher

Nowlan Freese made changes - 10/Dec/24 10:31 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Nowlan Freese made changes - 10/Dec/24 11:29 AM

Attachment		species.txt [ 18577 ]
Attachment		synonyms.txt [ 18578 ]

Hide

Permalink

Nowlan Freese added a comment - 10/Dec/24 11:30 AM - edited

1st level review:
Downloaded zip file and unpacked and added to IGB as new quickload data source.
Annotation loads for whole genome and I am able to load the sequence (no errors in log).

Couple of small things:

The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene.
Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms).
Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.

Show

Nowlan Freese added a comment - 10/Dec/24 11:30 AM - edited 1st level review: Downloaded zip file and unpacked and added to IGB as new quickload data source. Annotation loads for whole genome and I am able to load the sequence (no errors in log). Couple of small things: The annotation file has refGene in the name, and is also referred to as refGene in the annots.xml. I'm not sure that this is correct. I think refGene is a specific annotation created by the UCSC genome browser, which is what we use if we pull the annotation from the UCSC table browser. However, since this annotation is coming from NCBI I don't think it is refGene. I think refSeq is NCBI, but that might refer to a specific "level" of annotation (as in, this is the NCBI official version of the annotation). It looks like the annotation was originally deposited in GenBank, so maybe that would be more correct? Not sure, would need to dig into it more, but we probably shouldn't call it refGene. Need to include the C teleta info in species.txt and synonyms.txt. These can exist in two locations, both in IGB itself and in the SVN repository. We should probably add them to IGB (maybe make a new ticket that has the species and synonyms ready for the 4 new genomes), but since that requires a new version of IGB we usually also add them to the SVN repo. I've attached an example species.txt and synonyms.txt for C teleta (I think they are working, but good to double check the IGB user's guide on creating species and synonyms). Need to update the HEADER.md as it refers to pulling the genome from UCSC genome browser instead of NCBI.

Nowlan Freese made changes - 10/Dec/24 11:30 AM

Assignee

Nowlan Freese [ nfreese ]

Paige Kulzer [ pkulzer ]

Nowlan Freese made changes - 10/Dec/24 11:30 AM

Status

First Level Review in Progress [ 10301 ]

To-Do [ 10305 ]

Paige Kulzer (Inactive) made changes - 11/Dec/24 11:23 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 11/Dec/24 3:50 PM

I've addressed the above change requests:

I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations.
I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome.
I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms.

Let me know if I've updated all of these files correctly!

Show

Paige Kulzer (Inactive) added a comment - 11/Dec/24 3:50 PM I've addressed the above change requests: I edited the header to make it specific to NCBI rather than UCSC. I also tried to be as detailed as possible by including a link to the paper that published the genome and annotations. I added species.txt and synonyms.txt to the zip file. I also cut down contents.txt to only include this genome. I edited annots.txt to remove mention of refGene and refSeq which were UCSC-specific terms. Let me know if I've updated all of these files correctly!

Paige Kulzer (Inactive) made changes - 11/Dec/24 3:50 PM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Paige Kulzer (Inactive) made changes - 11/Dec/24 3:50 PM

Assignee

Paige Kulzer [ pkulzer ]

Nowlan Freese [ nfreese ]

Nowlan Freese made changes - 12/Dec/24 10:40 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Hide

Permalink

Nowlan Freese added a comment - 12/Dec/24 11:20 AM - edited

Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451

The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ][nine digits][.][version number]
The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ][nine digits][.][version number]

Show

Nowlan Freese added a comment - 12/Dec/24 11:20 AM - edited Note on nomenclature: https://support.nlm.nih.gov/kbArticle/?pn=KA-03451 The format for GenBank (primary) assembly accessions is: [ GCA ][ _ ] [nine digits] [.] [version number] The format for RefSeq (NCBI-derived) assembly accessions is: [ GCF ][ _ ] [nine digits] [.] [version number]

Nowlan Freese made changes - 12/Dec/24 11:32 AM

Status

First Level Review in Progress [ 10301 ]

Needs 1st Level Review [ 10005 ]

Nowlan Freese made changes - 12/Dec/24 1:50 PM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Hide

Permalink

Nowlan Freese added a comment - 12/Dec/24 2:47 PM

Tested locally on Mac, looks good.

Ready to push to SVN repo

Show

Nowlan Freese added a comment - 12/Dec/24 2:47 PM Tested locally on Mac, looks good. Ready to push to SVN repo

Nowlan Freese made changes - 12/Dec/24 2:48 PM

Assignee

Nowlan Freese [ nfreese ]

Paige Kulzer [ pkulzer ]

Nowlan Freese made changes - 12/Dec/24 2:48 PM

Status

First Level Review in Progress [ 10301 ]

Ready for Pull Request [ 10304 ]

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 16/Dec/24 9:18 AM

The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so:

svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here]

Ready for final review!

Show

Paige Kulzer (Inactive) added a comment - 16/Dec/24 9:18 AM The Capitella teleta genome has been pushed to the SVN repo. To do this, I had to use the username and password flags when checking in my changes, like so: svn ci -m "IGBF-3889: Add Capitella teleta (Jan 2013) genome to IGB" --username pkulzer --password [insert password here] Ready for final review!

Paige Kulzer (Inactive) made changes - 16/Dec/24 9:18 AM

Status

Ready for Pull Request [ 10304 ]

Pull Request Submitted [ 10101 ]

Paige Kulzer (Inactive) made changes - 16/Dec/24 9:18 AM

Status

Pull Request Submitted [ 10101 ]

Reviewing Pull Request [ 10303 ]

Paige Kulzer (Inactive) made changes - 16/Dec/24 9:18 AM

Status

Reviewing Pull Request [ 10303 ]

Merged Needs Testing [ 10002 ]

Paige Kulzer (Inactive) made changes - 16/Dec/24 9:18 AM

Assignee

Paige Kulzer [ pkulzer ]

Nowlan Freese [ nfreese ]

Hide

Permalink

Ann Loraine added a comment - 19/Dec/24 2:38 PM

I have deployed the latest copy of quickload repository to:

RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary)
UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup)

To test:

launch IGB and visit each new genome version (see above)
visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself
within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow
check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Show

Ann Loraine added a comment - 19/Dec/24 2:38 PM I have deployed the latest copy of quickload repository to: RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary) UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup) To test: launch IGB and visit each new genome version (see above) visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Nowlan Freese made changes - 20/Dec/24 9:34 AM

Status

Merged Needs Testing [ 10002 ]

Post-merge Testing In Progress [ 10003 ]

Hide

Permalink

Nowlan Freese added a comment - 20/Dec/24 9:39 AM

Tested following the instructions above. Everything looks good.

Closing ticket.

Show

Nowlan Freese added a comment - 20/Dec/24 9:39 AM Tested following the instructions above. Everything looks good. Closing ticket.

Nowlan Freese made changes - 20/Dec/24 9:39 AM

Assignee

Nowlan Freese [ nfreese ]

Paige Kulzer [ pkulzer ]

Nowlan Freese made changes - 20/Dec/24 9:39 AM

Resolution		Done [ 10000 ]
Status	Post-merge Testing In Progress [ 10003 ]	Closed [ 6 ]

Nowlan Freese made changes - 21/Dec/24 6:57 AM

Assignee

Paige Kulzer [ pkulzer ]

Nowlan Freese [ nfreese ]

Nowlan Freese made changes - 21/Dec/24 6:57 AM

Assignee

Nowlan Freese [ nfreese ]

Paige Kulzer [ pkulzer ]

Add Capitella teleta genome to IGB

Details

Description

Attachments

Attachments

Activity

People

Dates