[IGBF-3890] Add Hydra vulgaris genome to IGB - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Add genomes requested during SDB
Sprint:
Fall 1, Fall 7

Description

Task: Add the Hydra vulgaris genome and annotation to IGB. Current Hydra vulgaris genome version provided by ensembl: Hydra_105_v3 (Feb 2022).

Hydra vulgaris (HydraT2T_AEP)(Apr 2024) - https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_038396675.1/

Attachments

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 12/Sep/24 9:40 AM

Below is an outline of the steps I followed to create the Hydra vulgaris Quickload:
1. Convert genome .fna to .2bit

gunzip GCF_038396675.1_HydraT2T_AEP_genomic.fna.gz
./faToTwoBit GCF_038396675.1_HydraT2T_AEP_genomic.fna H_vulgaris_Apr_2024.2bit

2. Create genome.txt

./twoBitInfo H_vulgaris_Apr_2024.2bit genome.txt

3. Get gene models from NCBI (.gff), then convert .gff to .bed

cd ~/Documents/Repos/genomesource
./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/H_vulgaris_Apr_2024_refGene.bed

4. Check if UCSC has any information for this genome using its txid (NCBI:txid6087) and, since if it does, compare gene names/ID's to those present in the .bed file created in the previous step

cd ~/Downloads
gunzip -c gene2accession.gz | grep '^6087\t' > 6087.gene2accession.txt

5. Sort, gzip, and tabix the .bed file

sort -k1,1 -k2,2n H_vulgaris_Apr_2024_refGene.bed | bgzip > H_vulgaris_Apr_2024_refGene.bed.gz
tabix -0 -s 1 -b 2 -e 3 H_vulgaris_Apr_2024_refGene.bed.gz

6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order. Also check that no error messages are present in the Log.

7. Create annots.xml and add _H_vulgaris_ to contents.txt and .htaccess

cd ~/Documents/Repos/quickload
svn mkdir H_vulgaris_Apr_2024
svn cp A_gambiae_Oct_2006/annots.xml H_vulgaris_Apr_2024
nano H_vulgaris_Apr_2024/annots.xml
nano contents.txt
nano .htaccess

Show

Paige Kulzer (Inactive) added a comment - 12/Sep/24 9:40 AM Below is an outline of the steps I followed to create the Hydra vulgaris Quickload: 1. Convert genome .fna to .2bit gunzip GCF_038396675.1_HydraT2T_AEP_genomic.fna.gz ./faToTwoBit GCF_038396675.1_HydraT2T_AEP_genomic.fna H_vulgaris_Apr_2024.2bit 2. Create genome.txt ./twoBitInfo H_vulgaris_Apr_2024.2bit genome.txt 3. Get gene models from NCBI (.gff), then convert .gff to .bed cd ~/Documents/Repos/genomesource ./gff3ToBedDetail.py -g ~/Downloads/genomic.gff -b ~/Downloads/H_vulgaris_Apr_2024_refGene.bed 4. Check if UCSC has any information for this genome using its txid (NCBI:txid6087) and, since if it does, compare gene names/ID's to those present in the .bed file created in the previous step cd ~/Downloads gunzip -c gene2accession.gz | grep '^6087\t' > 6087.gene2accession.txt 5. Sort, gzip, and tabix the .bed file sort -k1,1 -k2,2n H_vulgaris_Apr_2024_refGene.bed | bgzip > H_vulgaris_Apr_2024_refGene.bed.gz tabix -0 -s 1 -b 2 -e 3 H_vulgaris_Apr_2024_refGene.bed.gz 6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order. Also check that no error messages are present in the Log. 7. Create annots.xml and add _H_vulgaris_ to contents.txt and .htaccess cd ~/Documents/Repos/quickload svn mkdir H_vulgaris_Apr_2024 svn cp A_gambiae_Oct_2006/annots.xml H_vulgaris_Apr_2024 nano H_vulgaris_Apr_2024/annots.xml nano contents.txt nano .htaccess

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 12/Sep/24 9:44 AM - edited

I've placed a zipped version of the new quickload folder in Google Drive for the reviewer to take a look at:
Path: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > H_vulgaris.zip
Link: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

Question for the reviewer: Do I need to manually edit the HEADER.md file for genomes like this one that were pulled from NCBI rather than UCSC?

Show

Paige Kulzer (Inactive) added a comment - 12/Sep/24 9:44 AM - edited I've placed a zipped version of the new quickload folder in Google Drive for the reviewer to take a look at: Path : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > H_vulgaris.zip Link : https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link Question for the reviewer: Do I need to manually edit the HEADER.md file for genomes like this one that were pulled from NCBI rather than UCSC?

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 12/Dec/24 9:31 AM

I've edited the header to make it specific to NCBI rather than UCSC, added species.txt and synonyms.txt to the zip file, cut down contents.txt to only include this genome, and edited annots.txt to replace UCSC's "refGene" with NCBI's "refSeq".

Let me know if I've updated all of these files correctly!

Show

Paige Kulzer (Inactive) added a comment - 12/Dec/24 9:31 AM I've edited the header to make it specific to NCBI rather than UCSC, added species.txt and synonyms.txt to the zip file, cut down contents.txt to only include this genome, and edited annots.txt to replace UCSC's "refGene" with NCBI's "refSeq". Let me know if I've updated all of these files correctly!

Hide

Permalink

Nowlan Freese added a comment - 12/Dec/24 3:05 PM

Testing:

The synonyms.txt has the wrong date (has Feb_2022 instead of Apr_2024) so it's not working correctly.
The 14th column of the bed file has a bunch of %2C which should be commas. Not sure why, but would probably be good to replace.

Show

Nowlan Freese added a comment - 12/Dec/24 3:05 PM Testing: The synonyms.txt has the wrong date (has Feb_2022 instead of Apr_2024) so it's not working correctly. The 14th column of the bed file has a bunch of %2C which should be commas. Not sure why, but would probably be good to replace.

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 13/Dec/24 11:38 AM

Good catch! I updated the date in synonyms.txt and replaced all instances of "%2C" with a comma. It appears this weird syntax was already present in the GFF file I downloaded from NCBI.

We also decided to use the common name denoted by NCBI rather than what we find on Google, so I've updated the common name to "Swiftwater hydra" in HEADER.md, contents.txt, and species.txt.

Ready for review!

Show

Paige Kulzer (Inactive) added a comment - 13/Dec/24 11:38 AM Good catch! I updated the date in synonyms.txt and replaced all instances of "%2C" with a comma. It appears this weird syntax was already present in the GFF file I downloaded from NCBI. We also decided to use the common name denoted by NCBI rather than what we find on Google, so I've updated the common name to "Swiftwater hydra" in HEADER.md, contents.txt, and species.txt. Ready for review!

Hide

Permalink

Nowlan Freese added a comment - 13/Dec/24 1:29 PM

Testing: everything looks good.

Show

Nowlan Freese added a comment - 13/Dec/24 1:29 PM Testing: everything looks good.

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 16/Dec/24 9:24 AM

The subversion repository appears to be down which is preventing me from pushing this quickload to the SVN site. When I try to check-in my changes, svn is responding with

svn: E200029: Commit failed (details follow):
svn: E200029: could not begin a transaction

And when I try to update my working copy with "svn update", svn is responding with

svn: E200029: Couldn't perform atomic initialization

Ann Loraine, could you restart the svn site and reattach the virtual hard drive storing the data like you did for ~~IGBF-3748~~?

Show

Paige Kulzer (Inactive) added a comment - 16/Dec/24 9:24 AM The subversion repository appears to be down which is preventing me from pushing this quickload to the SVN site. When I try to check-in my changes, svn is responding with svn: E200029: Commit failed (details follow): svn: E200029: could not begin a transaction And when I try to update my working copy with "svn update", svn is responding with svn: E200029: Couldn't perform atomic initialization Ann Loraine , could you restart the svn site and reattach the virtual hard drive storing the data like you did for IGBF-3748 ?

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 17/Dec/24 10:24 AM

The SVN site is back up and running, and the Hydra vulgaris genome has been pushed to the SVN repo.

Ready for final review!

Show

Paige Kulzer (Inactive) added a comment - 17/Dec/24 10:24 AM The SVN site is back up and running, and the Hydra vulgaris genome has been pushed to the SVN repo. Ready for final review!

Hide

Permalink

Ann Loraine added a comment - 19/Dec/24 2:39 PM - edited

I have deployed the latest copy of quickload repository to:

RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary)
UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup)
To test:

launch IGB and visit each new genome version (see above)
visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself
within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow
check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Show

Ann Loraine added a comment - 19/Dec/24 2:39 PM - edited I have deployed the latest copy of quickload repository to: RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary) UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup) To test: launch IGB and visit each new genome version (see above) visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Hide

Permalink

Nowlan Freese added a comment - 20/Dec/24 9:45 AM

Tested following instructions above. Everything looks good.

Closing ticket.

Show

Nowlan Freese added a comment - 20/Dec/24 9:45 AM Tested following instructions above. Everything looks good. Closing ticket.

People

Assignee:

Paige Kulzer (Inactive)

Reporter:

Paige Kulzer (Inactive)

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

05/Sep/24 1:51 PM

Updated:

20/Dec/24 9:45 AM

Resolved:

20/Dec/24 9:45 AM