[IGBF-3892] Add Vanessa cardui genome to IGB - JIRA UNCC

Paige Kulzer created issue - 05/Sep/24 1:52 PM

Paige Kulzer made changes - 05/Sep/24 1:52 PM

Field	Original Value	New Value
Epic Link		IGBF-3823 [ 23122 ]

Paige Kulzer made changes - 20/Sep/24 8:15 AM

Description	Task: Add the Vanessa cardui genome and annotation to IGB.	Task: Add the Vanessa cardui genome and annotation to IGB. Link to genome on NCBI: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_905220365.1/
Sprint		Fall 2 [ 203 ]

Paige Kulzer made changes - 20/Sep/24 8:15 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Paige Kulzer added a comment - 20/Sep/24 8:22 AM

Below is an outline of the steps I followed to create this Quickload:

1.Use wget to obtain the .2bit file from UCSC's track hub directory, then rename it

wget https://hgdownload.soe.ucsc.edu/hubs/GCF/905/220/365/GCF_905220365.1/GCF_905220365.1.2bit
mv GCF_905220365.1.2bit ilVanCard2.1.2bit

2. Create genome.txt, then check that the chromosome's are ordered logically (i.e., numerically)

./twoBitInfo ilVanCard2.1.2bit genome.txt
cat genome.txt

3. Use Vanessa cardui's taxID (171605) to get the information needed from gene2accession.gz and gene_info.gz to create the BED14 file in a later step

gunzip -c gene2accession.gz | grep '^171605\t' > 171605.gene2accession.txt
gunzip -c gene_info.gz | grep '^171605\t' > 171605.gene_info.txt

4. Download the RefSeqAll BED file from UCSC's table browser (Link: https://genome.ucsc.edu/cgi-bin/hgTables), then create the BED14 file using the following code:

cd ~/Documents/Repos/genomesource/
./ucscToBedDetail.py -a ~/Downloads/171605.gene2accession.txt -g ~/Downloads/171605.gene_info.txt ~/Downloads/V_cardui_ncbiRefSeq.bed.gz ~/Downloads/V_cardui_Feb_2021_ncbiRefSeq.bed

5. Sort, gzip, and tabix the BED14 file

cd ~/Downloads/
sort -k1,1 -k2,2n V_cardui_Feb_2021_ncbiRefSeq.bed | bgzip > V_cardui_Feb_2021_ncbiRefSeq.bed.gz
tabix -0 -s 1 -b 2 -e 3 V_cardui_Feb_2021_ncbiRefSeq.bed.gz

6. Sanity check the 2bit and BED files - Add the 2bit file as a reference, then drag/drop the BED files into IGB. Confirm that gene models are present, labeled correctly, and that no error messages are present in the Log.

7. Create a new directory in the quickload repo, then create annots.xml

cd ~/Documents/Repos/quickload/
svn mkdir V_cardui_Feb_2021
svn cp A_gambiae_Feb_2003/annots.xml V_cardui_Feb_2021
nano V_cardui_Feb_2021/annots.xml

8. Add V_cardui_Feb_2021 to contents.txt and .htaccess

nano contents.txt

V_cardui_Feb_2021 Vanessa cardui (Feb 2021) painted lady (ilVanCard2.1)

nano .htaccess

AddDescription "Vanessa cardui (Feb 2021) painted lady (ilVanCard2.1)" V_cardui_Feb_2021

9. Create HEADER.md

../genomesource/writeQuickLoadHeaderUCSC.py V_cardui_Feb_2021 > V_cardui_Feb_2021/HEADER.md

Show

Paige Kulzer added a comment - 20/Sep/24 8:22 AM Below is an outline of the steps I followed to create this Quickload: 1.Use wget to obtain the .2bit file from UCSC's track hub directory, then rename it wget https: //hgdownload.soe.ucsc.edu/hubs/GCF/905/220/365/GCF_905220365.1/GCF_905220365.1.2bit mv GCF_905220365.1.2bit ilVanCard2.1.2bit 2. Create genome.txt, then check that the chromosome's are ordered logically (i.e., numerically) ./twoBitInfo ilVanCard2.1.2bit genome.txt cat genome.txt 3. Use Vanessa cardui 's taxID (171605) to get the information needed from gene2accession.gz and gene_info.gz to create the BED14 file in a later step gunzip -c gene2accession.gz | grep '^171605\t' > 171605.gene2accession.txt gunzip -c gene_info.gz | grep '^171605\t' > 171605.gene_info.txt 4. Download the RefSeqAll BED file from UCSC's table browser (Link: https://genome.ucsc.edu/cgi-bin/hgTables ), then create the BED14 file using the following code: cd ~/Documents/Repos/genomesource/ ./ucscToBedDetail.py -a ~/Downloads/171605.gene2accession.txt -g ~/Downloads/171605.gene_info.txt ~/Downloads/V_cardui_ncbiRefSeq.bed.gz ~/Downloads/V_cardui_Feb_2021_ncbiRefSeq.bed 5. Sort, gzip, and tabix the BED14 file cd ~/Downloads/ sort -k1,1 -k2,2n V_cardui_Feb_2021_ncbiRefSeq.bed | bgzip > V_cardui_Feb_2021_ncbiRefSeq.bed.gz tabix -0 -s 1 -b 2 -e 3 V_cardui_Feb_2021_ncbiRefSeq.bed.gz 6. Sanity check the 2bit and BED files - Add the 2bit file as a reference, then drag/drop the BED files into IGB. Confirm that gene models are present, labeled correctly, and that no error messages are present in the Log. 7. Create a new directory in the quickload repo, then create annots.xml cd ~/Documents/Repos/quickload/ svn mkdir V_cardui_Feb_2021 svn cp A_gambiae_Feb_2003/annots.xml V_cardui_Feb_2021 nano V_cardui_Feb_2021/annots.xml 8. Add V_cardui_Feb_2021 to contents.txt and .htaccess nano contents.txt V_cardui_Feb_2021 Vanessa cardui (Feb 2021) painted lady (ilVanCard2.1) nano .htaccess AddDescription "Vanessa cardui (Feb 2021) painted lady (ilVanCard2.1)" V_cardui_Feb_2021 9. Create HEADER.md ../genomesource/writeQuickLoadHeaderUCSC.py V_cardui_Feb_2021 > V_cardui_Feb_2021/HEADER.md

Hide

Permalink

Paige Kulzer added a comment - 20/Sep/24 8:25 AM - edited

For review: I have zipped up a folder containing all of the relevant files needed to create this Quickload and place that in the shared Google Drive. Please take a look at these files and load the 2bit and BED files in IGB to double check that they're working properly.

Location of the Quickload on Google Drive:
Path: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > V_cardui.zip
Link: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

Show

Paige Kulzer added a comment - 20/Sep/24 8:25 AM - edited For review: I have zipped up a folder containing all of the relevant files needed to create this Quickload and place that in the shared Google Drive. Please take a look at these files and load the 2bit and BED files in IGB to double check that they're working properly. Location of the Quickload on Google Drive: Path : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > V_cardui.zip Link : https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

Paige Kulzer made changes - 20/Sep/24 8:26 AM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Paige Kulzer made changes - 20/Sep/24 8:26 AM

Assignee

Paige Kulzer [ pkulzer ]

Nowlan Freese [ nfreese ]

Paige Kulzer made changes - 20/Sep/24 8:26 AM

Story Points

2

1

Paige Kulzer made changes - 20/Sep/24 9:07 AM

Rank

Ranked higher

Nowlan Freese made changes - 27/Sep/24 10:30 AM

Sprint

Fall 2 [ 203 ]

Fall 3 [ 204 ]

Nowlan Freese made changes - 14/Oct/24 4:47 PM

Sprint

Fall 3 [ 204 ]

Fall 4 [ 205 ]

Nowlan Freese made changes - 29/Oct/24 2:13 PM

Sprint

Fall 4 [ 205 ]

Fall 5 [ 206 ]

Nowlan Freese made changes - 14/Nov/24 10:13 AM

Sprint

Fall 5 [ 206 ]

Fall 6 [ 207 ]

Nowlan Freese made changes - 18/Nov/24 10:02 AM

Sprint

Fall 6 [ 207 ]

Fall 7 [ 208 ]

Nowlan Freese made changes - 10/Dec/24 2:38 PM

Assignee

Nowlan Freese [ nfreese ]

Paige Kulzer [ pkulzer ]

Nowlan Freese made changes - 10/Dec/24 2:38 PM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Nowlan Freese made changes - 10/Dec/24 2:38 PM

Status

First Level Review in Progress [ 10301 ]

To-Do [ 10305 ]

Paige Kulzer made changes - 12/Dec/24 9:43 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Paige Kulzer added a comment - 12/Dec/24 10:25 AM

I've edited the header to make it specific to NCBI rather than UCSC, added species.txt and synonyms.txt to the zip file, cut down contents.txt to only include this genome, and edited annots.txt to replace UCSC's "refGene" with NCBI's "refSeq".

Let me know if I've updated all of these files correctly!

Show

Paige Kulzer added a comment - 12/Dec/24 10:25 AM I've edited the header to make it specific to NCBI rather than UCSC, added species.txt and synonyms.txt to the zip file, cut down contents.txt to only include this genome, and edited annots.txt to replace UCSC's "refGene" with NCBI's "refSeq". Let me know if I've updated all of these files correctly!

Paige Kulzer made changes - 12/Dec/24 10:25 AM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Paige Kulzer made changes - 12/Dec/24 10:25 AM

Assignee

Paige Kulzer [ pkulzer ]

Nowlan Freese [ nfreese ]

Nowlan Freese made changes - 13/Dec/24 11:42 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Hide

Permalink

Nowlan Freese added a comment - 13/Dec/24 12:04 PM

Testing: Everything looks good

Show

Nowlan Freese added a comment - 13/Dec/24 12:04 PM Testing: Everything looks good

Nowlan Freese made changes - 13/Dec/24 12:04 PM

Assignee

Nowlan Freese [ nfreese ]

Paige Kulzer [ pkulzer ]

Nowlan Freese made changes - 13/Dec/24 12:04 PM

Status

First Level Review in Progress [ 10301 ]

Ready for Pull Request [ 10304 ]

Hide

Permalink

Paige Kulzer added a comment - 17/Dec/24 10:25 AM

The Vanessa cardui genome has been pushed to the SVN repo.

Ready for final review!

Show

Paige Kulzer added a comment - 17/Dec/24 10:25 AM The Vanessa cardui genome has been pushed to the SVN repo. Ready for final review!

Paige Kulzer made changes - 17/Dec/24 10:25 AM

Status

Ready for Pull Request [ 10304 ]

Pull Request Submitted [ 10101 ]

Paige Kulzer made changes - 17/Dec/24 10:25 AM

Status

Pull Request Submitted [ 10101 ]

Reviewing Pull Request [ 10303 ]

Paige Kulzer made changes - 17/Dec/24 10:25 AM

Status

Reviewing Pull Request [ 10303 ]

Merged Needs Testing [ 10002 ]

Paige Kulzer made changes - 17/Dec/24 10:25 AM

Assignee

Paige Kulzer [ pkulzer ]

Nowlan Freese [ nfreese ]

Hide

Permalink

Ann Loraine added a comment - 19/Dec/24 2:39 PM

I have deployed the latest copy of quickload repository to:

RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary)
UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup)
To test:

launch IGB and visit each new genome version (see above)
visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself
within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow
check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Show

Ann Loraine added a comment - 19/Dec/24 2:39 PM I have deployed the latest copy of quickload repository to: RENCI hosting - http://igbquickload-main.bioviz.org/quickload/ (primary) UNC Charlotte hosting - http://igbquickload.org/quickload/ (backup) To test: launch IGB and visit each new genome version (see above) visit the subdirectories for each genome (by following the links above) and check that there is text describing the genome and datasets visible in IGB itself within IGB Available Data section, click any "linkout" icons and make sure a Web page opens and that it goes to a place that describes the dataset somehow check that when the datasets load, they look OK - gene models should be boxes with lines connecting them, for instance, and the track labels should be readable and should make sense ("making sense" is a subjective of course! mainly we're looking for problems that could trip up a user and cause confusion.)

Nowlan Freese made changes - 20/Dec/24 10:20 AM

Status

Merged Needs Testing [ 10002 ]

Post-merge Testing In Progress [ 10003 ]

Hide

Permalink

Nowlan Freese added a comment - 20/Dec/24 10:29 AM

Paige Kulzer - the 2bit file has a weird name and isn't loading correctly, can you update it to V_cardui_Feb_2021.2bit and push it to the SVN repo then ask Dr. Loraine to deploy it?

Show

Nowlan Freese added a comment - 20/Dec/24 10:29 AM Paige Kulzer - the 2bit file has a weird name and isn't loading correctly, can you update it to V_cardui_Feb_2021.2bit and push it to the SVN repo then ask Dr. Loraine to deploy it?

Nowlan Freese made changes - 20/Dec/24 10:30 AM

Assignee

Nowlan Freese [ nfreese ]

Paige Kulzer [ pkulzer ]

Nowlan Freese made changes - 20/Dec/24 10:30 AM

Status

Post-merge Testing In Progress [ 10003 ]

To-Do [ 10305 ]

Hide

Permalink

Paige Kulzer added a comment - 20/Dec/24 10:42 AM

Ann Loraine, as per Nowlan's comment above, I've updated the 2bit files name to "V_cardui_Feb_2021.2bit" and have pushed it to the SVN repo. It's ready for you to deploy!

Show

Paige Kulzer added a comment - 20/Dec/24 10:42 AM Ann Loraine , as per Nowlan's comment above, I've updated the 2bit files name to "V_cardui_Feb_2021.2bit" and have pushed it to the SVN repo. It's ready for you to deploy!

Paige Kulzer made changes - 20/Dec/24 10:42 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Paige Kulzer made changes - 20/Dec/24 10:42 AM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Paige Kulzer made changes - 20/Dec/24 10:42 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Paige Kulzer made changes - 20/Dec/24 10:42 AM

Status

First Level Review in Progress [ 10301 ]

Ready for Pull Request [ 10304 ]

Paige Kulzer made changes - 20/Dec/24 10:42 AM

Status

Ready for Pull Request [ 10304 ]

Pull Request Submitted [ 10101 ]

Paige Kulzer made changes - 20/Dec/24 10:42 AM

Assignee

Paige Kulzer [ pkulzer ]

Ann Loraine [ aloraine ]

Hide

Permalink

Ann Loraine added a comment - 21/Dec/24 9:08 AM

Update:

New file is deployed and old one is deleted - I updated quickloads hosted at RENCI and UNC Charlotte

Show

Ann Loraine added a comment - 21/Dec/24 9:08 AM Update: New file is deployed and old one is deleted - I updated quickloads hosted at RENCI and UNC Charlotte

Hide

Permalink

Ann Loraine added a comment - 21/Dec/24 9:21 AM - edited

Testing:

It looks like the documentation describing the data files needs to be improved.

The description of the files and where they came from in the comments above don't match the documentation. Specifically, the origin of the files is different. It looks like we got the files from the UCSC Genome Informations Web site and then used our tools to convert them to our preferred formats.

Suggestion:

Since we are doing a pretty good job of describing our process in the comments above (thanks Paige Kulzer) could we simply link to this ticket from our header file?

We could provide some very generic – but accurate – descriptions of where the files came from and how we processed them. We could also then provide a direct link to this Jira ticket, which should be publicly accessible.

I recommend we maybe make "Improve documentation for painted lady genome assembly quickload" as a new ticket.

Once that documentation is more complete, we could then get in touch with researchers working with this assembly and get their input. Also, we should make sure to link to articles for this particular genome assembly so that users will be able to get more information.

Show

Ann Loraine added a comment - 21/Dec/24 9:21 AM - edited Testing: It looks like the documentation describing the data files needs to be improved. The description of the files and where they came from in the comments above don't match the documentation. Specifically, the origin of the files is different. It looks like we got the files from the UCSC Genome Informations Web site and then used our tools to convert them to our preferred formats. Suggestion: Since we are doing a pretty good job of describing our process in the comments above (thanks Paige Kulzer ) could we simply link to this ticket from our header file? We could provide some very generic – but accurate – descriptions of where the files came from and how we processed them. We could also then provide a direct link to this Jira ticket, which should be publicly accessible. I recommend we maybe make "Improve documentation for painted lady genome assembly quickload" as a new ticket. Once that documentation is more complete, we could then get in touch with researchers working with this assembly and get their input. Also, we should make sure to link to articles for this particular genome assembly so that users will be able to get more information.

Ann Loraine made changes - 21/Dec/24 9:21 AM

Status

Pull Request Submitted [ 10101 ]

Reviewing Pull Request [ 10303 ]

Ann Loraine made changes - 21/Dec/24 9:21 AM

Status

Reviewing Pull Request [ 10303 ]

To-Do [ 10305 ]

Ann Loraine made changes - 21/Dec/24 9:21 AM

Assignee

Ann Loraine [ aloraine ]

Paige Kulzer [ pkulzer ]

Ann Loraine made changes - 21/Dec/24 9:45 AM

Link

This issue relates to ~~IGBF-4038~~ [ ~~IGBF-4038~~ ]

Hide

Permalink

Ann Loraine added a comment - 21/Dec/24 9:47 AM

I made the recommend ticket for improving the genome assembly documentation. Closing this one now.

Show

Ann Loraine added a comment - 21/Dec/24 9:47 AM I made the recommend ticket for improving the genome assembly documentation. Closing this one now.

Ann Loraine made changes - 21/Dec/24 9:47 AM

Status

To-Do [ 10305 ]

Pull Request Submitted [ 10101 ]

Ann Loraine made changes - 21/Dec/24 9:47 AM

Status

Pull Request Submitted [ 10101 ]

Reviewing Pull Request [ 10303 ]

Ann Loraine made changes - 21/Dec/24 9:47 AM

Status

Reviewing Pull Request [ 10303 ]

Merged Needs Testing [ 10002 ]

Ann Loraine made changes - 21/Dec/24 9:47 AM

Status

Merged Needs Testing [ 10002 ]

Post-merge Testing In Progress [ 10003 ]

Ann Loraine made changes - 21/Dec/24 9:47 AM

Resolution		Done [ 10000 ]
Status	Post-merge Testing In Progress [ 10003 ]	Closed [ 6 ]

Add Vanessa cardui genome to IGB

Details

Description

Attachments

Issue Links

Activity

People

Dates