Details
-
Type:
Task
-
Status: Closed (View Workflow)
-
Priority:
Major
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:2
-
Epic Link:
-
Sprint:Spring 1, Spring 2
Description
Situation: Dr. Camellia Okpodu of the University of Wyoming is studying the effects of herbicides on Euphorbia esula (leafy spurge), a noxious invasive weed. As of January 2026 there does not appear to be a good reference genome for leafy spurge (NCBI link), so Dr. Okpodu has aligned her data to Euphorbia lathyris (NCBI link).
Task: Add Euphorbia lathyris and its annotation to IGB Quickload. Also, we're increasing scope to include adding this genome to the IGB carousel.
Link: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_963576675.1/
Note: Only the NCBI RefSeq assembly (GCF) includes the annotation, so please use that one.
Attachments
Issue Links
- relates to
-
IGBF-4408 Add Euphorbia lathyris image to IGB 10.3.0 home screen carousel
-
- To-Do
-
Activity
Below is an outline of the steps I followed to create the Euphorbia lathyris Quickload:
1. Convert genome .fna to .2bit
./faToTwoBit GCF_963576675.1_ddEupLath1.1_genomic.fna E_lathyris_Nov_2023.2bit
2. Create genome.txt
./twoBitInfo E_lathyris_Nov_2023.2bit genome.txt
3. Convert gene models from .gff to .bed
~/Documents/Repos/genomesource/gff3ToBedDetail.py -g genomic.gff -b E_lathyris_Nov_2023_ncbiRefSeq.bed
4. Convert to BED14
gunzip -c gene2accession.gz | grep -p '^212925\t' > 212925.gene2accession.txt gunzip -c gene_info.gz | grep -p '^212925\t' > 212925.gene_info.txt ~/Documents/Repos/genomesource/ucscToBedDetail.py -a 212925.gene2accession.txt -g 212925.gene_info.txt E_lathyris_Nov_2023_ncbiRefSeq.bed.gz E_lathyris_Nov_2023_ncbiRefSeq.bed gzip E_lathyris_Nov_2023_ncbiRefSeq.bed
5. Sort, gzip, and tabix the .bed file
sort -k1,1 -k2,2n E_lathyris_Nov_2023_ncbiRefSeq.bed | bgzip > E_lathyris_Nov_2023_ncbiRefSeq.bed.gz tabix -0 -s 1 -b 2 -e 3 E_lathyris_Nov_2023_ncbiRefSeq.bed.gz
6. Sanity check the .bed and .2bit files - Add the .2bit file as a reference, then drag/drop the .bed file into IGB. Confirm that gene models are present, labeled correctly, and the chromosomes listed are in a logical order. Also check that no error messages are present in the Log.
7. Create annots.xml and add _E_lathyris_ to contents.txt and .htaccess
mkdir E_lathyris_Nov_2023 cp V_cardui_Feb_2021/annots.xml S_chinensis_Apr_2019 nano annots.xml cp H_vulgaris_Apr_2024/HEADER.md S_chinensis_Apr_2019 nano HEADER.md nano contents.txt nano .htaccess nano synonyms.txt nano species.txt
I've placed a zipped version of the new quickload folder in Google Drive for the reviewer to take a look at:
Path: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > E_lathyris_Nov_2023.zip
Link: https://drive.google.com/file/d/1G6dGICTfN3-crs3mWXE6zwT1TpCQ5RSe/view?usp=drive_link
Ready for review!
Thanks for providing the detailed steps in the previous comment! It helped me remember the process and better understand where problems can occur.
Here are some change requests:
- Add a new "file" tag that points to the reference genome sequence. Otherwise, the user won't be able to load the genome sequence. This extra "file" tag refers to just the reference sequence and should not be part of the gene models file tag. See: https://wiki.bioviz.org/confluence/display/igbman/About+annots.xml and the "reference" attribute
- For the gene models file tag, change the "title" field to "NCBI" (I think this will enable to NCBI Web links mechanism to work. See: Tools > Configure Web Links in IGB. Since the gene models are directly from an NCBI file, we should do our level best to enable users to right-click on gene models and see the NCBI option. Also, once that is working again, check that the Web links mechanism is forming the correct, up-to-date link into the corresponding records at NCBI. We might need to change the pattern or possibly even the track name.
- The bed-detail file format is buggy. There are some lines with 16 fields instead of 14. For example:
NC_088910.1 80231 83288 rna-XM_066002700.1 0 - 80231 83288 0 8 279,167,184,84,19,239,135,66, 0,374,755,1025,1888,1953,2288,2991, gene-LOC136211153 MACPF domain-containing protein NSL1-like NA NA
I guess there is a problem with the GFF parsing code somewhere?
- The accessions used to link back to records in RefSeq and NCBI all have prefixes like "-gene" and "-rna". These should be removed. If the prefixes are removed, then I think the Web links will work properly. For example, the entire string (e.g., gene-LOC136211153) is getting used as the query string, when really it should be LOC136211153.
I referenced the_ Aedes albopictus_ annots.xml file to add that file tag in. I also changed the title field as requested. Finally, I re-ran the above code but skipped step 4 which is where the extra columns were being added and manually removed any erroneous text from the resulting BED file.
An updated version of the quickload can be found here for review: https://drive.google.com/file/d/1JzK4eAujkOb9WiaMtReAx19ob5bjjbQL/view?usp=drive_link
One last request!
Please change the "url" attribute for the gene models file from:
url="E_lathyris_Nov_2023"
to
url="https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_963576675.1"
I have updated the annots.xml file with that requested change and re-uploaded E_lathyris_Nov_2023_updated.zip to Google Drive. Ready for review!
I reached out to Dr. Okpodu to see if she had any pictures of the species that we might be able to use in the IGB carousel. She didn't have any pictures but she found one in the Creative Commons. Per Dr. Okpodu, we can use it as long as we give attribution ( Frank Vincentz, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons). I'll attach the image to this ticket.
Next steps will be to remind myself how to add an image to the carousel and then do that as part of this ticket.
Request for Paige Kulzer:
- Please don't attach the image to this ticket but instead just add a link to it in a comment.
Testing:
- I downloaded the newly made Quickload zipfile, unpacked it, and added the resulting folder to IGB 10.2.0 as a new Quickload source
- I visited the Species and Genome Version menus and observed that the expected species with tooltip was present and that the expected genome version was present
- I then selected the new species Euphorbia lathyris and genome version E_lathyris_Nov_2023 and observed the gene models loaded for 33 sequences
- I checked that the Load Sequence button works as expected. It does: the genome sequence loads when I click the button.
- I checked that the Advanced Search tab shows the expected values and rows when I searched for regular expressions matching gene model ids.
- Here is a summary of the gene model ids:
35,425 ids match pattern XM_
3,746 ids match pattern XR_
TOTAL matches: 39,171
- I used the Terminal to check the number of lines in the gene models file
aloraine@Encantada-de-conocerte E_lathyris_Nov_2023 % gunzip -c E_lathyris_Nov_2023_ncbiRefSeq.bed.gz | wc -l 39171
The number of lines and ids are consistent with each other.
Other observations:
- I noticed that the Description field in the Advanced Search tab contains the expected user-friendly descriptions of gene functions associated with gene models. This is great!
- I noticed that many loci (gene names look like LOC136233124) had multiple gene models corresponding to different transcript variants, with descriptions different from the other models. The descriptions sometimes specified the variant. For example, LOC135224288 had gene model "XM_066012574.1" with description "uncharacterized protein, transcript variant X1" and also had gene model "XR_010686360.1" with description "uncharacterized protein, transcript variant X5"
The sophistication of the annotation is quite good! This seems like a high-quality genome assembly, with only 33 well-assembled sequences and 39,171 assembled, well-annotated transcripts.
I have a question about the synonyms.txt and species.txt files included in the Quickload folder and currenting "live" on Quickload main:
- Are these files identical to the corresponding files in IGB release 10.2.0, except for the addition of the Euphorbia lathyris species and genome version information?
- Are the species.txt and synonyms.txt files currently "live" in Quickload main the same as the species.txt and synonyms.txt files currently included in IGB release 10.2.0?
Investigating the above question.
While Ann investigates the above questions, I've gone ahead and added an image of the Euphorbia lathyris genome to the IGB carousel.
Branch: https://bitbucket.org/pkulzer-lorainelab/integrated-genome-browser/branch/IGBF-4386
Image: https://commons.wikimedia.org/wiki/Euphorbia_lathyris#/media/File:Euphorbia_lathyris_ies.jpg
Update:
- The IGB codebase synonyms.txt in "integrated-genome-browser/core/synonym-lookup/src/main/resources" lacks the new line for E_lathyris_Nov_2023 as expected.
- However, the IGB codebase synonyms.txt also lacks this line that is present in the new candidate Quickload's. The missing line is: "S_chinensis_Apr_2019 sc1"
- The copy of synonyms.txt present in the subversion repository also has the line "S_chinensis_Apr_2019 sc1"
Looks like the copy currently in subversion is more recent. Once we updated the subversion version with Paige Kulzer's new synonyms.txt we can update the IGB git repository using this same new file.
The IGB codebase's file species.txt also needs to be updated because it is missing this line:
"> Simmondsia chinensis Jojoba S_chinensis"
that is present in the subversion repository copy of species.txt.
I wanted to make a backup tonite for the repository prior to migrating to a new host, so I went ahead and committed Paige Kulzer's new files.
Repository is now at revision 233 but isn't yet deployed on RENCI and UNCC hosting.
FYI: Repository backups are here:
I updated the quickload sites available for IGB.
To test:
- Check that the new genome can be loaded from both the main site and the mirror site for IGB 10.2.0, the released version.
- Check that the Web interfaces look right and the new genome files can be viewed and/or dowloaded in the web browser
I noticed the file permissions for the new content on the mirror site were messed up. I fixed them on site by manually logging in and running chmod command. I'm not sure what the deal is with that! The primary site (on RENCI) was fine, however.
Web interface shows the new directory but there's no text next to the folder. Looks like the .htaccess file need to be updated? See: http://igbquickload.org/quickload/
Even more folder names are missing at the main Quickload site: http://igbquickload-main.bioviz.org/quickload/
I will make any necessary edits to the .htaccess file later because right now I want to set up a new Quickload site using the r233 backups.
Apart from that, everything looks fantastic!
Testing shows no problems, apart from the .htaccess file maybe needing to be updated, which I am going to investigate later.
Moving to DONE!
Making a separate ticket for updating the IGB codebase with the image.
Request: