[IGBF-3682] Add both tardigrade genome assemblies to IGB Quickload Main - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 10.0.0 Major Release
Labels:
None

Story Points:
5
Epic Link:
Support tardigrade genome assemblies and rna-seq in IGB
Sprint:
Spring 7, Spring 9, Summer 1, Summer 2, Summer 3, Summer 4, Summer 5

Description

Bob Goldstein, faculty from UNC Chapel Hill, is coming to Charlotte to give a seminar on Friday April 5 at 2:30.

His research Web site: https://goldsteinlab.weebly.com/

His tardigrades site: http://tardigrades.bio.unc.edu/

[~ann.loraine] signed up for a meeting with him at 1:30 in the third floor conference room. (We might use my office instead, depending on whether they finished repairing the walls yet - long story!)

Our goal is to talk with him about the tardigrade genome project and how we can represent tardigrade (water bears) in IGB.

Water bears are an emerging model system for studying how animals survive in extreme environments.

According to the UCSC Genome Browser, they now have three tardigrade genome assemblies available.

Bob Goldstein's tardigrades web site (http://tardigrades.bio.unc.edu/) mentions that they are "developing the water bear Hypsibius exemplaris as a new model for studying evo-devo and resistance to extremes."

Their tardigrade site has a "links" page that includes a blast search link hosted at NCBI. That page allows users to search genome assembly

"Hypsibius dujardini GenBank assembly GCA_002082055.1"

UCSC appears to also feature this same genome assembly on their site. They call it "nHd_3.1 Apr. 2017 H.dujardini (Z151 2017 tardigrades) (GCA_002082055.1)"

See:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_2790993_GCA_002082055.1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=MTYJ01000001.1%3A705325%2D715325&hgsid=2077288182_Hg3vuIwMpk5jGaQjGrnxuSsihSl4.

For this task:

Get reference 2bit genome sequence file for the above genome assembly
Investigate and get reference gene model annotations for the above assembly
Add reference genome assembly to IGB Quickload main
Add reference gene model annotations to IGB Quickload main
Review Goldstein lab publications to identify foundational RNA-Seq data for this organism

Let's aim to get the tardigrade data added to IGB QL before Friday so that we can:

show Dr. Goldstein an IGB demo with the tardigrade genome data
ask Dr. Goldstein which RNA-Seq data sets are most relevant to his research
tentatively schedule an on-line demo of IGB with Dr. Goldstein lab members

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

RvY_01679-GDV.png
748 kB
25/Jun/24 11:52 AM
RvY_01679-IGB.png
137 kB
25/Jun/24 11:56 AM
Screenshot 2024-06-28 at 2.32.58 PM.png
316 kB
28/Jun/24 2:33 PM

Issue Links

relates to

IGBF-3742 Prepare for 2024 SDB poster and booth presentations

Closed

IGBF-3685 Identify RNA-Seq data for Tardigrade

Closed

Activity

Ascending order - Click to sort in descending order

Ann Loraine created issue - 02/Apr/24 10:36 AM

Ann Loraine made changes - 02/Apr/24 10:36 AM

Field	Original Value	New Value
Epic Link		IGBF-2809 [ 19325 ]

Ann Loraine made changes - 02/Apr/24 10:36 AM

Rank

Ranked higher

Ann Loraine made changes - 02/Apr/24 10:36 AM

Summary

Investigate: Adding tardigrade to IGB Quickload Main

Investigate: Add tardigrade to IGB Quickload Main

Ann Loraine made changes - 02/Apr/24 10:39 AM

Epic Link

IGBF-2809 [ 19325 ]

~~IGBF-1478~~ [ 17563 ]

Nowlan Freese made changes - 02/Apr/24 10:39 AM

Link

This issue relates to IGBF-3681 [ IGBF-3681 ]

Ann Loraine made changes - 02/Apr/24 10:48 AM

Epic Link

~~IGBF-1478~~ [ 17563 ]

IGBF-1395 [ 17470 ]

Nowlan Freese made changes - 02/Apr/24 10:52 AM

Assignee

Nowlan Freese [ nfreese ]

Hide

Permalink

Nowlan Freese added a comment - 02/Apr/24 10:56 AM - edited

For the R. varieornatus tardigrade:

Paper: https://www.nature.com/articles/ncomms12808
Website: http://kumamushi.org/database.html
NCBI: https://www.ncbi.nlm.nih.gov/nuccore/BDGG00000000.1

Note: UCSC does have the R. varieornatus genome in the browser, but I am not seeing it in the Table Browser, UCSC API, or UCSC Genome Downloads.

UCSC Gene Models: https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2077278096_b5zQo29khGhNnPNEOOamtx0cKoXg&db=hub_2790993_GCA_001949185.1&c=BDGG01000001.1&g=hub_2790993_ncbiGene

The NCBI FTP site referenced as being used by UCSC: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/949/185/GCA_001949185.1_Rvar_4.0/

I pulled the following files:
GCA_001949185.1_Rvar_4.0_genomic.gff.gz
GCA_001949185.1_Rvar_4.0_genomic.fna.gz

I was able to load the files in IGB and they look correct.

NCBI also has this page: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/
The Genome sequences (FASTA) and Annotation features (GFF) files are the same as those from the ftp.

Show

Nowlan Freese added a comment - 02/Apr/24 10:56 AM - edited For the R. varieornatus tardigrade: Paper: https://www.nature.com/articles/ncomms12808 Website: http://kumamushi.org/database.html NCBI: https://www.ncbi.nlm.nih.gov/nuccore/BDGG00000000.1 Note: UCSC does have the R. varieornatus genome in the browser, but I am not seeing it in the Table Browser , UCSC API , or UCSC Genome Downloads . UCSC Gene Models: https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2077278096_b5zQo29khGhNnPNEOOamtx0cKoXg&db=hub_2790993_GCA_001949185.1&c=BDGG01000001.1&g=hub_2790993_ncbiGene The NCBI FTP site referenced as being used by UCSC: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/949/185/GCA_001949185.1_Rvar_4.0/ I pulled the following files: GCA_001949185.1_Rvar_4.0_genomic.gff.gz GCA_001949185.1_Rvar_4.0_genomic.fna.gz I was able to load the files in IGB and they look correct. NCBI also has this page: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/ The Genome sequences (FASTA) and Annotation features (GFF) files are the same as those from the ftp.

Hide

Permalink

Nowlan Freese added a comment - 02/Apr/24 2:46 PM - edited

Hypsibius tardigrade:

https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=58670

GCA_002082055.1 - Hypsibius exemplaris > https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_002082055.1/
GCA_001579985.1 - Hypsibius dujardini > https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001579985.1/
GCA_001455005.1 - Hypsibius dujardini > https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001455005.1/

Note that the submitter for GCA_001455005.1 is Bob Goldstein, but NCBI warns that this genome assembly has been contaminated.

The genome assembly contains sequences from other organisms, cloning vectors, linkers, adapters, or primers.

It looks like only Hypsibius exemplaris and Ramazzottius varieornatus have genome annotations available.

Based on this paper it looks like they are interested in using Hypsibius exemplaris as the model.

Show

Nowlan Freese added a comment - 02/Apr/24 2:46 PM - edited Hypsibius tardigrade: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=58670 GCA_002082055.1 - Hypsibius exemplaris > https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_002082055.1/ GCA_001579985.1 - Hypsibius dujardini > https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001579985.1/ GCA_001455005.1 - Hypsibius dujardini > https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001455005.1/ Note that the submitter for GCA_001455005.1 is Bob Goldstein, but NCBI warns that this genome assembly has been contaminated. The genome assembly contains sequences from other organisms, cloning vectors, linkers, adapters, or primers. It looks like only Hypsibius exemplaris and Ramazzottius varieornatus have genome annotations available. Based on this paper it looks like they are interested in using Hypsibius exemplaris as the model.

Nowlan Freese made changes - 02/Apr/24 2:50 PM

Comment

[ Paper: https://www.nature.com/articles/ncomms12808
Website: http://kumamushi.org/database.html
NCBI: https://www.ncbi.nlm.nih.gov/nuccore/BDGG00000000.1

Note: UCSC does have the [R. varieornatus|https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_2790993_GCA_001949185.1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=BDGG01000001.1%3A3111028%2D3121028&hgsid=2077278096_b5zQo29khGhNnPNEOOamtx0cKoXg] genome in the browser, but I am not seeing it in the [Table Browser|https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=2062754252_TCzwwWpelErfii1tZNtOWAOU96T6&clade=insect&org=0&db=0&hgta_group=genes&hgta_track=augustusGene&hgta_table=augustusGene&hgta_regionType=genome&position=&hgta_outputType=primaryTable&hgta_outFileName=knownGene.bed], [UCSC API|https://api.genome.ucsc.edu/list/ucscGenomes], or [UCSC Genome Downloads|https://hgdownload.soe.ucsc.edu/downloads.html]. ]

Nowlan Freese made changes - 03/Apr/24 9:40 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Nowlan Freese added a comment - 03/Apr/24 1:17 PM - edited

[~ann.loraine] - Here are the files for Hypsibius exemplaris: https://drive.google.com/drive/folders/1Yc5eNuJcMTW1b1Q7wOrPa4JQXR-XR4W0?usp=sharing

GCA_002082055.1_nHd_3.1_genomic.2bit
GCA_002082055.1_nHd_3.1_genomic.gff.gz
genome.txt

Show

Nowlan Freese added a comment - 03/Apr/24 1:17 PM - edited [~ann.loraine] - Here are the files for Hypsibius exemplaris: https://drive.google.com/drive/folders/1Yc5eNuJcMTW1b1Q7wOrPa4JQXR-XR4W0?usp=sharing GCA_002082055.1_nHd_3.1_genomic.2bit GCA_002082055.1_nHd_3.1_genomic.gff.gz genome.txt

Ann Loraine made changes - 03/Apr/24 4:05 PM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Hide

Permalink

Ann Loraine added a comment - 03/Apr/24 4:10 PM - edited

Thanks Nowlan Freese! Please delete that google folder because I have what I need!

I have added the genome to the svn repository and updated the quickload "main" content deployed at these locations:

To test:

Start IGB
Use Species and Genome menus (Current Genome tab) to select the new genome H_exemplaris_Z151_Apr_2017 for species Hypsibius exemplaris strain Z151
Check that the full species name appears in the species menu with tooltip "water bear"
All the gene models should load automatically. I have configured them so that the description field appears as the names.
Check that the Web site content makes sense. Visit the first URL above and scroll down the page to check that there is an entry for the new genome in a similar style to all the others
Enter the genome directory folder for "water bear" and check that the information there in the "header.md" file appearing above the files listing makes sense. If it doesn't, make a note here and suggest a fix.

Show

Ann Loraine added a comment - 03/Apr/24 4:10 PM - edited Thanks Nowlan Freese ! Please delete that google folder because I have what I need! I have added the genome to the svn repository and updated the quickload "main" content deployed at these locations: http://lorainelab-quickload.scidas.org/quickload http://www.igbquickload.org/quickload To test: Start IGB Use Species and Genome menus (Current Genome tab) to select the new genome H_exemplaris_Z151_Apr_2017 for species Hypsibius exemplaris strain Z151 Check that the full species name appears in the species menu with tooltip "water bear" All the gene models should load automatically. I have configured them so that the description field appears as the names. Check that the Web site content makes sense. Visit the first URL above and scroll down the page to check that there is an entry for the new genome in a similar style to all the others Enter the genome directory folder for "water bear" and check that the information there in the "header.md" file appearing above the files listing makes sense. If it doesn't, make a note here and suggest a fix.

Hide

Permalink

Ann Loraine added a comment - 03/Apr/24 4:13 PM

Note that I converted the GFF file from GenBank to bed-detail using a newly committed version of this script: gff3ToBedDetail.py on branch "master" in repository https://bitbucket.org/lorainelab/genomesource

Show

Ann Loraine added a comment - 03/Apr/24 4:13 PM Note that I converted the GFF file from GenBank to bed-detail using a newly committed version of this script: gff3ToBedDetail.py on branch "master" in repository https://bitbucket.org/lorainelab/genomesource

Nowlan Freese made changes - 03/Apr/24 4:26 PM

Link

This issue relates to ~~IGBF-3685~~ [ ~~IGBF-3685~~ ]

Hide

Permalink

Ann Loraine added a comment - 04/Apr/24 2:18 PM

How to update the genome dashboard with common name and image for the water bear?

Show

Ann Loraine added a comment - 04/Apr/24 2:18 PM How to update the genome dashboard with common name and image for the water bear?

Hide

Permalink

Nowlan Freese added a comment - 04/Apr/24 2:33 PM - edited

Looks like we would add the image in this folder with the name H_exemplaris_Z151.jpg

Based on this code from the genome dashboard, it looks like it is pulling the species.txt and using that to determine the common names.

Show

Nowlan Freese added a comment - 04/Apr/24 2:33 PM - edited Looks like we would add the image in this folder with the name H_exemplaris_Z151.jpg Based on this code from the genome dashboard , it looks like it is pulling the species.txt and using that to determine the common names.

Hide

Permalink

Ann Loraine added a comment - 04/Apr/24 2:54 PM - edited

To make sure it gets added to genome dashboard in a nice way:

Clone https://bitbucket.org/lorainelab/genome-dashboard/
Add image to public/images
Edit species.txt and synonyms.txt for IGB release branch (genome dashboard retrieves these files from the branch)
Make sure that the new genome is available from the primary Quickload main site used by that branch (see preferences json file for the release branch)

Note that genome dashboard, when deployed, will be configured to use the currently released IGB branch code base to coordinate between IGB and the quickload sites it knows about.

To do this for tardigrade, I made a branch with the required changes and did a PR from my fork and branch to the main branch on the team repository.

Show

Ann Loraine added a comment - 04/Apr/24 2:54 PM - edited To make sure it gets added to genome dashboard in a nice way: Clone https://bitbucket.org/lorainelab/genome-dashboard/ Add image to public/images Edit species.txt and synonyms.txt for IGB release branch (genome dashboard retrieves these files from the branch) Make sure that the new genome is available from the primary Quickload main site used by that branch (see preferences json file for the release branch) Note that genome dashboard, when deployed, will be configured to use the currently released IGB branch code base to coordinate between IGB and the quickload sites it knows about. To do this for tardigrade, I made a branch with the required changes and did a PR from my fork and branch to the main branch on the team repository.

Hide

Permalink

Nowlan Freese added a comment - 04/Apr/24 4:27 PM

PR to update genome dashboard util.js to point at the main IGB branch: https://bitbucket.org/lorainelab/genome-dashboard/pull-requests/29

Commit: https://bitbucket.org/nfreese/genome-dashboard/commits/d66c03a40668bb19466143edf3cc5462011e341f

Show

Nowlan Freese added a comment - 04/Apr/24 4:27 PM PR to update genome dashboard util.js to point at the main IGB branch: https://bitbucket.org/lorainelab/genome-dashboard/pull-requests/29 Commit: https://bitbucket.org/nfreese/genome-dashboard/commits/d66c03a40668bb19466143edf3cc5462011e341f

Hide

Permalink

Ann Loraine added a comment - 04/Apr/24 4:28 PM

Nowlan Freese has submitted PR to update the default IGB branch for util.js in genome-dashboard code to "main" instead of "master"

Show

Ann Loraine added a comment - 04/Apr/24 4:28 PM Nowlan Freese has submitted PR to update the default IGB branch for util.js in genome-dashboard code to "main" instead of "master"

Nowlan Freese made changes - 05/Apr/24 10:52 AM

Summary

Investigate: Add tardigrade to IGB Quickload Main

Add tardigrade to IGB Quickload Main

Nowlan Freese made changes - 05/Apr/24 10:52 AM

Story Points

0.5

Hide

Permalink

Nowlan Freese added a comment - 05/Apr/24 11:02 AM - edited

Reviewing:

Genome Dashboard is working correctly, shows common and scientific name, clicking on it opens the correct genome in IGB.
Hypsibius exemplaris showing correctly in IGB. Annotations are loading by default, annotations look the same when comparing them to UCSC.

Only question I have, the annots.xml lists the title as "Annotation submitted by Keio University (March 3, 2022)", which then shows as the track name. Was the intention to show it as the track name? I was also a little confused as the listed submission date is March 4, 2022.

Show

Nowlan Freese added a comment - 05/Apr/24 11:02 AM - edited Reviewing: Genome Dashboard is working correctly, shows common and scientific name, clicking on it opens the correct genome in IGB. Hypsibius exemplaris showing correctly in IGB. Annotations are loading by default, annotations look the same when comparing them to UCSC . Only question I have, the annots.xml lists the title as "Annotation submitted by Keio University (March 3, 2022)", which then shows as the track name. Was the intention to show it as the track name? I was also a little confused as the listed submission date is March 4, 2022 .

Nowlan Freese made changes - 05/Apr/24 11:04 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Nowlan Freese made changes - 05/Apr/24 11:04 AM

Status

First Level Review in Progress [ 10301 ]

To-Do [ 10305 ]

Nowlan Freese made changes - 05/Apr/24 11:05 AM

Assignee

Nowlan Freese [ nfreese ]

Ann Loraine [ aloraine ]

Nowlan Freese made changes - 12/Apr/24 11:03 AM

Story Points	2	4
Assignee	Ann Loraine [ aloraine ]

Ann Loraine made changes - 15/Apr/24 9:44 AM

Sprint

Spring 7 [ 191 ]

Spring 7, Spring 8 [ 191, 192 ]

Ann Loraine made changes - 15/Apr/24 9:44 AM

Rank

Ranked higher

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 24/Apr/24 2:22 PM - edited

To do:

~~Fix typo in annots.xml - change March 3, 2022 -> March 4, 2022~~ PK committed this change via svn on 4/24/24
~~Add R. varieornatus tardigrade genome and annotation to IGB Quickload.~~
Update gff3ToBedDetail.py in GenomeSource repo so that it handles non protein coding genes.

Show

Paige Kulzer (Inactive) added a comment - 24/Apr/24 2:22 PM - edited To do: Fix typo in annots.xml - change March 3, 2022 -> March 4, 2022 PK committed this change via svn on 4/24/24 Add R. varieornatus tardigrade genome and annotation to IGB Quickload. Update gff3ToBedDetail.py in GenomeSource repo so that it handles non protein coding genes.

Ann Loraine made changes - 24/Apr/24 9:46 PM

Summary

Add tardigrade to IGB Quickload Main

Add tardigrade genome assemblies to IGB Quickload Main

Hide

Permalink

Ann Loraine added a comment - 24/Apr/24 9:46 PM

R. varieornatus : https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/

Show

Ann Loraine added a comment - 24/Apr/24 9:46 PM R. varieornatus : https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/

Hide

Permalink

Ann Loraine added a comment - 24/Apr/24 9:49 PM

All the taxon tardigrada genome assemblies:

https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=42241

Show

Ann Loraine added a comment - 24/Apr/24 9:49 PM All the taxon tardigrada genome assemblies: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=42241

Paige Kulzer (Inactive) made changes - 25/Apr/24 10:44 AM

Sprint

Spring 7, Spring 8 [ 191, 192 ]

Spring 7, Spring 9 [ 191, 193 ]

Ann Loraine made changes - 12/May/24 4:33 PM

Sprint

Spring 7, Spring 9 [ 191, 193 ]

Spring 7, Spring 9, Spring 10 [ 191, 193, 194 ]

Ann Loraine made changes - 12/May/24 4:33 PM

Rank

Ranked higher

Paige Kulzer (Inactive) made changes - 23/May/24 2:16 PM

Sprint

Spring 7, Spring 9, Spring 10 [ 191, 193, 194 ]

Spring 7, Spring 9, Summer 1 [ 191, 193, 195 ]

Nowlan Freese made changes - 03/Jun/24 10:13 AM

Assignee

Nowlan Freese [ nfreese ]

Ann Loraine made changes - 09/Jun/24 4:44 PM

Sprint

Spring 7, Spring 9, Summer 1 [ 191, 193, 195 ]

Spring 7, Spring 9, Summer 1, Summer 2 [ 191, 193, 195, 196 ]

Ann Loraine made changes - 09/Jun/24 4:44 PM

Rank

Ranked higher

Ann Loraine made changes - 12/Jun/24 3:35 PM

Epic Link

IGBF-1395 [ 17470 ]

IGBF-3778 [ 22997 ]

Ann Loraine made changes - 12/Jun/24 3:36 PM

Summary

Add tardigrade genome assemblies to IGB Quickload Main

Add both tardigrade genome assemblies to IGB Quickload Main

Nowlan Freese made changes - 14/Jun/24 9:45 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Nowlan Freese added a comment - 14/Jun/24 12:22 PM

[~ann.loraine] - I have added the files for Ramazzottius varieornatus to Google Drive: https://drive.google.com/drive/folders/1acW5w8dMGvxit3kw9i_jiW9ZOUqV6W8X?usp=drive_link

GCA_001949185.1_Rvar_4.0_genomic.2bit
GCA_001949185.1_Rvar.gff

Files are from NCBI: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/

When I tried to run the latest version of gff3ToBedDetail.py I was seeing several warnings, so I was hoping you could take a look.

Show

Nowlan Freese added a comment - 14/Jun/24 12:22 PM [~ann.loraine] - I have added the files for Ramazzottius varieornatus to Google Drive: https://drive.google.com/drive/folders/1acW5w8dMGvxit3kw9i_jiW9ZOUqV6W8X?usp=drive_link GCA_001949185.1_Rvar_4.0_genomic.2bit GCA_001949185.1_Rvar.gff Files are from NCBI: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/ When I tried to run the latest version of gff3ToBedDetail.py I was seeing several warnings, so I was hoping you could take a look.

Ann Loraine made changes - 21/Jun/24 8:25 PM

Sprint

Spring 7, Spring 9, Summer 1, Summer 2 [ 191, 193, 195, 196 ]

Spring 7, Spring 9, Summer 1, Summer 2, Summer 3 [ 191, 193, 195, 196, 197 ]

Ann Loraine made changes - 21/Jun/24 8:25 PM

Rank

Ranked higher

Nowlan Freese made changes - 24/Jun/24 2:59 PM

Status

In Progress [ 3 ]

To-Do [ 10305 ]

Nowlan Freese made changes - 24/Jun/24 2:59 PM

Assignee

Nowlan Freese [ nfreese ]

Ann Loraine [ aloraine ]

Ann Loraine made changes - 24/Jun/24 4:10 PM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Ann Loraine added a comment - 24/Jun/24 4:21 PM - edited

Update:

Publication date for R. varieornatus assembly: 20 September 2016
Strain is YOKOZUNA-1 according to Genbank record at https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1
Therefore genome version should be added to IGB as: R_varieornatus_YOKOZUNA-1_Sep_2016
Genome version synonyms: Rvar_4.0, GCA_001949185.1

Show

Ann Loraine added a comment - 24/Jun/24 4:21 PM - edited Update: Publication date for R. varieornatus assembly: 20 September 2016 Strain is YOKOZUNA-1 according to Genbank record at https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1 Therefore genome version should be added to IGB as: R_varieornatus_YOKOZUNA-1_Sep_2016 Genome version synonyms: Rvar_4.0, GCA_001949185.1

Hide

Permalink

Ann Loraine added a comment - 24/Jun/24 4:31 PM - edited

Investigating GFF file from NCBI to try to understand errors noted by Nowlan Freese:

example line:

BDGG01000001.1  DDBJ    CDS     441     704     .       +       2       ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1

This feature type is "CDS" indicating that this line specifies a region of genomic sequence that is translated into protein. The "parent" identifier is "gene-RvY_00001". This means that anything with this same "parent" is part of the same gene model, usually. However, it's a bit weird because usually "gene" means: a collection of gene models that are theoretically produced by alternative transcription.

This means that the parser we write for this file needs to use the "parent" identifier to assemble components of a gene model.

The GFF "extra feature" section (the final column) also reports that the CDS has a "locus_tag" which seems like it could indicate a grouping that I normally think of as a "gene."

Searching for the parent identifier using grep brings up:

local aloraine$ grep RvY_00001-1 GCA_001949185.1_Rvar.gff 
BDGG01000001.1	DDBJ	gene	65	2382	.	+	.	ID=gene-RvY_00001;Name=RvY_00001-1;gbkey=Gene;gene=RvY_00001-1;gene_biotype=protein_coding;gene_synonym=RvY_00001.1;locus_tag=RvY_00001
BDGG01000001.1	DDBJ	CDS	65	242	.	+	0	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1
BDGG01000001.1	DDBJ	CDS	441	704	.	+	2	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1
BDGG01000001.1	DDBJ	CDS	934	1359	.	+	2	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1
BDGG01000001.1	DDBJ	CDS	1853	2382	.	+	2	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1

The search finds five lines of text: 4 CDS features and 1 gene feature.

What happens if we search this way?

local aloraine$ grep RvY_00001 GCA_001949185.1_Rvar.gff 
BDGG01000001.1	DDBJ	gene	65	2382	.	+	.	ID=gene-RvY_00001;Name=RvY_00001-1;gbkey=Gene;gene=RvY_00001-1;gene_biotype=protein_coding;gene_synonym=RvY_00001.1;locus_tag=RvY_00001
BDGG01000001.1	DDBJ	CDS	65	242	.	+	0	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1
BDGG01000001.1	DDBJ	CDS	441	704	.	+	2	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1
BDGG01000001.1	DDBJ	CDS	934	1359	.	+	2	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1
BDGG01000001.1	DDBJ	CDS	1853	2382	.	+	2	ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1

Show

Ann Loraine added a comment - 24/Jun/24 4:31 PM - edited Investigating GFF file from NCBI to try to understand errors noted by Nowlan Freese : example line: BDGG01000001.1 DDBJ CDS 441 704 . + 2 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 This feature type is "CDS" indicating that this line specifies a region of genomic sequence that is translated into protein. The "parent" identifier is "gene-RvY_00001". This means that anything with this same "parent" is part of the same gene model, usually. However, it's a bit weird because usually "gene" means: a collection of gene models that are theoretically produced by alternative transcription. This means that the parser we write for this file needs to use the "parent" identifier to assemble components of a gene model. The GFF "extra feature" section (the final column) also reports that the CDS has a "locus_tag" which seems like it could indicate a grouping that I normally think of as a "gene." Searching for the parent identifier using grep brings up: local aloraine$ grep RvY_00001-1 GCA_001949185.1_Rvar.gff BDGG01000001.1 DDBJ gene 65 2382 . + . ID=gene-RvY_00001;Name=RvY_00001-1;gbkey=Gene;gene=RvY_00001-1;gene_biotype=protein_coding;gene_synonym=RvY_00001.1;locus_tag=RvY_00001 BDGG01000001.1 DDBJ CDS 65 242 . + 0 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 BDGG01000001.1 DDBJ CDS 441 704 . + 2 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 BDGG01000001.1 DDBJ CDS 934 1359 . + 2 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 BDGG01000001.1 DDBJ CDS 1853 2382 . + 2 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 The search finds five lines of text: 4 CDS features and 1 gene feature. What happens if we search this way? local aloraine$ grep RvY_00001 GCA_001949185.1_Rvar.gff BDGG01000001.1 DDBJ gene 65 2382 . + . ID=gene-RvY_00001;Name=RvY_00001-1;gbkey=Gene;gene=RvY_00001-1;gene_biotype=protein_coding;gene_synonym=RvY_00001.1;locus_tag=RvY_00001 BDGG01000001.1 DDBJ CDS 65 242 . + 0 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 BDGG01000001.1 DDBJ CDS 441 704 . + 2 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 BDGG01000001.1 DDBJ CDS 934 1359 . + 2 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1 BDGG01000001.1 DDBJ CDS 1853 2382 . + 2 ID=cds-GAU87103.1;Parent=gene-RvY_00001;Dbxref=NCBI_GP:GAU87103.1;Name=GAU87103.1;gbkey=CDS;gene=RvY_00001-1;locus_tag=RvY_00001;product=hypothetical protein;protein_id=GAU87103.1

Hide

Permalink

Ann Loraine added a comment - 24/Jun/24 4:37 PM - edited

The GFF file's quote lines above suggest that for some gene models, there are no exon annotations. It's possible that this file uses "CDS" and only "CDS" to indicate the locations of translated exons and maybe uses "exons" to indicate the untranslated regions (UTRs) at the 5' and 3' ends of a transcript. There are probably not very many of these because the annotation was probably done using gene-finding software, which sucks at identifying UTRs.

Let's take a look:

local aloraine$ grep exon GCA_001949185.1_Rvar.gff  | head 
BDGG01000001.1	DDBJ	exon	443104	443176	.	+	.	ID=exon-RvY_t0001-1;Parent=rna-RvY_t0001;gbkey=tRNA;gene=tRNA-Lys(CUU);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0001;product=tRNA-Lys
BDGG01000001.1	DDBJ	exon	468837	468908	.	+	.	ID=exon-RvY_t0002-1;Parent=rna-RvY_t0002;gbkey=tRNA;gene=tRNA-Arg(CCG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0002;product=tRNA-Arg
BDGG01000001.1	DDBJ	exon	1764652	1764688	.	+	.	ID=exon-RvY_t0003-1;Parent=rna-RvY_t0003;gbkey=tRNA;gene=tRNA-Tyr(GUA);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0003;product=tRNA-Tyr
BDGG01000001.1	DDBJ	exon	1764702	1764737	.	+	.	ID=exon-RvY_t0003-2;Parent=rna-RvY_t0003;gbkey=tRNA;gene=tRNA-Tyr(GUA);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0003;product=tRNA-Tyr
BDGG01000001.1	DDBJ	exon	1908331	1908451	.	-	.	ID=exon-RvY_r0001-1;Parent=rna-RvY_r0001;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:27.1);locus_tag=RvY_r0001;product=5S ribosomal RNA
BDGG01000001.1	DDBJ	exon	2282898	2282979	.	-	.	ID=exon-RvY_t0004-1;Parent=rna-RvY_t0004;gbkey=tRNA;gene=tRNA-Leu(AAG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0004;product=tRNA-Leu
BDGG01000001.1	DDBJ	exon	2356047	2356161	.	-	.	ID=exon-RvY_r0002-1;Parent=rna-RvY_r0002;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:82.3);locus_tag=RvY_r0002;product=5S ribosomal RNA
BDGG01000001.1	DDBJ	exon	2410614	2410728	.	-	.	ID=exon-RvY_r0003-1;Parent=rna-RvY_r0003;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:82.3);locus_tag=RvY_r0003;product=5S ribosomal RNA
BDGG01000001.1	DDBJ	exon	2418110	2418224	.	-	.	ID=exon-RvY_r0004-1;Parent=rna-RvY_r0004;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:79.3);locus_tag=RvY_r0004;product=5S ribosomal RNA
BDGG01000001.1	DDBJ	exon	2530911	2531025	.	+	.	ID=exon-RvY_r0005-1;Parent=rna-RvY_r0005;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:84.4);locus_tag=RvY_r0005;product=5S ribosomal RN

All of the above lines look like they are coming from an RNA-prediction program called "Aragorn". Cool cool cool.

Are there any "exons" associated with protein-coding genes?

Let's find out:

local aloraine$ grep -w exon GCA_001949185.1_Rvar.gff  | grep protein
local aloraine$

The preceding line finds all instances of the whole word "exon" in the target GFF file and then pipes the output to another grep command that searches for instances of the word "protein." Nothing is found. So it appears that all the protein coding genes, hypothetical or otherwise, have zero "exon" features in them.

So, to assemble gene models for protein-coding genes, we have to find all the CDS features and then assemble them into gene models in our code.

The RNA-genes will need to be assembled in a different way.

Show

Ann Loraine added a comment - 24/Jun/24 4:37 PM - edited The GFF file's quote lines above suggest that for some gene models, there are no exon annotations. It's possible that this file uses "CDS" and only "CDS" to indicate the locations of translated exons and maybe uses "exons" to indicate the untranslated regions (UTRs) at the 5' and 3' ends of a transcript. There are probably not very many of these because the annotation was probably done using gene-finding software, which sucks at identifying UTRs. Let's take a look: local aloraine$ grep exon GCA_001949185.1_Rvar.gff | head BDGG01000001.1 DDBJ exon 443104 443176 . + . ID=exon-RvY_t0001-1;Parent=rna-RvY_t0001;gbkey=tRNA;gene=tRNA-Lys(CUU);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0001;product=tRNA-Lys BDGG01000001.1 DDBJ exon 468837 468908 . + . ID=exon-RvY_t0002-1;Parent=rna-RvY_t0002;gbkey=tRNA;gene=tRNA-Arg(CCG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0002;product=tRNA-Arg BDGG01000001.1 DDBJ exon 1764652 1764688 . + . ID=exon-RvY_t0003-1;Parent=rna-RvY_t0003;gbkey=tRNA;gene=tRNA-Tyr(GUA);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0003;product=tRNA-Tyr BDGG01000001.1 DDBJ exon 1764702 1764737 . + . ID=exon-RvY_t0003-2;Parent=rna-RvY_t0003;gbkey=tRNA;gene=tRNA-Tyr(GUA);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0003;product=tRNA-Tyr BDGG01000001.1 DDBJ exon 1908331 1908451 . - . ID=exon-RvY_r0001-1;Parent=rna-RvY_r0001;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:27.1);locus_tag=RvY_r0001;product=5S ribosomal RNA BDGG01000001.1 DDBJ exon 2282898 2282979 . - . ID=exon-RvY_t0004-1;Parent=rna-RvY_t0004;gbkey=tRNA;gene=tRNA-Leu(AAG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0004;product=tRNA-Leu BDGG01000001.1 DDBJ exon 2356047 2356161 . - . ID=exon-RvY_r0002-1;Parent=rna-RvY_r0002;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:82.3);locus_tag=RvY_r0002;product=5S ribosomal RNA BDGG01000001.1 DDBJ exon 2410614 2410728 . - . ID=exon-RvY_r0003-1;Parent=rna-RvY_r0003;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:82.3);locus_tag=RvY_r0003;product=5S ribosomal RNA BDGG01000001.1 DDBJ exon 2418110 2418224 . - . ID=exon-RvY_r0004-1;Parent=rna-RvY_r0004;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:79.3);locus_tag=RvY_r0004;product=5S ribosomal RNA BDGG01000001.1 DDBJ exon 2530911 2531025 . + . ID=exon-RvY_r0005-1;Parent=rna-RvY_r0005;gbkey=rRNA;gene=5s_rRNA;inference=ab initio prediction:RNAmmer:1.2(score:84.4);locus_tag=RvY_r0005;product=5S ribosomal RN All of the above lines look like they are coming from an RNA-prediction program called "Aragorn". Cool cool cool. Are there any "exons" associated with protein-coding genes? Let's find out: local aloraine$ grep -w exon GCA_001949185.1_Rvar.gff | grep protein local aloraine$ The preceding line finds all instances of the whole word "exon" in the target GFF file and then pipes the output to another grep command that searches for instances of the word "protein." Nothing is found. So it appears that all the protein coding genes, hypothetical or otherwise, have zero "exon" features in them. So, to assemble gene models for protein-coding genes, we have to find all the CDS features and then assemble them into gene models in our code. The RNA-genes will need to be assembled in a different way.

Hide

Permalink

Ann Loraine added a comment - 24/Jun/24 5:15 PM - edited

Investigating an example RNA gene:

local aloraine$ grep rna-RvY_t0002 GCA_001949185.1_Rvar.gff 
BDGG01000001.1	DDBJ	tRNA	468837	468908	.	+	.	ID=rna-RvY_t0002;Parent=gene-RvY_t0002;gbkey=tRNA;gene=tRNA-Arg(CCG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0002;product=tRNA-Arg
BDGG01000001.1	DDBJ	exon	468837	468908	.	+	.	ID=exon-RvY_t0002-1;Parent=rna-RvY_t0002;gbkey=tRNA;gene=tRNA-Arg(CCG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0002;product=tRNA-Arg

Looks like non-coding genes are constructed from exons that get grouped using a "Parent" gene identifier in the same way that protein-coding genes do.

Show

Ann Loraine added a comment - 24/Jun/24 5:15 PM - edited Investigating an example RNA gene: local aloraine$ grep rna-RvY_t0002 GCA_001949185.1_Rvar.gff BDGG01000001.1 DDBJ tRNA 468837 468908 . + . ID=rna-RvY_t0002;Parent=gene-RvY_t0002;gbkey=tRNA;gene=tRNA-Arg(CCG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0002;product=tRNA-Arg BDGG01000001.1 DDBJ exon 468837 468908 . + . ID=exon-RvY_t0002-1;Parent=rna-RvY_t0002;gbkey=tRNA;gene=tRNA-Arg(CCG);inference=ab initio prediction:Aragorn:1.2.2865%2CtRNAscan-SE:1.2366;locus_tag=RvY_t0002;product=tRNA-Arg Looks like non-coding genes are constructed from exons that get grouped using a "Parent" gene identifier in the same way that protein-coding genes do.

Hide

Permalink

Ann Loraine added a comment - 24/Jun/24 5:38 PM - edited

Examining errors and warnings emitted by gff3ToBedDetail.py:

  warnings.warn("Two lines with the same gene id: %s"%gene_id)
/Users/aloraine/src/genomesource/Mapping/Parser/Gff3.py:49: UserWarning: Two lines with the same gene id: gene-RvY_00053-2

The extra feature field attribute "gene id" is getting repeated in every line. I need to look into if this is going to be a problem or not.

Investigating:

BDGG01000001.1	DDBJ	gene	115830	115868	.	-	.	ID=gene-RvY_00053-2;Name=RvY_00053;gbkey=Gene;gene=RvY_00053;gene_biotype=protein_coding;gene_synonym=RvY_00053.2;is_ordered=true;locus_tag=RvY_00053-2
BDGG01000001.1	DDBJ	gene	115332	115577	.	-	.	ID=gene-RvY_00053-2;Name=RvY_00053;gbkey=Gene;gene=RvY_00053;gene_biotype=protein_coding;gene_synonym=RvY_00053.2;is_ordered=true;locus_tag=RvY_00053-2
... (more, not shown)

It appears the same entity with extra feature "ID=gene-RvY_00053-2" appears twice in the file, with different coordinates.

Show

Ann Loraine added a comment - 24/Jun/24 5:38 PM - edited Examining errors and warnings emitted by gff3ToBedDetail.py: warnings.warn( "Two lines with the same gene id: %s" %gene_id) /Users/aloraine/src/genomesource/Mapping/Parser/Gff3.py:49: UserWarning: Two lines with the same gene id: gene-RvY_00053-2 The extra feature field attribute "gene id" is getting repeated in every line. I need to look into if this is going to be a problem or not. Investigating: BDGG01000001.1 DDBJ gene 115830 115868 . - . ID=gene-RvY_00053-2;Name=RvY_00053;gbkey=Gene;gene=RvY_00053;gene_biotype=protein_coding;gene_synonym=RvY_00053.2;is_ordered= true ;locus_tag=RvY_00053-2 BDGG01000001.1 DDBJ gene 115332 115577 . - . ID=gene-RvY_00053-2;Name=RvY_00053;gbkey=Gene;gene=RvY_00053;gene_biotype=protein_coding;gene_synonym=RvY_00053.2;is_ordered= true ;locus_tag=RvY_00053-2 ... (more, not shown) It appears the same entity with extra feature "ID=gene-RvY_00053-2" appears twice in the file, with different coordinates.

Hide

Permalink

Ann Loraine added a comment - 25/Jun/24 11:08 AM

I don't think this GFF file is usable because it uses the same ID for entities with different coordinates. Looking into whether there is another option we can use instead.

Show

Ann Loraine added a comment - 25/Jun/24 11:08 AM I don't think this GFF file is usable because it uses the same ID for entities with different coordinates. Looking into whether there is another option we can use instead.

Hide

Permalink

Ann Loraine added a comment - 25/Jun/24 11:30 AM - edited

Contacting NCBI for help:

Hello! I need your help with the gene model annotation data file for this genome assembly: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/

I am working with this annotation file: GCA_001949185.1_Rvar_4.0_genomic.gff.gz. (It is available from the "ftp" link.)

The file has a problem I hope you can help me with.

Many lines in the file contain duplicated "ID" values and nonsensical "Parent" values.

For example, there are two lines with ID value of gene-RvY_00053-2, but they have different coordinates.

My understanding of GFF is that all ID values need to be unique within a file. This is because sub-features (e.g., CDS, exon, etc) have to be assembled into coherent gene models by referencing the same Parent. If sub-features belonging to different gene models reference the same Parent identifier, then there is no way to create coherent gene models from them.

Could you help me understand the proper way to use this file?

Thank you!!!

Ann Loraine, PhD
Professor, Bioinformatics and Genomics
UNC Charlotte

Show

Ann Loraine added a comment - 25/Jun/24 11:30 AM - edited Contacting NCBI for help: Hello! I need your help with the gene model annotation data file for this genome assembly: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/ I am working with this annotation file: GCA_001949185.1_Rvar_4.0_genomic.gff.gz. (It is available from the "ftp" link.) The file has a problem I hope you can help me with. Many lines in the file contain duplicated "ID" values and nonsensical "Parent" values. For example, there are two lines with ID value of gene-RvY_00053-2, but they have different coordinates. My understanding of GFF is that all ID values need to be unique within a file. This is because sub-features (e.g., CDS, exon, etc) have to be assembled into coherent gene models by referencing the same Parent. If sub-features belonging to different gene models reference the same Parent identifier, then there is no way to create coherent gene models from them. Could you help me understand the proper way to use this file? Thank you!!! Ann Loraine, PhD Professor, Bioinformatics and Genomics UNC Charlotte

Hide

Permalink

Ann Loraine added a comment - 25/Jun/24 11:47 AM - edited

Update:

Opened 2bit file as a new "custom" genome in IGB
Opened GFF file from Genbank in IGB, selected load model "whole genome"
Looked up a multi-exon gene model using NCBI's Genome Data Viewer page. Identified RvY_01679 as an example. It has two gene models and multiple exons per gene model.
Searched using RvY_01679 as the query in IGB using Advanced Search tab. IGB found the gene but when I went to its location in IGB by double-clicking the search results, the location appeared to be empty. So, it looks like the file is getting parsed without error and some aspects of the data are getting imported, but IGB cannot draw the gene model.
Images are attached

Note: This is an example of how seeing the same region in two different genome browsers helps science. In this case, it helps with trouble-shooting a nonsensical data file.

Show

Ann Loraine added a comment - 25/Jun/24 11:47 AM - edited Update: Opened 2bit file as a new "custom" genome in IGB Opened GFF file from Genbank in IGB, selected load model "whole genome" Looked up a multi-exon gene model using NCBI's Genome Data Viewer page. Identified RvY_01679 as an example. It has two gene models and multiple exons per gene model. Searched using RvY_01679 as the query in IGB using Advanced Search tab. IGB found the gene but when I went to its location in IGB by double-clicking the search results, the location appeared to be empty. So, it looks like the file is getting parsed without error and some aspects of the data are getting imported, but IGB cannot draw the gene model. Images are attached Note : This is an example of how seeing the same region in two different genome browsers helps science. In this case, it helps with trouble-shooting a nonsensical data file.

Ann Loraine made changes - 25/Jun/24 11:48 AM

Attachment

RvY_00053.png [ 18425 ]

Ann Loraine made changes - 25/Jun/24 11:52 AM

Attachment

RvY_01679-GDV.png [ 18426 ]

Ann Loraine made changes - 25/Jun/24 11:56 AM

Attachment

RvY_00053.png [ 18425 ]

Ann Loraine made changes - 25/Jun/24 11:56 AM

Attachment

RvY_01679-IGB.png [ 18427 ]

Ann Loraine made changes - 25/Jun/24 12:00 PM

Assignee

Ann Loraine [ aloraine ]

Hide

Permalink

Ann Loraine added a comment - 25/Jun/24 12:01 PM

My assessment of what to do next:

Wait for NCBI to get back to us.

Show

Ann Loraine added a comment - 25/Jun/24 12:01 PM My assessment of what to do next: Wait for NCBI to get back to us.

Ann Loraine made changes - 25/Jun/24 12:01 PM

Status

In Progress [ 3 ]

To-Do [ 10305 ]

Nowlan Freese made changes - 28/Jun/24 2:33 PM

Attachment

Screenshot 2024-06-28 at 2.32.58 PM.png [ 18436 ]

Hide

Permalink

Nowlan Freese added a comment - 28/Jun/24 2:40 PM - edited

[~ann.loraine] - I have added a bed file (R_varieornatus_YOKOZUNA-1_Nov_2016_ncbiGene.bed) of the R varieornatus annotation to the Google Drive: https://drive.google.com/drive/folders/1acW5w8dMGvxit3kw9i_jiW9ZOUqV6W8X?usp=drive_link

The file came from the UCSC Table Browser, see image below:

The table is listed as hub_3041413_ncbiGene, under track Gene Models
The description page for the R varieornatus gene models on the UCSC genome browser are listed as using Assembly: hub_3041413_Rvar_4.0 Nov. 2016 R.varieornatus (YOKOZUNA-1 2016 tardigrades) and

The Gene model track for the 09 Nov 2016 Ramazzottius varieornatus/GCA_001949185.1_Rvar_4.0 genome assembly is constructed from the gff file GCA_001949185.1_Rvar_4.0_genomic.gff.gz supplied with the genome assembly at the FTP location:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/949/185/GCA_001949185.1_Rvar_4.0/

I'm reasonably sure that the gff file it links to is the same one we were using from NCBI, and it appears that the UCSC bed file is created from it.

The bed file has a 13th and 14th field, but the 14th field is a repeat of the gene name (i.e. it is a placeholder). The gff file we were attempting to use before has some additional information, such as a product field that we might be able to use as the 14th field.

Show

Nowlan Freese added a comment - 28/Jun/24 2:40 PM - edited [~ann.loraine] - I have added a bed file (R_varieornatus_YOKOZUNA-1_Nov_2016_ncbiGene.bed) of the R varieornatus annotation to the Google Drive: https://drive.google.com/drive/folders/1acW5w8dMGvxit3kw9i_jiW9ZOUqV6W8X?usp=drive_link The file came from the UCSC Table Browser , see image below: The table is listed as hub_3041413_ncbiGene, under track Gene Models The description page for the R varieornatus gene models on the UCSC genome browser are listed as using Assembly: hub_3041413_Rvar_4.0 Nov. 2016 R.varieornatus (YOKOZUNA-1 2016 tardigrades) and The Gene model track for the 09 Nov 2016 Ramazzottius varieornatus/GCA_001949185.1_Rvar_4.0 genome assembly is constructed from the gff file GCA_001949185.1_Rvar_4.0_genomic.gff.gz supplied with the genome assembly at the FTP location: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/949/185/GCA_001949185.1_Rvar_4.0/ I'm reasonably sure that the gff file it links to is the same one we were using from NCBI, and it appears that the UCSC bed file is created from it. The bed file has a 13th and 14th field, but the 14th field is a repeat of the gene name (i.e. it is a placeholder). The gff file we were attempting to use before has some additional information, such as a product field that we might be able to use as the 14th field.

Ann Loraine made changes - 30/Jun/24 8:52 AM

Sprint

Spring 7, Spring 9, Summer 1, Summer 2, Summer 3 [ 191, 193, 195, 196, 197 ]

Spring 7, Spring 9, Summer 1, Summer 2, Summer 3, Summer 4 [ 191, 193, 195, 196, 197, 198 ]

Ann Loraine made changes - 30/Jun/24 8:52 AM

Rank

Ranked higher

Nowlan Freese made changes - 01/Jul/24 10:21 AM

Assignee

Nowlan Freese [ nfreese ]

Nowlan Freese made changes - 02/Jul/24 9:42 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Nowlan Freese added a comment - 03/Jul/24 9:38 AM - edited

I used the Note and Product tags from the GCA_001949185.1_Rvar.gff file to create a 14th description column. The updated bed file has been added to the Google Drive as R_varieornatus_YOKOZUNA-1_Nov_2016_ncbiGene.bed.gz. The genome.txt, 2bit file, and tabix index are also present in the Google Drive.

As noted by Dr. Loraine there are many duplicated ID values. In order to work around this, I retrieved the cds, rrna, and trna rows from the GCA_001949185.1_Rvar.gff file. Note that ID=id-RvY_02531 was an extra row in the GFF file that was hard to find and needed to be removed. I sorted by seqid, then start, then end, and then attributes. I used diff on the ID tag of the sorted gff to compare to the name value of the bed file and they were the same. I then extracted the Note and Product tags and used those to create the 14th description column. Much of this was done manually.

Next step: Create new genome and push files to SVN repo.

Show

Nowlan Freese added a comment - 03/Jul/24 9:38 AM - edited I used the Note and Product tags from the GCA_001949185.1_Rvar.gff file to create a 14th description column. The updated bed file has been added to the Google Drive as R_varieornatus_YOKOZUNA-1_Nov_2016_ncbiGene.bed.gz. The genome.txt, 2bit file, and tabix index are also present in the Google Drive. As noted by Dr. Loraine there are many duplicated ID values. In order to work around this, I retrieved the cds, rrna, and trna rows from the GCA_001949185.1_Rvar.gff file. Note that ID=id-RvY_02531 was an extra row in the GFF file that was hard to find and needed to be removed. I sorted by seqid, then start, then end, and then attributes. I used diff on the ID tag of the sorted gff to compare to the name value of the bed file and they were the same. I then extracted the Note and Product tags and used those to create the 14th description column. Much of this was done manually. Next step: Create new genome and push files to SVN repo.

Nowlan Freese made changes - 03/Jul/24 10:08 AM

Story Points

Hide

Permalink

Nowlan Freese added a comment - 03/Jul/24 1:29 PM

I have updated the name to be R_varieornatus_YOKOZUNA-1_Nov_2016 based on NCBI and UCSC.

Show

Nowlan Freese added a comment - 03/Jul/24 1:29 PM I have updated the name to be R_varieornatus_YOKOZUNA-1_Nov_2016 based on NCBI and UCSC .

Hide

Permalink

Ann Loraine added a comment - 05/Jul/24 9:44 AM

My apologies for the problems with the subversion server.

They are now fixed. You should now be able to commit your new files and directories to the subversion host as before.

Show

Ann Loraine added a comment - 05/Jul/24 9:44 AM My apologies for the problems with the subversion server. They are now fixed. You should now be able to commit your new files and directories to the subversion host as before.

Hide

Permalink

Nowlan Freese added a comment - 08/Jul/24 4:16 PM - edited

The Ramazzottius varieornatus genome and annotation have been deployed to the subversion host: https://svn.bioviz.org/viewvc/genomes/quickload/R_varieornatus_YOKOZUNA-1_Nov_2016/

To test:
Will need to have the svn repository on local machine (I think)

Do a svn update to pull the most recent changes
Add the svn repository as a local quickload in IGB
Look for the Ramazzottius varieornatus genome in IGB
Check that hovering on the species shows "tardigrade water bear"
Select the Ramazzottius varieornatus species and the Nov 2016 genome version
Check that the annotation file loads by default
Check that you can load sequence (click Load Sequence)

Show

Nowlan Freese added a comment - 08/Jul/24 4:16 PM - edited The Ramazzottius varieornatus genome and annotation have been deployed to the subversion host: https://svn.bioviz.org/viewvc/genomes/quickload/R_varieornatus_YOKOZUNA-1_Nov_2016/ To test: Will need to have the svn repository on local machine (I think) Do a svn update to pull the most recent changes Add the svn repository as a local quickload in IGB Look for the Ramazzottius varieornatus genome in IGB Check that hovering on the species shows "tardigrade water bear" Select the Ramazzottius varieornatus species and the Nov 2016 genome version Check that the annotation file loads by default Check that you can load sequence (click Load Sequence)

Nowlan Freese made changes - 08/Jul/24 4:17 PM

Assignee

Nowlan Freese [ nfreese ]

Nowlan Freese made changes - 08/Jul/24 4:17 PM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Nowlan Freese made changes - 08/Jul/24 4:35 PM

Link

This issue relates to ~~IGBF-3742~~ [ ~~IGBF-3742~~ ]

Nowlan Freese made changes - 10/Jul/24 9:19 AM

Assignee

Paige Kulzer [ pkulzer ]

Paige Kulzer (Inactive) made changes - 11/Jul/24 9:05 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Hide

Permalink

Paige Kulzer (Inactive) added a comment - 11/Jul/24 9:37 AM

No issues found while conducting the test outlined above – I was able to add the R. varieornatus genome to IGB as a local quickload and confirm that the annotations and sequences load properly. This quickload is ready for next steps!

Show

Paige Kulzer (Inactive) added a comment - 11/Jul/24 9:37 AM No issues found while conducting the test outlined above – I was able to add the R. varieornatus genome to IGB as a local quickload and confirm that the annotations and sequences load properly. This quickload is ready for next steps!

Paige Kulzer (Inactive) made changes - 11/Jul/24 9:37 AM

Status

First Level Review in Progress [ 10301 ]

Needs 1st Level Review [ 10005 ]

Paige Kulzer (Inactive) made changes - 11/Jul/24 9:37 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Paige Kulzer (Inactive) made changes - 11/Jul/24 9:37 AM

Status

First Level Review in Progress [ 10301 ]

Ready for Pull Request [ 10304 ]

Paige Kulzer (Inactive) made changes - 11/Jul/24 9:37 AM

Assignee

Paige Kulzer [ pkulzer ]

Ann Loraine [ aloraine ]

Hide

Permalink

Nowlan Freese added a comment - 11/Jul/24 3:51 PM

Next step is for Dr. Loraine to deploy the changes to the various Quickloads.

Show

Nowlan Freese added a comment - 11/Jul/24 3:51 PM Next step is for Dr. Loraine to deploy the changes to the various Quickloads.

Nowlan Freese made changes - 11/Jul/24 3:51 PM

Status

Ready for Pull Request [ 10304 ]

Pull Request Submitted [ 10101 ]

Hide

Permalink

Ann Loraine added a comment - 13/Jul/24 5:43 PM - edited

Changes deployed to QL at:

RENCI host (primary quickload)
UNC Charlotte host (backup quickload)

Show

Ann Loraine added a comment - 13/Jul/24 5:43 PM - edited Changes deployed to QL at: RENCI host (primary quickload) UNC Charlotte host (backup quickload)

Ann Loraine made changes - 13/Jul/24 5:59 PM

Status

Pull Request Submitted [ 10101 ]

Reviewing Pull Request [ 10303 ]

Ann Loraine made changes - 13/Jul/24 5:59 PM

Status

Reviewing Pull Request [ 10303 ]

Merged Needs Testing [ 10002 ]

Ann Loraine made changes - 13/Jul/24 5:59 PM

Assignee

Ann Loraine [ aloraine ]

Ann Loraine made changes - 14/Jul/24 2:40 PM

Sprint

Spring 7, Spring 9, Summer 1, Summer 2, Summer 3, Summer 4 [ 191, 193, 195, 196, 197, 198 ]

Spring 7, Spring 9, Summer 1, Summer 2, Summer 3, Summer 4, Summer 5 [ 191, 193, 195, 196, 197, 198, 199 ]

Ann Loraine made changes - 14/Jul/24 2:40 PM

Rank

Ranked higher

Dylan Marrotte (Inactive) made changes - 25/Jul/24 11:31 AM

Assignee

Dylan Marrotte [ dmarrott ]

Dylan Marrotte (Inactive) made changes - 25/Jul/24 1:14 PM

Status

Merged Needs Testing [ 10002 ]

Post-merge Testing In Progress [ 10003 ]

Hide

Permalink

Dylan Marrotte (Inactive) added a comment - 25/Jul/24 1:15 PM

No issues found while conducting the test!

Show

Dylan Marrotte (Inactive) added a comment - 25/Jul/24 1:15 PM No issues found while conducting the test!

Dylan Marrotte (Inactive) made changes - 25/Jul/24 1:15 PM

Resolution		Done [ 10000 ]
Status	Post-merge Testing In Progress [ 10003 ]	Closed [ 6 ]

Dylan Marrotte (Inactive) made changes - 25/Jul/24 1:15 PM

Resolution	Done [ 10000 ]
Status	Closed [ 6 ]	To-Do [ 10305 ]

Dylan Marrotte (Inactive) made changes - 25/Jul/24 1:15 PM

Assignee

Dylan Marrotte [ dmarrott ]

Ann Loraine [ aloraine ]

Nowlan Freese made changes - 25/Jul/24 1:41 PM

Status

To-Do [ 10305 ]

Pull Request Submitted [ 10101 ]

Nowlan Freese made changes - 25/Jul/24 1:41 PM

Status

Pull Request Submitted [ 10101 ]

Reviewing Pull Request [ 10303 ]

Nowlan Freese made changes - 25/Jul/24 1:41 PM

Status

Reviewing Pull Request [ 10303 ]

Merged Needs Testing [ 10002 ]

Nowlan Freese made changes - 25/Jul/24 1:41 PM

Status

Merged Needs Testing [ 10002 ]

Post-merge Testing In Progress [ 10003 ]

Nowlan Freese made changes - 25/Jul/24 1:41 PM

Resolution		Done [ 10000 ]
Status	Post-merge Testing In Progress [ 10003 ]	Closed [ 6 ]

Nowlan Freese made changes - 25/Jul/24 1:42 PM

Comment

[ To do:
* -Fix typo in [annots.xml|https://igbquickload-main.bioviz.org/quickload/H_exemplaris_Z151_Apr_2017/annots.xml] - change March 3, 2022 -> March 4, 2022-
* Add [R. varieornatus|https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001949185.1/] tardigrade genome and annotation to IGB Quickload.
* Update gff3ToBedDetail.py in [GenomeSource|https://bitbucket.org/lorainelab/genomesource/src/master/] repo so that it handles non protein coding genes. ]

Hide

Permalink

Nowlan Freese added a comment - 25/Jul/24 1:43 PM

Closing ticket.

Show

Nowlan Freese added a comment - 25/Jul/24 1:43 PM Closing ticket.

Dylan Marrotte (Inactive) made changes - 26/Jul/24 10:15 AM

Assignee

Ann Loraine [ aloraine ]

Nowlan Freese made changes - 24/Sep/24 8:56 AM

Fix Version/s

10.0.1 [ 10901 ]

Nowlan Freese made changes - 24/Sep/24 8:58 AM

Fix Version/s		10.0.0 Major Release [ 10900 ]
Fix Version/s	10.0.1 [ 10901 ]

People

Assignee:

Unassigned

Reporter:

Ann Loraine

Votes:

0 Vote for this issue

Watchers:

4 Start watching this issue

Dates

Created:

02/Apr/24 10:36 AM

Updated:

24/Sep/24 8:58 AM

Resolved:

25/Jul/24 1:41 PM