[IGBF-259] Add D. melanogaster (Release 6) genome annotations to IGBQuickLoad - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Quickload/Data Repository
Sprint:
Sprint 9

Description

Users have requested that the most recent version of the D. melanogaster (release 6) be added to IGB.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

FlyBase_vs_IGB_HR4genetranscripts.PNG
40 kB
02/Dec/14 9:06 AM

Issue Links

relates to

DB-130 Edge-matching is being randomly selective (for Fruit Fly Gene Model)

Open

DB-158 Resolve Fruit Fly gene annotations issues

Open

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Mason Meyer (Inactive) added a comment - 12/Nov/14 2:00 PM

To keep track of user requests for this issue I have included links to the JIRA stories containing the user requests:

~~HELP-72~~ and ~~HELP-80~~

Please inform Mason once the task is complete so that he can inform the users of the update.

Show

Mason Meyer (Inactive) added a comment - 12/Nov/14 2:00 PM To keep track of user requests for this issue I have included links to the JIRA stories containing the user requests: HELP-72 and HELP-80 Please inform Mason once the task is complete so that he can inform the users of the update.

Hide

Permalink

Ann Loraine added a comment - 24/Nov/14 8:35 AM

Message to FlyBase, sent via comment form, Nov 23:

Hello,

I'm working on a project to convert the latest (version 6) D. melanogaster gene model annotations and fasta sequence into formats viewable in Integrated Genome Browser. (http://www.bioviz.org)

I did this using some custom code (i.e., not yet good enough for others to re-use) and have deployed the files onto the IGB QuickLoad site, along with some documentation.

I've asked the team here to review the code, the files, and the documentation for accuracy. They will likely finish their review by Tuesday evening. I'll spend the next few days after that fixing any problems they find.

After that, I would be grateful if some-one at FlyBase could take a look. I'm not very familiar with fruit fly informatics and could easily miss important details.

Lastly, if you are hosting NGS files at FlyBase via an HTTP server, then it would not be difficult to configure IGB to access those files. This could give users a great new way to visualize the wealth of fly data that FlyBase offers. In a nutshell, the way this works is that we would reference the URLs of BAM files in the "annots.xml" file for fruit fly on our server. This is a bit like the UCSC "track hub" concept, but simpler. I will ask the team here to set up a demonstration site so that you can see how it works - probably the demonstration site will be ready the week after the Thanksgiving holiday.

Also, I'm not sure if anyone working at FlyBase would remember me? I worked at BDGP many years ago in Suzi Lewis' group. That was in the late 1990s - seems like a lifetime ago!

I look forward to hearing from you.

Ann Loraine
aloraine@uncc.edu
Associate Professor
UNC Charlotte
Dept of Bioinformatics and Genomics
www.lorainelab.org
www.bioviz.org
www.bitbucket.org/lorainelab

Show

Ann Loraine added a comment - 24/Nov/14 8:35 AM Message to FlyBase, sent via comment form, Nov 23: Hello, I'm working on a project to convert the latest (version 6) D. melanogaster gene model annotations and fasta sequence into formats viewable in Integrated Genome Browser. ( http://www.bioviz.org ) I did this using some custom code (i.e., not yet good enough for others to re-use) and have deployed the files onto the IGB QuickLoad site, along with some documentation. I've asked the team here to review the code, the files, and the documentation for accuracy. They will likely finish their review by Tuesday evening. I'll spend the next few days after that fixing any problems they find. After that, I would be grateful if some-one at FlyBase could take a look. I'm not very familiar with fruit fly informatics and could easily miss important details. Lastly, if you are hosting NGS files at FlyBase via an HTTP server, then it would not be difficult to configure IGB to access those files. This could give users a great new way to visualize the wealth of fly data that FlyBase offers. In a nutshell, the way this works is that we would reference the URLs of BAM files in the "annots.xml" file for fruit fly on our server. This is a bit like the UCSC "track hub" concept, but simpler. I will ask the team here to set up a demonstration site so that you can see how it works - probably the demonstration site will be ready the week after the Thanksgiving holiday. Also, I'm not sure if anyone working at FlyBase would remember me? I worked at BDGP many years ago in Suzi Lewis' group. That was in the late 1990s - seems like a lifetime ago! I look forward to hearing from you. Ann Loraine aloraine@uncc.edu Associate Professor UNC Charlotte Dept of Bioinformatics and Genomics www.lorainelab.org www.bioviz.org www.bitbucket.org/lorainelab

Hide

Permalink

Mason Meyer (Inactive) added a comment - 26/Nov/14 6:03 AM

Ivory noticed a strange edge-matching issue on a gene model in this new fruit fly genome release (See IGBF-278).

Show

Mason Meyer (Inactive) added a comment - 26/Nov/14 6:03 AM Ivory noticed a strange edge-matching issue on a gene model in this new fruit fly genome release (See IGBF-278 ).

Hide

Permalink

Mason Meyer (Inactive) added a comment - 02/Dec/14 9:12 AM - edited

Other than the issue for the one gene model [FBgn0002781];(IGBF-278), my testing has verified that the gene models are displaying properly in IGB. This story will now be closed.

Show

Mason Meyer (Inactive) added a comment - 02/Dec/14 9:12 AM - edited Other than the issue for the one gene model [FBgn0002781] ;( IGBF-278 ), my testing has verified that the gene models are displaying properly in IGB. This story will now be closed.

Hide

Permalink

Mason Meyer (Inactive) added a comment - 03/Dec/14 4:49 AM - edited

*Ivory's Notes from BaseCamp

Adding dm6 annotation to IGBQuickLoad
Many users have asked us to add the July 2014 release of the Drosophila genome, and annotations, to IGB. Here, we describe steps we took to do this and also include notes explaining technical details.

The organization that manages the fruit fly data is called FlyBase - Flybase.org. They assembled the new fruit fly genome and annotated it. Ann got the sequence data and annotations from FlyBase, reformatted them for IGB, and added them to IGBQuickLoad repository.

Synonyms for this new release include:

BDPG R6
dm6
D_melanogaster_Jul_2014

Annotation file downloads
We can get data files with annotations from here: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06

Ann wrote about making the files in this blog post: http://lorainelab.org/?p=457

Testing annotation file in IGB
This is a line from the gtf file:

3R FlyBase CDS 21361670 21361827 15 + 2 gene_id "FBgn0261838"; gene_symbol "pre-mod(mdg4)-Z"; transcript_id "FBtr0303402"; transcript_symbol "pre-mod(mdg4)-Z-RA"; # SO:0000459:gene_with_trans_spliced_transcript SO:0000722:gene_with_dicistronic_mRNA

I tried to search "FBgn0261838" (the gene id) in "ids or names" but IGB returned nothing.
IGB did return search results for "pre-mod(mdg4)-Z", and for "FBtr0303402",
The gene name is in the description, I have to search "key words" not just "ids or names" if I want to find the gene name.
Maybe gene name and gene symbol should switch, so the transcript ids and the gene ids are in the name and id fields and the symbol is in the description field.

The comment doesn't seem to have disrupted the feature, but the tags (SO:0000722, and SO:0000459) didn't make it into the bed detail description, not sure if that would have been very good or not.

FBgn0002781 has a ton of transcripts (31) , and they are not all on the same strand.
Weird.
All start codons are described as being on the positive strand.
Strangly, some transcripts (ex: FBtr0307760) have some components on the negative strand and some on the positive strand. ... ?
It looks like this transcript should actually be on the negative strand. The start codon in IGB is around 21,367,195, and in the gtf file the stop codon is:

3R FlyBase stop_codon 21367204 21367206 . + 0 gene_id "FBgn0002781"; gene_symbol "mod(mdg4)"; transcript_id "FBtr0307760"; transcript_symbol "mod(mdg4)-RAE";

and the START in the gtf is around 21377030 - 21377032, and IGB shows the translation stop as 21,377,031.

It looks like the problem stems from the fact that the flybase gtf has some components annotation with +. I don't see any pattern as to why those 4 out of 14 components for this transcript are annotated as + strand while all the rest are - stand. weird.

Thank you comment: SO:0000459:gene_with_trans_spliced_transcript
I didn't know that happened, and I don't know how best to show it in IGB. I guess the gene as a whole is considered to be on the negative strand but some of its exons are flipped during splicing.
Showing the entire transcript on the + strand is not accurate. But I don't know how to show that the translated region has a flipped sequence.

Looking with extreme detail at FBtr0088013 (a transcript of gene FBgn0014184), there may be a small problem.
No, actually, everything is fine. It just threw me off that the range of the stop codon was not part of the translated region in IGB (which seems ideal) and it was not part of the untranslated region in the FlyBase file (which seems odd, but ok) so the ends of things looked like they didn't match up, but they do.

This does present a minor problem: should the stop codon be included in the translated region? I think you could make a good argument either way, but consistency is king, and we are inconsistent: In the Arabidopsis annotations, the stop codon is considered part of the translated region; in the fruit fly annotations, it is not. See attached images.

FBgn0031299
includes comment:

codon exception, see reports for details.

The comment is only applied to the start_codon component of each transcript. These transcripts are odd in that the translated region begins with CUG rather than the canonical AUG.

scores:
I don't know how flybase is using the score field, but they do have scores in the score field, sometimes just a place holder. In IGB the score is always 0. In the handful of cases that I checked, the score is the same for all components of a given transcript, so there's no reason we shouldn't be able to transfer it into a transcript score in bed detail.

In summary

Most of the genes that I looked at seem fine.
The issues I found that we would want to address are

the inclusion of the stop codon as part of the translated region (applies to all genes, does not effect reading frame), and
the strand associated with the genes that have the comment:
SO:0000459:gene_with_trans_spliced_transcript
Which is a small set (26 genes). And of those, only one gene, FBgn0002781, seems to have any issue, so it may or may not be related to the comment.

Show

Mason Meyer (Inactive) added a comment - 03/Dec/14 4:49 AM - edited *Ivory's Notes from BaseCamp Adding dm6 annotation to IGBQuickLoad Many users have asked us to add the July 2014 release of the Drosophila genome, and annotations, to IGB. Here, we describe steps we took to do this and also include notes explaining technical details. The organization that manages the fruit fly data is called FlyBase - Flybase.org. They assembled the new fruit fly genome and annotated it. Ann got the sequence data and annotations from FlyBase, reformatted them for IGB, and added them to IGBQuickLoad repository. Synonyms for this new release include: BDPG R6 dm6 D_melanogaster_Jul_2014 Annotation file downloads We can get data files with annotations from here: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06 Ann wrote about making the files in this blog post: http://lorainelab.org/?p=457 Testing annotation file in IGB This is a line from the gtf file: 3R FlyBase CDS 21361670 21361827 15 + 2 gene_id "FBgn0261838"; gene_symbol "pre-mod(mdg4)-Z"; transcript_id "FBtr0303402"; transcript_symbol "pre-mod(mdg4)-Z-RA"; # SO:0000459:gene_with_trans_spliced_transcript SO:0000722:gene_with_dicistronic_mRNA I tried to search "FBgn0261838" (the gene id) in "ids or names" but IGB returned nothing. IGB did return search results for "pre-mod(mdg4)-Z", and for "FBtr0303402", The gene name is in the description, I have to search "key words" not just "ids or names" if I want to find the gene name. Maybe gene name and gene symbol should switch, so the transcript ids and the gene ids are in the name and id fields and the symbol is in the description field. The comment doesn't seem to have disrupted the feature, but the tags (SO:0000722, and SO:0000459) didn't make it into the bed detail description, not sure if that would have been very good or not. FBgn0002781 has a ton of transcripts (31) , and they are not all on the same strand. Weird. All start codons are described as being on the positive strand. Strangly, some transcripts (ex: FBtr0307760) have some components on the negative strand and some on the positive strand. ... ? It looks like this transcript should actually be on the negative strand. The start codon in IGB is around 21,367,195, and in the gtf file the stop codon is: 3R FlyBase stop_codon 21367204 21367206 . + 0 gene_id "FBgn0002781"; gene_symbol "mod(mdg4)"; transcript_id "FBtr0307760"; transcript_symbol "mod(mdg4)-RAE"; and the START in the gtf is around 21377030 - 21377032, and IGB shows the translation stop as 21,377,031. It looks like the problem stems from the fact that the flybase gtf has some components annotation with +. I don't see any pattern as to why those 4 out of 14 components for this transcript are annotated as + strand while all the rest are - stand. weird. Thank you comment: SO:0000459:gene_with_trans_spliced_transcript I didn't know that happened, and I don't know how best to show it in IGB. I guess the gene as a whole is considered to be on the negative strand but some of its exons are flipped during splicing. Showing the entire transcript on the + strand is not accurate. But I don't know how to show that the translated region has a flipped sequence. Looking with extreme detail at FBtr0088013 (a transcript of gene FBgn0014184), there may be a small problem. No, actually, everything is fine. It just threw me off that the range of the stop codon was not part of the translated region in IGB (which seems ideal) and it was not part of the untranslated region in the FlyBase file (which seems odd, but ok) so the ends of things looked like they didn't match up, but they do. This does present a minor problem: should the stop codon be included in the translated region? I think you could make a good argument either way, but consistency is king, and we are inconsistent: In the Arabidopsis annotations, the stop codon is considered part of the translated region; in the fruit fly annotations, it is not. See attached images. FBgn0031299 includes comment: codon exception, see reports for details. The comment is only applied to the start_codon component of each transcript. These transcripts are odd in that the translated region begins with CUG rather than the canonical AUG. scores: I don't know how flybase is using the score field, but they do have scores in the score field, sometimes just a place holder. In IGB the score is always 0. In the handful of cases that I checked, the score is the same for all components of a given transcript, so there's no reason we shouldn't be able to transfer it into a transcript score in bed detail. In summary Most of the genes that I looked at seem fine. The issues I found that we would want to address are the inclusion of the stop codon as part of the translated region (applies to all genes, does not effect reading frame), and the strand associated with the genes that have the comment: SO:0000459:gene_with_trans_spliced_transcript Which is a small set (26 genes). And of those, only one gene, FBgn0002781, seems to have any issue, so it may or may not be related to the comment.

Hide

Permalink

Mason Meyer (Inactive) added a comment - 03/Dec/14 4:57 AM

Here is a link to Ivory's notes on this issue from BaseCamp:
https://basecamp.com/2734534/projects/7303664/documents/7206320

Show

Mason Meyer (Inactive) added a comment - 03/Dec/14 4:57 AM Here is a link to Ivory's notes on this issue from BaseCamp: https://basecamp.com/2734534/projects/7303664/documents/7206320

Add D. melanogaster (Release 6) genome annotations to IGBQuickLoad

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates