*Ivory's Notes from BaseCamp
Adding dm6 annotation to IGBQuickLoad
Many users have asked us to add the July 2014 release of the Drosophila genome, and annotations, to IGB. Here, we describe steps we took to do this and also include notes explaining technical details.
The organization that manages the fruit fly data is called FlyBase - Flybase.org. They assembled the new fruit fly genome and annotated it. Ann got the sequence data and annotations from FlyBase, reformatted them for IGB, and added them to IGBQuickLoad repository.
Synonyms for this new release include:
BDPG R6
dm6
D_melanogaster_Jul_2014
Annotation file downloads
We can get data files with annotations from here: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06
Ann wrote about making the files in this blog post: http://lorainelab.org/?p=457
Testing annotation file in IGB
This is a line from the gtf file:
3R FlyBase CDS 21361670 21361827 15 + 2 gene_id "FBgn0261838"; gene_symbol "pre-mod(mdg4)-Z"; transcript_id "FBtr0303402"; transcript_symbol "pre-mod(mdg4)-Z-RA"; # SO:0000459:gene_with_trans_spliced_transcript SO:0000722:gene_with_dicistronic_mRNA
I tried to search "FBgn0261838" (the gene id) in "ids or names" but IGB returned nothing.
IGB did return search results for "pre-mod(mdg4)-Z", and for "FBtr0303402",
The gene name is in the description, I have to search "key words" not just "ids or names" if I want to find the gene name.
Maybe gene name and gene symbol should switch, so the transcript ids and the gene ids are in the name and id fields and the symbol is in the description field.
The comment doesn't seem to have disrupted the feature, but the tags (SO:0000722, and SO:0000459) didn't make it into the bed detail description, not sure if that would have been very good or not.
FBgn0002781 has a ton of transcripts (31) , and they are not all on the same strand.
Weird.
All start codons are described as being on the positive strand.
Strangly, some transcripts (ex: FBtr0307760) have some components on the negative strand and some on the positive strand. ... ?
It looks like this transcript should actually be on the negative strand. The start codon in IGB is around 21,367,195, and in the gtf file the stop codon is:
3R FlyBase stop_codon 21367204 21367206 . + 0 gene_id "FBgn0002781"; gene_symbol "mod(mdg4)"; transcript_id "FBtr0307760"; transcript_symbol "mod(mdg4)-RAE";
and the START in the gtf is around 21377030 - 21377032, and IGB shows the translation stop as 21,377,031.
It looks like the problem stems from the fact that the flybase gtf has some components annotation with +. I don't see any pattern as to why those 4 out of 14 components for this transcript are annotated as + strand while all the rest are - stand. weird.
Thank you comment: SO:0000459:gene_with_trans_spliced_transcript
I didn't know that happened, and I don't know how best to show it in IGB. I guess the gene as a whole is considered to be on the negative strand but some of its exons are flipped during splicing.
Showing the entire transcript on the + strand is not accurate. But I don't know how to show that the translated region has a flipped sequence.
Looking with extreme detail at FBtr0088013 (a transcript of gene FBgn0014184), there may be a small problem.
No, actually, everything is fine. It just threw me off that the range of the stop codon was not part of the translated region in IGB (which seems ideal) and it was not part of the untranslated region in the FlyBase file (which seems odd, but ok) so the ends of things looked like they didn't match up, but they do.
This does present a minor problem: should the stop codon be included in the translated region? I think you could make a good argument either way, but consistency is king, and we are inconsistent: In the Arabidopsis annotations, the stop codon is considered part of the translated region; in the fruit fly annotations, it is not. See attached images.
FBgn0031299
includes comment:
- codon exception, see reports for details.
The comment is only applied to the start_codon component of each transcript. These transcripts are odd in that the translated region begins with CUG rather than the canonical AUG.
scores:
I don't know how flybase is using the score field, but they do have scores in the score field, sometimes just a place holder. In IGB the score is always 0. In the handful of cases that I checked, the score is the same for all components of a given transcript, so there's no reason we shouldn't be able to transfer it into a transcript score in bed detail.
In summary
Most of the genes that I looked at seem fine.
The issues I found that we would want to address are
the inclusion of the stop codon as part of the translated region (applies to all genes, does not effect reading frame), and
the strand associated with the genes that have the comment:
SO:0000459:gene_with_trans_spliced_transcript
Which is a small set (26 genes). And of those, only one gene, FBgn0002781, seems to have any issue, so it may or may not be related to the comment.
To keep track of user requests for this issue I have included links to the JIRA stories containing the user requests:
HELP-72andHELP-80Please inform Mason once the task is complete so that he can inform the users of the update.