Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-259

Add D. melanogaster (Release 6) genome annotations to IGBQuickLoad

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Users have requested that the most recent version of the D. melanogaster (release 6) be added to IGB.

        Attachments

          Issue Links

            Activity

            Hide
            mason Mason Meyer (Inactive) added a comment -

            To keep track of user requests for this issue I have included links to the JIRA stories containing the user requests:

            HELP-72 and HELP-80

            Please inform Mason once the task is complete so that he can inform the users of the update.

            Show
            mason Mason Meyer (Inactive) added a comment - To keep track of user requests for this issue I have included links to the JIRA stories containing the user requests: HELP-72 and HELP-80 Please inform Mason once the task is complete so that he can inform the users of the update.
            Hide
            ann.loraine Ann Loraine added a comment -

            Message to FlyBase, sent via comment form, Nov 23:

            Hello,

            I'm working on a project to convert the latest (version 6) D. melanogaster gene model annotations and fasta sequence into formats viewable in Integrated Genome Browser. (http://www.bioviz.org)

            I did this using some custom code (i.e., not yet good enough for others to re-use) and have deployed the files onto the IGB QuickLoad site, along with some documentation.

            I've asked the team here to review the code, the files, and the documentation for accuracy. They will likely finish their review by Tuesday evening. I'll spend the next few days after that fixing any problems they find.

            After that, I would be grateful if some-one at FlyBase could take a look. I'm not very familiar with fruit fly informatics and could easily miss important details.

            Lastly, if you are hosting NGS files at FlyBase via an HTTP server, then it would not be difficult to configure IGB to access those files. This could give users a great new way to visualize the wealth of fly data that FlyBase offers. In a nutshell, the way this works is that we would reference the URLs of BAM files in the "annots.xml" file for fruit fly on our server. This is a bit like the UCSC "track hub" concept, but simpler. I will ask the team here to set up a demonstration site so that you can see how it works - probably the demonstration site will be ready the week after the Thanksgiving holiday.

            Also, I'm not sure if anyone working at FlyBase would remember me? I worked at BDGP many years ago in Suzi Lewis' group. That was in the late 1990s - seems like a lifetime ago!

            I look forward to hearing from you.

            Ann Loraine
            aloraine@uncc.edu
            Associate Professor
            UNC Charlotte
            Dept of Bioinformatics and Genomics
            www.lorainelab.org
            www.bioviz.org
            www.bitbucket.org/lorainelab

            Show
            ann.loraine Ann Loraine added a comment - Message to FlyBase, sent via comment form, Nov 23: Hello, I'm working on a project to convert the latest (version 6) D. melanogaster gene model annotations and fasta sequence into formats viewable in Integrated Genome Browser. ( http://www.bioviz.org ) I did this using some custom code (i.e., not yet good enough for others to re-use) and have deployed the files onto the IGB QuickLoad site, along with some documentation. I've asked the team here to review the code, the files, and the documentation for accuracy. They will likely finish their review by Tuesday evening. I'll spend the next few days after that fixing any problems they find. After that, I would be grateful if some-one at FlyBase could take a look. I'm not very familiar with fruit fly informatics and could easily miss important details. Lastly, if you are hosting NGS files at FlyBase via an HTTP server, then it would not be difficult to configure IGB to access those files. This could give users a great new way to visualize the wealth of fly data that FlyBase offers. In a nutshell, the way this works is that we would reference the URLs of BAM files in the "annots.xml" file for fruit fly on our server. This is a bit like the UCSC "track hub" concept, but simpler. I will ask the team here to set up a demonstration site so that you can see how it works - probably the demonstration site will be ready the week after the Thanksgiving holiday. Also, I'm not sure if anyone working at FlyBase would remember me? I worked at BDGP many years ago in Suzi Lewis' group. That was in the late 1990s - seems like a lifetime ago! I look forward to hearing from you. Ann Loraine aloraine@uncc.edu Associate Professor UNC Charlotte Dept of Bioinformatics and Genomics www.lorainelab.org www.bioviz.org www.bitbucket.org/lorainelab
            Hide
            mason Mason Meyer (Inactive) added a comment -

            Ivory noticed a strange edge-matching issue on a gene model in this new fruit fly genome release (See IGBF-278).

            Show
            mason Mason Meyer (Inactive) added a comment - Ivory noticed a strange edge-matching issue on a gene model in this new fruit fly genome release (See IGBF-278 ).
            Hide
            mason Mason Meyer (Inactive) added a comment - - edited

            Other than the issue for the one gene model [FBgn0002781];(IGBF-278), my testing has verified that the gene models are displaying properly in IGB. This story will now be closed.

            Show
            mason Mason Meyer (Inactive) added a comment - - edited Other than the issue for the one gene model [FBgn0002781] ;( IGBF-278 ), my testing has verified that the gene models are displaying properly in IGB. This story will now be closed.
            Hide
            mason Mason Meyer (Inactive) added a comment - - edited

            *Ivory's Notes from BaseCamp

            Adding dm6 annotation to IGBQuickLoad
            Many users have asked us to add the July 2014 release of the Drosophila genome, and annotations, to IGB. Here, we describe steps we took to do this and also include notes explaining technical details.

            The organization that manages the fruit fly data is called FlyBase - Flybase.org. They assembled the new fruit fly genome and annotated it. Ann got the sequence data and annotations from FlyBase, reformatted them for IGB, and added them to IGBQuickLoad repository.

            Synonyms for this new release include:

            BDPG R6
            dm6
            D_melanogaster_Jul_2014

            Annotation file downloads
            We can get data files with annotations from here: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06

            Ann wrote about making the files in this blog post: http://lorainelab.org/?p=457

            Testing annotation file in IGB
            This is a line from the gtf file:

            3R FlyBase CDS 21361670 21361827 15 + 2 gene_id "FBgn0261838"; gene_symbol "pre-mod(mdg4)-Z"; transcript_id "FBtr0303402"; transcript_symbol "pre-mod(mdg4)-Z-RA"; # SO:0000459:gene_with_trans_spliced_transcript SO:0000722:gene_with_dicistronic_mRNA

            I tried to search "FBgn0261838" (the gene id) in "ids or names" but IGB returned nothing.
            IGB did return search results for "pre-mod(mdg4)-Z", and for "FBtr0303402",
            The gene name is in the description, I have to search "key words" not just "ids or names" if I want to find the gene name.
            Maybe gene name and gene symbol should switch, so the transcript ids and the gene ids are in the name and id fields and the symbol is in the description field.

            The comment doesn't seem to have disrupted the feature, but the tags (SO:0000722, and SO:0000459) didn't make it into the bed detail description, not sure if that would have been very good or not.

            FBgn0002781 has a ton of transcripts (31) , and they are not all on the same strand.
            Weird.
            All start codons are described as being on the positive strand.
            Strangly, some transcripts (ex: FBtr0307760) have some components on the negative strand and some on the positive strand. ... ?
            It looks like this transcript should actually be on the negative strand. The start codon in IGB is around 21,367,195, and in the gtf file the stop codon is:

            3R FlyBase stop_codon 21367204 21367206 . + 0 gene_id "FBgn0002781"; gene_symbol "mod(mdg4)"; transcript_id "FBtr0307760"; transcript_symbol "mod(mdg4)-RAE";

            and the START in the gtf is around 21377030 - 21377032, and IGB shows the translation stop as 21,377,031.

            It looks like the problem stems from the fact that the flybase gtf has some components annotation with +. I don't see any pattern as to why those 4 out of 14 components for this transcript are annotated as + strand while all the rest are - stand. weird.

            Thank you comment: SO:0000459:gene_with_trans_spliced_transcript
            I didn't know that happened, and I don't know how best to show it in IGB. I guess the gene as a whole is considered to be on the negative strand but some of its exons are flipped during splicing.
            Showing the entire transcript on the + strand is not accurate. But I don't know how to show that the translated region has a flipped sequence.

            Looking with extreme detail at FBtr0088013 (a transcript of gene FBgn0014184), there may be a small problem.
            No, actually, everything is fine. It just threw me off that the range of the stop codon was not part of the translated region in IGB (which seems ideal) and it was not part of the untranslated region in the FlyBase file (which seems odd, but ok) so the ends of things looked like they didn't match up, but they do.

            This does present a minor problem: should the stop codon be included in the translated region? I think you could make a good argument either way, but consistency is king, and we are inconsistent: In the Arabidopsis annotations, the stop codon is considered part of the translated region; in the fruit fly annotations, it is not. See attached images.

            FBgn0031299
            includes comment:

            1. codon exception, see reports for details.

            The comment is only applied to the start_codon component of each transcript. These transcripts are odd in that the translated region begins with CUG rather than the canonical AUG.

            scores:
            I don't know how flybase is using the score field, but they do have scores in the score field, sometimes just a place holder. In IGB the score is always 0. In the handful of cases that I checked, the score is the same for all components of a given transcript, so there's no reason we shouldn't be able to transfer it into a transcript score in bed detail.

            In summary

            Most of the genes that I looked at seem fine.
            The issues I found that we would want to address are

            the inclusion of the stop codon as part of the translated region (applies to all genes, does not effect reading frame), and
            the strand associated with the genes that have the comment:
            SO:0000459:gene_with_trans_spliced_transcript
            Which is a small set (26 genes). And of those, only one gene, FBgn0002781, seems to have any issue, so it may or may not be related to the comment.

            Show
            mason Mason Meyer (Inactive) added a comment - - edited *Ivory's Notes from BaseCamp Adding dm6 annotation to IGBQuickLoad Many users have asked us to add the July 2014 release of the Drosophila genome, and annotations, to IGB. Here, we describe steps we took to do this and also include notes explaining technical details. The organization that manages the fruit fly data is called FlyBase - Flybase.org. They assembled the new fruit fly genome and annotated it. Ann got the sequence data and annotations from FlyBase, reformatted them for IGB, and added them to IGBQuickLoad repository. Synonyms for this new release include: BDPG R6 dm6 D_melanogaster_Jul_2014 Annotation file downloads We can get data files with annotations from here: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06 Ann wrote about making the files in this blog post: http://lorainelab.org/?p=457 Testing annotation file in IGB This is a line from the gtf file: 3R FlyBase CDS 21361670 21361827 15 + 2 gene_id "FBgn0261838"; gene_symbol "pre-mod(mdg4)-Z"; transcript_id "FBtr0303402"; transcript_symbol "pre-mod(mdg4)-Z-RA"; # SO:0000459:gene_with_trans_spliced_transcript SO:0000722:gene_with_dicistronic_mRNA I tried to search "FBgn0261838" (the gene id) in "ids or names" but IGB returned nothing. IGB did return search results for "pre-mod(mdg4)-Z", and for "FBtr0303402", The gene name is in the description, I have to search "key words" not just "ids or names" if I want to find the gene name. Maybe gene name and gene symbol should switch, so the transcript ids and the gene ids are in the name and id fields and the symbol is in the description field. The comment doesn't seem to have disrupted the feature, but the tags (SO:0000722, and SO:0000459) didn't make it into the bed detail description, not sure if that would have been very good or not. FBgn0002781 has a ton of transcripts (31) , and they are not all on the same strand. Weird. All start codons are described as being on the positive strand. Strangly, some transcripts (ex: FBtr0307760) have some components on the negative strand and some on the positive strand. ... ? It looks like this transcript should actually be on the negative strand. The start codon in IGB is around 21,367,195, and in the gtf file the stop codon is: 3R FlyBase stop_codon 21367204 21367206 . + 0 gene_id "FBgn0002781"; gene_symbol "mod(mdg4)"; transcript_id "FBtr0307760"; transcript_symbol "mod(mdg4)-RAE"; and the START in the gtf is around 21377030 - 21377032, and IGB shows the translation stop as 21,377,031. It looks like the problem stems from the fact that the flybase gtf has some components annotation with +. I don't see any pattern as to why those 4 out of 14 components for this transcript are annotated as + strand while all the rest are - stand. weird. Thank you comment: SO:0000459:gene_with_trans_spliced_transcript I didn't know that happened, and I don't know how best to show it in IGB. I guess the gene as a whole is considered to be on the negative strand but some of its exons are flipped during splicing. Showing the entire transcript on the + strand is not accurate. But I don't know how to show that the translated region has a flipped sequence. Looking with extreme detail at FBtr0088013 (a transcript of gene FBgn0014184), there may be a small problem. No, actually, everything is fine. It just threw me off that the range of the stop codon was not part of the translated region in IGB (which seems ideal) and it was not part of the untranslated region in the FlyBase file (which seems odd, but ok) so the ends of things looked like they didn't match up, but they do. This does present a minor problem: should the stop codon be included in the translated region? I think you could make a good argument either way, but consistency is king, and we are inconsistent: In the Arabidopsis annotations, the stop codon is considered part of the translated region; in the fruit fly annotations, it is not. See attached images. FBgn0031299 includes comment: codon exception, see reports for details. The comment is only applied to the start_codon component of each transcript. These transcripts are odd in that the translated region begins with CUG rather than the canonical AUG. scores: I don't know how flybase is using the score field, but they do have scores in the score field, sometimes just a place holder. In IGB the score is always 0. In the handful of cases that I checked, the score is the same for all components of a given transcript, so there's no reason we shouldn't be able to transfer it into a transcript score in bed detail. In summary Most of the genes that I looked at seem fine. The issues I found that we would want to address are the inclusion of the stop codon as part of the translated region (applies to all genes, does not effect reading frame), and the strand associated with the genes that have the comment: SO:0000459:gene_with_trans_spliced_transcript Which is a small set (26 genes). And of those, only one gene, FBgn0002781, seems to have any issue, so it may or may not be related to the comment.
            Hide
            mason Mason Meyer (Inactive) added a comment -

            Here is a link to Ivory's notes on this issue from BaseCamp:
            https://basecamp.com/2734534/projects/7303664/documents/7206320

            Show
            mason Mason Meyer (Inactive) added a comment - Here is a link to Ivory's notes on this issue from BaseCamp: https://basecamp.com/2734534/projects/7303664/documents/7206320

              People

              • Assignee:
                mason Mason Meyer (Inactive)
                Reporter:
                mason Mason Meyer (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: