Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2381

Investigate problem with FIMO GFF file parsing

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      0.5
    • Sprint:
      Spring 8 : 11 May to 25 May, Spring 9 : 25 May to 8 Jun

      Description

      Dr. Beth Krizek (krizek@biol.sc.edu) has let us know about a problem with parsing the attached file.

      It is GFF format, but when we view it in IGB, only a small part of the data are being shown.

      When you click on one of the items, the "Id" field does not exactly match the "Id" fields in the GFF "extra feature" field, the section of semi-colon-separated characters in the final field.

      Please investigate why this is happening. Is there a problem with our GFF parsing code, or is it an issue with the formatting of this file?

      In addition, take a look at the unit testing code for this format. If the data in this file are OK, we should use some lines of data in the file to add new test cases to the testing code.

      The file was generated using a program called "FIMO" which is a popular tool in bioinformatics. Therefore we should do our best to support visualizing the output in IGB.

        Attachments

        1. 914_SmokeTestData.png
          914_SmokeTestData.png
          320 kB
        2. 916_withMerge_SmokeTestData.png
          916_withMerge_SmokeTestData.png
          440 kB
        3. 916_withMerge_withFimo.png
          916_withMerge_withFimo.png
          213 kB
        4. fimo_NO-ID.gff
          127 kB
        5. fimo.gff
          156 kB
        6. fimo file-version 1.JPG
          fimo file-version 1.JPG
          383 kB
        7. fimo-version1-Windows.JPG
          fimo-version1-Windows.JPG
          221 kB
        8. Screen Shot 2020-05-13 at 5.54.07 PM.png
          Screen Shot 2020-05-13 at 5.54.07 PM.png
          185 kB
        9. Screen Shot 2020-05-13 at 6.49.45 PM.png
          Screen Shot 2020-05-13 at 6.49.45 PM.png
          185 kB

          Activity

          ann.loraine Ann Loraine created issue -
          ann.loraine Ann Loraine made changes -
          Field Original Value New Value
          Epic Link IGBF-1765 [ 17855 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          shamika Shamika Gajanan Kulkarni (Inactive) made changes -
          Assignee Shamika Gajanan Kulkarni [ shamika ]
          Hide
          nfreese Nowlan Freese added a comment -

          The file does not pass this gff3 validator: http://genometools.org/cgi-bin/gff3validator.cgi

          This file seems odd to me. Would need to do a comparison with the gff3 specifications to double-check the file formatting and compare that to the gff3 expectations of IGB.

          Show
          nfreese Nowlan Freese added a comment - The file does not pass this gff3 validator: http://genometools.org/cgi-bin/gff3validator.cgi This file seems odd to me. Would need to do a comparison with the gff3 specifications to double-check the file formatting and compare that to the gff3 expectations of IGB.
          shamika Shamika Gajanan Kulkarni (Inactive) made changes -
          Status To-Do [ 10305 ] In Progress [ 3 ]
          Hide
          shamika Shamika Gajanan Kulkarni (Inactive) added a comment - - edited

          Me and Prutha Kulkarni were working on this today. The following are our findings.
          1) The gff3 validator mentioned in the above comment was unable to pass the HomoSapien gff3 file. The error message received was "Validation unsuccessful!
          GenomeTools error: the child feature with type 'pseudogenic_transcript' on line 8 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/GFF3_HS_only.gff3" is not part-of parent feature with type 'gene' given on line 3 (according to type checker 'OBO file /home/satta/genometools_for_web/gtdata/obo_files/so.obo')"
          The file uploaded can be found in the link "http://igbquickload.org/smokeTestingQuickload/H_sapiens_Dec_2013/GFF3/"
          So, we were not sure if we could rely on the validator for proper results.

          2) Dr. Nowlan Freese provided me a reference to the format for .gff3 file.
          https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
          It says that the ID field should be unique for all children.
          "ID
          Indicates the ID of the feature. The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID must collectively represent a single feature."
          The .gff file provided does not contain unique IDs. Also, it does not contain Parent attribute. We are not sure if this is alright, since the other .gff3 files we referred have Parent attribute.

          3) We tried to run the test case for GFFParser using fimo.gff file provided, and got an error message as below:
          "java.io.IOException: Using GFF symloader but GFF3 file detected."
          So, we tried to run the test case for GFF3Parser. Initially, this test case was skipped using @Ignore annotation. We removed it and tried to check if the file is getting parsed properly or not.
          We were getting Null Pointer Exception in 'testParseCanonical' function on line "testResults(gff3.getGenome())" in GFF3ParserTest. While debugging, we found out that the buildindex() function call in init() function inside getGenome() was causing NPE.
          So we went ahead and tried to debug buildindex() in SymLoader.java and figured that the line "sortedResult = externalSortService.merge(input, uri.toString(), comparatorMetadata, conf);" is causing error. We are not getting any value in sortedResult and merge is a function inside externalSortService interface.
          We are not sure how to find out what the function does as it is only called with parameters and does not have a function body.
          Kindly suggest further steps so that we can continue working with this tomorrow.

          cc: Prof. [~aloraine]

          Show
          shamika Shamika Gajanan Kulkarni (Inactive) added a comment - - edited Me and Prutha Kulkarni were working on this today. The following are our findings. 1) The gff3 validator mentioned in the above comment was unable to pass the HomoSapien gff3 file. The error message received was "Validation unsuccessful! GenomeTools error: the child feature with type 'pseudogenic_transcript' on line 8 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/GFF3_HS_only.gff3" is not part-of parent feature with type 'gene' given on line 3 (according to type checker 'OBO file /home/satta/genometools_for_web/gtdata/obo_files/so.obo')" The file uploaded can be found in the link "http://igbquickload.org/smokeTestingQuickload/H_sapiens_Dec_2013/GFF3/" So, we were not sure if we could rely on the validator for proper results. 2) Dr. Nowlan Freese provided me a reference to the format for .gff3 file. https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md It says that the ID field should be unique for all children. "ID Indicates the ID of the feature. The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID must collectively represent a single feature." The .gff file provided does not contain unique IDs. Also, it does not contain Parent attribute. We are not sure if this is alright, since the other .gff3 files we referred have Parent attribute. 3) We tried to run the test case for GFFParser using fimo.gff file provided, and got an error message as below: "java.io.IOException: Using GFF symloader but GFF3 file detected." So, we tried to run the test case for GFF3Parser. Initially, this test case was skipped using @Ignore annotation. We removed it and tried to check if the file is getting parsed properly or not. We were getting Null Pointer Exception in 'testParseCanonical' function on line "testResults(gff3.getGenome())" in GFF3ParserTest. While debugging, we found out that the buildindex() function call in init() function inside getGenome() was causing NPE. So we went ahead and tried to debug buildindex() in SymLoader.java and figured that the line "sortedResult = externalSortService.merge(input, uri.toString(), comparatorMetadata, conf);" is causing error. We are not getting any value in sortedResult and merge is a function inside externalSortService interface. We are not sure how to find out what the function does as it is only called with parameters and does not have a function body. Kindly suggest further steps so that we can continue working with this tomorrow. cc: Prof. [~aloraine]
          Hide
          nfreese Nowlan Freese added a comment -

          Here's the FIMO page where they include an example fimo.gff output: http://meme-suite.org/doc/fimo.html

          Show
          nfreese Nowlan Freese added a comment - Here's the FIMO page where they include an example fimo.gff output: http://meme-suite.org/doc/fimo.html
          Hide
          nfreese Nowlan Freese added a comment -

          I'm wondering if that gff3 validator I included in a previous comment is correct. May be worth checking another validator: http://www.raetschlab.org/suppl/gff-tools

          Show
          nfreese Nowlan Freese added a comment - I'm wondering if that gff3 validator I included in a previous comment is correct. May be worth checking another validator: http://www.raetschlab.org/suppl/gff-tools
          Hide
          prutha Prutha Kulkarni (Inactive) added a comment - - edited

          Prof. [~aloraine], could you please contact the FIMO team and check with them if the format they've given is as per the specifications or not?

          Also, we will check the other validator which is mentioned by Dr Nowlan Freese today.

          Show
          prutha Prutha Kulkarni (Inactive) added a comment - - edited Prof. [~aloraine] , could you please contact the FIMO team and check with them if the format they've given is as per the specifications or not? Also, we will check the other validator which is mentioned by Dr Nowlan Freese today.
          Hide
          prutha Prutha Kulkarni (Inactive) added a comment - - edited

          Findings:

          • In the test case for simplest version of GFF parser the code is written to parse the format of the file which is present in the "Resources/Data" directory and is not generic.
          • For eg. assertEquals(99, sym.getStart());
          • Here, the start position is hardcoded which can be different for every file. Likewise, there are many assert statements in the test case.
          • Initially, we tried to change the code and make it work for this particular file but we realized that the entire code was very specific to the file which is present in "Resources/Data" directory.
          • So instead, we tried to upload the file directly in IGB and attaching screenshots of the "selection info" and the changed file to the ticket.
            Shamika Gajanan Kulkarni and I are attaching screenshots for both the platforms i.e. MACOS and WINDOWS just to be sure.
          • One thing we wanted to mention is the "selection info" is showing "=" for each tag value which is kind of weird. We don't know if any particular value is missing for the tag and that is why it is showing "=" where it was expecting something else. Because ideally, the value corresponding to the key should be extracted while parsing and not the value starting from "=".

          Prof. [~aloraine], could you please let us know how to proceed further?

          Show
          prutha Prutha Kulkarni (Inactive) added a comment - - edited Findings: In the test case for simplest version of GFF parser the code is written to parse the format of the file which is present in the "Resources/Data" directory and is not generic. For eg. assertEquals(99, sym.getStart()); Here, the start position is hardcoded which can be different for every file. Likewise, there are many assert statements in the test case. Initially, we tried to change the code and make it work for this particular file but we realized that the entire code was very specific to the file which is present in "Resources/Data" directory. So instead, we tried to upload the file directly in IGB and attaching screenshots of the "selection info" and the changed file to the ticket. Shamika Gajanan Kulkarni and I are attaching screenshots for both the platforms i.e. MACOS and WINDOWS just to be sure. One thing we wanted to mention is the "selection info" is showing "=" for each tag value which is kind of weird. We don't know if any particular value is missing for the tag and that is why it is showing "=" where it was expecting something else. Because ideally, the value corresponding to the key should be extracted while parsing and not the value starting from "=". Prof. [~aloraine] , could you please let us know how to proceed further?
          prutha Prutha Kulkarni (Inactive) made changes -
          shamika Shamika Gajanan Kulkarni (Inactive) made changes -
          Attachment fimo file-version 1.JPG [ 14726 ]
          Attachment fimo-version1-Windows.JPG [ 14727 ]
          Hide
          ann.loraine Ann Loraine added a comment -

          Can you look into the code to find out why it is showing the = character in the Selection Info values column?

          Somehow the code is able to figure out what the property names are, e.g, "Name", "Qvalue" etc. So it must be using the "=" character somehow.

          Show
          ann.loraine Ann Loraine added a comment - Can you look into the code to find out why it is showing the = character in the Selection Info values column? Somehow the code is able to figure out what the property names are, e.g, "Name", "Qvalue" etc. So it must be using the "=" character somehow.
          Hide
          prutha Prutha Kulkarni (Inactive) added a comment -

          The code was using regex for parsing the file. So, we changed it to remove the "=" and now it is getting parsed properly.
          Attaching a screenshot for future reference.
          cc: [~aloraine] , Shamika Gajanan Kulkarni

          Show
          prutha Prutha Kulkarni (Inactive) added a comment - The code was using regex for parsing the file. So, we changed it to remove the "=" and now it is getting parsed properly. Attaching a screenshot for future reference. cc: [~aloraine] , Shamika Gajanan Kulkarni
          prutha Prutha Kulkarni (Inactive) made changes -
          Hide
          nfreese Nowlan Freese added a comment - - edited

          This is a good writeup someone posted on Biostars detailing all of the different versions of GFF and GTF: https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md

          Show
          nfreese Nowlan Freese added a comment - - edited This is a good writeup someone posted on Biostars detailing all of the different versions of GFF and GTF: https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md
          Hide
          shamika Shamika Gajanan Kulkarni (Inactive) added a comment -

          Please find the regex code changes in the link below:

          https://bitbucket.org/skulka2710/shamika_igb/branch/IGBF-2381#diff

          Show
          shamika Shamika Gajanan Kulkarni (Inactive) added a comment - Please find the regex code changes in the link below: https://bitbucket.org/skulka2710/shamika_igb/branch/IGBF-2381#diff
          ann.loraine Ann Loraine made changes -
          Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
          ann.loraine Ann Loraine made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          ann.loraine Ann Loraine made changes -
          Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
          Hide
          ann.loraine Ann Loraine added a comment -

          Philip Badzuh - Can you please submit a PR for Shamika? You can do it by forking the IGB repo, cloning it to your local, adding her fork as a remote, fetching her branch, pushing her branch to your fork, and then submitting a PR from there to the team repo master branch.

          Her authorship of the commit will still be visible, by the way.

          Show
          ann.loraine Ann Loraine added a comment - Philip Badzuh - Can you please submit a PR for Shamika? You can do it by forking the IGB repo, cloning it to your local, adding her fork as a remote, fetching her branch, pushing her branch to your fork, and then submitting a PR from there to the team repo master branch. Her authorship of the commit will still be visible, by the way.
          ann.loraine Ann Loraine made changes -
          Assignee Shamika Gajanan Kulkarni [ shamika ] Philip Badzuh [ pbadzuh ]
          ann.loraine Ann Loraine made changes -
          Sprint Spring 8 : 11 May to 25 May [ 94 ] Spring 8 : 11 May to 25 May, Spring 9 : 25 May to 8 Jun [ 94, 95 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          Show
          pbadzuh Philip Badzuh (Inactive) added a comment - Please see the PR below: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/794/igbf-2381-fix-regex-to-not-display-in/diff
          pbadzuh Philip Badzuh (Inactive) made changes -
          Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
          pbadzuh Philip Badzuh (Inactive) made changes -
          Assignee Philip Badzuh [ pbadzuh ]
          ann.loraine Ann Loraine made changes -
          Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
          ann.loraine Ann Loraine made changes -
          Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
          nfreese Nowlan Freese made changes -
          Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ]
          nfreese Nowlan Freese made changes -
          Status Post-merge Testing In Progress [ 10003 ] Merged Needs Testing [ 10002 ]
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ]
          nfreese Nowlan Freese made changes -
          Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ]
          Hide
          nfreese Nowlan Freese added a comment - - edited

          [~aloraine]

          My recommendation would be to revert these changes.

          The changes in the PR above do not appear to fix the issue. Testing on 9.1.6 on the attached fimo.gff file loads the data exactly the same as 9.1.4. The main issue is that only a small number of the data appear in IGB compared to examination of the file itself. However, the data that do appear have correct formatting.

          To force IGB to show all of the data, the ##gff-version 3 first line comment can be removed from the fimo.gff file. This appears to cause IGB to parse the data not as a GFF3 file, but as a GTF file.

          Note that GTF files parse the attribute column as: attribute_name “attribute_value”; attribute_name “attribute_value”;

          whereas GFF3 files parse the attribute column as: ID=cds00004;Parent=mRNA00001,mRNA00002;Name=edenprotein.4

          The attribute values in the fimo.gff file are formatted according to the GFF3 spec as: Name=NNNNKCMTYGRDWHYYGTG_1-;Alias=MEME-2;

          So by removing ##gff-version 3 and loading the fimo.gff file as a GTF through IGB the attribute column values appear with the equals signs. By stripping the equals signs in the parsing this causes the fimo.gff file to appear to be loaded correctly (through the GTF parser), but it is affecting the parsing of GTF files.

          In 9.1.6 the GTF_HomoSapienChr1Only.gtf file loads all of the data in a single row and the attribute column does not contain the correct values. In 9.1.4, the data appear correct visually, though the attribute column also does not appear to be parsed correctly.

          After reading through this description of GTF/GFF file formats I found this sentence describing GFF3 ID: "The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features." The ID field is optional and the NAME field does not have to be unique: "Unlike IDs, there is no requirement that the Name be unique within the file."

          I think it may be worth trying to understand further why the fimo.gff file is not displaying correctly in IGB. Edit: The fimo.gff file includes a non-unique ID attribute, which causes incorrect parsing of the file. Removal of the ID attribute from the fimo file appears to fix the issue.

          Show
          nfreese Nowlan Freese added a comment - - edited [~aloraine] My recommendation would be to revert these changes. The changes in the PR above do not appear to fix the issue. Testing on 9.1.6 on the attached fimo.gff file loads the data exactly the same as 9.1.4. The main issue is that only a small number of the data appear in IGB compared to examination of the file itself. However, the data that do appear have correct formatting. To force IGB to show all of the data, the ##gff-version 3 first line comment can be removed from the fimo.gff file. This appears to cause IGB to parse the data not as a GFF3 file, but as a GTF file. Note that GTF files parse the attribute column as: attribute_name “attribute_value”; attribute_name “attribute_value”; whereas GFF3 files parse the attribute column as: ID=cds00004;Parent=mRNA00001,mRNA00002;Name=edenprotein.4 The attribute values in the fimo.gff file are formatted according to the GFF3 spec as: Name=NNNNKCMTYGRDWHYYGTG_1-;Alias=MEME-2; So by removing ##gff-version 3 and loading the fimo.gff file as a GTF through IGB the attribute column values appear with the equals signs. By stripping the equals signs in the parsing this causes the fimo.gff file to appear to be loaded correctly (through the GTF parser), but it is affecting the parsing of GTF files. In 9.1.6 the GTF_HomoSapienChr1Only.gtf file loads all of the data in a single row and the attribute column does not contain the correct values. In 9.1.4, the data appear correct visually, though the attribute column also does not appear to be parsed correctly. After reading through this description of GTF/GFF file formats I found this sentence describing GFF3 ID: "The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features." The ID field is optional and the NAME field does not have to be unique : "Unlike IDs, there is no requirement that the Name be unique within the file." I think it may be worth trying to understand further why the fimo.gff file is not displaying correctly in IGB. Edit: The fimo.gff file includes a non-unique ID attribute, which causes incorrect parsing of the file. Removal of the ID attribute from the fimo file appears to fix the issue.
          nfreese Nowlan Freese made changes -
          Status Post-merge Testing In Progress [ 10003 ] Merged Needs Testing [ 10002 ]
          Hide
          ann.loraine Ann Loraine added a comment -

          I agree. We should revert. Should I go ahead and do that now?

          Show
          ann.loraine Ann Loraine added a comment - I agree. We should revert. Should I go ahead and do that now?
          Hide
          nfreese Nowlan Freese added a comment -

          Yes, I don't see any reason to leave the changes in master for the time being. I think this issue will need a much deeper dive into the current GFF/GTF parsing in IGB.

          Show
          nfreese Nowlan Freese added a comment - Yes, I don't see any reason to leave the changes in master for the time being. I think this issue will need a much deeper dive into the current GFF/GTF parsing in IGB.
          nfreese Nowlan Freese made changes -
          Attachment 914_SmokeTestData.png [ 14743 ]
          Attachment 916_withMerge_SmokeTestData.png [ 14744 ]
          Attachment 916_withMerge_withFimo.png [ 14745 ]
          nfreese Nowlan Freese made changes -
          Assignee Nowlan Freese [ nfreese ] Ann Loraine [ aloraine ]
          Hide
          ann.loraine Ann Loraine added a comment -

          Reverted. Moving to Closed.

          Show
          ann.loraine Ann Loraine added a comment - Reverted. Moving to Closed.
          ann.loraine Ann Loraine made changes -
          Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
          ann.loraine Ann Loraine made changes -
          Resolution Done [ 10000 ]
          Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
          ann.loraine Ann Loraine made changes -
          Assignee Ann Loraine [ aloraine ]
          Hide
          nfreese Nowlan Freese added a comment -

          Dr. Loraine pointed out that the fimo.gff file does have an ID attribute in the 9th column. The ID must be unique: " IDs for each feature must be unique within the scope of the GFF file." Removing the ID attribute from each row appears to fix the file. All of the data are visible in IGB and the data appear to be parsed correctly (attached).

          This seems to be an issue with the output of FIMO, and not an issue with how IGB is parsing the GFF3.

          Show
          nfreese Nowlan Freese added a comment - Dr. Loraine pointed out that the fimo.gff file does have an ID attribute in the 9th column. The ID must be unique: " IDs for each feature must be unique within the scope of the GFF file." Removing the ID attribute from each row appears to fix the file. All of the data are visible in IGB and the data appear to be parsed correctly (attached). This seems to be an issue with the output of FIMO, and not an issue with how IGB is parsing the GFF3.
          nfreese Nowlan Freese made changes -
          Attachment fimo_NO-ID.gff [ 14746 ]
          nfreese Nowlan Freese made changes -
          Link This issue relates to IGBF-2406 [ IGBF-2406 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              ann.loraine Ann Loraine
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: