Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2381

Investigate problem with FIMO GFF file parsing

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      0.5
    • Sprint:
      Spring 8 : 11 May to 25 May, Spring 9 : 25 May to 8 Jun

      Description

      Dr. Beth Krizek (krizek@biol.sc.edu) has let us know about a problem with parsing the attached file.

      It is GFF format, but when we view it in IGB, only a small part of the data are being shown.

      When you click on one of the items, the "Id" field does not exactly match the "Id" fields in the GFF "extra feature" field, the section of semi-colon-separated characters in the final field.

      Please investigate why this is happening. Is there a problem with our GFF parsing code, or is it an issue with the formatting of this file?

      In addition, take a look at the unit testing code for this format. If the data in this file are OK, we should use some lines of data in the file to add new test cases to the testing code.

      The file was generated using a program called "FIMO" which is a popular tool in bioinformatics. Therefore we should do our best to support visualizing the output in IGB.

        Attachments

        1. 914_SmokeTestData.png
          914_SmokeTestData.png
          320 kB
        2. 916_withMerge_SmokeTestData.png
          916_withMerge_SmokeTestData.png
          440 kB
        3. 916_withMerge_withFimo.png
          916_withMerge_withFimo.png
          213 kB
        4. fimo_NO-ID.gff
          127 kB
        5. fimo.gff
          156 kB
        6. fimo file-version 1.JPG
          fimo file-version 1.JPG
          383 kB
        7. fimo-version1-Windows.JPG
          fimo-version1-Windows.JPG
          221 kB
        8. Screen Shot 2020-05-13 at 5.54.07 PM.png
          Screen Shot 2020-05-13 at 5.54.07 PM.png
          185 kB
        9. Screen Shot 2020-05-13 at 6.49.45 PM.png
          Screen Shot 2020-05-13 at 6.49.45 PM.png
          185 kB

          Activity

          Hide
          nfreese Nowlan Freese added a comment - - edited

          [~aloraine]

          My recommendation would be to revert these changes.

          The changes in the PR above do not appear to fix the issue. Testing on 9.1.6 on the attached fimo.gff file loads the data exactly the same as 9.1.4. The main issue is that only a small number of the data appear in IGB compared to examination of the file itself. However, the data that do appear have correct formatting.

          To force IGB to show all of the data, the ##gff-version 3 first line comment can be removed from the fimo.gff file. This appears to cause IGB to parse the data not as a GFF3 file, but as a GTF file.

          Note that GTF files parse the attribute column as: attribute_name “attribute_value”; attribute_name “attribute_value”;

          whereas GFF3 files parse the attribute column as: ID=cds00004;Parent=mRNA00001,mRNA00002;Name=edenprotein.4

          The attribute values in the fimo.gff file are formatted according to the GFF3 spec as: Name=NNNNKCMTYGRDWHYYGTG_1-;Alias=MEME-2;

          So by removing ##gff-version 3 and loading the fimo.gff file as a GTF through IGB the attribute column values appear with the equals signs. By stripping the equals signs in the parsing this causes the fimo.gff file to appear to be loaded correctly (through the GTF parser), but it is affecting the parsing of GTF files.

          In 9.1.6 the GTF_HomoSapienChr1Only.gtf file loads all of the data in a single row and the attribute column does not contain the correct values. In 9.1.4, the data appear correct visually, though the attribute column also does not appear to be parsed correctly.

          After reading through this description of GTF/GFF file formats I found this sentence describing GFF3 ID: "The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features." The ID field is optional and the NAME field does not have to be unique: "Unlike IDs, there is no requirement that the Name be unique within the file."

          I think it may be worth trying to understand further why the fimo.gff file is not displaying correctly in IGB. Edit: The fimo.gff file includes a non-unique ID attribute, which causes incorrect parsing of the file. Removal of the ID attribute from the fimo file appears to fix the issue.

          Show
          nfreese Nowlan Freese added a comment - - edited [~aloraine] My recommendation would be to revert these changes. The changes in the PR above do not appear to fix the issue. Testing on 9.1.6 on the attached fimo.gff file loads the data exactly the same as 9.1.4. The main issue is that only a small number of the data appear in IGB compared to examination of the file itself. However, the data that do appear have correct formatting. To force IGB to show all of the data, the ##gff-version 3 first line comment can be removed from the fimo.gff file. This appears to cause IGB to parse the data not as a GFF3 file, but as a GTF file. Note that GTF files parse the attribute column as: attribute_name “attribute_value”; attribute_name “attribute_value”; whereas GFF3 files parse the attribute column as: ID=cds00004;Parent=mRNA00001,mRNA00002;Name=edenprotein.4 The attribute values in the fimo.gff file are formatted according to the GFF3 spec as: Name=NNNNKCMTYGRDWHYYGTG_1-;Alias=MEME-2; So by removing ##gff-version 3 and loading the fimo.gff file as a GTF through IGB the attribute column values appear with the equals signs. By stripping the equals signs in the parsing this causes the fimo.gff file to appear to be loaded correctly (through the GTF parser), but it is affecting the parsing of GTF files. In 9.1.6 the GTF_HomoSapienChr1Only.gtf file loads all of the data in a single row and the attribute column does not contain the correct values. In 9.1.4, the data appear correct visually, though the attribute column also does not appear to be parsed correctly. After reading through this description of GTF/GFF file formats I found this sentence describing GFF3 ID: "The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features." The ID field is optional and the NAME field does not have to be unique : "Unlike IDs, there is no requirement that the Name be unique within the file." I think it may be worth trying to understand further why the fimo.gff file is not displaying correctly in IGB. Edit: The fimo.gff file includes a non-unique ID attribute, which causes incorrect parsing of the file. Removal of the ID attribute from the fimo file appears to fix the issue.
          Hide
          ann.loraine Ann Loraine added a comment -

          I agree. We should revert. Should I go ahead and do that now?

          Show
          ann.loraine Ann Loraine added a comment - I agree. We should revert. Should I go ahead and do that now?
          Hide
          nfreese Nowlan Freese added a comment -

          Yes, I don't see any reason to leave the changes in master for the time being. I think this issue will need a much deeper dive into the current GFF/GTF parsing in IGB.

          Show
          nfreese Nowlan Freese added a comment - Yes, I don't see any reason to leave the changes in master for the time being. I think this issue will need a much deeper dive into the current GFF/GTF parsing in IGB.
          Hide
          ann.loraine Ann Loraine added a comment -

          Reverted. Moving to Closed.

          Show
          ann.loraine Ann Loraine added a comment - Reverted. Moving to Closed.
          Hide
          nfreese Nowlan Freese added a comment -

          Dr. Loraine pointed out that the fimo.gff file does have an ID attribute in the 9th column. The ID must be unique: " IDs for each feature must be unique within the scope of the GFF file." Removing the ID attribute from each row appears to fix the file. All of the data are visible in IGB and the data appear to be parsed correctly (attached).

          This seems to be an issue with the output of FIMO, and not an issue with how IGB is parsing the GFF3.

          Show
          nfreese Nowlan Freese added a comment - Dr. Loraine pointed out that the fimo.gff file does have an ID attribute in the 9th column. The ID must be unique: " IDs for each feature must be unique within the scope of the GFF file." Removing the ID attribute from each row appears to fix the file. All of the data are visible in IGB and the data appear to be parsed correctly (attached). This seems to be an issue with the output of FIMO, and not an issue with how IGB is parsing the GFF3.

            People

            • Assignee:
              Unassigned
              Reporter:
              ann.loraine Ann Loraine
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: