Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:

      Description

      Situation: The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

      Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3

      GFF3 Sequence Section
      GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

      ##gff-version 3
      ctg123 . exon            1300  1500  .  +  .  ID=exon00001
      ctg123 . exon            1050  1500  .  +  .  ID=exon00002
      ctg123 . exon            3000  3902  .  +  .  ID=exon00003
      ctg123 . exon            5000  5500  .  +  .  ID=exon00004
      ctg123 . exon            7000  9000  .  +  .  ID=exon00005
      ##FASTA
      >ctg123
      cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
      tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
      tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
      aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
      aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
      cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
      ...
      

      When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

      You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.

      Task: Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.


      Example files which are not being parsed correctly:

      • prodigal_Lambda_phage_sequences.gff
      • FragGeneScan_Lambda_phage_sequences.gff

      Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK
      Link to files on Loraine Lab Google Drive: https://drive.google.com/drive/folders/1MLsVItXNcskfiCAg62GFmxWc1-NR40Tx?usp=drive_link

        Attachments

          Issue Links

            Activity

            pkulzer Paige Kulzer created issue -
            pkulzer Paige Kulzer made changes -
            Field Original Value New Value
            Epic Link IGBF-1765 [ 17855 ]
            pkulzer Paige Kulzer made changes -
            Link This issue relates to IGBF-3924 [ IGBF-3924 ]
            nfreese Nowlan Freese made changes -
            Sprint Fall 4 [ 205 ]
            pkulzer Paige Kulzer made changes -
            Description *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3):

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.
            *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Files which are not being parsed correctly:
            * prodigal_Lambda_phage_sequences.gff
            * FragGeneScan_Lambda_phage_sequences.gff

            Link to Metacerberus output on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK

            Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3):

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.
            pkulzer Paige Kulzer made changes -
            Description *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Files which are not being parsed correctly:
            * prodigal_Lambda_phage_sequences.gff
            * FragGeneScan_Lambda_phage_sequences.gff

            Link to Metacerberus output on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK

            Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3):

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.
            *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3):

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.

            ----

            Example files which are not being parsed correctly:
            * prodigal_Lambda_phage_sequences.gff
            * FragGeneScan_Lambda_phage_sequences.gff

            Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK
            pkulzer Paige Kulzer made changes -
            Labels intermediate
            pkulzer Paige Kulzer made changes -
            Labels intermediate Intermediate
            pkulzer Paige Kulzer made changes -
            Link This issue relates to IGBF-3955 [ IGBF-3955 ]
            pkulzer Paige Kulzer made changes -
            Description *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3):

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.

            ----

            Example files which are not being parsed correctly:
            * prodigal_Lambda_phage_sequences.gff
            * FragGeneScan_Lambda_phage_sequences.gff

            Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK
            *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.

            ----

            Example files which are not being parsed correctly:
            * prodigal_Lambda_phage_sequences.gff
            * FragGeneScan_Lambda_phage_sequences.gff

            Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK
            nfreese Nowlan Freese made changes -
            Assignee Nowlan Freese [ nfreese ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 4 [ 205 ]
            nfreese Nowlan Freese made changes -
            Sprint Fall 6 [ 207 ]
            pkulzer Paige Kulzer made changes -
            Link This issue relates to IGBF-3884 [ IGBF-3884 ]
            pkulzer Paige Kulzer made changes -
            Link This issue relates to IGBF-3884 [ IGBF-3884 ]
            pkulzer Paige Kulzer made changes -
            Link This issue blocks IGBF-3884 [ IGBF-3884 ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 6 [ 207 ] Fall 6, Fall 7 [ 207, 208 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            pkulzer Paige Kulzer made changes -
            Summary Upgrade GFF parser Investigate: GFF parser error
            pkulzer Paige Kulzer made changes -
            Assignee Paige Kulzer [ pkulzer ]
            pkulzer Paige Kulzer made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            pkulzer Paige Kulzer made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            pkulzer Paige Kulzer made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            pkulzer Paige Kulzer made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            pkulzer Paige Kulzer made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            pkulzer Paige Kulzer made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            pkulzer Paige Kulzer made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            pkulzer Paige Kulzer made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            pkulzer Paige Kulzer made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
            pkulzer Paige Kulzer made changes -
            Link This issue relates to IGBF-4002 [ IGBF-4002 ]
            nfreese Nowlan Freese made changes -
            Epic Link IGBF-1765 [ 17855 ] IGBF-4028 [ 23324 ]
            nfreese Nowlan Freese made changes -
            Description *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.

            ----

            Example files which are not being parsed correctly:
            * prodigal_Lambda_phage_sequences.gff
            * FragGeneScan_Lambda_phage_sequences.gff

            Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK
            *Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.

            Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3

            {quote}*GFF3 Sequence Section*
            GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:

            {noformat}
            ##gff-version 3
            ctg123 . exon 1300 1500 . + . ID=exon00001
            ctg123 . exon 1050 1500 . + . ID=exon00002
            ctg123 . exon 3000 3902 . + . ID=exon00003
            ctg123 . exon 5000 5500 . + . ID=exon00004
            ctg123 . exon 7000 9000 . + . ID=exon00005
            ##FASTA
            >ctg123
            cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
            tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
            tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
            aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
            aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
            cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
            ...
            {noformat}

            When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

            You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote}

            *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.

            ----

            Example files which are not being parsed correctly:
            * prodigal_Lambda_phage_sequences.gff
            * FragGeneScan_Lambda_phage_sequences.gff

            Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK
            Link to files on Loraine Lab Google Drive: https://drive.google.com/drive/folders/1MLsVItXNcskfiCAg62GFmxWc1-NR40Tx?usp=drive_link

              People

              • Assignee:
                pkulzer Paige Kulzer
                Reporter:
                pkulzer Paige Kulzer
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: