Details
-
Type:
Improvement
-
Status: Closed (View Workflow)
-
Priority:
Minor
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:
-
Story Points:1.5
-
Epic Link:
-
Sprint:Fall 6, Fall 7
Description
Situation: The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See IGBF-3924 for a report on the resulting error when a GFF3 file with this section is loaded into IGB.
Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3
GFF3 Sequence Section
GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example:##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ...When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.
You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.
Task: Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section.
Example files which are not being parsed correctly:
- prodigal_Lambda_phage_sequences.gff
- FragGeneScan_Lambda_phage_sequences.gff
Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK
Link to files on Loraine Lab Google Drive: https://drive.google.com/drive/folders/1MLsVItXNcskfiCAg62GFmxWc1-NR40Tx?usp=drive_link
Attachments
Issue Links
Activity
| Field | Original Value | New Value |
|---|---|---|
| Epic Link | IGBF-1765 [ 17855 ] |
| Sprint | Fall 4 [ 205 ] |
| Description |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3): {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Files which are not being parsed correctly: * prodigal_Lambda_phage_sequences.gff * FragGeneScan_Lambda_phage_sequences.gff Link to Metacerberus output on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3): {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. |
| Description |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Files which are not being parsed correctly: * prodigal_Lambda_phage_sequences.gff * FragGeneScan_Lambda_phage_sequences.gff Link to Metacerberus output on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3): {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3): {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. ---- Example files which are not being parsed correctly: * prodigal_Lambda_phage_sequences.gff * FragGeneScan_Lambda_phage_sequences.gff Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK |
| Labels | intermediate |
| Labels | intermediate | Intermediate |
| Link | This issue relates to IGBF-3955 [ IGBF-3955 ] |
| Description |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Here's some more info from the GFF3 documentation (https://gmod.org/wiki/GFF3): {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. ---- Example files which are not being parsed correctly: * prodigal_Lambda_phage_sequences.gff * FragGeneScan_Lambda_phage_sequences.gff Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3 {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. ---- Example files which are not being parsed correctly: * prodigal_Lambda_phage_sequences.gff * FragGeneScan_Lambda_phage_sequences.gff Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK |
| Assignee | Nowlan Freese [ nfreese ] |
| Sprint | Fall 4 [ 205 ] |
| Sprint | Fall 6 [ 207 ] |
| Sprint | Fall 6 [ 207 ] | Fall 6, Fall 7 [ 207, 208 ] |
| Rank | Ranked higher |
| Summary | Upgrade GFF parser | Investigate: GFF parser error |
| Assignee | Paige Kulzer [ pkulzer ] |
| Status | To-Do [ 10305 ] | In Progress [ 3 ] |
| Status | In Progress [ 3 ] | Needs 1st Level Review [ 10005 ] |
| Status | Needs 1st Level Review [ 10005 ] | First Level Review in Progress [ 10301 ] |
| Status | First Level Review in Progress [ 10301 ] | Ready for Pull Request [ 10304 ] |
| Status | Ready for Pull Request [ 10304 ] | Pull Request Submitted [ 10101 ] |
| Status | Pull Request Submitted [ 10101 ] | Reviewing Pull Request [ 10303 ] |
| Status | Reviewing Pull Request [ 10303 ] | Merged Needs Testing [ 10002 ] |
| Status | Merged Needs Testing [ 10002 ] | Post-merge Testing In Progress [ 10003 ] |
| Resolution | Done [ 10000 ] | |
| Status | Post-merge Testing In Progress [ 10003 ] | Closed [ 6 ] |
| Description |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3 {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. ---- Example files which are not being parsed correctly: * prodigal_Lambda_phage_sequences.gff * FragGeneScan_Lambda_phage_sequences.gff Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK |
*Situation:* The GFF3 file format may include a Sequence Section in FASTA format at the end of the file, but IGB is not currently able to parse a GFF3 file when this section is present. See Here's some more info from the GFF3 documentation - https://gmod.org/wiki/GFF3 {quote}*GFF3 Sequence Section* GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: {noformat} ##gff-version 3 ctg123 . exon 1300 1500 . + . ID=exon00001 ctg123 . exon 1050 1500 . + . ID=exon00002 ctg123 . exon 3000 3902 . + . ID=exon00003 ctg123 . exon 5000 5500 . + . ID=exon00004 ctg123 . exon 7000 9000 . + . ID=exon00005 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc ... {noformat} When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file. You don’t have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.{quote} *Task:* Upgrade the GFF parser logic to be able to handle GFF3 files with a Sequence Section. ---- Example files which are not being parsed correctly: * prodigal_Lambda_phage_sequences.gff * FragGeneScan_Lambda_phage_sequences.gff Link to those files on Google Drive: https://drive.google.com/drive/folders/14noPsmKYMxX9jgHYQhkjqaTGzT8z8bSK Link to files on Loraine Lab Google Drive: https://drive.google.com/drive/folders/1MLsVItXNcskfiCAg62GFmxWc1-NR40Tx?usp=drive_link |