[HELP-319] Investigate why NCBI GFF files will not open - JIRA UNCC

Details

Type: Bug
Status: To-Do (View Workflow)
Priority: Major
Resolution: Unresolved
Labels:
- Intermediate

Story Points:
1.5

Description

Email from user, via sourceforge, from: sjack93@users.sourceforge.net

The IGB website says their program supports many different file formats, including gff. I have saved a genome from NCBI in several different file formats to my computer (using Ubuntu), but when I go to open the custom genome from IGB, it only recognizes the .fna file type, when I have genomes saved as .gbff and .gff in the same folder. It's like the program won't recognize these file types saved on my computer. Does anyone know why this might be happening? Thanks in advance.

and direct email from user:

Hi Ann,

I found your e-mail address on the troubleshooting page of the IGB (integrated genome browser) website and I'm hoping you can help me. The IGB user guide states that many different file formats are supported for this program (>20 file types including gff, gbff, fna and more). However, when I attempt to open genome from file, it recognizes only about half the file types it claims to be compatible with. I am running IGB on Ubuntu. Do you have any idea why this might be happening?

from: stephanie.jack@unb.ca

Reply:

Reply from Dr. Loraine:

Thank you for getting in touch.

Can you send me URLs of the GFF files you are trying to open?

We've had some issues with NCBI's GFF files in the past and I think we may have fixed those problems ... but this may need to be updated!

If we are able to fix the problem, we should be able to roll it out to you fairly quickly as an "early access" IGB release. We are setting the early access mechanism in the next few weeks, so hopefully we can get your problem addressed in a few weeks, as well.

Attachments

Issue Links

relates to

IGBF-1546 Add American eel genome

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Nowlan Freese added a comment - 20/Feb/19 3:41 PM

User help thread on sourceforge :

Stephanie:
The IGB website says their program supports many different file formats, including gff. I have saved a genome from NCBI in several different file formats to my computer (using Ubuntu), but when I go to open the custom genome from IGB, it only recognizes the .fna file type, when I have genomes saved as .gbff and .gff in the same folder. It's like the program won't recognize these file types saved on my computer. Does anyone know why this might be happening? Thanks in advance.

Nowlan:
To load a custom genome, IGB is looking for sequence files such as fasta, fna, or 2bit. If the genome for your data is not available, and you do not have a sequence file, you can drag and drop the .gff file directly into IGB and then click Load Data.

Stephanie:
Thanks for the info. I loaded the genome as a fna file, then the annotation as a .gff file, which was recommended to me on another online forum. The program has been "retreiving chromosomes" for many hours, do you know why this might be? Also, do you know what the "maximum heap size" is? It's displayed on the bottom right corner of the IGB interface, and the proportin of max heap size being used contnues to change as the program attempts to retreive chromosomes.

Dr. Loraine:
Are there a lot of reference sequences mentioned in the GFF file?
I would recommend opening the same file sequence in anIDE with debugger to see where the hang up occurs.
It would be nice if IGB could handle the various issues that come up with NCBI gff — NCBI is a major clearinghouse for genomic data that many people use.

Nowlan:
The issue appears to be with the gff file you are trying to view. The file does not appear to contain gene annotations mapped to a genome.

The Anguilla rostrata (American eel) annotation is available from dryad. If you unpack the file, there is a file called american_eel_genome_v5.gff that appears to contain the annotation. I sorted and compressed the file and index (attached). Try loading it (american_eel_genome_v5.sorted.gff.gz) in IGB and let me know if it works for you.

Show

Nowlan Freese added a comment - 20/Feb/19 3:41 PM User help thread on sourceforge : Stephanie: The IGB website says their program supports many different file formats, including gff. I have saved a genome from NCBI in several different file formats to my computer (using Ubuntu), but when I go to open the custom genome from IGB, it only recognizes the .fna file type, when I have genomes saved as .gbff and .gff in the same folder. It's like the program won't recognize these file types saved on my computer. Does anyone know why this might be happening? Thanks in advance. Nowlan: To load a custom genome, IGB is looking for sequence files such as fasta, fna, or 2bit. If the genome for your data is not available, and you do not have a sequence file, you can drag and drop the .gff file directly into IGB and then click Load Data. Stephanie: Thanks for the info. I loaded the genome as a fna file, then the annotation as a .gff file, which was recommended to me on another online forum. The program has been "retreiving chromosomes" for many hours, do you know why this might be? Also, do you know what the "maximum heap size" is? It's displayed on the bottom right corner of the IGB interface, and the proportin of max heap size being used contnues to change as the program attempts to retreive chromosomes. Dr. Loraine: Are there a lot of reference sequences mentioned in the GFF file? I would recommend opening the same file sequence in anIDE with debugger to see where the hang up occurs. It would be nice if IGB could handle the various issues that come up with NCBI gff — NCBI is a major clearinghouse for genomic data that many people use. Nowlan: The issue appears to be with the gff file you are trying to view. The file does not appear to contain gene annotations mapped to a genome. The Anguilla rostrata (American eel) annotation is available from dryad. If you unpack the file, there is a file called american_eel_genome_v5.gff that appears to contain the annotation. I sorted and compressed the file and index (attached). Try loading it (american_eel_genome_v5.sorted.gff.gz) in IGB and let me know if it works for you.

Hide

Permalink

Nowlan Freese added a comment - 20/Feb/19 3:41 PM

User help thread on Biostars

Show

Nowlan Freese added a comment - 20/Feb/19 3:41 PM User help thread on Biostars

Hide

Permalink

Nowlan Freese added a comment - 20/Feb/19 3:49 PM

There seem to be a couple of issues. The gff file from NCBI she was trying to load is oddly formatted, and it is unclear if it contains actual annotations.

The other issue is that the file contains ~12,000 assemblies/chromosomes. This causes the following exception in IGB:

Feb 20, 2019 3:45:51 PM com.affymetrix.genometry.quickload.QuickLoadSymLoader logException
SEVERE: Too many open files
java.io.IOException: Too many open files
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at java.io.File.createTempFile(File.java:2070)
at com.affymetrix.genometry.symloader.SymLoader.addToLists(SymLoader.java:421)
at com.affymetrix.genometry.symloader.GFF3.parseLines(GFF3.java:215)
at com.affymetrix.genometry.symloader.SymLoader.buildIndex(SymLoader.java:241)
at com.affymetrix.genometry.symloader.GFF3.init(GFF3.java:76)
at com.affymetrix.genometry.symloader.GFF3.getChromosomeList(GFF3.java:83)
at com.affymetrix.genometry.quickload.QuickLoadSymLoader.getChromosomeList(QuickLoadSymLoader.java:254)
at com.affymetrix.igb.view.load.GeneralLoadUtils$3.runInBackground(GeneralLoadUtils.java:1002)
at com.affymetrix.igb.view.load.GeneralLoadUtils$3.runInBackground(GeneralLoadUtils.java:996)
at com.affymetrix.genometry.thread.CThreadWorker.doInBackground(CThreadWorker.java:73)
at javax.swing.SwingWorker$1.call(SwingWorker.java:295)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at javax.swing.SwingWorker.run(SwingWorker.java:334)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Show

Nowlan Freese added a comment - 20/Feb/19 3:49 PM There seem to be a couple of issues. The gff file from NCBI she was trying to load is oddly formatted, and it is unclear if it contains actual annotations. The other issue is that the file contains ~12,000 assemblies/chromosomes. This causes the following exception in IGB: Feb 20, 2019 3:45:51 PM com.affymetrix.genometry.quickload.QuickLoadSymLoader logException SEVERE: Too many open files java.io.IOException: Too many open files at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createTempFile(File.java:2024) at java.io.File.createTempFile(File.java:2070) at com.affymetrix.genometry.symloader.SymLoader.addToLists(SymLoader.java:421) at com.affymetrix.genometry.symloader.GFF3.parseLines(GFF3.java:215) at com.affymetrix.genometry.symloader.SymLoader.buildIndex(SymLoader.java:241) at com.affymetrix.genometry.symloader.GFF3.init(GFF3.java:76) at com.affymetrix.genometry.symloader.GFF3.getChromosomeList(GFF3.java:83) at com.affymetrix.genometry.quickload.QuickLoadSymLoader.getChromosomeList(QuickLoadSymLoader.java:254) at com.affymetrix.igb.view.load.GeneralLoadUtils$3.runInBackground(GeneralLoadUtils.java:1002) at com.affymetrix.igb.view.load.GeneralLoadUtils$3.runInBackground(GeneralLoadUtils.java:996) at com.affymetrix.genometry.thread.CThreadWorker.doInBackground(CThreadWorker.java:73) at javax.swing.SwingWorker$1.call(SwingWorker.java:295) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at javax.swing.SwingWorker.run(SwingWorker.java:334) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Hide

Permalink

Nowlan Freese added a comment - 20/Feb/19 3:52 PM

The best current solution is to download the annotation file from dryad , which is the annotation source listed in the original American eel paper. The file needs to be bgzipped and tabix indexed or else igb will throw an exception: Too many open files.

Unpack the file.
Navigate to american_eel_genome_v5.sorted.gff
(grep ^"#" american_eel_genome_v5.gff; grep -v ^"#" american_eel_genome_v5.gff | sort -k1,1 -k4,4n) > american_eel_genome_v5.sorted.gff
bgzip american_eel_genome_v5.sorted.gff
tabix american_eel_genome_v5.sorted.gff

Show

Nowlan Freese added a comment - 20/Feb/19 3:52 PM The best current solution is to download the annotation file from dryad , which is the annotation source listed in the original American eel paper. The file needs to be bgzipped and tabix indexed or else igb will throw an exception: Too many open files. Unpack the file. Navigate to american_eel_genome_v5.sorted.gff (grep ^"#" american_eel_genome_v5.gff; grep -v ^"#" american_eel_genome_v5.gff | sort -k1,1 -k4,4n) > american_eel_genome_v5.sorted.gff bgzip american_eel_genome_v5.sorted.gff tabix american_eel_genome_v5.sorted.gff

Investigate why NCBI GFF files will not open

Details

Description

Attachments

Issue Links

Activity

People

Dates