[IGBF-3409] Fix wheat genome files and documentation - JIRA UNCC

Details

Type: Task
Status: To-Do (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
0.25
Epic Link:
Improve IGB for users
Sprint:
Spring 1, Spring 2

Description

This directory:

http://lorainelab-quickload.scidas.org/quickload/T_aestivum_Aug_2018/

contains a ".csi" file instead of a ".tbi" file.

Not clear why. Investigate and fix.

Also, modify the "annots.xml" to point IGB to a sequence file stored at EBI or wherever it is.

After doing this, update HEADER.md in the quickload svn repository to no longer advise users to retrieve and deploy the wheat genome file.

Attachments

Issue Links

relates to

IGBF-2538 Support CSI tabix index for very large genomes

To-Do

IGBF-2333 Add wheat genome to IGB quickload system

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Paige Kulzer added a comment - 21/Jan/25 9:45 AM

It appears that the tabix (.tbi) index format can only handle up to a certain chromosome size (512Mbp). For chromosome sizes greater than 512Mbp, the CSI index format should be used. Per this tabix documentation (https://www.htslib.org/doc/tabix.html):

The tabix (.tbi) and BAI index formats can handle individual chromosomes up to 512 Mbp (2^29 bases) in length. If your input file might contain data lines with begin or end positions greater than that, you will need to use a CSI index.

The bread wheat genome is larger than 512Mbp and therefore a CSI index was created for this quickload instead of a tabix index.

Show

Paige Kulzer added a comment - 21/Jan/25 9:45 AM It appears that the tabix (.tbi) index format can only handle up to a certain chromosome size (512Mbp). For chromosome sizes greater than 512Mbp, the CSI index format should be used. Per this tabix documentation ( https://www.htslib.org/doc/tabix.html): The tabix (.tbi) and BAI index formats can handle individual chromosomes up to 512 Mbp (2^29 bases) in length. If your input file might contain data lines with begin or end positions greater than that, you will need to use a CSI index. The bread wheat genome is larger than 512Mbp and therefore a CSI index was created for this quickload instead of a tabix index.

Hide

Permalink

Paige Kulzer added a comment - 21/Jan/25 10:57 AM - edited

I modified the annots.xml file to point to the genome sequence file hosted by ensembl. I also added the label_field attribute such that the id of each gene model is automatically displayed.

Note: I tested these changes in IGB and found that clicking the Load Sequence button produced an error, saying there was no genome sequence to load. I then retrieved the genome sequence file from ensembl, converted it to 2bit format, and added it to the Triticum aestivum Quickload folder before testing again. Now, there are no errors being thrown after clicking the Load Sequence button.

Show

Paige Kulzer added a comment - 21/Jan/25 10:57 AM - edited I modified the annots.xml file to point to the genome sequence file hosted by ensembl. I also added the label_field attribute such that the id of each gene model is automatically displayed. Note: I tested these changes in IGB and found that clicking the Load Sequence button produced an error, saying there was no genome sequence to load. I then retrieved the genome sequence file from ensembl, converted it to 2bit format, and added it to the Triticum aestivum Quickload folder before testing again. Now, there are no errors being thrown after clicking the Load Sequence button.

Hide

Permalink

Paige Kulzer added a comment - 21/Jan/25 11:13 AM

I'm not seeing anything in HEADER.md that advise users to retrieve and deploy the wheat genome file - maybe this has already been fixed. Therefore, this Quickload is now ready for testing!

I've put a zipped copy of the Quickload on Google Drive for review: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

Show

Paige Kulzer added a comment - 21/Jan/25 11:13 AM I'm not seeing anything in HEADER.md that advise users to retrieve and deploy the wheat genome file - maybe this has already been fixed. Therefore, this Quickload is now ready for testing! I've put a zipped copy of the Quickload on Google Drive for review: https://drive.google.com/drive/folders/1bFRx4PqldxNf400n7Vr9SD_dNeNmtpvk?usp=drive_link

Hide

Permalink

Nowlan Freese added a comment - 24/Feb/25 2:01 PM

Unclear if anything needs to be done for this ticket.

See ~~IGBF-2333~~ for why the CSI index is needed (IGBF-2538 would add support for CSI).
2bit file is found in the QuickLoad, but we did not put it in the SVN repo due to the size (see ~~IGBF-2333~~).
Header.md no longer advises users to retrieve and deploy the wheat genome file.

I think we can close this ticket.

Show

Nowlan Freese added a comment - 24/Feb/25 2:01 PM Unclear if anything needs to be done for this ticket. See IGBF-2333 for why the CSI index is needed ( IGBF-2538 would add support for CSI). 2bit file is found in the QuickLoad, but we did not put it in the SVN repo due to the size (see IGBF-2333 ). Header.md no longer advises users to retrieve and deploy the wheat genome file. I think we can close this ticket.

Hide

Permalink

Ann Loraine added a comment - 25/Feb/25 10:50 AM

To-do:

Improve header markdown (quickload) to better document where the files came from

Show

Ann Loraine added a comment - 25/Feb/25 10:50 AM To-do: Improve header markdown (quickload) to better document where the files came from

Fix wheat genome files and documentation

Details

Description

Attachments

Issue Links

Activity

People

Dates