[IGBF-3893] Create Telomere to Telomere human genome Quickload - JIRA UNCC

Details

Type: Task
Status: Needs 1st Level Review (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Improve IGB for users
Sprint:
Fall 3, Fall 4

Description

Situation: There is a new version of the human genome referred to as either telomere to telomere or T2T or HS1. UCSC does provide this genome, link, and IGB is pulling in the genome through the UCSC REST API. As part of ~~IGBF-3902~~, IGB is now including the hs1 genome under the "Homo sapiens" Species dropdown, but a Quickload for this genome still needs to be made.

Tasks: Create a bed14 annotation file for this new genome, and add it and the 2bit file (link) to IGB Quickload.

Attachments

Issue Links

is blocked by

IGBF-2537 Enable Quickload to specify the 2bit file for a genome

Closed

relates to

IGBF-3902 Fix incorrect genome names in Species drop-down

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Ann Loraine added a comment - 17/Sep/24 11:23 AM - edited

As per discussion with Nowlan Freese after scrum, we would like to do these two things:

Use species name "Homo sapiens T2T" for "human telomere to telomere reference"

This is to allow this assembly to exist alongside the traditional hg38, hg19, etc assemblies. We want to do this because so many people are 100% using hg38 and hg19, not this new assembly just yet.

The version should be something like: H_sapiens_T2T_MMM_YYYY

Note that we need to be super duper careful about making sure that our version dates map correctly onto UCSC patch releases, or whatever they are doing to keep track of how the sequence itself (and all the constituent contigs) changes over time.

Instead of version-controlling the 2bit file in our subversion repository, let's "reference=TRUE" tag in annots.xml and have the file name be the URL of the 2bit file hosted on the UCSC genome web site.

This will enable IGB to locally cache the genome file instead of always having to use the JSON REST API to retrieve sequence data all the time. Also, retrieving data from a 2bit file may be faster than getting sequence data from the JSON REST API.

Testing: Make sure that IGB can also retrieve sequence data from the JSON REST API in case the URL of the 2bit file changes or the UCSC Web site messes up somehow.

To do this, a tester can edit the "file" tag to point to a bogus location. If everything works the way it is supposed to work, then the user won't even notice that the 2bit file is missing and will simply default to getting data from the UCSC JSON API.

However, note that the "load priority" numbers are related to which data source IGB retrieves sequence data from.

Show

Ann Loraine added a comment - 17/Sep/24 11:23 AM - edited As per discussion with Nowlan Freese after scrum, we would like to do these two things: Use species name "Homo sapiens T2T" for "human telomere to telomere reference" This is to allow this assembly to exist alongside the traditional hg38, hg19, etc assemblies. We want to do this because so many people are 100% using hg38 and hg19, not this new assembly just yet. The version should be something like: H_sapiens_T2T_MMM_YYYY Note that we need to be super duper careful about making sure that our version dates map correctly onto UCSC patch releases, or whatever they are doing to keep track of how the sequence itself (and all the constituent contigs) changes over time. Instead of version-controlling the 2bit file in our subversion repository, let's "reference=TRUE" tag in annots.xml and have the file name be the URL of the 2bit file hosted on the UCSC genome web site. This will enable IGB to locally cache the genome file instead of always having to use the JSON REST API to retrieve sequence data all the time. Also, retrieving data from a 2bit file may be faster than getting sequence data from the JSON REST API. Testing : Make sure that IGB can also retrieve sequence data from the JSON REST API in case the URL of the 2bit file changes or the UCSC Web site messes up somehow. To do this, a tester can edit the "file" tag to point to a bogus location. If everything works the way it is supposed to work, then the user won't even notice that the 2bit file is missing and will simply default to getting data from the UCSC JSON API. However, note that the "load priority" numbers are related to which data source IGB retrieves sequence data from.

Hide

Permalink

Paige Kulzer added a comment - 18/Sep/24 11:10 AM - edited

To the best of my ability, I've incorporated all of the above information in the creation of this Quickload.

Specifically, I've added the following line to contents.txt:

H_sapiens_T2T_Jan_2022  Homo sapiens T2T (Jan 2022) human being (T2T-CHM13 v2.0/hs1)

And I've added the following line to .htaccess:

AddDescription "Homo sapiens T2T hs1 (T2T Consortium T2T-CHM13 v2.0)" H_sapiens_T2T_Jan_2022

Finally, I created the following annots.xml file:

<files>
  <file name="https://hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit"
        title="Human telomere-to-telomere reference"
        reference="true"
   />   
  <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz"
         title="UCSC Tracks/Genes and Gene Predictions/RefSeq All"
         description="Data from UCSC Table Browser - RefSeq All (ncbiRefSeq table) - updated May 29, 2023"
         url="H_sapiens_T2T_Jan_2022"
         foreground="B71725"
         name_size="14"
         direction_type="arrow"
         label_field="title"
         background="FFFFFF"
         show2tracks="true"
	 load_hint="Whole Sequence"
   />
  <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz"
        title="UCSC Tracks/Genes and Gene Predictions/RefSeq Curated"
        description="Data from UCSC Table Browser - RefSeq Curated (ncbiRefSeqCurated) table - updated May 29, 2023"
        url="H_sapiens_T2T_Jan_2022"
        foreground="000000"
        name_size="14"
        direction_type="arrow"
        label_field="title"
        background="FFFFFF"
	show2tracks="true"
        load_hint="Whole Sequence"
   />
</files>

Ann Loraine or Nowlan Freese, please let me know if I've overlooked anything here!

Show

Paige Kulzer added a comment - 18/Sep/24 11:10 AM - edited To the best of my ability, I've incorporated all of the above information in the creation of this Quickload. Specifically, I've added the following line to contents.txt: H_sapiens_T2T_Jan_2022 Homo sapiens T2T (Jan 2022) human being (T2T-CHM13 v2.0/hs1) And I've added the following line to .htaccess: AddDescription "Homo sapiens T2T hs1 (T2T Consortium T2T-CHM13 v2.0)" H_sapiens_T2T_Jan_2022 Finally, I created the following annots.xml file: <files> <file name="https://hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit" title="Human telomere-to-telomere reference" reference="true" /> <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz" title="UCSC Tracks/Genes and Gene Predictions/RefSeq All" description="Data from UCSC Table Browser - RefSeq All (ncbiRefSeq table) - updated May 29, 2023" url="H_sapiens_T2T_Jan_2022" foreground="B71725" name_size="14" direction_type="arrow" label_field="title" background="FFFFFF" show2tracks="true" load_hint="Whole Sequence" /> <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz" title="UCSC Tracks/Genes and Gene Predictions/RefSeq Curated" description="Data from UCSC Table Browser - RefSeq Curated (ncbiRefSeqCurated) table - updated May 29, 2023" url="H_sapiens_T2T_Jan_2022" foreground="000000" name_size="14" direction_type="arrow" label_field="title" background="FFFFFF" show2tracks="true" load_hint="Whole Sequence" /> </files> Ann Loraine or Nowlan Freese , please let me know if I've overlooked anything here!

Hide

Permalink

Paige Kulzer added a comment - 18/Sep/24 11:33 AM

Below is an outline of the steps I followed to create this Quickload:

1.Use wget to obtain the .2bit file from UCSC's track hub directory, then rename it

wget https://hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit
mv hs1.2bit H_sapiens_T2T_Jan_2022.2bit

2. Create genome.txt, then manually edit it so that the chromosome's ordered logically (i.e., numerically)

./twoBitInfo H_sapiens_T2T_Jan_2022.2bit genome.txt

3. Use Homo sapiens' taxID (9606) to get the information needed from gene2accession.gz and gene_info.gz to create the BED14 file in a later step

gunzip -c gene2accession.gz | grep '^9606\t' > 9606.gene2accession.txt
gunzip -c gene_info.gz | grep '^9606\t' > 9606.gene_info.txt

4. Download the RefSeqAll and RefSeqCurated BED files from UCSC's table browser (Link: https://genome.ucsc.edu/cgi-bin/hgTables), then create the BED14 files using the following code:

cd ~/Documents/Repos/genomesource/
./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeq.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed
./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeqCurated.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed

5. Sort, gzip, and tabix the BED14 files

cd ~/Downloads/
sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz
sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz
tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz
tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz

6. Sanity check the 2bit and BED files - Add the 2bit file as a reference, then drag/drop the BED files into IGB. Confirm that gene models are present, labeled correctly, and that no error messages are present in the Log.

7. Create a new directory in the quickload repo, then create annots.xml

cd ~/Documents/Repos/quickload/
svn mkdir H_sapiens_T2T_Jan_2022
svn cp H_sapiens_Dec_2013/annots.xml H_sapiens_T2T_Jan_2022
nano H_sapiens_T2T_Jan_2022/annots.xml

8. Add H_sapiens_T2T_Jan_2022 to contents.txt and .htaccess

nano contents.txt
nano .htaccess

9. Create HEADER.md

../genomesource/writeQuickLoadHeaderUCSC.py H_sapiens_T2T_Jan_2022 > H_sapiens_T2T_Jan_2022/HEADER.md

Show

Paige Kulzer added a comment - 18/Sep/24 11:33 AM Below is an outline of the steps I followed to create this Quickload: 1.Use wget to obtain the .2bit file from UCSC's track hub directory, then rename it wget https: //hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit mv hs1.2bit H_sapiens_T2T_Jan_2022.2bit 2. Create genome.txt, then manually edit it so that the chromosome's ordered logically (i.e., numerically) ./twoBitInfo H_sapiens_T2T_Jan_2022.2bit genome.txt 3. Use Homo sapiens ' taxID (9606) to get the information needed from gene2accession.gz and gene_info.gz to create the BED14 file in a later step gunzip -c gene2accession.gz | grep '^9606\t' > 9606.gene2accession.txt gunzip -c gene_info.gz | grep '^9606\t' > 9606.gene_info.txt 4. Download the RefSeqAll and RefSeqCurated BED files from UCSC's table browser (Link: https://genome.ucsc.edu/cgi-bin/hgTables ), then create the BED14 files using the following code: cd ~/Documents/Repos/genomesource/ ./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeq.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed ./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeqCurated.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed 5. Sort, gzip, and tabix the BED14 files cd ~/Downloads/ sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz 6. Sanity check the 2bit and BED files - Add the 2bit file as a reference, then drag/drop the BED files into IGB. Confirm that gene models are present, labeled correctly, and that no error messages are present in the Log. 7. Create a new directory in the quickload repo, then create annots.xml cd ~/Documents/Repos/quickload/ svn mkdir H_sapiens_T2T_Jan_2022 svn cp H_sapiens_Dec_2013/annots.xml H_sapiens_T2T_Jan_2022 nano H_sapiens_T2T_Jan_2022/annots.xml 8. Add H_sapiens_T2T_Jan_2022 to contents.txt and .htaccess nano contents.txt nano .htaccess 9. Create HEADER.md ../genomesource/writeQuickLoadHeaderUCSC.py H_sapiens_T2T_Jan_2022 > H_sapiens_T2T_Jan_2022/HEADER.md

Hide

Permalink

Paige Kulzer added a comment - 18/Sep/24 11:48 AM

For review: I have zipped up a folder containing all of the relevant files needed to create this Quickload and place that in the shared Google Drive. Please take a look at these files and load the 2bit and BED files in IGB to double check that they're working properly. Also, see Dr. Loraine's comment above for further testing instructions.

Location of the Quickload on Google Drive: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > H_sapiens_T2T_Jan_2022.zip

Question for the reviewer: In the description of UCSC's track hub directory for this genome, it says,

"This is the track hub directory for the T2T CHM13 v2.0 assembly of the UCSC Genome Browser (the internal assembly name is "hs1", but we suggest using CHM13 in publications)."

Should I associate all three assembly names with this Quickload (T2T CHM13 v2.0, hs1, and CHM13), or should I leave out the internal assembly name (hs1) and just use T2T CHM13 v2.0 and CHM13?

Show

Paige Kulzer added a comment - 18/Sep/24 11:48 AM For review: I have zipped up a folder containing all of the relevant files needed to create this Quickload and place that in the shared Google Drive. Please take a look at these files and load the 2bit and BED files in IGB to double check that they're working properly. Also, see Dr. Loraine's comment above for further testing instructions. Location of the Quickload on Google Drive : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > H_sapiens_T2T_Jan_2022.zip Question for the reviewer: In the description of UCSC's track hub directory for this genome, it says, "This is the track hub directory for the T2T CHM13 v2.0 assembly of the UCSC Genome Browser (the internal assembly name is "hs1", but we suggest using CHM13 in publications)." Should I associate all three assembly names with this Quickload (T2T CHM13 v2.0, hs1, and CHM13), or should I leave out the internal assembly name (hs1) and just use T2T CHM13 v2.0 and CHM13?

Hide

Permalink

Nowlan Freese added a comment - 20/Sep/24 1:29 PM

I have made some changes to the species and synonyms files in IGB so that the human telomere to telomere genome is now its own species.

To test:

Start IGB
Click on the picture of the mona lisa - this should load the hg38 human genome
In the Species menu, select Homo sapiens T2T (the tooltip should say Human Telomere to Telomere)
In the Genome Version, select H_sapiens_T2T_Jan_2022 (the tooltip should say hs1 and T2T-CHM13 v2.0/hs1)
The genome should load, zoom in and click Load Sequence, and check that there is data through UCSC REST in the Available Data window

Branch: https://bitbucket.org/nfreese/nowlanfork-igb/branch/IGBF-3893

Show

Nowlan Freese added a comment - 20/Sep/24 1:29 PM I have made some changes to the species and synonyms files in IGB so that the human telomere to telomere genome is now its own species. To test: Start IGB Click on the picture of the mona lisa - this should load the hg38 human genome In the Species menu, select Homo sapiens T2T (the tooltip should say Human Telomere to Telomere) In the Genome Version, select H_sapiens_T2T_Jan_2022 (the tooltip should say hs1 and T2T-CHM13 v2.0/hs1) The genome should load, zoom in and click Load Sequence, and check that there is data through UCSC REST in the Available Data window Branch: https://bitbucket.org/nfreese/nowlanfork-igb/branch/IGBF-3893

Hide

Permalink

Nowlan Freese added a comment - 20/Sep/24 2:13 PM

Ann Loraine

PR for the species.txt and synonyms.txt change: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/1038

Show

Nowlan Freese added a comment - 20/Sep/24 2:13 PM Ann Loraine PR for the species.txt and synonyms.txt change: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/1038

People

Assignee:

Nowlan Freese

Reporter:

Nowlan Freese

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

05/Sep/24 1:53 PM

Updated:

21/Oct/24 9:40 AM