Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3893

Create Telomere to Telomere human genome Quickload

    Details

    • Type: Task
    • Status: Needs 1st Level Review (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Situation: There is a new version of the human genome referred to as either telomere to telomere or T2T or HS1. UCSC does provide this genome, link, and IGB is pulling in the genome through the UCSC REST API. As part of IGBF-3902, IGB is now including the hs1 genome under the "Homo sapiens" Species dropdown, but a Quickload for this genome still needs to be made.

      Tasks: Create a bed14 annotation file for this new genome, and add it and the 2bit file (link) to IGB Quickload.

        Attachments

          Issue Links

            Activity

            Hide
            ann.loraine Ann Loraine added a comment - - edited

            As per discussion with Nowlan Freese after scrum, we would like to do these two things:

            • Use species name "Homo sapiens T2T" for "human telomere to telomere reference"

            This is to allow this assembly to exist alongside the traditional hg38, hg19, etc assemblies. We want to do this because so many people are 100% using hg38 and hg19, not this new assembly just yet.

            The version should be something like: H_sapiens_T2T_MMM_YYYY

            Note that we need to be super duper careful about making sure that our version dates map correctly onto UCSC patch releases, or whatever they are doing to keep track of how the sequence itself (and all the constituent contigs) changes over time.

            • Instead of version-controlling the 2bit file in our subversion repository, let's "reference=TRUE" tag in annots.xml and have the file name be the URL of the 2bit file hosted on the UCSC genome web site.

            This will enable IGB to locally cache the genome file instead of always having to use the JSON REST API to retrieve sequence data all the time. Also, retrieving data from a 2bit file may be faster than getting sequence data from the JSON REST API.

            Testing: Make sure that IGB can also retrieve sequence data from the JSON REST API in case the URL of the 2bit file changes or the UCSC Web site messes up somehow.

            To do this, a tester can edit the "file" tag to point to a bogus location. If everything works the way it is supposed to work, then the user won't even notice that the 2bit file is missing and will simply default to getting data from the UCSC JSON API.

            However, note that the "load priority" numbers are related to which data source IGB retrieves sequence data from.

            Show
            ann.loraine Ann Loraine added a comment - - edited As per discussion with Nowlan Freese after scrum, we would like to do these two things: Use species name "Homo sapiens T2T" for "human telomere to telomere reference" This is to allow this assembly to exist alongside the traditional hg38, hg19, etc assemblies. We want to do this because so many people are 100% using hg38 and hg19, not this new assembly just yet. The version should be something like: H_sapiens_T2T_MMM_YYYY Note that we need to be super duper careful about making sure that our version dates map correctly onto UCSC patch releases, or whatever they are doing to keep track of how the sequence itself (and all the constituent contigs) changes over time. Instead of version-controlling the 2bit file in our subversion repository, let's "reference=TRUE" tag in annots.xml and have the file name be the URL of the 2bit file hosted on the UCSC genome web site. This will enable IGB to locally cache the genome file instead of always having to use the JSON REST API to retrieve sequence data all the time. Also, retrieving data from a 2bit file may be faster than getting sequence data from the JSON REST API. Testing : Make sure that IGB can also retrieve sequence data from the JSON REST API in case the URL of the 2bit file changes or the UCSC Web site messes up somehow. To do this, a tester can edit the "file" tag to point to a bogus location. If everything works the way it is supposed to work, then the user won't even notice that the 2bit file is missing and will simply default to getting data from the UCSC JSON API. However, note that the "load priority" numbers are related to which data source IGB retrieves sequence data from.
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            To the best of my ability, I've incorporated all of the above information in the creation of this Quickload.

            Specifically, I've added the following line to contents.txt:

            H_sapiens_T2T_Jan_2022  Homo sapiens T2T (Jan 2022) human being (T2T-CHM13 v2.0/hs1)
            

            And I've added the following line to .htaccess:

            AddDescription "Homo sapiens T2T hs1 (T2T Consortium T2T-CHM13 v2.0)" H_sapiens_T2T_Jan_2022
            

            Finally, I created the following annots.xml file:

            <files>
              <file name="https://hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit"
                    title="Human telomere-to-telomere reference"
                    reference="true"
               />   
              <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz"
                     title="UCSC Tracks/Genes and Gene Predictions/RefSeq All"
                     description="Data from UCSC Table Browser - RefSeq All (ncbiRefSeq table) - updated May 29, 2023"
                     url="H_sapiens_T2T_Jan_2022"
                     foreground="B71725"
                     name_size="14"
                     direction_type="arrow"
                     label_field="title"
                     background="FFFFFF"
                     show2tracks="true"
            	 load_hint="Whole Sequence"
               />
              <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz"
                    title="UCSC Tracks/Genes and Gene Predictions/RefSeq Curated"
                    description="Data from UCSC Table Browser - RefSeq Curated (ncbiRefSeqCurated) table - updated May 29, 2023"
                    url="H_sapiens_T2T_Jan_2022"
                    foreground="000000"
                    name_size="14"
                    direction_type="arrow"
                    label_field="title"
                    background="FFFFFF"
            	show2tracks="true"
                    load_hint="Whole Sequence"
               />
            </files>
            

            Ann Loraine or Nowlan Freese, please let me know if I've overlooked anything here!

            Show
            pkulzer Paige Kulzer added a comment - - edited To the best of my ability, I've incorporated all of the above information in the creation of this Quickload. Specifically, I've added the following line to contents.txt: H_sapiens_T2T_Jan_2022 Homo sapiens T2T (Jan 2022) human being (T2T-CHM13 v2.0/hs1) And I've added the following line to .htaccess: AddDescription "Homo sapiens T2T hs1 (T2T Consortium T2T-CHM13 v2.0)" H_sapiens_T2T_Jan_2022 Finally, I created the following annots.xml file: <files> <file name="https://hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit" title="Human telomere-to-telomere reference" reference="true" /> <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz" title="UCSC Tracks/Genes and Gene Predictions/RefSeq All" description="Data from UCSC Table Browser - RefSeq All (ncbiRefSeq table) - updated May 29, 2023" url="H_sapiens_T2T_Jan_2022" foreground="B71725" name_size="14" direction_type="arrow" label_field="title" background="FFFFFF" show2tracks="true" load_hint="Whole Sequence" /> <file name="H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz" title="UCSC Tracks/Genes and Gene Predictions/RefSeq Curated" description="Data from UCSC Table Browser - RefSeq Curated (ncbiRefSeqCurated) table - updated May 29, 2023" url="H_sapiens_T2T_Jan_2022" foreground="000000" name_size="14" direction_type="arrow" label_field="title" background="FFFFFF" show2tracks="true" load_hint="Whole Sequence" /> </files> Ann Loraine or Nowlan Freese , please let me know if I've overlooked anything here!
            Hide
            pkulzer Paige Kulzer added a comment -

            Below is an outline of the steps I followed to create this Quickload:

            1.Use wget to obtain the .2bit file from UCSC's track hub directory, then rename it

            wget https://hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit
            mv hs1.2bit H_sapiens_T2T_Jan_2022.2bit
            

            2. Create genome.txt, then manually edit it so that the chromosome's ordered logically (i.e., numerically)

            ./twoBitInfo H_sapiens_T2T_Jan_2022.2bit genome.txt
            

            3. Use Homo sapiens' taxID (9606) to get the information needed from gene2accession.gz and gene_info.gz to create the BED14 file in a later step

            gunzip -c gene2accession.gz | grep '^9606\t' > 9606.gene2accession.txt
            gunzip -c gene_info.gz | grep '^9606\t' > 9606.gene_info.txt
            

            4. Download the RefSeqAll and RefSeqCurated BED files from UCSC's table browser (Link: https://genome.ucsc.edu/cgi-bin/hgTables), then create the BED14 files using the following code:

            cd ~/Documents/Repos/genomesource/
            ./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeq.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed
            ./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeqCurated.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed
            

            5. Sort, gzip, and tabix the BED14 files

            cd ~/Downloads/
            sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz
            sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz
            tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz
            tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz
            

            6. Sanity check the 2bit and BED files - Add the 2bit file as a reference, then drag/drop the BED files into IGB. Confirm that gene models are present, labeled correctly, and that no error messages are present in the Log.

            7. Create a new directory in the quickload repo, then create annots.xml

            cd ~/Documents/Repos/quickload/
            svn mkdir H_sapiens_T2T_Jan_2022
            svn cp H_sapiens_Dec_2013/annots.xml H_sapiens_T2T_Jan_2022
            nano H_sapiens_T2T_Jan_2022/annots.xml
            

            8. Add H_sapiens_T2T_Jan_2022 to contents.txt and .htaccess

            nano contents.txt
            nano .htaccess
            

            9. Create HEADER.md

            ../genomesource/writeQuickLoadHeaderUCSC.py H_sapiens_T2T_Jan_2022 > H_sapiens_T2T_Jan_2022/HEADER.md
            
            Show
            pkulzer Paige Kulzer added a comment - Below is an outline of the steps I followed to create this Quickload: 1.Use wget to obtain the .2bit file from UCSC's track hub directory, then rename it wget https: //hgdownload.soe.ucsc.edu/gbdb/hs1/hs1.2bit mv hs1.2bit H_sapiens_T2T_Jan_2022.2bit 2. Create genome.txt, then manually edit it so that the chromosome's ordered logically (i.e., numerically) ./twoBitInfo H_sapiens_T2T_Jan_2022.2bit genome.txt 3. Use Homo sapiens ' taxID (9606) to get the information needed from gene2accession.gz and gene_info.gz to create the BED14 file in a later step gunzip -c gene2accession.gz | grep '^9606\t' > 9606.gene2accession.txt gunzip -c gene_info.gz | grep '^9606\t' > 9606.gene_info.txt 4. Download the RefSeqAll and RefSeqCurated BED files from UCSC's table browser (Link: https://genome.ucsc.edu/cgi-bin/hgTables ), then create the BED14 files using the following code: cd ~/Documents/Repos/genomesource/ ./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeq.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed ./ucscToBedDetail.py -a ~/Downloads/9606.gene2accession.txt -g ~/Downloads/9606.gene_info.txt ~/Downloads/H_sapiens_T2T_ncbiRefSeqCurated.bed.gz ~/Downloads/H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed 5. Sort, gzip, and tabix the BED14 files cd ~/Downloads/ sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz sort -k1,1 -k2,2n H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed | bgzip > H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeq.bed.gz tabix -0 -s 1 -b 2 -e 3 H_sapiens_T2T_Jan_2022_ncbiRefSeqCurated.bed.gz 6. Sanity check the 2bit and BED files - Add the 2bit file as a reference, then drag/drop the BED files into IGB. Confirm that gene models are present, labeled correctly, and that no error messages are present in the Log. 7. Create a new directory in the quickload repo, then create annots.xml cd ~/Documents/Repos/quickload/ svn mkdir H_sapiens_T2T_Jan_2022 svn cp H_sapiens_Dec_2013/annots.xml H_sapiens_T2T_Jan_2022 nano H_sapiens_T2T_Jan_2022/annots.xml 8. Add H_sapiens_T2T_Jan_2022 to contents.txt and .htaccess nano contents.txt nano .htaccess 9. Create HEADER.md ../genomesource/writeQuickLoadHeaderUCSC.py H_sapiens_T2T_Jan_2022 > H_sapiens_T2T_Jan_2022/HEADER.md
            Hide
            pkulzer Paige Kulzer added a comment -

            For review: I have zipped up a folder containing all of the relevant files needed to create this Quickload and place that in the shared Google Drive. Please take a look at these files and load the 2bit and BED files in IGB to double check that they're working properly. Also, see Dr. Loraine's comment above for further testing instructions.

            Location of the Quickload on Google Drive: research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > H_sapiens_T2T_Jan_2022.zip

            Question for the reviewer: In the description of UCSC's track hub directory for this genome, it says,

            "This is the track hub directory for the T2T CHM13 v2.0 assembly of the UCSC Genome Browser (the internal assembly name is "hs1", but we suggest using CHM13 in publications)."

            Should I associate all three assembly names with this Quickload (T2T CHM13 v2.0, hs1, and CHM13), or should I leave out the internal assembly name (hs1) and just use T2T CHM13 v2.0 and CHM13?

            Show
            pkulzer Paige Kulzer added a comment - For review: I have zipped up a folder containing all of the relevant files needed to create this Quickload and place that in the shared Google Drive. Please take a look at these files and load the 2bit and BED files in IGB to double check that they're working properly. Also, see Dr. Loraine's comment above for further testing instructions. Location of the Quickload on Google Drive : research-big-lorainelab > IGB Project Documentation and Plans > IGB Genomes > H_sapiens_T2T_Jan_2022.zip Question for the reviewer: In the description of UCSC's track hub directory for this genome, it says, "This is the track hub directory for the T2T CHM13 v2.0 assembly of the UCSC Genome Browser (the internal assembly name is "hs1", but we suggest using CHM13 in publications)." Should I associate all three assembly names with this Quickload (T2T CHM13 v2.0, hs1, and CHM13), or should I leave out the internal assembly name (hs1) and just use T2T CHM13 v2.0 and CHM13?
            Hide
            nfreese Nowlan Freese added a comment -

            I have made some changes to the species and synonyms files in IGB so that the human telomere to telomere genome is now its own species.

            To test:

            1. Start IGB
            2. Click on the picture of the mona lisa - this should load the hg38 human genome
            3. In the Species menu, select Homo sapiens T2T (the tooltip should say Human Telomere to Telomere)
            4. In the Genome Version, select H_sapiens_T2T_Jan_2022 (the tooltip should say hs1 and T2T-CHM13 v2.0/hs1)
            5. The genome should load, zoom in and click Load Sequence, and check that there is data through UCSC REST in the Available Data window

            Branch: https://bitbucket.org/nfreese/nowlanfork-igb/branch/IGBF-3893

            Show
            nfreese Nowlan Freese added a comment - I have made some changes to the species and synonyms files in IGB so that the human telomere to telomere genome is now its own species. To test: Start IGB Click on the picture of the mona lisa - this should load the hg38 human genome In the Species menu, select Homo sapiens T2T (the tooltip should say Human Telomere to Telomere) In the Genome Version, select H_sapiens_T2T_Jan_2022 (the tooltip should say hs1 and T2T-CHM13 v2.0/hs1) The genome should load, zoom in and click Load Sequence, and check that there is data through UCSC REST in the Available Data window Branch: https://bitbucket.org/nfreese/nowlanfork-igb/branch/IGBF-3893
            Hide
            nfreese Nowlan Freese added a comment -
            Show
            nfreese Nowlan Freese added a comment - Ann Loraine PR for the species.txt and synonyms.txt change: https://bitbucket.org/lorainelab/integrated-genome-browser/pull-requests/1038

              People

              • Assignee:
                nfreese Nowlan Freese
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: