Details
-
Type:
Task
-
Status: Closed (View Workflow)
-
Priority:
Major
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:4
-
Epic Link:
-
Sprint:Spring 3 2022 Jan 31 - Feb 11, Spring 4 2022 Feb 14 - Feb 25, Spring 5 2022 Feb 28 - Mar 11, Spring 6 2022 Mar 14 - Mar 25, Spring 7 2022 Mar 28 - Apr 8, Spring 8 2022 Apr 11 - Apr 22, Spring 9 2022 May 9, Summer 1 2022 May 23, Summer 2 2022 June 6, Summer 3 2022 June 21, Summer 4 2022 July 4, Summer 5 2022 July 18, Summer 6 2022 Aug 1
Description
One major goal of making IGB capable of viewing data from Track Hubs is to enable users to explore and analyze data using two genome browser instead of just one - Integrated Genome Browser and the UCSC Genome Browser.
We hypothesize that users will be able to notice and investigate new aspects of their data if they are able to use different genome browser tools to look at the same data. These new aspects could be anything - problems in data sets they were not aware of, problems with the browsers themselves, and, best of all, entirely new and undiscovered aspects of biology they were not previously aware of.
To emphasize and illustrate the above idea, let's create and deploy a Track Hub that will show genomes from the main IGB Quickload site, focusing on genome quickloads for important research organisms.
Create these track hubs "by hand", similar to what an ordinary (non-programmer) user would likely do. Take notes on issues and problems you encounter, as we will want to include these observations in papers and talks about the process.
For this task, create a Track Hub for the latest version of human genome, including data sets from all the quickloads available for that human genome (except Genome in a Bottle).
See this repository for possibly useful code and documentation regarding the quickload format: https://bitbucket.org/lorainelab/igbquickload/src/master/
Attachments
- Alignments.JPG
- 186 kB
- human_hub.zip
- 2 kB
- Junctions.JPG
- 77 kB
- quickload-to-trackhub.pptx
- 9.03 MB
- RefSeq.JPG
- 96 kB
Issue Links
Activity
I have uploaded the trackhub to cyverse using bioviz connect.
Note Nowlan Freese, Karthik Raveendran: I was unable to upload a folder and needed to manually recreate the directory structure in bioviz connect and then upload files into the respective folders. Not sure if this is a bug, because the New -> Upload option states "Upload File/Folder". Observed on macOS.
I have added the trackhub to UCSC at the link below
However, it seems that because all the data in quickload is compressed, UCSC is unable to read it. For example, clicking on 'view sequences' at the above link results in the error below:
- "Unsupported type 'psl' in hub https://data.cyverse.org/dav-anon/iplant/home/pbadzuh/human_hub/hub.txt genome hub_3279417_hg38 track all_est"
The file the trackhub refers to is H_sapiens_Dec_2013_all_est.psl.gz and the psl data type is listed as supported on the UCSC trackDB documentation page here, under the 'type' tab.
After searching the UCSC support group, I found the issue reported earlier here. It looks like some file types cannot be part of a hub, but can be loaded in the UCSC genome browser manually.
It would be nice to know the entire list of file types with this restriction. I was unable to find this in the documentation, and I don't seem to have permission to post to the UCSC support group.
The most recent human genome release in IGB Quickload system has three different Quickload data sources, called: RNA-Seq, Genome in a Bottle, and IGB Quickload.
We need to be able to examine the data from RNA-Seq and IGB Quickload in the UCSC Genome Browser.
PSL format is probably not going to be supported as per comments above.
I was able to find a list of filetypes that are supported in track hubs.
- bam/cram: Compressed Sequence Alignment/Map tracks
- bigBed: Item or region tracks
- bigBarChart: Bar charts of categorical variables displayed over genomic regions
- bigChain: Genome-wide Pairwise Alignments
- bigGenePred: Gene Annotations
- bigInteract: Pairwise interactions
- bigLolly: Lollipops
- bigNarrowPeak: Peaks
- bigMaf: Mulitple Alignments
- bigPsl: Pairwise Alignments
- bigWig: Signal graphing tracks
- hic: Hi-C contact matrices
- halSnake: HAL Snake Format
- vcfTabix: Variant Call Format
- vcfPhasedTrio: Variant Call Format Trios
Please see more details here. So, unfortunately, of all the files in the quickloads you refer to in the previous comment, only the bam filetype is supported. bed can be converted to bigBed and psl can be converted to bigPsl, however, bedgraph doesn't seem to be supported in any format.
Please provide a link to the list of supported file types.
For the next steps, please add the RNA-Seq Quickload BAM files to the track hub to close the ticket.
The significance of what you have done is starting to become more clear to me.
Seeing how you've translated a small part of "Quickload main" (one genome only) demonstrates how we could easily add new meta-data files to our existing Quickloads, without even needing to create new folders or subfolders, to transform them into Track Hubs.
If we do this, not only will it make it easier for us to compare the two genome browsers, and understand the distinct benefits of each, we will also make it easier for people who are already familiar with Track Hubs to understand our Quickload format, and transform their Track Hubs into Quickloads for easy visualization in IGB.
(Writing an article about this work is easier when you can state the significance.)
cc: Philip Badzuh
I'm curious how many of the accepted track hub file types will load correctly in IGB. For example, we had to fix the bigBed parsing in IGB to get it to load correctly. I don't think we currently support cram, and I've never heard of some of the other formats (bigLolly).
Philip Badzuh - The Discovery Environment does not currently allow for uploading folders through Terrain. This appears to be an oversight in the modal title as we were also working on a modal for renaming files/folders around the same time. Thank you for spotting this, I had not noticed it before. I have created ticket IGBF-3081 to fix the issue.
[~aloraine], when trying to convert SRR1957124_Adrenal_gland.bed big bed using UCSC's bedToBigBed utility, but am observing the error below:
Error line 3157 of SRR1957124_Adrenal_gland.bed: score (2334) must be between 0 and 1000
What do score vales represent, and can they be transformed to what UCSC expects? Dr. Freese noted that their specification of bed files differs from ours in its definition of scores - https://genome.ucsc.edu/FAQ/FAQformat.html#format1
During Wed Feb 22 scrum, Nowlan Freese explained that this is a "junctions" file, the output of tophat. Each line of data corresponds to an intron splicing event detected by the "tophat" algorithm. The "score" field indicates the number of sequence read alignments that aligned with and supported the splicing event, called a "junction" because it indicates the join between two flanking, spliced exons.
For our reference, the actual physical file is:
and the file title is: "20 tissues SRP056969/Junctions/Adrenal gland tophat junctions"
We note that this inability to represent numbers larger than 1000 could be considered a weakness or flaw in the UCSC Genome Browser Track Hubs scheme. But, it's possible there's another Track Hub compatible file format that would allow a larger score value.
What to do next:
Post to the UCSC Genome Browser help list. Describe the problem and ask for help. Post a link to the question when this is done. Specifically, mention that you are trying to visualize BED files produced by TopHat, a spliced aligner. The TopHat program reports number of splice reads supporting a junction as a positive integer that can be anything and is often very large. How do we make a bigbed file out of this?
I have sent an email out to the support group asking about how to proceed with the issue described previously.
Please see the current state of the script that is intended to eventually convert quickload data into a track hub.
I also found the following confirmation of supported hub data types here: "The data tracks provided by a hub must be formatted in one of the compressed binary index formats supported by the Genome Browser: bigWig, bigBed, bigGenePred, bigChain, bigNarrowPeak, bigBarChart, bigInteract, bigPsl, bigMaf, hic, BAM, CRAM, HAL or VCF."
Useful icommands documentation, for uploading data files to cyverse:
https://cyverse.atlassian.net/wiki/spaces/DS/pages/241869855/Using+iCommands
After reading through the response in the UCSC google group I would like to try and create a bigBed file using the bed detail workaround proposed by Jairo Navarro Gonzalez. I will update this comment depending on the result.
Unfortunately it doesn't look like the bed detail workaround will work for IGB.
Jairo Navarro Gonzalez's idea was to use a bed4+8 to create the bigBed. The bed4+8 tells the bedToBigBed parser to use the first 4 columns as traditional bed and then the remaining 8 columns are considered custom fields. However, this will not work in IGB due to an issue discussed in IGBF-2978 where IGB does not have access to the header information defining the custom columns. This forces IGB to only parse the information contained within the first 4 columns, in this example. Thus the bed4+8 bigBed file will not show correctly in IGB.
Another option proposed by Jairo Navarro Gonzalez would be to set the scores to an arbitrary value and then move the score column to the 13th column of the bed file. While this should work in IGB, it may be confusing for users that what IGB considers the score value is now arbitrary and what would normally be an ID is now the score.
The final option will be to scale the score values as suggested by Jairo Navarro Gonzalez. This approach should technically work in IGB though it may be somewhat confusing to users.
[~aloraine] suggests: If the score exceeds 1000, then change the score to 1000.
Also, this is something to bring up when we published our work.
Please create a very clear step-by-step narrative explaining how to look at the new track hub in UCSC Genome Browser. Also, please make a new zip file with the latest files, including any new files created. Let's see how big it will be.
I have uploaded the multi-quickload-derived track hub to cyverse here.
To view the hub data in UCSC, you can open this link or follow the instructions below, in order to configure it yourself.
- Open the my hubs page
- Enter the above public cyverse hub URL suffixed with /hub.txt.
- Click 'Add Hub'
- You will be redirected to the 'Browse/Select Species' page for the trackhub
- Select the hg38 assembly (the only one configured) and click 'GO'
- The UCSC genome browser will open
- Toward the bottom, under the 'other' section, select a view other than 'hide' for any file of interest and press 'refresh'
- You should see the data displayed in the browser above
Track hub directory structure
6.9G .
4.0K ./hub.txt
4.0K ./genomes.txt
6.9G ./hg38
4.0K ./hg38/trackDb.txt
6.2G ./hg38/SRR1957124_Adrenal_gland.bam
310M ./hg38/SRR1957124_Adrenal_gland.scaled.bw
292M ./hg38/SRR1957124_Adrenal_gland.bw
193M ./hg38/H_sapiens_Dec_2013_all_mrna.bigPsl.bb
7.1M ./hg38/SRR1957124_Adrenal_gland.bb
Notes:
It looks like all files except for SRR1957124_Adrenal_gland.bam are being displayed properly. Will investigate why UCSC can't read our bam format next week.
I think I understand why you were asking about copying big data files onto CyVerse.
I was confused because I assumed you could use actual URLs for "bigDataUrl".
That is, instead of:
track rnaseq_quickload_adrenal_gland_bam
bigDataUrl SRR1957124_Adrenal_gland.bam
shortLabel Alignment file
longLabel Alignment file for RNAseq quickload adrenal gland
type bam
you can use:
track rnaseq_quickload_adrenal_gland_bam bigDataUrl http://lorainelab-quickload.scidas.org/rnaseq/H_sapiens_Dec_2013/SRP056969/SRR1957124_Adrenal_gland.bam shortLabel Alignment file longLabel Alignment file for RNAseq quickload adrenal gland type bam
Almost by definition, every file available via Quickload is already deployed in the Web directories of the Quickload host.
So there is no need to copy the files to a new location.
You just need to put their URLs in the trackDb.txt file.
Is this right? Maybe I misunderstand?
For the use case images, focus on the following question:
- How often does the exon-skipped transcript variant of human MEOX1 gene occur?
I used IGB to answer this question and summarized the results in this slide deck:
The data shown are from the RNA-Seq Quickload. Using the track hub view of these same data, can you use the UCSC Genome Browser to answer this same question? What new information can you gain from using this alternative browser?
For example, one of the strengths of the UCSC Genome Browser is that there are many more datasets available to look at. Can some of these data sets that are not available in IGB be used to corroborate or refine the results from IGB?
I have attached a powerpoint showing the differences between IGB and UCSC in visualizing various file formats at MEOX1. I have not yet taken a look at additional datasets outside of the ones I created by converting quickload files.
My latest code changes can be found here.
The converted data files can be found here.
The track hub is hosted here and can be opened by following this link.
I have question about why the "util" directory is excluded from version control. (I noticed a path with "util" in the name was listed in the .gitignore file.)
Normally, when I see the word "util" in a path, I assume that this directory will contain "utility" code, such as simple methods that other parts of the code might be using in various settings.
Is that what "util" is? If yes, shouldn't it be under version control? For example, if the code is expecting to import a particular utility function and that function is either not present or hasn't been updated to later versions due to the line to exclude it in the .gitignore file, what will happen?
Nowlan Freese - Could you check out the functionality? Now that we have built a thing, we need to experience what it's like to use it. Also, we want to make sure we understand and can follow Philip Badzuh's instructions above.
Since two, possibly three, of us will be working on this at the same time, kind of, I am moving it back to "In Progress".
The util/ directory stores the UCSC utility binaries required to perform the conversions. They are specific to OS, so I thought to exclude them. The ones required are bedToBigBed, pslToBigPsl, and bedGraphToBigWig, and they can be downloaded here. The naming of the util/ folder can be changed to something like bin/ if that would make more sense.
Testing in IGB 9.1.10 commit ee3c038144133d0251886d442f8f01ffba2b65d8:
In IGB:
- Select the Dec 2013 human genome.
- Navigate to MEOX1 gene at chr17:43,639,311-43,662,999
- Under RNA-Seq in Available Data, add the Heart data for scaled, unscaled, junctions, and reads.
- Click Load Data
In UCSC:
- Open the MyHub page: https://genome.ucsc.edu/cgi-bin/hgHubConnect#unlistedHubs
- Add Philip's trackhub: https://data.cyverse.org/dav-anon/iplant/home/pbadzuh/human_hub/hub.txt
- Navigate to MEOX1 gene at chr17:43,639,311-43,662,999
- Set Heart scaled coverage to full
- Set Heart unscaled coverage to full
- Set Heart tophat junctions to full
- Set Heart alignments to squish
- Set RefSeq Curated to full
- The two coverage files (scaled and unscaled) are very similar in appearance and color between UCSC and IGB.
- The alignment (bam) appears very similar, though the reads are colored by orientation by default in UCSC (unclear if this can be changed).
- The annotation files (bigbed RefSeq and Junctions) do not appear correctly in UCSC. Need to double-check how the bigBed files were created and what UCSC is expecting. May be related to ticket
IGBF-2978.
Is there a way to organize the files into categories? I find the organization/display of the files in UCSC to be very confusing. I would have expected the different files to be grouped by folder instead of displayed all together under "other". May be worth looking into to determine if there is a way to force this to be organized differently.
Is there a way to display track hub data on top of the already existing UCSC genomes? Right now the only files I see available are those provided by Philip's track hub, but UCSC has many other files and annotations available for the hg38 genome.
Regarding the second question above:
One of the benefits of visualizing data in UCSC is the ability to see other data associated with the same genome. The UCSC Genome Browser has dozens and dozens of different datasets, including EST data, which contain information relevant to splicing. One of its major benefits as a visualization tool is the ability to view one's own data alongside other data sets. Being able to see all the data together in the same view makes it possible to explore new ideas about the underlying biological processes.
For example, the UCSC genome browser provides EST tracks, which are tracks showing the alignment of short sequence reads, called ESTs, from cDNA libraries prepared from different tissue types. The MEOX1 gene models probably originally came from ESTs, including the exon-skipped form. Can we find other data supporting the exon-skipped form?
Code review:
Change request for Philip Badzuh:
- Yes, please, change the name of the folder from "util" to "bin" as you suggested. Once that is done, please submit a PR for merging.
Thank you!
I have made the requested change. Please see the PR here.
New code is merged.
For testing, someone needs to try converting the same quickload sites to trackhubs.
Request for the tester:
Please read and review the code and also the comments above. Do your best to reproduce the work. If you get stuck and can't get unstuck after making a strong effort, only then ask the developer for assistance. The goal of doing this is to assess the usability and understandability of the new code, and, in so doing, better understand what computer savvy users are likely to experience when they use these new tools.
Test Report:
I was able to replicate the testing protocol mentioned by Dr. Freese in the comment above and I made similar observations.
1. Coverage graphs (scaled and unscaled) are similar in both browsers
2. Alignments track has way more reads in UCSC than IGB (see attached image)
3. UCSC heart tophat junctions track has two more reads than IGB (see attached image)
4. RefSeq Curated is slightly different at the ends (see attached image)
Add digital artifacts required for using the created resources to the Google Drive folder for the paper.
Please see the zip file containing all data and metadata files here.
Following my testing above:
- Having all of the UCSC tracks available is very nice, this is a great improvement.
- All of the TrackHub files are still clumped in one section titled Experimental. Unclear if this can be optimized.
- The name of each file is the path of that file (i.e. 20 tissues SRP056969/Graph - Scaled/Adrenal gland scaled coverage). This is somewhat confusing.
- The bigBed files are still showing incorrectly.
Next step: I'm going to try and fix the bigBed files.
After some initial testing of the bigBed file on the TrackHub site (H_sapiens_Dec_2013_ncbiRefSeqCurated.bb) I don't see anything wrong with the file itself. I also uploaded it to UCSC as a custom track and it appeared correctly. So the issue must be coming from the TrackHub itself, potentially a configuration issue. I will continue to investigate.
The issue with the bigBed files has to do with how the trackDb.txt was configured. From UCSC:
If the type is bigBed, it may be followed by an optional number denoting the number of fields in the bigBed file (e.g., "type bigBed 12" for a file with 12 fields or "type bigBed 12 +" for a file that contains additional non-standard columns). If no number is given, a default value of 3 is assumed (a very limited display that omits names, strand information, and exon boundaries).
The current tradkDb.txt has all bigBeds set as "type bigBed". These will need to be changed to "type bigBed 12 +" for bed detail files such as the various annotations, and "type bigBed 12" for the bed 12 files.
I am working on fixing the trackDb.txt file.
Regarding the following issues:
- The alignment (bam) appears very similar, though the reads are colored by orientation by default in UCSC (unclear if this can be changed). -This has been fixed in the trackDb.txt file by setting the bam color mode to off.
- The bigBed files are still showing incorrectly. -The bigBed files now show correctly, the type bigBed has been set correctly for each file.
- All of the TrackHub files are still clumped in one section titled Experimental. Unclear if this can be optimized. -We may be able to use the superTrack setting to put the various tissues in folders together.
- The name of each file is the path of that file (i.e. 20 tissues SRP056969/Graph - Scaled/Adrenal gland scaled coverage). This is somewhat confusing. -I'm not sure why these were set like this. UCSC is showing the shortLabels, which are set to these odd names in the trackDb.txt file. Is there a reason they are set like this or can we change them?
Regarding why the name of each file is set to something like "SRP056969/Graph - Scaled/Adrenal gland scaled coverage" :
This appears to be carried over from the annots.xml "file" attributes, probably the "title" attribute.
For this particular sample, the annots.xml file is here:
Its file tag:
<file name="SRP056969/SRR1957124_Adrenal_gland.scaled.bedgraph.gz" title="20 tissues SRP056969/Graph - Scaled/Adrenal gland scaled coverage" description="scaled coverage graph" url="H_sapiens_Dec_2013/SRP056969" background="ffffff" foreground="9900FF" />
Note the title attribute. It is meant to be human-friendly and descriptive.
In the IGB interface, variations on this "title" value appear in a number of places:
- the track label, in which case only the value following the final forward-slash character is shown
- the "Available Data Sets" section, which displays the title in the names of nested folders, where the names folders come from characters separated by forward-slashes in "title"
- various tables, including the IGB color preferences table and the table showing loaded datasets
Also, in IGB, the "name" value never appears, except in rare instances in which the title value is absent. In IGB, "name" is only used to locate the file for loading. That is, the "file name" is a kind of URL. We never want to show users this, except when they actually need to know the name of the physical file.
Unfortunately, the track labels for UCSC genome browser appear on one line only, so they need to be quite short, unlike in IGB, which allows wrapping of track labels onto multiple lines, since often these track labels are very tall and can fit a lot of text.
However, UCSC genome browser allows display of long horizontal test above each track.
So, I think in our case, we might want to come up with a very short label that can fit in the track label but is descriptive enough. In this case, the sample or tissue name is the best option. We probably don't need to indicate the type of data, since this will likely be obvious from the track itself. Also, we include the type of data in the name in IGB because users need know what type of data is available because they pick and choose which datasets to load. UCSC is a little different in that I believe all the tracks will be shown by default. In this latter situation, text explaining where the data came from is probably the most important.
I have organized the tissue files into four superTrack folders (Junctions, Reads, Scaled Graph, Unscaled Graph).
I have shortened the short labels so that they do not get clipped, and I have added additional information to the long labels.
Instructions for adding this track hub to UCSC can be found at the Bitbucket repository: https://bitbucket.org/nfreese/trackhub-human-hub/src/main/
Steps to view this on UCSC needs more detail to avoid blockers when users visit UCSC pages and can't figure out what to do because the page has a lot of features and it's hard to know what the next step should be.
We also need some slides or images describing how it looks when everything is working correctly. Creating these slides & images should be part of first level review / testing.
Nowlan Freese - CyVerse and UCSC endpoints sometimes fail. It's either CyVerse or UCSC. We are not hosting any of this ourselves.
[~aloraine] - Our unreliability is the sum of CyVerse and UCSC unreliability.
[~freese] - In testing each track hub (on the hub table), noted a set of problems coming up in IGB. We assuming that the hubs are all set up "correctly." Many exceptions were encountered within IGB application.
Waiting for an auspicious time to perform next step - [~aloraine] to use the interface.
Changes requests:
- Instructions need to be updated as there is no longer a "My Hubs" tab.
- In hub.txt probably needs a better "info URL"; currently it just goes to the Dec 2013 directory:
descriptionUrl http://www.igbquickload.org/quickload/H_sapiens_Dec_2013/
This is confusing because there are no descriptions of the datasets.
- Updated instructions to refer to "Connected Hub" instead of "My Hubs".
- What URL should we point the hub.txt to?
Suggestion: Link to the bitbucket repository.
The hub.txt descriptionUrl has been updated to point to the bitbucket repo.
Commit: https://bitbucket.org/nfreese/trackhub-human-hub/commits/e67fc5acd31d00ac60ef96b9f62e8199a7778b8e
All requested changes are made. Closing the ticket.
Please see the attached file human_hub.zip. This contains the hub folder and all required configuration.
Quickload source
UCSC documentation used
Testing - needs to be deployed first