[IGBF-1811] Investigate DE Analysis Apps - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Connect Integrated Genome Browser and CyVerse
Sprint:
Summer 2019 Sprint 8, Summer 2019 Sprint 9

Description

Task: Investigate how to make the various apps in the Discovery Environment which BioViz-CyVerse will use to run analyses, such as generating depth graphs from bam files, scaled depth graphs, or generating an index file.

Attachments

Issue Links

relates to

IGBF-1623 Implement DepthGraph App in CyVerse

Closed

IGBF-1857 Implement Scaled Coverage Graph App in DE

Closed

IGBF-1858 Implement DREME Motif Finder App in DE

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Nowlan Freese added a comment - 27/Jun/19 2:37 PM - edited

There are three apps described in the grant:

1: Bedtools coverage graphs. Users will select a sequencing dataset (BAM file) stored on CyVerse and
calculate a genome-wide coverage graph using bedtools. The coverage graph will be saved to the same
directory where the BAM File resides on CyVerse and also added as a new track to IGB.

2: Average, normalized coverage graph for RNA-Seq data. Users will select two or more RNA-Seq read
alignment tracks together with a gene model annotations track. The CyVerse app will calculate average RPKM
values for each annotated gene and save results to the user’s workspace in CyVerse. The corresponding
Operator in IGB will display the result as a genome histogram.

3: Find Motifs. Users will search for over-represented motifs, as described in the narrative above. We will
likely use weeder to find motifs, as it is available under an open source license.

Consideration was given for how to make running apps as easy as possible. The current thinking is that apps will be enabled/hidden based on the file type. A user will select a file, right-click, select analyze, and be presented with the applicable apps. For these early apps, I would like them to be single input, i.e., the user selected file is the only input required.

Coveage graphs can be produced by the tool bedtools genomecov . It is able to take into account spliced features, but needs to be tested on paired-end reads. There is a bedtools genomecov tool in the DE, however, it is currently configured to require additional unnecessary files (such as a genome or annotation). Therefore we will need to make a copy of the app (completed), configure it to only require a bam file as input (completed), and will then need to make it public (in progress). Note that changes to the app must be done before it is public, as once it is public no edits can be made. For the time being I have shared the app with the other developers so that it is accessible for testing.

Producing RPKM will require annotation files. Instead, Dr. Loraine and I discussed using Counts Per Million to scale coverage based on the number of total reads for the file. While bedtools genomecov does have the ability to scale the bedgraph output, this is not possible through the DE without asking the user to manually enter the scaling value (i.e. there is no way for us to calculate the scaling value first or read it in from a file in the DE). There is another tool, deeptools bamCoverage that can perform Counts Per Million scaling. It can also output bigWig file format, which may be better than bedgraph as it is more compressed. Unfortunately deeptools bamcoverage is not a tool currently in the DE, so I have asked them to add it. As of June 27, 2019 I am still waiting for them to share the tool with me (initial response was on June 20, 2019).

Motifs can be identified by weeder, however weeder requires background frequency files. The DREME tool (part of the MEME suite) can be configured to only require the fasta sequences centered around regions of interest (such as ChIP-Seq peaks). Output is an XML that we could parse to show results. However, I'm not sure what if anything we would show in IGB as the motifs that are returned do not have positional data. The DREME tool is currently available in the DE. However, it was configured to require a background frequency file. Copying and editing the app is not possible, as it is currently listed as deprecated. I have asked the DE staff to update the tool. As of June 27, 2019 they have not responded.

Uploading new tools to DE is possible, though there are caveats. The tool has to have a docker image uploaded to docker hub. It can then be added to the DE, but is not publicly available. To make a tool public, the tool has to be requested through the DE. There is some confusing language about you being the owner/creator of the tool and a requirement for test files. Once a tool has been added, apps can be built off of the tool and customized as needed. These apps can then also be made public.

Show

Nowlan Freese added a comment - 27/Jun/19 2:37 PM - edited There are three apps described in the grant: 1: Bedtools coverage graphs. Users will select a sequencing dataset (BAM file) stored on CyVerse and calculate a genome-wide coverage graph using bedtools. The coverage graph will be saved to the same directory where the BAM File resides on CyVerse and also added as a new track to IGB. 2: Average, normalized coverage graph for RNA-Seq data. Users will select two or more RNA-Seq read alignment tracks together with a gene model annotations track. The CyVerse app will calculate average RPKM values for each annotated gene and save results to the user’s workspace in CyVerse. The corresponding Operator in IGB will display the result as a genome histogram. 3: Find Motifs. Users will search for over-represented motifs, as described in the narrative above. We will likely use weeder to find motifs, as it is available under an open source license. Consideration was given for how to make running apps as easy as possible. The current thinking is that apps will be enabled/hidden based on the file type. A user will select a file, right-click, select analyze, and be presented with the applicable apps. For these early apps, I would like them to be single input, i.e., the user selected file is the only input required. Coveage graphs can be produced by the tool bedtools genomecov . It is able to take into account spliced features, but needs to be tested on paired-end reads. There is a bedtools genomecov tool in the DE, however, it is currently configured to require additional unnecessary files (such as a genome or annotation). Therefore we will need to make a copy of the app (completed), configure it to only require a bam file as input (completed), and will then need to make it public (in progress). Note that changes to the app must be done before it is public, as once it is public no edits can be made. For the time being I have shared the app with the other developers so that it is accessible for testing. Producing RPKM will require annotation files. Instead, Dr. Loraine and I discussed using Counts Per Million to scale coverage based on the number of total reads for the file. While bedtools genomecov does have the ability to scale the bedgraph output, this is not possible through the DE without asking the user to manually enter the scaling value (i.e. there is no way for us to calculate the scaling value first or read it in from a file in the DE). There is another tool, deeptools bamCoverage that can perform Counts Per Million scaling. It can also output bigWig file format, which may be better than bedgraph as it is more compressed. Unfortunately deeptools bamcoverage is not a tool currently in the DE, so I have asked them to add it. As of June 27, 2019 I am still waiting for them to share the tool with me (initial response was on June 20, 2019). Motifs can be identified by weeder, however weeder requires background frequency files. The DREME tool (part of the MEME suite) can be configured to only require the fasta sequences centered around regions of interest (such as ChIP-Seq peaks). Output is an XML that we could parse to show results. However, I'm not sure what if anything we would show in IGB as the motifs that are returned do not have positional data. The DREME tool is currently available in the DE. However, it was configured to require a background frequency file. Copying and editing the app is not possible, as it is currently listed as deprecated. I have asked the DE staff to update the tool. As of June 27, 2019 they have not responded. Uploading new tools to DE is possible, though there are caveats. The tool has to have a docker image uploaded to docker hub. It can then be added to the DE, but is not publicly available. To make a tool public, the tool has to be requested through the DE. There is some confusing language about you being the owner/creator of the tool and a requirement for test files. Once a tool has been added, apps can be built off of the tool and customized as needed. These apps can then also be made public.

Hide

Permalink

Ann Loraine added a comment - 27/Jun/19 2:52 PM

Small suggestion:

Show all Apps by default, but display them as grayed out when they cannot be run on a file because of its type. Doing this can help user learn the system as they will be reminded of the apps that are available.

Show

Ann Loraine added a comment - 27/Jun/19 2:52 PM Small suggestion: Show all Apps by default, but display them as grayed out when they cannot be run on a file because of its type. Doing this can help user learn the system as they will be reminded of the apps that are available.

Hide

Permalink

Nowlan Freese added a comment - 01/Jul/19 5:08 PM - edited

The mosdepth app is a fast way to produce coverage graph output, however, I don't see it as being any better than bedtools genomeCov.
Pro: It is fast.
Pro: It produces pre-indexed files.
Neutral: The index used is a .csi, which I'm not sure we currently support in IGB.
Con: Output files are labeled as .bed instead of .bedgraph (they are bedgraph). This would add an additional step to rename them so they can be visualized correctly in IGB (not necessarily trivial in the DE).
Con: Does not allow for scaling.

Show

Nowlan Freese added a comment - 01/Jul/19 5:08 PM - edited The mosdepth app is a fast way to produce coverage graph output, however, I don't see it as being any better than bedtools genomeCov. Pro: It is fast. Pro: It produces pre-indexed files. Neutral: The index used is a .csi, which I'm not sure we currently support in IGB. Con: Output files are labeled as .bed instead of .bedgraph (they are bedgraph). This would add an additional step to rename them so they can be visualized correctly in IGB (not necessarily trivial in the DE). Con: Does not allow for scaling.

Hide

Permalink

Nowlan Freese added a comment - 01/Jul/19 5:15 PM

The deepTools bamCoverage tool has been added to the Discovery Environment.
Pro: Can produce either bedgraph or bigwig.
Pro: Can scale data values based on Counts Per Million.
Con: Slower than mosdepth or bedtools. A 2.4GB file using 4 processors takes ~30 minutes to complete. deepTools can bin regions to increase speed, but this would decrease the resolution/quality of the output.

My recommendation would be to use bedtools to produce a standard bedgraph, as the tool is already in the Discovery Environment and runs quickly. We would use deepTools for producing scaled bedgraph.

Show

Nowlan Freese added a comment - 01/Jul/19 5:15 PM The deepTools bamCoverage tool has been added to the Discovery Environment. Pro: Can produce either bedgraph or bigwig. Pro: Can scale data values based on Counts Per Million. Con: Slower than mosdepth or bedtools. A 2.4GB file using 4 processors takes ~30 minutes to complete. deepTools can bin regions to increase speed, but this would decrease the resolution/quality of the output. My recommendation would be to use bedtools to produce a standard bedgraph, as the tool is already in the Discovery Environment and runs quickly. We would use deepTools for producing scaled bedgraph.

Hide

Permalink

Ann Loraine added a comment - 01/Jul/19 8:16 PM

30 minutes for that size of file seems very slow. But I am not sure how much faster bedtools or mosdepth would be on the same file. My guess is they would finish much quicker on the same system, but I don't know.

Show

Ann Loraine added a comment - 01/Jul/19 8:16 PM 30 minutes for that size of file seems very slow. But I am not sure how much faster bedtools or mosdepth would be on the same file. My guess is they would finish much quicker on the same system, but I don't know.

Hide

Permalink

Nowlan Freese added a comment - 02/Jul/19 9:32 AM

deepTools bamCoverage is very slow. Bedtools on default settings can do the same file in 10 minutes. deepTools at one processor and no binning takes ~100 minutes, four processors ~30 min. The default deepTools settings use 50bp bins to make the processing faster.

The primary advantage of deepTools is that it can both scale and produce bigWig files. Mosdepth does not appear to have a way to scale the output. Bedtools can scale the output, but it cannot accept a file (containing the scaling factor) produced as part of a pipeline as an argument (necessary for use in the Discovery Environment).

Show

Nowlan Freese added a comment - 02/Jul/19 9:32 AM deepTools bamCoverage is very slow. Bedtools on default settings can do the same file in 10 minutes. deepTools at one processor and no binning takes ~100 minutes, four processors ~30 min. The default deepTools settings use 50bp bins to make the processing faster. The primary advantage of deepTools is that it can both scale and produce bigWig files. Mosdepth does not appear to have a way to scale the output. Bedtools can scale the output, but it cannot accept a file (containing the scaling factor) produced as part of a pipeline as an argument (necessary for use in the Discovery Environment).

People

Assignee:

Nowlan Freese

Reporter:

Nowlan Freese

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

13/Jun/19 9:40 AM

Updated:

02/Jul/19 9:32 AM

Resolved:

27/Jun/19 2:47 PM