There are three apps described in the grant:
1: Bedtools coverage graphs. Users will select a sequencing dataset (BAM file) stored on CyVerse and
calculate a genome-wide coverage graph using bedtools. The coverage graph will be saved to the same
directory where the BAM File resides on CyVerse and also added as a new track to IGB.
2: Average, normalized coverage graph for RNA-Seq data. Users will select two or more RNA-Seq read
alignment tracks together with a gene model annotations track. The CyVerse app will calculate average RPKM
values for each annotated gene and save results to the user’s workspace in CyVerse. The corresponding
Operator in IGB will display the result as a genome histogram.
3: Find Motifs. Users will search for over-represented motifs, as described in the narrative above. We will
likely use weeder to find motifs, as it is available under an open source license.
Consideration was given for how to make running apps as easy as possible. The current thinking is that apps will be enabled/hidden based on the file type. A user will select a file, right-click, select analyze, and be presented with the applicable apps. For these early apps, I would like them to be single input, i.e., the user selected file is the only input required.
Coveage graphs can be produced by the tool bedtools genomecov . It is able to take into account spliced features, but needs to be tested on paired-end reads. There is a bedtools genomecov tool in the DE, however, it is currently configured to require additional unnecessary files (such as a genome or annotation). Therefore we will need to make a copy of the app (completed), configure it to only require a bam file as input (completed), and will then need to make it public (in progress). Note that changes to the app must be done before it is public, as once it is public no edits can be made. For the time being I have shared the app with the other developers so that it is accessible for testing.
Producing RPKM will require annotation files. Instead, Dr. Loraine and I discussed using Counts Per Million to scale coverage based on the number of total reads for the file. While bedtools genomecov does have the ability to scale the bedgraph output, this is not possible through the DE without asking the user to manually enter the scaling value (i.e. there is no way for us to calculate the scaling value first or read it in from a file in the DE). There is another tool, deeptools bamCoverage that can perform Counts Per Million scaling. It can also output bigWig file format, which may be better than bedgraph as it is more compressed. Unfortunately deeptools bamcoverage is not a tool currently in the DE, so I have asked them to add it. As of June 27, 2019 I am still waiting for them to share the tool with me (initial response was on June 20, 2019).
Motifs can be identified by weeder, however weeder requires background frequency files. The DREME tool (part of the MEME suite) can be configured to only require the fasta sequences centered around regions of interest (such as ChIP-Seq peaks). Output is an XML that we could parse to show results. However, I'm not sure what if anything we would show in IGB as the motifs that are returned do not have positional data. The DREME tool is currently available in the DE. However, it was configured to require a background frequency file. Copying and editing the app is not possible, as it is currently listed as deprecated. I have asked the DE staff to update the tool. As of June 27, 2019 they have not responded.
Uploading new tools to DE is possible, though there are caveats. The tool has to have a docker image uploaded to docker hub. It can then be added to the DE, but is not publicly available. To make a tool public, the tool has to be requested through the DE. There is some confusing language about you being the owner/creator of the tool and a requirement for test files. Once a tool has been added, apps can be built off of the tool and customized as needed. These apps can then also be made public.
There are three apps described in the grant:
1: Bedtools coverage graphs. Users will select a sequencing dataset (BAM file) stored on CyVerse and
calculate a genome-wide coverage graph using bedtools. The coverage graph will be saved to the same
directory where the BAM File resides on CyVerse and also added as a new track to IGB.
2: Average, normalized coverage graph for RNA-Seq data. Users will select two or more RNA-Seq read
alignment tracks together with a gene model annotations track. The CyVerse app will calculate average RPKM
values for each annotated gene and save results to the user’s workspace in CyVerse. The corresponding
Operator in IGB will display the result as a genome histogram.
3: Find Motifs. Users will search for over-represented motifs, as described in the narrative above. We will
likely use weeder to find motifs, as it is available under an open source license.
Consideration was given for how to make running apps as easy as possible. The current thinking is that apps will be enabled/hidden based on the file type. A user will select a file, right-click, select analyze, and be presented with the applicable apps. For these early apps, I would like them to be single input, i.e., the user selected file is the only input required.
Coveage graphs can be produced by the tool bedtools genomecov . It is able to take into account spliced features, but needs to be tested on paired-end reads. There is a bedtools genomecov tool in the DE, however, it is currently configured to require additional unnecessary files (such as a genome or annotation). Therefore we will need to make a copy of the app (completed), configure it to only require a bam file as input (completed), and will then need to make it public (in progress). Note that changes to the app must be done before it is public, as once it is public no edits can be made. For the time being I have shared the app with the other developers so that it is accessible for testing.
Producing RPKM will require annotation files. Instead, Dr. Loraine and I discussed using Counts Per Million to scale coverage based on the number of total reads for the file. While bedtools genomecov does have the ability to scale the bedgraph output, this is not possible through the DE without asking the user to manually enter the scaling value (i.e. there is no way for us to calculate the scaling value first or read it in from a file in the DE). There is another tool, deeptools bamCoverage that can perform Counts Per Million scaling. It can also output bigWig file format, which may be better than bedgraph as it is more compressed. Unfortunately deeptools bamcoverage is not a tool currently in the DE, so I have asked them to add it. As of June 27, 2019 I am still waiting for them to share the tool with me (initial response was on June 20, 2019).
Motifs can be identified by weeder, however weeder requires background frequency files. The DREME tool (part of the MEME suite) can be configured to only require the fasta sequences centered around regions of interest (such as ChIP-Seq peaks). Output is an XML that we could parse to show results. However, I'm not sure what if anything we would show in IGB as the motifs that are returned do not have positional data. The DREME tool is currently available in the DE. However, it was configured to require a background frequency file. Copying and editing the app is not possible, as it is currently listed as deprecated. I have asked the DE staff to update the tool. As of June 27, 2019 they have not responded.
Uploading new tools to DE is possible, though there are caveats. The tool has to have a docker image uploaded to docker hub. It can then be added to the DE, but is not publicly available. To make a tool public, the tool has to be requested through the DE. There is some confusing language about you being the owner/creator of the tool and a requirement for test files. Once a tool has been added, apps can be built off of the tool and customized as needed. These apps can then also be made public.