[IGBF-3044] Investigate: Alternative splicing statistical analysis - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
1
Epic Link:
Publish rice splicing and methylation results
Sprint:
Spring 1 2022 Jan 3 - Jan 14, Spring 2 2022 Jan 18 - Jan 28

Description

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. This is not a deal-breaker, but it would be nice if we could perform a single test even for alternative splicing choices where there are more than two options. There may be better software or better methods we can use instead.

However, one thing to always keep in mind is that we mainly care about understanding how and if splicing is regulated, and in detecting changes in the splicing machinery's function. In this case, what matters ultimately the frequency with which individual splice sites get used relative to alternative options, and what causes this frequency to change. Currently, we assess splice site selection frequency by counting relevant reads and ignoring both irrelevant reads, such as reads that align outside the differentially spliced regions, or reads we can't assign to a single choice. Moreover, because of the complexity of splicing and because of the unknowableness of reads mapping far away from a given splice site, we ought to focus our attention on the reads for which we are most confident in their mapping. What we care about the most is whether or not the underlying frequency of splice site usage changes between treatments or conditions. It's nice if we can know what that frequency is most likely to be, but a very rough estimate of the actual value is fine. Again, we care about the fact of a change, because the change signals that something biologically interesting might be happening.

For this task, let's scan the papers found in the literature review results (see: ~~IGBF-3041~~) to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

Also, let's investigate the statistics literature to determine if the test we need is already developed for an analogous setting that we can deploy for ourselves. We already have software that counts and reports relevant read alignments (see: arabitag), and so really we only need a good way to test for differences in splicing across conditions.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

3044.xlsx
19 kB
13/Jan/22 5:05 PM
Mosaic.jpeg
55 kB
04/Jan/22 4:21 PM

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Ann Loraine added a comment - 29/Dec/21 7:53 AM - edited

rMATs - tests whether there is a difference in proportion higher than a user-supplied threshold (the default difference is 0.05). Also, it tries to model variability in the proportions obtained from replicate samples.

It also tries to normalize for splicing event length by including a length factor in the "percent spliced in" proportion. In addition, it uses reads aligned on top of exons, not just junction reads, as counting toward the use of one splice site or another.

The statistical approach is described in a 2014 PNAS paper. I'm not sure if the paper provides a solid justification for the method, however. Simulations are presented, but the simulations assume a binary choice only, which, as we know from experience, is not realistic for many of the most interesting cases, e.g., highly alternatively spliced genes that may alter their splicing patterns under different conditions or treatments.

The current version of rMATs is called rMATS "turbo" which includes a number of speed enhancements, including some parallelization of the code.

Big question I still have is: Does it detect differential splicing among un-annotated splice variants? This matters for the tomato pollen project because the tomato genome annotations do not include splice variants.

Reading google group posts for rMATs has been helpful.

2014 rMATS paper link: https://www.pnas.org/content/111/51/E5593
rMATS google group: https://groups.google.com/g/rmats-user-group

Show

Ann Loraine added a comment - 29/Dec/21 7:53 AM - edited rMATs - tests whether there is a difference in proportion higher than a user-supplied threshold (the default difference is 0.05). Also, it tries to model variability in the proportions obtained from replicate samples. It also tries to normalize for splicing event length by including a length factor in the "percent spliced in" proportion. In addition, it uses reads aligned on top of exons, not just junction reads, as counting toward the use of one splice site or another. The statistical approach is described in a 2014 PNAS paper. I'm not sure if the paper provides a solid justification for the method, however. Simulations are presented, but the simulations assume a binary choice only, which, as we know from experience, is not realistic for many of the most interesting cases, e.g., highly alternatively spliced genes that may alter their splicing patterns under different conditions or treatments. The current version of rMATs is called rMATS "turbo" which includes a number of speed enhancements, including some parallelization of the code. Big question I still have is: Does it detect differential splicing among un-annotated splice variants? This matters for the tomato pollen project because the tomato genome annotations do not include splice variants. Reading google group posts for rMATs has been helpful. 2014 rMATS paper link: https://www.pnas.org/content/111/51/E5593 rMATS google group: https://groups.google.com/g/rmats-user-group

6 older comments

Hide

Permalink

Ann Loraine added a comment - 04/Jan/22 4:35 PM - edited

Other links and resources:

Slides from Iowa State Biostatistics professor Dan Nettleton "A Generalized Linear Model for Binomial Response Data"
Stats StackExchange question: "Statistical test on percentage data with replicates"
Stats StackExchange question: "Analyzing FACS Data"
Stats StackExchange question: "Which test to use when comparing multiple sets of proportions (with unknown sample size)?"

Show

Ann Loraine added a comment - 04/Jan/22 4:35 PM - edited Other links and resources: Slides from Iowa State Biostatistics professor Dan Nettleton " A Generalized Linear Model for Binomial Response Data " Stats StackExchange question: " Statistical test on percentage data with replicates " Stats StackExchange question: " Analyzing FACS Data " Stats StackExchange question: " Which test to use when comparing multiple sets of proportions (with unknown sample size)? "

Hide

Permalink

Ann Loraine added a comment - 11/Jan/22 6:21 PM

rMATs script from an older project:

#!/bin/bash
# only works with samtools/0.1.19
RMATDIR=/lustre/groups/lorainelab/sw/rMATS.3.0.9
S=$RMATDIR/RNASeq-MATS.py
GTF=$RMATDIR/gtf/TAIR10.gtf
# prefix
P='w'
CONTR="${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam"
P='m'
TREAT="${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam"
DIR=AS
CMD="python $S -o $DIR -b1 $CONTR -b2 $TREAT -gtf $GTF -len 101 -t single"
echo "running: $CMD" > $DIR.out
$CMD 2>$DIR.err 1>>$DIR.out  
echo "DONE."

Show

Ann Loraine added a comment - 11/Jan/22 6:21 PM rMATs script from an older project: #!/bin/bash # only works with samtools/0.1.19 RMATDIR=/lustre/groups/lorainelab/sw/rMATS.3.0.9 S=$RMATDIR/RNASeq-MATS.py GTF=$RMATDIR/gtf/TAIR10.gtf # prefix P='w' CONTR= "${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam" P='m' TREAT= "${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam" DIR=AS CMD= "python $S -o $DIR -b1 $CONTR -b2 $TREAT -gtf $GTF -len 101 -t single" echo "running: $CMD" > $DIR.out $CMD 2>$DIR.err 1>>$DIR.out echo "DONE."

Hide

Permalink

Ann Loraine added a comment - 13/Jan/22 10:08 AM

Nowlan Freese will make table of methods and endnote library.

Show

Ann Loraine added a comment - 13/Jan/22 10:08 AM Nowlan Freese will make table of methods and endnote library.

Hide

Permalink

Nowlan Freese added a comment - 13/Jan/22 5:07 PM

I have attached an excel spreadsheet (3044.xlsx) listing methods for the articles identified in ~~IGBF-3041~~. Blank data indicate that there was no mention of the methods in the paper.

The endnote library is attached to ticket ~~IGBF-3041~~.

Show

Nowlan Freese added a comment - 13/Jan/22 5:07 PM I have attached an excel spreadsheet (3044.xlsx) listing methods for the articles identified in IGBF-3041 . Blank data indicate that there was no mention of the methods in the paper. The endnote library is attached to ticket IGBF-3041 .

Hide

Permalink

Ann Loraine added a comment - 18/Jan/22 10:52 AM - edited

Note: ~~IGBF-3041~~ discusses the 2021 paper titled "Genome-wide discovery of natural variation in pre-mRNA splicing and prioritising causal alternative splicing to salt stress response in rice" (from Harkamal Walia lab at University of Nebraska) which explained why they needed to develop a different method and why rMATS was not sufficient for their data analysis.

This is probably the paper we need to orient our work.

For example, we can assess whether or not the results from this previous work are reproduced in our work. If they are, that makes the conclusions much stronger.

Show

Ann Loraine added a comment - 18/Jan/22 10:52 AM - edited Note: IGBF-3041 discusses the 2021 paper titled "Genome-wide discovery of natural variation in pre-mRNA splicing and prioritising causal alternative splicing to salt stress response in rice" (from Harkamal Walia lab at University of Nebraska) which explained why they needed to develop a different method and why rMATS was not sufficient for their data analysis. This is probably the paper we need to orient our work. For example, we can assess whether or not the results from this previous work are reproduced in our work. If they are, that makes the conclusions much stronger.

People

Assignee:

Ann Loraine

Reporter:

Ann Loraine

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

23/Dec/21 7:17 AM

Updated:

24/Jan/22 10:12 AM

Resolved:

24/Jan/22 10:12 AM