[IGBF-3044] Investigate: Alternative splicing statistical analysis - JIRA UNCC

Ann Loraine created issue - 23/Dec/21 7:17 AM

Ann Loraine made changes - 23/Dec/21 7:17 AM

Field	Original Value	New Value
Epic Link		IGBF-3039 [ 21553 ]

Ann Loraine made changes - 25/Dec/21 11:53 AM

Sprint

Spring 1 2022 Jan 3 - Jan 14 [ 136 ]

Ann Loraine made changes - 25/Dec/21 11:53 AM

Rank

Ranked lower

Hide

Permalink

Ann Loraine added a comment - 29/Dec/21 7:53 AM - edited

rMATs - tests whether there is a difference in proportion higher than a user-supplied threshold (the default difference is 0.05). Also, it tries to model variability in the proportions obtained from replicate samples.

It also tries to normalize for splicing event length by including a length factor in the "percent spliced in" proportion. In addition, it uses reads aligned on top of exons, not just junction reads, as counting toward the use of one splice site or another.

The statistical approach is described in a 2014 PNAS paper. I'm not sure if the paper provides a solid justification for the method, however. Simulations are presented, but the simulations assume a binary choice only, which, as we know from experience, is not realistic for many of the most interesting cases, e.g., highly alternatively spliced genes that may alter their splicing patterns under different conditions or treatments.

The current version of rMATs is called rMATS "turbo" which includes a number of speed enhancements, including some parallelization of the code.

Big question I still have is: Does it detect differential splicing among un-annotated splice variants? This matters for the tomato pollen project because the tomato genome annotations do not include splice variants.

Reading google group posts for rMATs has been helpful.

2014 rMATS paper link: https://www.pnas.org/content/111/51/E5593
rMATS google group: https://groups.google.com/g/rmats-user-group

Show

Ann Loraine added a comment - 29/Dec/21 7:53 AM - edited rMATs - tests whether there is a difference in proportion higher than a user-supplied threshold (the default difference is 0.05). Also, it tries to model variability in the proportions obtained from replicate samples. It also tries to normalize for splicing event length by including a length factor in the "percent spliced in" proportion. In addition, it uses reads aligned on top of exons, not just junction reads, as counting toward the use of one splice site or another. The statistical approach is described in a 2014 PNAS paper. I'm not sure if the paper provides a solid justification for the method, however. Simulations are presented, but the simulations assume a binary choice only, which, as we know from experience, is not realistic for many of the most interesting cases, e.g., highly alternatively spliced genes that may alter their splicing patterns under different conditions or treatments. The current version of rMATs is called rMATS "turbo" which includes a number of speed enhancements, including some parallelization of the code. Big question I still have is: Does it detect differential splicing among un-annotated splice variants? This matters for the tomato pollen project because the tomato genome annotations do not include splice variants. Reading google group posts for rMATs has been helpful. 2014 rMATS paper link: https://www.pnas.org/content/111/51/E5593 rMATS google group: https://groups.google.com/g/rmats-user-group

Hide

Permalink

Ann Loraine added a comment - 29/Dec/21 8:08 AM - edited

As noted in the Description section above, we need a way to test whether a proportion is different across groups where we have multiple samples from each group. This method needs to take advantage of sample size and also tolerate sample sizes being different.

The scenario described in this explanation from University of Virginia Library Data Services explains a scenario in which a proportion is testing across multiple samples:

https://data.library.virginia.edu/pairwise-comparisons-of-proportions/

In the scenario described in the above link, we ask whether there is any difference in proportion of students who floss and students who do not floss across several schools in a city. To answer the question, we create a large table listing the number of yes and no answers from each school. We then use prop.test and a multiple hypothesis testing correction to perform the proportion testing.

However, this different from situations in which you want to test whether a particular treatment changes splicing.

The method described above is better suited to the following question:

Does a splicing proportion ever change for a particular splicing choice? Phrased another way: is there evidence that a particular splicing pattern ever changes? This is similar to the flossing question because we are simply asking whether the proportion ever changes across multiple experiments. For example, we could "harvest" several high quality RNA-Seq data sets, find the number of reads per data set that support either the long or the short form splice choice, and then ask: does the resulting proportion ever change? Doing this for most of the documented splicing events could identify truly regulated splicing events. Our expectation is that these "truly regulated" splicing events probably arise in genes involved in splicing, based on our prior knowledge of splicing.

Show

Ann Loraine added a comment - 29/Dec/21 8:08 AM - edited As noted in the Description section above, we need a way to test whether a proportion is different across groups where we have multiple samples from each group. This method needs to take advantage of sample size and also tolerate sample sizes being different. The scenario described in this explanation from University of Virginia Library Data Services explains a scenario in which a proportion is testing across multiple samples: https://data.library.virginia.edu/pairwise-comparisons-of-proportions/ In the scenario described in the above link, we ask whether there is any difference in proportion of students who floss and students who do not floss across several schools in a city. To answer the question, we create a large table listing the number of yes and no answers from each school. We then use prop.test and a multiple hypothesis testing correction to perform the proportion testing. However, this different from situations in which you want to test whether a particular treatment changes splicing. The method described above is better suited to the following question: Does a splicing proportion ever change for a particular splicing choice? Phrased another way: is there evidence that a particular splicing pattern ever changes? This is similar to the flossing question because we are simply asking whether the proportion ever changes across multiple experiments. For example, we could "harvest" several high quality RNA-Seq data sets, find the number of reads per data set that support either the long or the short form splice choice, and then ask: does the resulting proportion ever change? Doing this for most of the documented splicing events could identify truly regulated splicing events. Our expectation is that these "truly regulated" splicing events probably arise in genes involved in splicing, based on our prior knowledge of splicing.

Hide

Permalink

Ann Loraine added a comment - 29/Dec/21 8:19 AM - edited

Another post discusses a question very similar to ours:

https://www.researchgate.net/post/How_to_test_the_equality_of_two_proportions_with_multiple_samplings

Hello,
I am searching for a statistical test to compare two proportions of responsive cells.
I know that Khi2, z-test and Fisher’s exact test allow this kind of comparison but the issue I have is that I have several proportions in each cases (ie : several samples on the same condition that all contain a proportion).
For example:
Drug A = [12/25, 1/100, 40/70]
Drug B = [17/35, 40/200, 20/50]
I suppose that I cannot use a t-test because the data is discrete (1 if the cell respond, 0 if it does not).
I also does not want to do the mean of my data because I want to consider the variation between each measures.
Should I use a GLM with a Bernoulli law or something like that?
Really looking forward to your inputs!
Thank you very much in advance
Angel

One user proposed using a "Cochran-Mantel-Haenszel test, with Drugs as the strata". The user even provided example R code!

However, when I looked into it, I realized that the proposed test is probably not what we need because it assumes replicates from the two different groups can be paired to form contingency tables.

I posted a follow-up question about this, asking how the code should change if the number of replicates differs between groups. I don't think the code could accommodate this, based on my understanding. But I could be wrong. So far the person who recommended this test has not replied.

However, there is one bit of useful help here! The person's code gives what I think is a potentially very useful example visualization. We could use this I think to illustrate how splicing proportions differ between treatments. See attached file titled "Mosaic.jpeg".

In our version of this type of plot, the vertical rectangles would represent individual replicates. The y axis would show the fraction of reads supporting short and long forms, with colored regions representing short and long forms. The x-axis would represent the number of reads relevant to the choice. Thus the area would represent the amount of evidence provided by the sample. Bigger (wider) rectangles would show how we attach more weight to events with more reads as evidence.

Show

Ann Loraine added a comment - 29/Dec/21 8:19 AM - edited Another post discusses a question very similar to ours: https://www.researchgate.net/post/How_to_test_the_equality_of_two_proportions_with_multiple_samplings Hello, I am searching for a statistical test to compare two proportions of responsive cells. I know that Khi2, z-test and Fisher’s exact test allow this kind of comparison but the issue I have is that I have several proportions in each cases (ie : several samples on the same condition that all contain a proportion). For example: Drug A = [12/25, 1/100, 40/70] Drug B = [17/35, 40/200, 20/50] I suppose that I cannot use a t-test because the data is discrete (1 if the cell respond, 0 if it does not). I also does not want to do the mean of my data because I want to consider the variation between each measures. Should I use a GLM with a Bernoulli law or something like that? Really looking forward to your inputs! Thank you very much in advance Angel One user proposed using a "Cochran-Mantel-Haenszel test, with Drugs as the strata". The user even provided example R code! However, when I looked into it, I realized that the proposed test is probably not what we need because it assumes replicates from the two different groups can be paired to form contingency tables. I posted a follow-up question about this, asking how the code should change if the number of replicates differs between groups. I don't think the code could accommodate this, based on my understanding. But I could be wrong. So far the person who recommended this test has not replied. However, there is one bit of useful help here! The person's code gives what I think is a potentially very useful example visualization. We could use this I think to illustrate how splicing proportions differ between treatments. See attached file titled "Mosaic.jpeg". In our version of this type of plot, the vertical rectangles would represent individual replicates. The y axis would show the fraction of reads supporting short and long forms, with colored regions representing short and long forms. The x-axis would represent the number of reads relevant to the choice. Thus the area would represent the amount of evidence provided by the sample. Bigger (wider) rectangles would show how we attach more weight to events with more reads as evidence.

Ann Loraine made changes - 29/Dec/21 8:32 AM

Description

The Loraine Lab has developed an alternative splicing RNA-Seq analysis tool called "find_junctions" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. There may be better software we can use instead.

For this task, use the literature review results to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. Another weakness is that the test does not take into account the number of informative reads collected. There may be better software or better methods we can use instead.

For this task, use the literature review results to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

Ann Loraine made changes - 29/Dec/21 8:33 AM

Description

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. Another weakness is that the test does not take into account the number of informative reads collected. There may be better software or better methods we can use instead.

For this task, use the literature review results to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. Another weakness is that the test does not take into account the number of informative reads collected. There may be better software or better methods we can use instead.

For this task, use the literature review results to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

Also, investigate the statistics literature to determine if the test we need is already developed.

Ann Loraine made changes - 29/Dec/21 8:33 AM

Description

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. Another weakness is that the test does not take into account the number of informative reads collected. There may be better software or better methods we can use instead.

For this task, use the literature review results to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

Also, investigate the statistics literature to determine if the test we need is already developed.

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. Another weakness is that the test does not take into account the number of informative reads collected. There may be better software or better methods we can use instead.

For this task, use the literature review results to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

Also, investigate the statistics literature to determine if the test we need is already developed for an analogous setting that we can deploy for ourselves.

Hide

Permalink

Ann Loraine added a comment - 29/Dec/21 8:36 AM

Beta regression could be an option. Here is a link explaining it:

Beta Regression for Percent and Proportion Data

Show

Ann Loraine added a comment - 29/Dec/21 8:36 AM Beta regression could be an option. Here is a link explaining it: Beta Regression for Percent and Proportion Data

Hide

Permalink

Ann Loraine added a comment - 31/Dec/21 1:35 PM

A 2020 paper from Mark Robinson's group:

BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty and link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8

Show

Ann Loraine added a comment - 31/Dec/21 1:35 PM A 2020 paper from Mark Robinson's group: BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty and link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8

Hide

Permalink

Ann Loraine added a comment - 31/Dec/21 2:04 PM - edited

I'm starting to like a Bayesian approach in which we explicitly treat the underlying frequency of splice site usage as an unknown parameter. Selecting one of two mutually exclusive splice sites can be modeled as a Bernoulli trial, like flipping a coin. Each relevant read alignment from a single replicate represents a sequence of coin flips. There may be some dependencies between coin flips in that some reads could come from PCR duplicates during library construction. In fact, there is probably some physical limitation to the number of times we can sample an alternatively spliced region, a limitation arising from the amount of spliced RNA transcript molecules we can isolate.

I am interested in the beta distribution because it seems like this could help me narrow in on realistic values for theta, where theta is the proportion of splicing events that use one or the other choice.

As I mentioned above, often there is more than one choice for a given intronic region. There are many examples of situations in which the 3' or 5' boundary of an intron (a splice site) could be located in more than two places. Maybe this is OK if we define the Bernoulli trial as: assemble the 5' or 3' splicing complex at this location, or don't. In this case, an "informative" read alignment counting in favor of the splice site being used will always be a split read, and an "informative" read alignment counting against the splice site being used will always be a read that maps to the splice site but does not "split" across the site. The pattern we need to see in the affirmative case (1) is obvious and easy to recognize. The pattern we accept as a negative case is less obvious or clear.

The Arabitag algorithm identifies allowable negative examples by using gene model annotations and by defining alternatively spliced regions. (See the article titled Many rice genes are differentially spliced between roots and shoots but cytokinin has minimal effect on splicing for detailed explanation.)

Show

Ann Loraine added a comment - 31/Dec/21 2:04 PM - edited I'm starting to like a Bayesian approach in which we explicitly treat the underlying frequency of splice site usage as an unknown parameter. Selecting one of two mutually exclusive splice sites can be modeled as a Bernoulli trial, like flipping a coin. Each relevant read alignment from a single replicate represents a sequence of coin flips. There may be some dependencies between coin flips in that some reads could come from PCR duplicates during library construction. In fact, there is probably some physical limitation to the number of times we can sample an alternatively spliced region, a limitation arising from the amount of spliced RNA transcript molecules we can isolate. I am interested in the beta distribution because it seems like this could help me narrow in on realistic values for theta, where theta is the proportion of splicing events that use one or the other choice. As I mentioned above, often there is more than one choice for a given intronic region. There are many examples of situations in which the 3' or 5' boundary of an intron (a splice site) could be located in more than two places. Maybe this is OK if we define the Bernoulli trial as: assemble the 5' or 3' splicing complex at this location, or don't. In this case, an "informative" read alignment counting in favor of the splice site being used will always be a split read, and an "informative" read alignment counting against the splice site being used will always be a read that maps to the splice site but does not "split" across the site. The pattern we need to see in the affirmative case (1) is obvious and easy to recognize. The pattern we accept as a negative case is less obvious or clear. The Arabitag algorithm identifies allowable negative examples by using gene model annotations and by defining alternatively spliced regions. (See the article titled Many rice genes are differentially spliced between roots and shoots but cytokinin has minimal effect on splicing for detailed explanation.)

Ann Loraine made changes - 04/Jan/22 12:46 AM

Summary

Investigate: Alternative splicing statistical analysis tools

Investigate: Alternative splicing statistical analysis

Ann Loraine made changes - 04/Jan/22 12:50 AM

Assignee

Ann Loraine [ aloraine ]

Ann Loraine made changes - 04/Jan/22 12:51 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Ann Loraine made changes - 04/Jan/22 4:18 PM

Description

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. Another weakness is that the test does not take into account the number of informative reads collected. There may be better software or better methods we can use instead.

For this task, use the literature review results to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

Also, investigate the statistics literature to determine if the test we need is already developed for an analogous setting that we can deploy for ourselves.

The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. This is not a deal-breaker, but it would be nice if we could perform a single test even for alternative splicing choices where there are more than two options. There may be better software or better methods we can use instead.

However, one thing to always keep in mind is that we mainly care about understanding how and if splicing is regulated, and in detecting changes in the splicing machinery's function. In this case, what matters ultimately the frequency with which individual splice sites get used relative to alternative options, and what causes this frequency to change. Currently, we assess splice site selection frequency by counting relevant reads and ignoring both irrelevant reads, such as reads that align outside the differentially spliced regions, or reads we can't assign to a single choice. Moreover, because of the complexity of splicing and because of the unknowableness of reads mapping far away from a given splice site, we ought to focus our attention on the reads for which we are most confident in their mapping. What we care about the most is whether or not the underlying frequency of splice site usage changes between treatments or conditions. It's nice if we can know what that frequency is most likely to be, but a very rough estimate of the actual value is fine. Again, we care about the fact of a change, because the change signals that something biologically interesting might be happening.

For this task, let's scan the papers found in the literature review results (see: ~~IGBF-3041~~) to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

Also, let's investigate the statistics literature to determine if the test we need is already developed for an analogous setting that we can deploy for ourselves. We already have software that counts and reports relevant read alignments (see: [arabitag|https://bitbucket.org/lorainelab/altspliceanalysis]), and so really we only need a good way to test for differences in splicing across conditions.

Ann Loraine made changes - 04/Jan/22 4:21 PM

Attachment

Mosaic.jpeg [ 17048 ]

Ann Loraine made changes - 04/Jan/22 4:23 PM

Comment

[ If we care about how splicing is regulated, and in detecting changes in the splicing machinery's function, then what matters ultimately is the frequency with which individual splice sites get used relative to alternative options. We can assess splice site selection frequency by counting relevant reads and ignoring both irrelevant reads, such as reads that align the differentially spliced regions, or reads we can't assign to a single choice. Moreover, because of the complexity of splicing and because of the unknowableness of reads mapping far away from a given splice site, we ought to focus our attention on the reads for which we are most confident in their mapping. What we care about the most is whether or not the underlying frequency of splice site usage changes between treatments or conditions. It's nice if we can know what that frequency is most likely to be, but a very rough estimate of the actual value is fine. Again, we care about the fact of a change, because the change signals that something biologically interesting might be happening. ]

Hide

Permalink

Ann Loraine added a comment - 04/Jan/22 4:29 PM - edited

I'm revising my opinion!

I think what we actually would need is to fit a binomial generalized linear model, as described in this post on the statistics stack exchange web site: https://stats.stackexchange.com/questions/315428/how-to-perform-a-binomial-test-when-having-replicates

code:

Mod <- glm(data = DF, cbind(irregular,regular) ~ Treat, family = "binomial") 
summary(Mod)      #This prints the results, p.values and statistics. 
exp(confint(Mod)) #This gives you the CIs for the different terms in the model

where "DF" is a data frame such that:

with numbers of successes (called "irregular") and failures ("regular") , and a column for treatment/group called Treat, with one Petri dish in each row

In our situation, the data frame would be a small data frame containing read counts for a single event, with each row being read counts from a single library.

This post contains some good discussion. It is worth reading!

Show

Ann Loraine added a comment - 04/Jan/22 4:29 PM - edited I'm revising my opinion! I think what we actually would need is to fit a binomial generalized linear model, as described in this post on the statistics stack exchange web site: https://stats.stackexchange.com/questions/315428/how-to-perform-a-binomial-test-when-having-replicates code: Mod <- glm(data = DF, cbind(irregular,regular) ~ Treat, family = "binomial" ) summary(Mod) #This prints the results, p.values and statistics. exp(confint(Mod)) #This gives you the CIs for the different terms in the model where "DF" is a data frame such that: with numbers of successes (called "irregular") and failures ("regular") , and a column for treatment/group called Treat, with one Petri dish in each row In our situation, the data frame would be a small data frame containing read counts for a single event, with each row being read counts from a single library. This post contains some good discussion. It is worth reading!

Ann Loraine made changes - 04/Jan/22 4:31 PM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Ann Loraine made changes - 04/Jan/22 4:31 PM

Assignee

Ann Loraine [ aloraine ]

Hide

Permalink

Ann Loraine added a comment - 04/Jan/22 4:35 PM - edited

Resolution		Done [ 10000 ]
Status	Post-merge Testing In Progress [ 10003 ]	Closed [ 6 ]

Investigate: Alternative splicing statistical analysis

Details

Description

Attachments

Attachments

Activity

People

Dates