Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3044

Investigate: Alternative splicing statistical analysis

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      The Loraine Lab has developed alternative splicing RNA-Seq analysis tools called "find_junctions" and "arabitag" that we've used in conjunction with statistical testing to detect when splicing changes in an experiment.

      This software works OK but interpreting results when one gene has multiple alternative splicing choices is difficult. Another potential weakness is the statistical test we use to detect a change across conditions. To test for a change, we compare proportion means using a t-test, after converting the proportions data to something more normal, using a transformation. In practice, this works OK, but things get confusing when we have multiple splicing patterns affect the same region of a gene, as we end up comparing each option pairwise against each other option. This is not a deal-breaker, but it would be nice if we could perform a single test even for alternative splicing choices where there are more than two options. There may be better software or better methods we can use instead.

      However, one thing to always keep in mind is that we mainly care about understanding how and if splicing is regulated, and in detecting changes in the splicing machinery's function. In this case, what matters ultimately the frequency with which individual splice sites get used relative to alternative options, and what causes this frequency to change. Currently, we assess splice site selection frequency by counting relevant reads and ignoring both irrelevant reads, such as reads that align outside the differentially spliced regions, or reads we can't assign to a single choice. Moreover, because of the complexity of splicing and because of the unknowableness of reads mapping far away from a given splice site, we ought to focus our attention on the reads for which we are most confident in their mapping. What we care about the most is whether or not the underlying frequency of splice site usage changes between treatments or conditions. It's nice if we can know what that frequency is most likely to be, but a very rough estimate of the actual value is fine. Again, we care about the fact of a change, because the change signals that something biologically interesting might be happening.

      For this task, let's scan the papers found in the literature review results (see: IGBF-3041) to identify alternative splicing analysis software packages and determine: Which of these, if any, should we explore as an alternative to the find junctions pipeline?

      Also, let's investigate the statistics literature to determine if the test we need is already developed for an analogous setting that we can deploy for ourselves. We already have software that counts and reports relevant read alignments (see: arabitag), and so really we only need a good way to test for differences in splicing across conditions.

        Attachments

        1. 3044.xlsx
          19 kB
          Nowlan Freese
        2. Mosaic.jpeg
          55 kB
          Ann Loraine

          Activity

          Hide
          ann.loraine Ann Loraine added a comment - - edited

          rMATs - tests whether there is a difference in proportion higher than a user-supplied threshold (the default difference is 0.05). Also, it tries to model variability in the proportions obtained from replicate samples.

          It also tries to normalize for splicing event length by including a length factor in the "percent spliced in" proportion. In addition, it uses reads aligned on top of exons, not just junction reads, as counting toward the use of one splice site or another.

          The statistical approach is described in a 2014 PNAS paper. I'm not sure if the paper provides a solid justification for the method, however. Simulations are presented, but the simulations assume a binary choice only, which, as we know from experience, is not realistic for many of the most interesting cases, e.g., highly alternatively spliced genes that may alter their splicing patterns under different conditions or treatments.

          The current version of rMATs is called rMATS "turbo" which includes a number of speed enhancements, including some parallelization of the code.

          Big question I still have is: Does it detect differential splicing among un-annotated splice variants? This matters for the tomato pollen project because the tomato genome annotations do not include splice variants.

          Reading google group posts for rMATs has been helpful.

          Show
          ann.loraine Ann Loraine added a comment - - edited rMATs - tests whether there is a difference in proportion higher than a user-supplied threshold (the default difference is 0.05). Also, it tries to model variability in the proportions obtained from replicate samples. It also tries to normalize for splicing event length by including a length factor in the "percent spliced in" proportion. In addition, it uses reads aligned on top of exons, not just junction reads, as counting toward the use of one splice site or another. The statistical approach is described in a 2014 PNAS paper. I'm not sure if the paper provides a solid justification for the method, however. Simulations are presented, but the simulations assume a binary choice only, which, as we know from experience, is not realistic for many of the most interesting cases, e.g., highly alternatively spliced genes that may alter their splicing patterns under different conditions or treatments. The current version of rMATs is called rMATS "turbo" which includes a number of speed enhancements, including some parallelization of the code. Big question I still have is: Does it detect differential splicing among un-annotated splice variants? This matters for the tomato pollen project because the tomato genome annotations do not include splice variants. Reading google group posts for rMATs has been helpful. 2014 rMATS paper link: https://www.pnas.org/content/111/51/E5593 rMATS google group: https://groups.google.com/g/rmats-user-group
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          As noted in the Description section above, we need a way to test whether a proportion is different across groups where we have multiple samples from each group. This method needs to take advantage of sample size and also tolerate sample sizes being different.

          The scenario described in this explanation from University of Virginia Library Data Services explains a scenario in which a proportion is testing across multiple samples:

          In the scenario described in the above link, we ask whether there is any difference in proportion of students who floss and students who do not floss across several schools in a city. To answer the question, we create a large table listing the number of yes and no answers from each school. We then use prop.test and a multiple hypothesis testing correction to perform the proportion testing.

          However, this different from situations in which you want to test whether a particular treatment changes splicing.

          The method described above is better suited to the following question:

          Does a splicing proportion ever change for a particular splicing choice? Phrased another way: is there evidence that a particular splicing pattern ever changes? This is similar to the flossing question because we are simply asking whether the proportion ever changes across multiple experiments. For example, we could "harvest" several high quality RNA-Seq data sets, find the number of reads per data set that support either the long or the short form splice choice, and then ask: does the resulting proportion ever change? Doing this for most of the documented splicing events could identify truly regulated splicing events. Our expectation is that these "truly regulated" splicing events probably arise in genes involved in splicing, based on our prior knowledge of splicing.

          Show
          ann.loraine Ann Loraine added a comment - - edited As noted in the Description section above, we need a way to test whether a proportion is different across groups where we have multiple samples from each group. This method needs to take advantage of sample size and also tolerate sample sizes being different. The scenario described in this explanation from University of Virginia Library Data Services explains a scenario in which a proportion is testing across multiple samples: https://data.library.virginia.edu/pairwise-comparisons-of-proportions/ In the scenario described in the above link, we ask whether there is any difference in proportion of students who floss and students who do not floss across several schools in a city. To answer the question, we create a large table listing the number of yes and no answers from each school. We then use prop.test and a multiple hypothesis testing correction to perform the proportion testing. However, this different from situations in which you want to test whether a particular treatment changes splicing. The method described above is better suited to the following question: Does a splicing proportion ever change for a particular splicing choice? Phrased another way: is there evidence that a particular splicing pattern ever changes? This is similar to the flossing question because we are simply asking whether the proportion ever changes across multiple experiments. For example, we could "harvest" several high quality RNA-Seq data sets, find the number of reads per data set that support either the long or the short form splice choice, and then ask: does the resulting proportion ever change? Doing this for most of the documented splicing events could identify truly regulated splicing events. Our expectation is that these "truly regulated" splicing events probably arise in genes involved in splicing, based on our prior knowledge of splicing.
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          Another post discusses a question very similar to ours:

          Hello,
          I am searching for a statistical test to compare two proportions of responsive cells.
          I know that Khi2, z-test and Fisher’s exact test allow this kind of comparison but the issue I have is that I have several proportions in each cases (ie : several samples on the same condition that all contain a proportion).
          For example:
          Drug A = [12/25, 1/100, 40/70]
          Drug B = [17/35, 40/200, 20/50]
          I suppose that I cannot use a t-test because the data is discrete (1 if the cell respond, 0 if it does not).
          I also does not want to do the mean of my data because I want to consider the variation between each measures.
          Should I use a GLM with a Bernoulli law or something like that?
          Really looking forward to your inputs!
          Thank you very much in advance
          Angel

          One user proposed using a "Cochran-Mantel-Haenszel test, with Drugs as the strata". The user even provided example R code!

          However, when I looked into it, I realized that the proposed test is probably not what we need because it assumes replicates from the two different groups can be paired to form contingency tables.

          I posted a follow-up question about this, asking how the code should change if the number of replicates differs between groups. I don't think the code could accommodate this, based on my understanding. But I could be wrong. So far the person who recommended this test has not replied.

          However, there is one bit of useful help here! The person's code gives what I think is a potentially very useful example visualization. We could use this I think to illustrate how splicing proportions differ between treatments. See attached file titled "Mosaic.jpeg".

          In our version of this type of plot, the vertical rectangles would represent individual replicates. The y axis would show the fraction of reads supporting short and long forms, with colored regions representing short and long forms. The x-axis would represent the number of reads relevant to the choice. Thus the area would represent the amount of evidence provided by the sample. Bigger (wider) rectangles would show how we attach more weight to events with more reads as evidence.

          Show
          ann.loraine Ann Loraine added a comment - - edited Another post discusses a question very similar to ours: https://www.researchgate.net/post/How_to_test_the_equality_of_two_proportions_with_multiple_samplings Hello, I am searching for a statistical test to compare two proportions of responsive cells. I know that Khi2, z-test and Fisher’s exact test allow this kind of comparison but the issue I have is that I have several proportions in each cases (ie : several samples on the same condition that all contain a proportion). For example: Drug A = [12/25, 1/100, 40/70] Drug B = [17/35, 40/200, 20/50] I suppose that I cannot use a t-test because the data is discrete (1 if the cell respond, 0 if it does not). I also does not want to do the mean of my data because I want to consider the variation between each measures. Should I use a GLM with a Bernoulli law or something like that? Really looking forward to your inputs! Thank you very much in advance Angel One user proposed using a "Cochran-Mantel-Haenszel test, with Drugs as the strata". The user even provided example R code! However, when I looked into it, I realized that the proposed test is probably not what we need because it assumes replicates from the two different groups can be paired to form contingency tables. I posted a follow-up question about this, asking how the code should change if the number of replicates differs between groups. I don't think the code could accommodate this, based on my understanding. But I could be wrong. So far the person who recommended this test has not replied. However, there is one bit of useful help here! The person's code gives what I think is a potentially very useful example visualization. We could use this I think to illustrate how splicing proportions differ between treatments. See attached file titled "Mosaic.jpeg". In our version of this type of plot, the vertical rectangles would represent individual replicates. The y axis would show the fraction of reads supporting short and long forms, with colored regions representing short and long forms. The x-axis would represent the number of reads relevant to the choice. Thus the area would represent the amount of evidence provided by the sample. Bigger (wider) rectangles would show how we attach more weight to events with more reads as evidence.
          Hide
          ann.loraine Ann Loraine added a comment -

          Beta regression could be an option. Here is a link explaining it:

          Show
          ann.loraine Ann Loraine added a comment - Beta regression could be an option. Here is a link explaining it: Beta Regression for Percent and Proportion Data
          Hide
          ann.loraine Ann Loraine added a comment -

          A 2020 paper from Mark Robinson's group:

          BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty and link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8

          Show
          ann.loraine Ann Loraine added a comment - A 2020 paper from Mark Robinson's group: BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty and link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          I'm starting to like a Bayesian approach in which we explicitly treat the underlying frequency of splice site usage as an unknown parameter. Selecting one of two mutually exclusive splice sites can be modeled as a Bernoulli trial, like flipping a coin. Each relevant read alignment from a single replicate represents a sequence of coin flips. There may be some dependencies between coin flips in that some reads could come from PCR duplicates during library construction. In fact, there is probably some physical limitation to the number of times we can sample an alternatively spliced region, a limitation arising from the amount of spliced RNA transcript molecules we can isolate.

          I am interested in the beta distribution because it seems like this could help me narrow in on realistic values for theta, where theta is the proportion of splicing events that use one or the other choice.

          As I mentioned above, often there is more than one choice for a given intronic region. There are many examples of situations in which the 3' or 5' boundary of an intron (a splice site) could be located in more than two places. Maybe this is OK if we define the Bernoulli trial as: assemble the 5' or 3' splicing complex at this location, or don't. In this case, an "informative" read alignment counting in favor of the splice site being used will always be a split read, and an "informative" read alignment counting against the splice site being used will always be a read that maps to the splice site but does not "split" across the site. The pattern we need to see in the affirmative case (1) is obvious and easy to recognize. The pattern we accept as a negative case is less obvious or clear.

          The Arabitag algorithm identifies allowable negative examples by using gene model annotations and by defining alternatively spliced regions. (See the article titled Many rice genes are differentially spliced between roots and shoots but cytokinin has minimal effect on splicing for detailed explanation.)

          Show
          ann.loraine Ann Loraine added a comment - - edited I'm starting to like a Bayesian approach in which we explicitly treat the underlying frequency of splice site usage as an unknown parameter. Selecting one of two mutually exclusive splice sites can be modeled as a Bernoulli trial, like flipping a coin. Each relevant read alignment from a single replicate represents a sequence of coin flips. There may be some dependencies between coin flips in that some reads could come from PCR duplicates during library construction. In fact, there is probably some physical limitation to the number of times we can sample an alternatively spliced region, a limitation arising from the amount of spliced RNA transcript molecules we can isolate. I am interested in the beta distribution because it seems like this could help me narrow in on realistic values for theta, where theta is the proportion of splicing events that use one or the other choice. As I mentioned above, often there is more than one choice for a given intronic region. There are many examples of situations in which the 3' or 5' boundary of an intron (a splice site) could be located in more than two places. Maybe this is OK if we define the Bernoulli trial as: assemble the 5' or 3' splicing complex at this location, or don't. In this case, an "informative" read alignment counting in favor of the splice site being used will always be a split read, and an "informative" read alignment counting against the splice site being used will always be a read that maps to the splice site but does not "split" across the site. The pattern we need to see in the affirmative case (1) is obvious and easy to recognize. The pattern we accept as a negative case is less obvious or clear. The Arabitag algorithm identifies allowable negative examples by using gene model annotations and by defining alternatively spliced regions. (See the article titled Many rice genes are differentially spliced between roots and shoots but cytokinin has minimal effect on splicing for detailed explanation.)
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          I'm revising my opinion!

          I think what we actually would need is to fit a binomial generalized linear model, as described in this post on the statistics stack exchange web site: https://stats.stackexchange.com/questions/315428/how-to-perform-a-binomial-test-when-having-replicates

          code:

          Mod <- glm(data = DF, cbind(irregular,regular) ~ Treat, family = "binomial") 
          summary(Mod)      #This prints the results, p.values and statistics. 
          exp(confint(Mod)) #This gives you the CIs for the different terms in the model
          

          where "DF" is a data frame such that:

          with numbers of successes (called "irregular") and failures ("regular") , and a column for treatment/group called Treat, with one Petri dish in each row

          In our situation, the data frame would be a small data frame containing read counts for a single event, with each row being read counts from a single library.

          This post contains some good discussion. It is worth reading!

          Show
          ann.loraine Ann Loraine added a comment - - edited I'm revising my opinion! I think what we actually would need is to fit a binomial generalized linear model, as described in this post on the statistics stack exchange web site: https://stats.stackexchange.com/questions/315428/how-to-perform-a-binomial-test-when-having-replicates code: Mod <- glm(data = DF, cbind(irregular,regular) ~ Treat, family = "binomial" ) summary(Mod) #This prints the results, p.values and statistics. exp(confint(Mod)) #This gives you the CIs for the different terms in the model where "DF" is a data frame such that: with numbers of successes (called "irregular") and failures ("regular") , and a column for treatment/group called Treat, with one Petri dish in each row In our situation, the data frame would be a small data frame containing read counts for a single event, with each row being read counts from a single library. This post contains some good discussion. It is worth reading!
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          Other links and resources:

          Show
          ann.loraine Ann Loraine added a comment - - edited Other links and resources: Slides from Iowa State Biostatistics professor Dan Nettleton " A Generalized Linear Model for Binomial Response Data " Stats StackExchange question: " Statistical test on percentage data with replicates " Stats StackExchange question: " Analyzing FACS Data " Stats StackExchange question: " Which test to use when comparing multiple sets of proportions (with unknown sample size)? "
          Hide
          ann.loraine Ann Loraine added a comment -

          rMATs script from an older project:

          #!/bin/bash
          # only works with samtools/0.1.19
          RMATDIR=/lustre/groups/lorainelab/sw/rMATS.3.0.9
          S=$RMATDIR/RNASeq-MATS.py
          GTF=$RMATDIR/gtf/TAIR10.gtf
          # prefix
          P='w'
          CONTR="${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam"
          P='m'
          TREAT="${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam"
          DIR=AS
          CMD="python $S -o $DIR -b1 $CONTR -b2 $TREAT -gtf $GTF -len 101 -t single"
          echo "running: $CMD" > $DIR.out
          $CMD 2>$DIR.err 1>>$DIR.out  
          echo "DONE."
          
          Show
          ann.loraine Ann Loraine added a comment - rMATs script from an older project: #!/bin/bash # only works with samtools/0.1.19 RMATDIR=/lustre/groups/lorainelab/sw/rMATS.3.0.9 S=$RMATDIR/RNASeq-MATS.py GTF=$RMATDIR/gtf/TAIR10.gtf # prefix P='w' CONTR= "${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam" P='m' TREAT= "${P}1.bam,${P}2.bam,${P}3.bam,${P}4.bam" DIR=AS CMD= "python $S -o $DIR -b1 $CONTR -b2 $TREAT -gtf $GTF -len 101 -t single" echo "running: $CMD" > $DIR.out $CMD 2>$DIR.err 1>>$DIR.out echo "DONE."
          Hide
          ann.loraine Ann Loraine added a comment -

          Nowlan Freese will make table of methods and endnote library.

          Show
          ann.loraine Ann Loraine added a comment - Nowlan Freese will make table of methods and endnote library.
          Hide
          nfreese Nowlan Freese added a comment -

          I have attached an excel spreadsheet (3044.xlsx) listing methods for the articles identified in IGBF-3041. Blank data indicate that there was no mention of the methods in the paper.

          The endnote library is attached to ticket IGBF-3041.

          Show
          nfreese Nowlan Freese added a comment - I have attached an excel spreadsheet (3044.xlsx) listing methods for the articles identified in IGBF-3041 . Blank data indicate that there was no mention of the methods in the paper. The endnote library is attached to ticket IGBF-3041 .
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          Note: IGBF-3041 discusses the 2021 paper titled "Genome-wide discovery of natural variation in pre-mRNA splicing and prioritising causal alternative splicing to salt stress response in rice" (from Harkamal Walia lab at University of Nebraska) which explained why they needed to develop a different method and why rMATS was not sufficient for their data analysis.

          This is probably the paper we need to orient our work.

          For example, we can assess whether or not the results from this previous work are reproduced in our work. If they are, that makes the conclusions much stronger.

          Show
          ann.loraine Ann Loraine added a comment - - edited Note: IGBF-3041 discusses the 2021 paper titled "Genome-wide discovery of natural variation in pre-mRNA splicing and prioritising causal alternative splicing to salt stress response in rice" (from Harkamal Walia lab at University of Nebraska) which explained why they needed to develop a different method and why rMATS was not sufficient for their data analysis. This is probably the paper we need to orient our work. For example, we can assess whether or not the results from this previous work are reproduced in our work. If they are, that makes the conclusions much stronger.

            People

            • Assignee:
              ann.loraine Ann Loraine
              Reporter:
              ann.loraine Ann Loraine
            • Votes:
              0 Vote for this issue
              Watchers:
              Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: