Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3172

Tabulate splicing support by running arabitag algorithm on junction and bam files


    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:


      Run the arabitag algorithm to tabulate splice junction support using output from find_junctions.sh in linked issue.

      Ref: See diagram from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6589529/figure/pld3136-fig-0001/
      See paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6589529/

      Rice paper repository: https://bitbucket.org/lorainelab/ricealtsplice/src/master/

      Arabitag algorithm repository: https://bitbucket.org/lorainelab/altspliceanalysis/src/master/


          Issue Links


            Mdavis4290 Molly Davis added a comment - - edited
            Mdavis4290 Molly Davis added a comment - - edited During our meeting I brought up using deep learning/machine learning to identify splicing events. Here are some papers I found that were related to the idea: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6722613/ https://academic.oup.com/bioinformatics/article/33/14/i274/3953982 code: https://majiq.biociphers.org/jha_et_al_2017/ https://www.sciencedirect.com/science/article/pii/S0092867418316295 code: https://github.com/Illumina/SpliceAI
            ann.loraine Ann Loraine added a comment -

            Running arabitag tabulation algorithm with:

            sbatch --export=BED=S_lycopersicum_Jun_2022_noPRAM.bed,F=5,RIF=10 --output=arabitag.out --error=arabitag.err arabitag.sh 2>jobs.err 1>jobs.out



            using code from my home directory.

            ann.loraine Ann Loraine added a comment - Running arabitag tabulation algorithm with: sbatch --export=BED=S_lycopersicum_Jun_2022_noPRAM.bed,F=5,RIF=10 --output=arabitag.out --error=arabitag.err arabitag.sh 2>jobs.err 1>jobs.out in: /nobackup/tomato_genome/alt_splicing/arabitag using code from my home directory.
            ann.loraine Ann Loraine added a comment -

            Made changes to arabitag.sh to enable the splicing analysis algorithm to be run in parallel on the slurm cluster.
            Ran it, and now the files are all made.
            Within directory /nobackup/tomato_genome/alt_splicing/arabitag/all, there are multiple files with prefix "SRR" identifier and file name suffix splice_support.txt.
            The first line of each file is a comment (marked with hash character) giving the flanking base number and other parameters used to run the data processing steps.
            Each row represents an alternative splicing event, with two mutually exclusive choices.

            • chromosome - the location of the event
            • strand - the strand of the event
            • start - interbase coordinates for the start position of the difference region, the alternatively spliced region
            • end - interbase coordinates for the end position of the difference region
            • Ga - gene model lacking the difference region
            • Gp - gene model containing the difference region
            • Ga_[sample name] - the number of alignments from "sample_name" (an RNA-Seq library) that unambiguously support model Ga
            • Gp_[sample name] - the number of alignmetns from "sample_name" that unambiguously support model Gp

            For the next step, we need to consolidate all of these into a single data frame, or something like that, so that we can then determine how or if support for Ga and Gp splicing choices varies between or among samples.

            ann.loraine Ann Loraine added a comment - Made changes to arabitag.sh to enable the splicing analysis algorithm to be run in parallel on the slurm cluster. Ran it, and now the files are all made. Within directory /nobackup/tomato_genome/alt_splicing/arabitag/all, there are multiple files with prefix "SRR" identifier and file name suffix splice_support.txt. The first line of each file is a comment (marked with hash character) giving the flanking base number and other parameters used to run the data processing steps. Each row represents an alternative splicing event, with two mutually exclusive choices. chromosome - the location of the event strand - the strand of the event start - interbase coordinates for the start position of the difference region, the alternatively spliced region end - interbase coordinates for the end position of the difference region Ga - gene model lacking the difference region Gp - gene model containing the difference region Ga_ [sample name] - the number of alignments from "sample_name" (an RNA-Seq library) that unambiguously support model Ga Gp_ [sample name] - the number of alignmetns from "sample_name" that unambiguously support model Gp For the next step, we need to consolidate all of these into a single data frame, or something like that, so that we can then determine how or if support for Ga and Gp splicing choices varies between or among samples.
            ann.loraine Ann Loraine added a comment -

            To review:

            • check that files have reasonable sizes (no "zero" size files, for example)
            • check that every "SRR" bam file in our control and experimental sample directories has a corresponding "splice_support.txt" file
            ann.loraine Ann Loraine added a comment - To review: check that files have reasonable sizes (no "zero" size files, for example) check that every "SRR" bam file in our control and experimental sample directories has a corresponding "splice_support.txt" file
            Mdavis4290 Molly Davis added a comment - - edited


            Directory- /nobackup/tomato_genome/alt_splicing/arabitag/all

            • With command LL/ll I checked to see if any files were zero. Only file 'jobs.err' had zero file size but all of the SRR and support files were non-zero files.
            • Every "SRR" bam file in our control and experimental sample directories has a corresponding "splice_support.txt" file. SRP328042-molly SRR files are in-between SRP252265 files in the directory might be hard to differentiate which is control and which is experimental.


            Mdavis4290 Molly Davis added a comment - - edited Review: Directory- /nobackup/tomato_genome/alt_splicing/arabitag/all With command LL/ll I checked to see if any files were zero. Only file 'jobs.err' had zero file size but all of the SRR and support files were non-zero files. Every "SRR" bam file in our control and experimental sample directories has a corresponding "splice_support.txt" file. SRP328042-molly SRR files are in-between SRP252265 files in the directory might be hard to differentiate which is control and which is experimental. [~aloraine]
            ann.loraine Ann Loraine added a comment -

            Thanks [~molly]. I decided to store all the files in the same folder and will use an index / table of contents type strategy to distinguish them. Moving to Done.

            ann.loraine Ann Loraine added a comment - Thanks [~molly] . I decided to store all the files in the same folder and will use an index / table of contents type strategy to distinguish them. Moving to Done.
            ann.loraine Ann Loraine added a comment -

            Re-opening to include new work to tabulate results and read into R for interactive analysis of results.

            ann.loraine Ann Loraine added a comment - Re-opening to include new work to tabulate results and read into R for interactive analysis of results.
            ann.loraine Ann Loraine added a comment -

            Splicing support output files added as "tar" file to https://bitbucket.org/hotpollen/splicing-analysis as TabulateSplicingSupport/results/2022-09-08-results.tar.

            ann.loraine Ann Loraine added a comment - Splicing support output files added as "tar" file to https://bitbucket.org/hotpollen/splicing-analysis as TabulateSplicingSupport/results/2022-09-08-results.tar.
            ann.loraine Ann Loraine added a comment - - edited

            Creating R code for organizing and sanity-checking results. Working on this fork and branch: https://bitbucket.org/aloraine/splicing-analysis/branch/IGBF-3172

            ann.loraine Ann Loraine added a comment - - edited Creating R code for organizing and sanity-checking results. Working on this fork and branch: https://bitbucket.org/aloraine/splicing-analysis/branch/IGBF-3172
            ann.loraine Ann Loraine added a comment -

            Merged changes into master branch. Created new folder with consolidated results:

            • TabulateSplicingSupport/results/ containing file 2022-09-08-results.txt
            ann.loraine Ann Loraine added a comment - Merged changes into master branch. Created new folder with consolidated results: TabulateSplicingSupport/results/ containing file 2022-09-08-results.txt


              • Assignee:
                ann.loraine Ann Loraine
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: