Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3454

Develop an R prototype script EB-Seq using Muday time course

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Purpose: Write a first draft of an EB-Seq script to analyze the time course data.

      Why? Because comparing time points violates an assumption of independence, a few tools have been developed to address this and improve the RNA-seq comparisons on time dependent studies.
      A few examples:

      • mfuzz
      • PAL-D
      • EB-Seq

      I tested out PAL-D last January and it was a horrible experience.
      Rasha ran mfuzz with success.
      EB-Seq has been run by me and Liz in the past.

      Liz's details on Eb0seq can be found here:
      _Original details from Liz Cooper
      “EBSeq script on my GitHub page associated with the sweet sorghum paper (in case you decide to use that package): “
      Rio/EBSeqHMM_Cluster.R at master · eacooper400/Rio · GitHub

      Liz_

      In R:

      The R package should be acquired like so:

          1. Installing Required Packages (only done once)
            #source("https://bioconductor.org/biocLite.R")
            BiocManager::install("EBSeq")
            BiocManager::install("EBSeqHMM")
            install.packages("blockmodeling")
          1. Load Required Pacakges
            library(EBSeq)
            library(EBSeqHMM)

        Attachments

          Issue Links

            Activity

            Hide
            robofjoy Robert Reid added a comment -

            Here is journal article discussing rna-seq statistical methods. They discuss EB-SEQ.

            And they mention that is performs well when n=3, as in this project!

            journal.pone.0264246.pdf

            Show
            robofjoy Robert Reid added a comment - Here is journal article discussing rna-seq statistical methods. They discuss EB-SEQ. And they mention that is performs well when n=3, as in this project! journal.pone.0264246.pdf
            Hide
            robofjoy Robert Reid added a comment -

            It's slowly coming back to me now!

            EB-SEQ does the cool pattern identifying of up and down patterns.

            For example, using their example dataset, you get a results like so,

            • Most_Likely_Path Max_PP *
              Gene_40 "Up-Up-Down-Down" "0.9884"
              Gene_1 "Down-Up-Down-Down" "0.7742"
              Gene_15 "Down-Down-Up-Up" "0.5733"
              Gene_9 "Down-Down-Up-Up" "0.5524"
              Gene_2 "Up-Up-Down-Down" "0.5321"
              Gene_17 "Down-Down-Up-Up" "0.5084"

            (In this example, there are 5 time points, so you get 4 states up genes going up or down)

            Show
            robofjoy Robert Reid added a comment - It's slowly coming back to me now! EB-SEQ does the cool pattern identifying of up and down patterns. For example, using their example dataset, you get a results like so, Most_Likely_Path Max_PP * Gene_40 "Up-Up-Down-Down" "0.9884" Gene_1 "Down-Up-Down-Down" "0.7742" Gene_15 "Down-Down-Up-Up" "0.5733" Gene_9 "Down-Down-Up-Up" "0.5524" Gene_2 "Up-Up-Down-Down" "0.5321" Gene_17 "Down-Down-Up-Up" "0.5084" (In this example, there are 5 time points, so you get 4 states up genes going up or down)
            Hide
            robofjoy Robert Reid added a comment -

            Success!

            I've read in the data and produced output.
            I ran this test 6 times, breaking down the data into these 6 categories:

            1. ARE 28C
            2. ARE 34C
            3. VF36 28C
            4. VF36 34C
            5. F3H-OE3 28C
            6. FSH-OE3 34C

            I have produced 6 files that summarizes the genes that were found to show some change in expression pattern.
            E.g.,
            "Most_Likely_Path" "Max_PP"
            "Solyc08G002520" "Down-Up-Down" "0.9777"
            "Solyc05G000582" "Up-Down-Up" "0.9661"
            "Solyc04G002780" "Up-Down-Up" "0.9656"
            "Solyc00G000013" "Down-Up-Down" "0.9651"
            "Solyc04G000324" "Down-Up-Down" "0.9651"
            "Solyc03G002116" "Down-Up-Down" "0.964"
            "Solyc02G002947" "Down-Down-Up" "0.9596"
            "Solyc03G003029" "Down-Down-Up" "0.9489"
            "Solyc00G000015" "Down-Up-Down" "0.9388"

            The R code for this will need review.

            Will summarize number of genes found for each type of pattern possible: ("Down-Down-Up" )

            Show
            robofjoy Robert Reid added a comment - Success! I've read in the data and produced output. I ran this test 6 times, breaking down the data into these 6 categories: ARE 28C ARE 34C VF36 28C VF36 34C F3H-OE3 28C FSH-OE3 34C I have produced 6 files that summarizes the genes that were found to show some change in expression pattern. E.g., "Most_Likely_Path" "Max_PP" "Solyc08G002520" "Down-Up-Down" "0.9777" "Solyc05G000582" "Up-Down-Up" "0.9661" "Solyc04G002780" "Up-Down-Up" "0.9656" "Solyc00G000013" "Down-Up-Down" "0.9651" "Solyc04G000324" "Down-Up-Down" "0.9651" "Solyc03G002116" "Down-Up-Down" "0.964" "Solyc02G002947" "Down-Down-Up" "0.9596" "Solyc03G003029" "Down-Down-Up" "0.9489" "Solyc00G000015" "Down-Up-Down" "0.9388" The R code for this will need review. Will summarize number of genes found for each type of pattern possible: ("Down-Down-Up" )
            Hide
            robofjoy Robert Reid added a comment -

            Ha! major bug.

            We set levels for the time points like so.
            CondVector=c("T15","T30","T45","T75","T15","T30","T45","T75","T15","T30","T45","T75")

            And that should correspond with our data we sub-select:
            a28<-select(x,matches("A.28"))

            [1] "A.28.15.7" "A.28.30.7" "A.28.45.7" "A.28.75.7" "A.28.45.8" "A.28.15.8" "A.28.30.8" "A.28.75.8" "A.28.30.9"
            [10] "A.28.15.9" "A.28.75.9" "A.28.45.9"

            However, the order in our sheet is not always the same. I'll need to assign levels correctly.

            Show
            robofjoy Robert Reid added a comment - Ha! major bug. We set levels for the time points like so. CondVector=c("T15","T30","T45","T75","T15","T30","T45","T75","T15","T30","T45","T75") And that should correspond with our data we sub-select: a28<-select(x,matches("A.28")) [1] "A.28.15.7" "A.28.30.7" "A.28.45.7" "A.28.75.7" "A.28.45.8" "A.28.15.8" "A.28.30.8" "A.28.75.8" "A.28.30.9" [10] "A.28.15.9" "A.28.75.9" "A.28.45.9" However, the order in our sheet is not always the same. I'll need to assign levels correctly.
            Hide
            robofjoy Robert Reid added a comment -

            Resolved the ordering issue.
            WHen we select our experiments of interest like so:

            > colnames(f28)
            [1] "F.28.15.7" "F.28.30.7" "F.28.45.7" "F.28.75.7" "F.28.75.8" "F.28.45.8" "F.28.15.8" "F.28.30.8" "F.28.30.9"
            [10] "F.28.15.9" "F.28.75.9" "F.28.45.9"

            We'd like the 15 min, first, followed by the 30, then 45 and 75. The original sheet is not set up that way.
            But this tidyverse command rectifies the problem:

            > colnames(f28[,order(colnames(f28))])
            [1] "F.28.15.7" "F.28.15.8" "F.28.15.9" "F.28.30.7" "F.28.30.8" "F.28.30.9" "F.28.45.7" "F.28.45.8" "F.28.45.9"
            [10] "F.28.75.7" "F.28.75.8" "F.28.75.9"

            This allows everything to consistently be ordered for the next EBseq step.

            Show
            robofjoy Robert Reid added a comment - Resolved the ordering issue. WHen we select our experiments of interest like so: > colnames(f28) [1] "F.28.15.7" "F.28.30.7" "F.28.45.7" "F.28.75.7" "F.28.75.8" "F.28.45.8" "F.28.15.8" "F.28.30.8" "F.28.30.9" [10] "F.28.15.9" "F.28.75.9" "F.28.45.9" We'd like the 15 min, first, followed by the 30, then 45 and 75. The original sheet is not set up that way. But this tidyverse command rectifies the problem: > colnames(f28 [,order(colnames(f28))] ) [1] "F.28.15.7" "F.28.15.8" "F.28.15.9" "F.28.30.7" "F.28.30.8" "F.28.30.9" "F.28.45.7" "F.28.45.8" "F.28.45.9" [10] "F.28.75.7" "F.28.75.8" "F.28.75.9" This allows everything to consistently be ordered for the next EBseq step.
            Hide
            robofjoy Robert Reid added a comment -

            Tweaked my function so we can easily get how many genes are found in each type of up-down condition.

            The function now returns 2 items, the EBseq Object results and a list of how many genes in each condition:

            Unable to find source-code formatter for language: r. Available languages are: actionscript, html, java, javascript, none, sql, xhtml, xml
            fetch_GeneCalls <- function(tibby) {
              dim(tibby)
              tibby <- tibby[,order(colnames(tibby))]  ##Reordering the columns to match the conditions
              mat <- as.data.frame(tibby)
              rownames(mat) <- geneid
              head(mat)
              Sizes=MedianNorm(mat)
              qSizes=QuantileNorm(mat,.75)
              print(qSizes)
              GeneNormData=GetNormalizedMat(mat, Sizes)
              EBSeqHMMGeneOut=EBSeqHMMTest(Data=data.matrix(mat), sizeFactors=Sizes, Conditions=Conditions,UpdateRd=50)
              dim(EBSeqHMMGeneOut)
              GeneDECalls=GetDECalls(EBSeqHMMGeneOut, FDR=.05)
              
              #print the gene COnf calls summary
              GeneConfCalls=GetConfidentCalls(EBSeqHMMGeneOut, FDR=.05,cutoff=.5, OnlyDynamic=T)
              print(GeneConfCalls$NumEach) ##We print these, but results are not returned. Should make a separate function.
              list_data <- list(GeneDECalls,GeneConfCalls$NumEach)
              return(list_data)
            }
            
            Show
            robofjoy Robert Reid added a comment - Tweaked my function so we can easily get how many genes are found in each type of up-down condition. The function now returns 2 items, the EBseq Object results and a list of how many genes in each condition: Unable to find source-code formatter for language: r. Available languages are: actionscript, html, java, javascript, none, sql, xhtml, xml fetch_GeneCalls <- function(tibby) { dim(tibby) tibby <- tibby[,order(colnames(tibby))] ##Reordering the columns to match the conditions mat <- as.data.frame(tibby) rownames(mat) <- geneid head(mat) Sizes=MedianNorm(mat) qSizes=QuantileNorm(mat,.75) print(qSizes) GeneNormData=GetNormalizedMat(mat, Sizes) EBSeqHMMGeneOut=EBSeqHMMTest(Data=data.matrix(mat), sizeFactors=Sizes, Conditions=Conditions,UpdateRd=50) dim(EBSeqHMMGeneOut) GeneDECalls=GetDECalls(EBSeqHMMGeneOut, FDR=.05) #print the gene COnf calls summary GeneConfCalls=GetConfidentCalls(EBSeqHMMGeneOut, FDR=.05,cutoff=.5, OnlyDynamic=T) print(GeneConfCalls$NumEach) ##We print these, but results are not returned. Should make a separate function. list_data <- list(GeneDECalls,GeneConfCalls$NumEach) return (list_data) }
            Hide
            robofjoy Robert Reid added a comment -

            Great success. A master table has been created.

            But I am going to go back to the Python code and add a Description column for each gene to make the sheet more insightful!

            Show
            robofjoy Robert Reid added a comment - Great success. A master table has been created. But I am going to go back to the Python code and add a Description column for each gene to make the sheet more insightful!
            Hide
            robofjoy Robert Reid added a comment -

            To make 1 big table from the command line:

            for file in *bseqGeneCalls-SL4.txt; do python ./add2MasterTable.py ./masterTable-ebseqhmm.txt ./$

            {file}

            ./muday-144-SL4_counts-salmon.txt; mv ./masterTable-ebseqhmm.new ./masterTable-ebseqhmm.txt ; done

            Description of genes now added as a column.

            Show
            robofjoy Robert Reid added a comment - To make 1 big table from the command line: for file in *bseqGeneCalls-SL4.txt; do python ./add2MasterTable.py ./masterTable-ebseqhmm.txt ./$ {file} ./muday-144-SL4_counts-salmon.txt; mv ./masterTable-ebseqhmm.new ./masterTable-ebseqhmm.txt ; done Description of genes now added as a column.
            Hide
            robofjoy Robert Reid added a comment -

            To create a master table, we first need to all of the genes that show some sort of path pattern change.
            To get these we iterate through all the result files, get the first column, add them all to a temp file.
            We then sort that temp file and keep the unique gene_ID entries.
            These are the basis for the master table.
            Commands were as follows:

            for file in *_ebseqGeneCalls-SL4.txt; do echo $file; awk '

            { print $1 }

            ' $file >> tempFoundIds.txt ; done

            sort tempFoundIds.txt | uniq > tempFoundIds-uniq.txt

            cp tempFoundIds-uniq.txt masterTable-ebseqhmm.orig
            cp tempFoundIds-uniq.txt masterTable-ebseqhmm.txt

            Show
            robofjoy Robert Reid added a comment - To create a master table, we first need to all of the genes that show some sort of path pattern change. To get these we iterate through all the result files, get the first column, add them all to a temp file. We then sort that temp file and keep the unique gene_ID entries. These are the basis for the master table. Commands were as follows: for file in *_ebseqGeneCalls-SL4.txt; do echo $file; awk ' { print $1 } ' $file >> tempFoundIds.txt ; done sort tempFoundIds.txt | uniq > tempFoundIds-uniq.txt cp tempFoundIds-uniq.txt masterTable-ebseqhmm.orig cp tempFoundIds-uniq.txt masterTable-ebseqhmm.txt
            Hide
            robofjoy Robert Reid added a comment -

            Need testing and get the code into the repo.

            Starting with running things on the cluster as a test.

            Location = /projects/tomato_genome/fnb/robswork/timecluster/ebseq

            Show
            robofjoy Robert Reid added a comment - Need testing and get the code into the repo. Starting with running things on the cluster as a test. Location = /projects/tomato_genome/fnb/robswork/timecluster/ebseq
            Hide
            robofjoy Robert Reid added a comment -

            masterTable-ebseqhmm.xlsx

            I sent this to Gloria's lab to see if this has any value. I have not heard back yet.

            The next step is review. I have run it on my mac and on the powermac.

            For the cluster, my R library package is not set up correctly. I need to resolve this.

            Show
            robofjoy Robert Reid added a comment - masterTable-ebseqhmm.xlsx I sent this to Gloria's lab to see if this has any value. I have not heard back yet. The next step is review. I have run it on my mac and on the powermac. For the cluster, my R library package is not set up correctly. I need to resolve this.
            Hide
            robofjoy Robert Reid added a comment -

            I got no response on 2 emails.
            So I don't plan to add any more to this tool at this time.

            We should review the R code and get it in the hot pollen repo once it is universally working?

            Right now the code is across 2 r scripts. I am going to simplify it into 1 script. The 2nd script is just a functions script.

            Show
            robofjoy Robert Reid added a comment - I got no response on 2 emails. So I don't plan to add any more to this tool at this time. We should review the R code and get it in the hot pollen repo once it is universally working? Right now the code is across 2 r scripts. I am going to simplify it into 1 script. The 2nd script is just a functions script.
            Hide
            robofjoy Robert Reid added a comment - - edited

            2 Rscripts are now just 1.

            This is testable and reviewable!

            Show
            robofjoy Robert Reid added a comment - - edited 2 Rscripts are now just 1. This is testable and reviewable!
            Hide
            robofjoy Robert Reid added a comment - - edited

            For R script, see comments below....... muday-144-SL4_counts-salmon.txt

            Instructions for review:

            1. Download the attached R script. And run it on the attached Salmon counts file.
            2. Open the script in Rstudio and try to run it. A few libraries will need to be installed possibly depending on your environment.
            3. You will need to change the locations on lines 20 and 26 to your folder location.
            4. Takes many minutes to run.
            5. The end result should be a collection of result files that look similar in size to these:
            6. rw-rr-@ 1 robreid staff 63100 Oct 19 10:24 a28gc_ebseqGeneCalls-SL4.txt
              rw-rr-@ 1 robreid staff 63544 Oct 19 10:24 a34gc_ebseqGeneCalls-SL4.txt
              rw-rr-@ 1 robreid staff 49124 Oct 19 10:24 v28gc_ebseqGeneCalls-SL4.txt
              rw-rr-@ 1 robreid staff 49817 Oct 19 10:24 v34gc_ebseqGeneCalls-SL4.txt
              rw-rr-@ 1 robreid staff 40735 Oct 19 10:24 f28gc_ebseqGeneCalls-SL4.txt
              rw-rr-@ 1 robreid staff 42402 Oct 19 10:24 f34gc_ebseqGeneCalls-SL4.txt
              rw-rr-@ 1 robreid staff 483 Oct 19 10:24 a28gc_numIneachPath-SL4.txt
              rw-rr-@ 1 robreid staff 484 Oct 19 10:24 a34gc_numIneachPath-SL4.txt
              rw-rr-@ 1 robreid staff 482 Oct 19 10:24 v28gc_numIneachPath-SL4.txt
              rw-rr-@ 1 robreid staff 483 Oct 19 10:24 v34gc_numIneachPath-SL4.txt
              rw-rr-@ 1 robreid staff 481 Oct 19 10:24 f28gc_numIneachPath-SL4.txt
              rw-rr-@ 1 robreid staff 482 Oct 19 10:24 f34gc_numIneachPath-SL4.txt

            Take a peak at a result file or 2 and see if it is intuitive!

            Show
            robofjoy Robert Reid added a comment - - edited For R script, see comments below....... muday-144-SL4_counts-salmon.txt Instructions for review: Download the attached R script. And run it on the attached Salmon counts file. Open the script in Rstudio and try to run it. A few libraries will need to be installed possibly depending on your environment. You will need to change the locations on lines 20 and 26 to your folder location. Takes many minutes to run. The end result should be a collection of result files that look similar in size to these: rw-r r -@ 1 robreid staff 63100 Oct 19 10:24 a28gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 63544 Oct 19 10:24 a34gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 49124 Oct 19 10:24 v28gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 49817 Oct 19 10:24 v34gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 40735 Oct 19 10:24 f28gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 42402 Oct 19 10:24 f34gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 483 Oct 19 10:24 a28gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 484 Oct 19 10:24 a34gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 482 Oct 19 10:24 v28gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 483 Oct 19 10:24 v34gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 481 Oct 19 10:24 f28gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 482 Oct 19 10:24 f34gc_numIneachPath-SL4.txt Take a peak at a result file or 2 and see if it is intuitive!
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Review:

            • The script runs good in RStudio and ran into no issues.
            • I did have issues with the output files. There is a third column after "Max_PP" that is not named and is full of numbers. I am unsure what those numbers represent.
            • I also try to open the txt files in excel and they do not visually transfer correctly. The strings are mismatched and do not separate correctly.
            • Might be beneficial to create a data frame object for each result and make sure the columns save correctly with column names.

            Thanks!

            Show
            Mdavis4290 Molly Davis added a comment - - edited Review : The script runs good in RStudio and ran into no issues. I did have issues with the output files. There is a third column after "Max_PP" that is not named and is full of numbers. I am unsure what those numbers represent. I also try to open the txt files in excel and they do not visually transfer correctly. The strings are mismatched and do not separate correctly. Might be beneficial to create a data frame object for each result and make sure the columns save correctly with column names. Thanks!
            Hide
            robofjoy Robert Reid added a comment -

            Ebseq-hmm_mudayTimeCourse-Version3.R

            Updated version that handles the headers and keeps things in line for better exporting to Excel.

            Molly can you run this version?

            Show
            robofjoy Robert Reid added a comment - Ebseq-hmm_mudayTimeCourse-Version3.R Updated version that handles the headers and keeps things in line for better exporting to Excel. Molly can you run this version?
            Hide
            Mdavis4290 Molly Davis added a comment -

            Review:

            • The EBSeq output txt files are fixed now and column names are fixed.
            • The numIneachPath txt files still have some formatting issues but that's only if you try to open in excel. These files are readable as just txt files so I see no reason for them needing to be readable in excel also.

            Moving ticket to done!

            Show
            Mdavis4290 Molly Davis added a comment - Review : The EBSeq output txt files are fixed now and column names are fixed. The numIneachPath txt files still have some formatting issues but that's only if you try to open in excel. These files are readable as just txt files so I see no reason for them needing to be readable in excel also. Moving ticket to done!

              People

              • Assignee:
                robofjoy Robert Reid
                Reporter:
                robofjoy Robert Reid
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: