[IGBF-3454] Develop an R prototype script EB-Seq using Muday time course - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Prepare flavonoid RNA-Seq time course analyses for publication
Sprint:
Fall 2 2023 Sep 17, Fall 3 2023 Oct 2, Fall 4 2023 Oct 16

Description

Purpose: Write a first draft of an EB-Seq script to analyze the time course data.

Why? Because comparing time points violates an assumption of independence, a few tools have been developed to address this and improve the RNA-seq comparisons on time dependent studies.
A few examples:

mfuzz
PAL-D
EB-Seq

I tested out PAL-D last January and it was a horrible experience.
Rasha ran mfuzz with success.
EB-Seq has been run by me and Liz in the past.

Liz's details on Eb0seq can be found here:
_Original details from Liz Cooper
“EBSeq script on my GitHub page associated with the sweet sorghum paper (in case you decide to use that package): “
Rio/EBSeqHMM_Cluster.R at master · eacooper400/Rio · GitHub

Liz_

In R:

The R package should be acquired like so:

1. 1. Installing Required Packages (only done once)
    #source("https://bioconductor.org/biocLite.R")
    BiocManager::install("EBSeq")
    BiocManager::install("EBSeqHMM")
    install.packages("blockmodeling")

1. 1. Load Required Pacakges
    library(EBSeq)
    library(EBSeqHMM)

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

Ebseq-hmm_mudayTimeCourse-Version2.R
6 kB
20/Oct/23 9:03 AM
Ebseq-hmm_mudayTimeCourse-Version2 2.R
4 kB
18/Oct/23 3:14 PM
Ebseq-hmm_mudayTimeCourse-Version3.R
6 kB
27/Oct/23 9:58 AM
journal.pone.0264246.pdf
2.21 MB
25/Sep/23 9:02 AM
masterTable-ebseqhmm.xlsx
272 kB
13/Oct/23 10:10 AM
muday-144-SL4_counts-salmon.txt
8.22 MB
20/Oct/23 9:03 AM
muday-144-SL4_counts-salmon.txt
8.22 MB
18/Oct/23 3:14 PM
RNAseq_fnxs.R
5 kB
18/Oct/23 3:14 PM

Issue Links

is blocked by

IGBF-3462 Even Expression failure in EBSEQ-HMM

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Robert Reid added a comment - 25/Sep/23 9:02 AM

Here is journal article discussing rna-seq statistical methods. They discuss EB-SEQ.

And they mention that is performs well when n=3, as in this project!

journal.pone.0264246.pdf

Show

Robert Reid added a comment - 25/Sep/23 9:02 AM Here is journal article discussing rna-seq statistical methods. They discuss EB-SEQ. And they mention that is performs well when n=3, as in this project! journal.pone.0264246.pdf

Hide

Permalink

Robert Reid added a comment - 25/Sep/23 9:18 AM

It's slowly coming back to me now!

EB-SEQ does the cool pattern identifying of up and down patterns.

For example, using their example dataset, you get a results like so,

Most_Likely_Path Max_PP *
Gene_40 "Up-Up-Down-Down" "0.9884"
Gene_1 "Down-Up-Down-Down" "0.7742"
Gene_15 "Down-Down-Up-Up" "0.5733"
Gene_9 "Down-Down-Up-Up" "0.5524"
Gene_2 "Up-Up-Down-Down" "0.5321"
Gene_17 "Down-Down-Up-Up" "0.5084"

(In this example, there are 5 time points, so you get 4 states up genes going up or down)

Show

Robert Reid added a comment - 25/Sep/23 9:18 AM It's slowly coming back to me now! EB-SEQ does the cool pattern identifying of up and down patterns. For example, using their example dataset, you get a results like so, Most_Likely_Path Max_PP * Gene_40 "Up-Up-Down-Down" "0.9884" Gene_1 "Down-Up-Down-Down" "0.7742" Gene_15 "Down-Down-Up-Up" "0.5733" Gene_9 "Down-Down-Up-Up" "0.5524" Gene_2 "Up-Up-Down-Down" "0.5321" Gene_17 "Down-Down-Up-Up" "0.5084" (In this example, there are 5 time points, so you get 4 states up genes going up or down)

Hide

Permalink

Robert Reid added a comment - 29/Sep/23 8:50 AM

Success!

I've read in the data and produced output.
I ran this test 6 times, breaking down the data into these 6 categories:

ARE 28C
ARE 34C
VF36 28C
VF36 34C
F3H-OE3 28C
FSH-OE3 34C

I have produced 6 files that summarizes the genes that were found to show some change in expression pattern.
E.g.,
"Most_Likely_Path" "Max_PP"
"Solyc08G002520" "Down-Up-Down" "0.9777"
"Solyc05G000582" "Up-Down-Up" "0.9661"
"Solyc04G002780" "Up-Down-Up" "0.9656"
"Solyc00G000013" "Down-Up-Down" "0.9651"
"Solyc04G000324" "Down-Up-Down" "0.9651"
"Solyc03G002116" "Down-Up-Down" "0.964"
"Solyc02G002947" "Down-Down-Up" "0.9596"
"Solyc03G003029" "Down-Down-Up" "0.9489"
"Solyc00G000015" "Down-Up-Down" "0.9388"

The R code for this will need review.

Will summarize number of genes found for each type of pattern possible: ("Down-Down-Up" )

Show

Robert Reid added a comment - 29/Sep/23 8:50 AM Success! I've read in the data and produced output. I ran this test 6 times, breaking down the data into these 6 categories: ARE 28C ARE 34C VF36 28C VF36 34C F3H-OE3 28C FSH-OE3 34C I have produced 6 files that summarizes the genes that were found to show some change in expression pattern. E.g., "Most_Likely_Path" "Max_PP" "Solyc08G002520" "Down-Up-Down" "0.9777" "Solyc05G000582" "Up-Down-Up" "0.9661" "Solyc04G002780" "Up-Down-Up" "0.9656" "Solyc00G000013" "Down-Up-Down" "0.9651" "Solyc04G000324" "Down-Up-Down" "0.9651" "Solyc03G002116" "Down-Up-Down" "0.964" "Solyc02G002947" "Down-Down-Up" "0.9596" "Solyc03G003029" "Down-Down-Up" "0.9489" "Solyc00G000015" "Down-Up-Down" "0.9388" The R code for this will need review. Will summarize number of genes found for each type of pattern possible: ("Down-Down-Up" )

Hide

Permalink

Robert Reid added a comment - 29/Sep/23 1:51 PM

Ha! major bug.

We set levels for the time points like so.
CondVector=c("T15","T30","T45","T75","T15","T30","T45","T75","T15","T30","T45","T75")

And that should correspond with our data we sub-select:
a28<-select(x,matches("A.28"))

[1] "A.28.15.7" "A.28.30.7" "A.28.45.7" "A.28.75.7" "A.28.45.8" "A.28.15.8" "A.28.30.8" "A.28.75.8" "A.28.30.9"
[10] "A.28.15.9" "A.28.75.9" "A.28.45.9"

However, the order in our sheet is not always the same. I'll need to assign levels correctly.

Show

Robert Reid added a comment - 29/Sep/23 1:51 PM Ha! major bug. We set levels for the time points like so. CondVector=c("T15","T30","T45","T75","T15","T30","T45","T75","T15","T30","T45","T75") And that should correspond with our data we sub-select: a28<-select(x,matches("A.28")) [1] "A.28.15.7" "A.28.30.7" "A.28.45.7" "A.28.75.7" "A.28.45.8" "A.28.15.8" "A.28.30.8" "A.28.75.8" "A.28.30.9" [10] "A.28.15.9" "A.28.75.9" "A.28.45.9" However, the order in our sheet is not always the same. I'll need to assign levels correctly.

Hide

Permalink

Robert Reid added a comment - 03/Oct/23 9:17 AM

Resolved the ordering issue.
WHen we select our experiments of interest like so:

> colnames(f28)
[1] "F.28.15.7" "F.28.30.7" "F.28.45.7" "F.28.75.7" "F.28.75.8" "F.28.45.8" "F.28.15.8" "F.28.30.8" "F.28.30.9"
[10] "F.28.15.9" "F.28.75.9" "F.28.45.9"

We'd like the 15 min, first, followed by the 30, then 45 and 75. The original sheet is not set up that way.
But this tidyverse command rectifies the problem:

> colnames(f28[,order(colnames(f28))])
[1] "F.28.15.7" "F.28.15.8" "F.28.15.9" "F.28.30.7" "F.28.30.8" "F.28.30.9" "F.28.45.7" "F.28.45.8" "F.28.45.9"
[10] "F.28.75.7" "F.28.75.8" "F.28.75.9"

This allows everything to consistently be ordered for the next EBseq step.

Show

Robert Reid added a comment - 03/Oct/23 9:17 AM Resolved the ordering issue. WHen we select our experiments of interest like so: > colnames(f28) [1] "F.28.15.7" "F.28.30.7" "F.28.45.7" "F.28.75.7" "F.28.75.8" "F.28.45.8" "F.28.15.8" "F.28.30.8" "F.28.30.9" [10] "F.28.15.9" "F.28.75.9" "F.28.45.9" We'd like the 15 min, first, followed by the 30, then 45 and 75. The original sheet is not set up that way. But this tidyverse command rectifies the problem: > colnames(f28 [,order(colnames(f28))] ) [1] "F.28.15.7" "F.28.15.8" "F.28.15.9" "F.28.30.7" "F.28.30.8" "F.28.30.9" "F.28.45.7" "F.28.45.8" "F.28.45.9" [10] "F.28.75.7" "F.28.75.8" "F.28.75.9" This allows everything to consistently be ordered for the next EBseq step.

Hide

Permalink

Robert Reid added a comment - 03/Oct/23 9:32 AM

Tweaked my function so we can easily get how many genes are found in each type of up-down condition.

The function now returns 2 items, the EBseq Object results and a list of how many genes in each condition:

Unable to find source-code formatter for language: r. Available languages are: actionscript, html, java, javascript, none, sql, xhtml, xml

fetch_GeneCalls <- function(tibby) {
  dim(tibby)
  tibby <- tibby[,order(colnames(tibby))]  ##Reordering the columns to match the conditions
  mat <- as.data.frame(tibby)
  rownames(mat) <- geneid
  head(mat)
  Sizes=MedianNorm(mat)
  qSizes=QuantileNorm(mat,.75)
  print(qSizes)
  GeneNormData=GetNormalizedMat(mat, Sizes)
  EBSeqHMMGeneOut=EBSeqHMMTest(Data=data.matrix(mat), sizeFactors=Sizes, Conditions=Conditions,UpdateRd=50)
  dim(EBSeqHMMGeneOut)
  GeneDECalls=GetDECalls(EBSeqHMMGeneOut, FDR=.05)
  
  #print the gene COnf calls summary
  GeneConfCalls=GetConfidentCalls(EBSeqHMMGeneOut, FDR=.05,cutoff=.5, OnlyDynamic=T)
  print(GeneConfCalls$NumEach) ##We print these, but results are not returned. Should make a separate function.
  list_data <- list(GeneDECalls,GeneConfCalls$NumEach)
  return(list_data)
}

Show

Robert Reid added a comment - 03/Oct/23 9:32 AM Tweaked my function so we can easily get how many genes are found in each type of up-down condition. The function now returns 2 items, the EBseq Object results and a list of how many genes in each condition: Unable to find source-code formatter for language: r. Available languages are: actionscript, html, java, javascript, none, sql, xhtml, xml fetch_GeneCalls <- function(tibby) { dim(tibby) tibby <- tibby[,order(colnames(tibby))] ##Reordering the columns to match the conditions mat <- as.data.frame(tibby) rownames(mat) <- geneid head(mat) Sizes=MedianNorm(mat) qSizes=QuantileNorm(mat,.75) print(qSizes) GeneNormData=GetNormalizedMat(mat, Sizes) EBSeqHMMGeneOut=EBSeqHMMTest(Data=data.matrix(mat), sizeFactors=Sizes, Conditions=Conditions,UpdateRd=50) dim(EBSeqHMMGeneOut) GeneDECalls=GetDECalls(EBSeqHMMGeneOut, FDR=.05) #print the gene COnf calls summary GeneConfCalls=GetConfidentCalls(EBSeqHMMGeneOut, FDR=.05,cutoff=.5, OnlyDynamic=T) print(GeneConfCalls$NumEach) ##We print these, but results are not returned. Should make a separate function. list_data <- list(GeneDECalls,GeneConfCalls$NumEach) return (list_data) }

Hide

Permalink

Robert Reid added a comment - 10/Oct/23 9:51 AM

Great success. A master table has been created.

But I am going to go back to the Python code and add a Description column for each gene to make the sheet more insightful!

Show

Robert Reid added a comment - 10/Oct/23 9:51 AM Great success. A master table has been created. But I am going to go back to the Python code and add a Description column for each gene to make the sheet more insightful!

Hide

Permalink

Robert Reid added a comment - 11/Oct/23 10:15 AM

To make 1 big table from the command line:

for file in *bseqGeneCalls-SL4.txt; do python ./add2MasterTable.py ./masterTable-ebseqhmm.txt ./$

{file}

./muday-144-SL4_counts-salmon.txt; mv ./masterTable-ebseqhmm.new ./masterTable-ebseqhmm.txt ; done

Description of genes now added as a column.

Show

Robert Reid added a comment - 11/Oct/23 10:15 AM To make 1 big table from the command line: for file in *bseqGeneCalls-SL4.txt; do python ./add2MasterTable.py ./masterTable-ebseqhmm.txt ./$ {file} ./muday-144-SL4_counts-salmon.txt; mv ./masterTable-ebseqhmm.new ./masterTable-ebseqhmm.txt ; done Description of genes now added as a column.

Hide

Permalink

Robert Reid added a comment - 12/Oct/23 9:18 AM

To create a master table, we first need to all of the genes that show some sort of path pattern change.
To get these we iterate through all the result files, get the first column, add them all to a temp file.
We then sort that temp file and keep the unique gene_ID entries.
These are the basis for the master table.
Commands were as follows:

for file in *_ebseqGeneCalls-SL4.txt; do echo $file; awk '

{ print $1 }

' $file >> tempFoundIds.txt ; done

sort tempFoundIds.txt | uniq > tempFoundIds-uniq.txt

cp tempFoundIds-uniq.txt masterTable-ebseqhmm.orig
cp tempFoundIds-uniq.txt masterTable-ebseqhmm.txt

Show

Robert Reid added a comment - 12/Oct/23 9:18 AM To create a master table, we first need to all of the genes that show some sort of path pattern change. To get these we iterate through all the result files, get the first column, add them all to a temp file. We then sort that temp file and keep the unique gene_ID entries. These are the basis for the master table. Commands were as follows: for file in *_ebseqGeneCalls-SL4.txt; do echo $file; awk ' { print $1 } ' $file >> tempFoundIds.txt ; done sort tempFoundIds.txt | uniq > tempFoundIds-uniq.txt cp tempFoundIds-uniq.txt masterTable-ebseqhmm.orig cp tempFoundIds-uniq.txt masterTable-ebseqhmm.txt

Hide

Permalink

Robert Reid added a comment - 12/Oct/23 9:56 AM

Need testing and get the code into the repo.

Starting with running things on the cluster as a test.

Location = /projects/tomato_genome/fnb/robswork/timecluster/ebseq

Show

Robert Reid added a comment - 12/Oct/23 9:56 AM Need testing and get the code into the repo. Starting with running things on the cluster as a test. Location = /projects/tomato_genome/fnb/robswork/timecluster/ebseq

Hide

Permalink

Robert Reid added a comment - 13/Oct/23 10:12 AM

masterTable-ebseqhmm.xlsx

I sent this to Gloria's lab to see if this has any value. I have not heard back yet.

The next step is review. I have run it on my mac and on the powermac.

For the cluster, my R library package is not set up correctly. I need to resolve this.

Show

Robert Reid added a comment - 13/Oct/23 10:12 AM masterTable-ebseqhmm.xlsx I sent this to Gloria's lab to see if this has any value. I have not heard back yet. The next step is review. I have run it on my mac and on the powermac. For the cluster, my R library package is not set up correctly. I need to resolve this.

Hide

Permalink

Robert Reid added a comment - 19/Oct/23 9:35 AM

I got no response on 2 emails.
So I don't plan to add any more to this tool at this time.

We should review the R code and get it in the hot pollen repo once it is universally working?

Right now the code is across 2 r scripts. I am going to simplify it into 1 script. The 2nd script is just a functions script.

Show

Robert Reid added a comment - 19/Oct/23 9:35 AM I got no response on 2 emails. So I don't plan to add any more to this tool at this time. We should review the R code and get it in the hot pollen repo once it is universally working? Right now the code is across 2 r scripts. I am going to simplify it into 1 script. The 2nd script is just a functions script.

Hide

Permalink

Robert Reid added a comment - 20/Oct/23 8:57 AM - edited

2 Rscripts are now just 1.

This is testable and reviewable!

Show

Robert Reid added a comment - 20/Oct/23 8:57 AM - edited 2 Rscripts are now just 1. This is testable and reviewable!

Hide

Permalink

Robert Reid added a comment - 20/Oct/23 9:19 AM - edited

For R script, see comments below....... muday-144-SL4_counts-salmon.txt

Instructions for review:

Download the attached R script. And run it on the attached Salmon counts file.
Open the script in Rstudio and try to run it. A few libraries will need to be installed possibly depending on your environment.
You will need to change the locations on lines 20 and 26 to your folder location.
Takes many minutes to run.
The end result should be a collection of result files that look similar in size to these:
~~rw-r~~r-@ 1 robreid staff 63100 Oct 19 10:24 a28gc_ebseqGeneCalls-SL4.txt
~~rw-r~~r-@ 1 robreid staff 63544 Oct 19 10:24 a34gc_ebseqGeneCalls-SL4.txt
~~rw-r~~r-@ 1 robreid staff 49124 Oct 19 10:24 v28gc_ebseqGeneCalls-SL4.txt
~~rw-r~~r-@ 1 robreid staff 49817 Oct 19 10:24 v34gc_ebseqGeneCalls-SL4.txt
~~rw-r~~r-@ 1 robreid staff 40735 Oct 19 10:24 f28gc_ebseqGeneCalls-SL4.txt
~~rw-r~~r-@ 1 robreid staff 42402 Oct 19 10:24 f34gc_ebseqGeneCalls-SL4.txt
~~rw-r~~r-@ 1 robreid staff 483 Oct 19 10:24 a28gc_numIneachPath-SL4.txt
~~rw-r~~r-@ 1 robreid staff 484 Oct 19 10:24 a34gc_numIneachPath-SL4.txt
~~rw-r~~r-@ 1 robreid staff 482 Oct 19 10:24 v28gc_numIneachPath-SL4.txt
~~rw-r~~r-@ 1 robreid staff 483 Oct 19 10:24 v34gc_numIneachPath-SL4.txt
~~rw-r~~r-@ 1 robreid staff 481 Oct 19 10:24 f28gc_numIneachPath-SL4.txt
~~rw-r~~r-@ 1 robreid staff 482 Oct 19 10:24 f34gc_numIneachPath-SL4.txt

Take a peak at a result file or 2 and see if it is intuitive!

Show

Robert Reid added a comment - 20/Oct/23 9:19 AM - edited For R script, see comments below....... muday-144-SL4_counts-salmon.txt Instructions for review: Download the attached R script. And run it on the attached Salmon counts file. Open the script in Rstudio and try to run it. A few libraries will need to be installed possibly depending on your environment. You will need to change the locations on lines 20 and 26 to your folder location. Takes many minutes to run. The end result should be a collection of result files that look similar in size to these: rw-r r -@ 1 robreid staff 63100 Oct 19 10:24 a28gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 63544 Oct 19 10:24 a34gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 49124 Oct 19 10:24 v28gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 49817 Oct 19 10:24 v34gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 40735 Oct 19 10:24 f28gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 42402 Oct 19 10:24 f34gc_ebseqGeneCalls-SL4.txt rw-r r -@ 1 robreid staff 483 Oct 19 10:24 a28gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 484 Oct 19 10:24 a34gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 482 Oct 19 10:24 v28gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 483 Oct 19 10:24 v34gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 481 Oct 19 10:24 f28gc_numIneachPath-SL4.txt rw-r r -@ 1 robreid staff 482 Oct 19 10:24 f34gc_numIneachPath-SL4.txt Take a peak at a result file or 2 and see if it is intuitive!

Hide

Permalink

Molly Davis added a comment - 24/Oct/23 3:46 PM - edited

Review:

The script runs good in RStudio and ran into no issues.
I did have issues with the output files. There is a third column after "Max_PP" that is not named and is full of numbers. I am unsure what those numbers represent.
I also try to open the txt files in excel and they do not visually transfer correctly. The strings are mismatched and do not separate correctly.
Might be beneficial to create a data frame object for each result and make sure the columns save correctly with column names.

Thanks!

Show

Molly Davis added a comment - 24/Oct/23 3:46 PM - edited Review : The script runs good in RStudio and ran into no issues. I did have issues with the output files. There is a third column after "Max_PP" that is not named and is full of numbers. I am unsure what those numbers represent. I also try to open the txt files in excel and they do not visually transfer correctly. The strings are mismatched and do not separate correctly. Might be beneficial to create a data frame object for each result and make sure the columns save correctly with column names. Thanks!

Hide

Permalink

Robert Reid added a comment - 27/Oct/23 9:58 AM

Ebseq-hmm_mudayTimeCourse-Version3.R

Updated version that handles the headers and keeps things in line for better exporting to Excel.

Molly can you run this version?

Show

Robert Reid added a comment - 27/Oct/23 9:58 AM Ebseq-hmm_mudayTimeCourse-Version3.R Updated version that handles the headers and keeps things in line for better exporting to Excel. Molly can you run this version?

Hide

Permalink

Molly Davis added a comment - 27/Oct/23 4:38 PM

Review:

The EBSeq output txt files are fixed now and column names are fixed.
The numIneachPath txt files still have some formatting issues but that's only if you try to open in excel. These files are readable as just txt files so I see no reason for them needing to be readable in excel also.

Moving ticket to done!

Show

Molly Davis added a comment - 27/Oct/23 4:38 PM Review : The EBSeq output txt files are fixed now and column names are fixed. The numIneachPath txt files still have some formatting issues but that's only if you try to open in excel. These files are readable as just txt files so I see no reason for them needing to be readable in excel also. Moving ticket to done!

People

Assignee:

Robert Reid

Reporter:

Robert Reid

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

25/Sep/23 8:57 AM

Updated:

27/Oct/23 4:38 PM

Resolved:

27/Oct/23 4:38 PM