[IGBF-3143] Run RNA-Seq data processing pipeline on positive splicing control and experimental samples - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Support NSF pollen grant
Sprint:
Summer 4 2022 July 4, Summer 5 2022 July 18, Summer 6 2022 Aug 1, Fall 1 2022 Aug 15

Description

Data sets to process:

Positive control: SRP328042 Data are published in this article.
Experimental: SRP252265

To-Do:

Obtain data in fastq format from Sequence Read Archive using fasterqdump options for paired end data - DONE
Please data into directories named for the SRP number, e.g., SRP328042 and SRP252265 within a directory named "alt_splicing" under "nobackup" - DONE
Make a note of the particular commands used to perform the data retrieval (see comment below)
Create "samples" text file listing the SRR fastq files for running nf-core/rna-seq nextflow
Run nf-core/rnaseq using proper maximum intron size parameter using "tomato.config"

Notes:

Experimental datasets originally processed using code in https://bitbucket.org/hotpollen/rna-seq/src/master/ and https://bitbucket.org/hotpollen/flavonoid-rnaseq
All Pollen project datasets are now in the SRA under the same project number ! (SRP252265)
Ann is using a fork of flavonoid-rnaseq for all new code she's writing, on branch ~~IGBF-3143~~. To find her fork, go to https://bitbucket.org/hotpollen/flavonoid-rnaseq and select "forks"
Documentation for the pipeline we are using is here: https://nf-co.re/rnaseq/3.4/usage

Methods used to create positive control RNA-Seq data from SRP328042, according to the paper:

2.5.2. Preparation of RNA-Seq Library and Sequencing Total RNA was extracted utilizing Trizol reagent (Invitrogen, Waltham, MA, USA). RNA quantity and quality were determined by NanoDrop 1000 spectrophotometer (Thermo Scientific Inc., Waltham, MA, USA), 1% agarose gel electrophoresis and Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). Following the protocol described by [28], strand-specific RNA-Seq libraries from 3 biological replicates for each group from WW and DS anthers were prepared using 1 ng/µL of total RNA sample and sequenced by Novogene Biotech (Beijing, China) on Illumina HiSeq 4000 system (Illumina, Inc., San Diego, CA, USA) according to the manufacturer’s instructions. The raw sequence reads were deposited into NCBI Sequence Read Archive under accession the number PRJNA746070.

A PDF copy of the protocol paper (reference 28) for RNA-Seq library synthesis is attached.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

10.1.1.1052.3871.pdf
25/Jul/22 1:39 PM
393 kB
Ann Loraine

Issue Links

blocks

IGBF-3144 Make annots.xml for RNA-Seq junction, alignment, and coverage graphs

Closed

IGBF-3165 Make junction files for experimental and positive control alignments data

Closed

is blocked by

IGBF-3127 Version control nf-core configuration files used for rna-seq analysis

Closed

relates to

IGBF-2970 Re-run nf-core/rnaseq using proper strand designation and better sample prefix

Closed

IGBF-3127 Version control nf-core configuration files used for rna-seq analysis

Closed

IGBF-3162 Generate scaled coverage graphs for RNA-Seq alignments

Closed

IGBF-3228 Download and process data for SRP100604 and SRP268884

Closed

IGBF-2947 Investigate using nf-core/rnaseq pipeline

Closed

IGBF-3135 Add new tomato genome and annotations to IGB Quickload repository

Closed

Show 4 more links (4 relates to)

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Ann Loraine added a comment - 07/Jul/22 12:32 PM

Example data retrieval command sequence:

/Users/mollydavis333/Desktop/sratoolkit.3.0.0-mac64/bin/fasterq-dump -S SRR15111745
rsync --progress /Users/mollydavis333/Desktop/SRR15111737_1.fastq  mdavi258@hpc.uncc.edu:/nobackup/tomato_genome/alt_splicing

Show

Ann Loraine added a comment - 07/Jul/22 12:32 PM Example data retrieval command sequence: /Users/mollydavis333/Desktop/sratoolkit.3.0.0-mac64/bin/fasterq-dump -S SRR15111745 rsync --progress /Users/mollydavis333/Desktop/SRR15111737_1.fastq mdavi258@hpc.uncc.edu:/nobackup/tomato_genome/alt_splicing

8 older comments

Hide

Permalink

Ann Loraine added a comment - 27/Jul/22 2:58 PM - edited

Same error occurred. Ann, Rob, and Molly met via zoom to trouble-shoot. Rob suggested trying to run the pipeline using the older configuration and older data, since that worked. Rob & Molly proceeding with trouble-shooting as I am out of ideas!

Show

Ann Loraine added a comment - 27/Jul/22 2:58 PM - edited Same error occurred. Ann, Rob, and Molly met via zoom to trouble-shoot. Rob suggested trying to run the pipeline using the older configuration and older data, since that worked. Rob & Molly proceeding with trouble-shooting as I am out of ideas!

Hide

Permalink

Molly Davis added a comment - 12/Aug/22 12:55 PM - edited

Nextflow nf-core/rnaseq pipeline update:

Directory Location: /nobackup/tomato_genome/alt_splicing/SRP252265

Dr. Reid and I successfully ran the pipeline with SRP252265 for out.4.txt:

[nf-core/rnaseq] Pipeline completed successfully
Completed at: 02-Aug-2022 16:48:38
Duration : 1m 38s
CPU hours : 481.7 (100% cached)
Succeeded : 2
Cached : 828_

Solution: Added module load singularity to steps and removed the symbolic link for nf-core-rnaseq-3.4 and directly downloaded the files to the directory while not in an interactive session:

module load nf-core
nf-core download rnaseq -r 3.4

The only symbolic links present were for the file resources:

S_lycopersicum_Jun_2022.bed
S_lycopersicum_Jun_2022.fa
S_lycopersicum_Jun_2022.gtf
tomato.config
doIt.sh

Steps for successful run:

Add .bash_profile configurations
Start a tmux session (tmux new -s base)
Start an interactive session (srun --partition Andromeda --cpus-per-task 16 --mem-per-cpu 12000 --time 60:00:00 --pty bash)
Activate virtual environment (module load nf-core, module load singularity)
Start Nextflow (./doIt.sh SRP252265.csv S_lycopersicum_Jun_2022.fa S_lycopersicum_Jun_2022.gtf S_lycopersicum_Jun_2022.bed tomato.config 1> out.0.txt 2> err.0.txt )
Check out file and results directory

Possible error to arise:
Error executing process > 'NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)'

Caused by:
Process `NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)` terminated with an error exit status (1)

Work dir:
/nobackup/tomato_genome/alt_splicing/SRP252265/work/1e/a6b39f6fb5d57e905cdc43045523cd

Line 52 in versions file:
TRIMGALORE:
trimgalore: perl: error while loading shared libraries: libperl.so: cannot open shared object file: No such file or directory

Discussion: The following error is for a file that contains all of the versions of modules that were used during the pipeline process. The error line 52 in the versions file was for trimgalore. Trimgalore did successfully work with the use of the pipelines resources but on the cluster the module was nonexistent. So the process worked and the results were still there after it finished it was just informing us that the version type couldn't be found for the file to record.

We also created a script to use for easier use (still in the works to be seen as successful):

#!/bin/bash

#SBATCH --time=296:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --mem=401gb
#SBATCH --job-name=nnnnfcore-splice
#SBATCH --partition=Draco
#SBATCH --output=rrr-%x.%j.out
#SBATCH --error=rrr-%x.%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=mdavi258@uncc.edu

umask 007
set -eu

#file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /nobackup/tomato_genome/rnaseq-phase1/trinity/names.txt)
#dir=$(sed -n -e "${PBS_ARRAYID}p" /lustre/groups/lorainelab/data/illumina/sweet_potato/filtered/dir.txt)

echo "Launching Nextflow NF-Core"

#module load java8 
module load singularity
module load nf-core

##Always use slurm
export NXF_EXECUTOR=slurm
##Save Singularity containers in home dir
export NXF_SINGULARITY_CAHCEDIR="$HOME/nxf"
## Tell nextflow not to use internet
export NXF_OFFLINE='TRUE'
## Control Java Heap Size
export NXF_OPTS="-Xms2g -Xmx8g"

###Change to correct directory with fastq and symbolic link files
cd /nobackup/tomato_genome/alt_splicing/SRP328042-molly

#makeblastdb -in  /projects/tomato_genome/db/reference-4.0/ITAG4.0_cDNA.fasta -dbtype nucl

#nextflow run nf-core/rnaseq –profile ../../test,singularity

#/nobackup/tomato_genome/nfcore_rnaseq/tomato.config



###Change to correct names in directory
nextflow run nf-core-rnaseq-3.4/workflow \
         -resume \
         -profile singularity \
         -c tomato.config \
         --aligner star_salmon \
         --save_trimmed \
         --fasta S_lycopersicum_Jun_2022.fa \
         --input SRP328042.csv \
         --gtf S_lycopersicum_Jun_2022.gtf  \
         --gene_bed S_lycopersicum_Jun_2022.bed \
         --skip_biotype_qc \
         --skip_markduplicates \
         --skip_bigwig \
         --skip_stringtie \
         --skip_qualimap \
         --skip_fastqc


echo "Nextflow Pipeline Finished"

Troubleshooting:

Need to figure out if it would work better in Andromeda than Draco partition.
Using the script seems so have more random errors that do not make sense than using the srun which feels like a more organized and precise system to use.

Show

Molly Davis added a comment - 12/Aug/22 12:55 PM - edited Nextflow nf-core/rnaseq pipeline update: Directory Location: /nobackup/tomato_genome/alt_splicing/SRP252265 Dr. Reid and I successfully ran the pipeline with SRP252265 for out.4.txt : [nf-core/rnaseq] Pipeline completed successfully Completed at: 02-Aug-2022 16:48:38 Duration : 1m 38s CPU hours : 481.7 (100% cached) Succeeded : 2 Cached : 828_ Solution : Added module load singularity to steps and removed the symbolic link for nf-core-rnaseq-3.4 and directly downloaded the files to the directory while not in an interactive session: module load nf-core nf-core download rnaseq -r 3.4 The only symbolic links present were for the file resources: S_lycopersicum_Jun_2022.bed S_lycopersicum_Jun_2022.fa S_lycopersicum_Jun_2022.gtf tomato.config doIt.sh Steps for successful run: Add .bash_profile configurations Start a tmux session (tmux new -s base) Start an interactive session (srun --partition Andromeda --cpus-per-task 16 --mem-per-cpu 12000 --time 60:00:00 --pty bash) Activate virtual environment (module load nf-core, module load singularity) Start Nextflow (./doIt.sh SRP252265.csv S_lycopersicum_Jun_2022.fa S_lycopersicum_Jun_2022.gtf S_lycopersicum_Jun_2022.bed tomato.config 1> out.0.txt 2> err.0.txt ) Check out file and results directory Possible error to arise: Error executing process > 'NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)' Caused by: Process `NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)` terminated with an error exit status (1) Work dir: /nobackup/tomato_genome/alt_splicing/SRP252265/work/1e/a6b39f6fb5d57e905cdc43045523cd Line 52 in versions file: TRIMGALORE: trimgalore: perl: error while loading shared libraries: libperl.so: cannot open shared object file: No such file or directory Discussion: The following error is for a file that contains all of the versions of modules that were used during the pipeline process. The error line 52 in the versions file was for trimgalore. Trimgalore did successfully work with the use of the pipelines resources but on the cluster the module was nonexistent. So the process worked and the results were still there after it finished it was just informing us that the version type couldn't be found for the file to record. We also created a script to use for easier use (still in the works to be seen as successful): #!/bin/bash #SBATCH --time=296:30:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=24 #SBATCH --mem=401gb #SBATCH --job-name=nnnnfcore-splice #SBATCH --partition=Draco #SBATCH --output=rrr-%x.%j.out #SBATCH --error=rrr-%x.%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=mdavi258@uncc.edu umask 007 set -eu #file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /nobackup/tomato_genome/rnaseq-phase1/trinity/names.txt) #dir=$(sed -n -e "${PBS_ARRAYID}p" /lustre/groups/lorainelab/data/illumina/sweet_potato/filtered/dir.txt) echo "Launching Nextflow NF-Core" #module load java8 module load singularity module load nf-core ##Always use slurm export NXF_EXECUTOR=slurm ##Save Singularity containers in home dir export NXF_SINGULARITY_CAHCEDIR= "$HOME/nxf" ## Tell nextflow not to use internet export NXF_OFFLINE='TRUE' ## Control Java Heap Size export NXF_OPTS= "-Xms2g -Xmx8g" ###Change to correct directory with fastq and symbolic link files cd /nobackup/tomato_genome/alt_splicing/SRP328042-molly #makeblastdb -in /projects/tomato_genome/db/reference-4.0/ITAG4.0_cDNA.fasta -dbtype nucl #nextflow run nf-core/rnaseq –profile ../../test,singularity #/nobackup/tomato_genome/nfcore_rnaseq/tomato.config ###Change to correct names in directory nextflow run nf-core-rnaseq-3.4/workflow \ -resume \ -profile singularity \ -c tomato.config \ --aligner star_salmon \ --save_trimmed \ --fasta S_lycopersicum_Jun_2022.fa \ --input SRP328042.csv \ --gtf S_lycopersicum_Jun_2022.gtf \ --gene_bed S_lycopersicum_Jun_2022.bed \ --skip_biotype_qc \ --skip_markduplicates \ --skip_bigwig \ --skip_stringtie \ --skip_qualimap \ --skip_fastqc echo "Nextflow Pipeline Finished" Troubleshooting: Need to figure out if it would work better in Andromeda than Draco partition. Using the script seems so have more random errors that do not make sense than using the srun which feels like a more organized and precise system to use.

Hide

Permalink

Molly Davis added a comment - 15/Aug/22 11:22 AM

Nextflow nf-core/rnaseq pipeline update:

SRP328042 successfully ran through the nf-core pipeline.
Working Directory:
/nobackup/tomato_genome/alt_splicing/SRP328042-molly

Troubleshooting: I skipped custom_dumpsoftwareversions in the doIt.sh script:
nextflow run nf-core-rnaseq-3.4/workflow \
-resume \
-profile singularity \
-c $CONFIG \
--aligner star_salmon \
--save_trimmed \
--fasta $GENOMEFASTA \
--input $SAMPLESHEET \
--gtf $GTF \
--gene_bed $GENEBED \
--skip_biotype_qc \
--skip_markduplicates \
--skip_bigwig \
--skip_stringtie \
--skip_qualimap \
--skip_fastqc \
--skip_custom_dumpsoftwareversions

Show

Molly Davis added a comment - 15/Aug/22 11:22 AM Nextflow nf-core/rnaseq pipeline update: SRP328042 successfully ran through the nf-core pipeline. Working Directory: /nobackup/tomato_genome/alt_splicing/SRP328042-molly Troubleshooting: I skipped custom_dumpsoftwareversions in the doIt.sh script: nextflow run nf-core-rnaseq-3.4/workflow \ -resume \ -profile singularity \ -c $CONFIG \ --aligner star_salmon \ --save_trimmed \ --fasta $GENOMEFASTA \ --input $SAMPLESHEET \ --gtf $GTF \ --gene_bed $GENEBED \ --skip_biotype_qc \ --skip_markduplicates \ --skip_bigwig \ --skip_stringtie \ --skip_qualimap \ --skip_fastqc \ --skip_custom_dumpsoftwareversions

Hide

Permalink

Ann Loraine added a comment - 26/Aug/22 12:51 PM

During our meeting today, we looked over the files. Everything looks great, so moving this ticket to Done.

Summary:

Data processing output files are on the HPC cluster here:

/nobackup/tomato_genome/alt_splicing/SRP328042-molly
/nobackup/tomato_genome/alt_splicing/SRP252265

Show

Ann Loraine added a comment - 26/Aug/22 12:51 PM During our meeting today, we looked over the files. Everything looks great, so moving this ticket to Done. Summary: Data processing output files are on the HPC cluster here: /nobackup/tomato_genome/alt_splicing/SRP328042-molly /nobackup/tomato_genome/alt_splicing/SRP252265

Hide

Permalink

Ann Loraine added a comment - 17/Nov/22 12:46 PM

Planning to align data from two new possible possible controls: SRP100604 and SRP268884

Show

Ann Loraine added a comment - 17/Nov/22 12:46 PM Planning to align data from two new possible possible controls: SRP100604 and SRP268884

Run RNA-Seq data processing pipeline on positive splicing control and experimental samples

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates