Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3143

Run RNA-Seq data processing pipeline on positive splicing control and experimental samples

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      3
    • Sprint:
      Summer 4 2022 July 4, Summer 5 2022 July 18, Summer 6 2022 Aug 1, Fall 1 2022 Aug 15

      Description

      Data sets to process:

      Positive control: SRP328042 Data are published in this article.
      Experimental: SRP252265

      To-Do:

      • Obtain data in fastq format from Sequence Read Archive using fasterqdump options for paired end data - DONE
      • Please data into directories named for the SRP number, e.g., SRP328042 and SRP252265 within a directory named "alt_splicing" under "nobackup" - DONE
      • Make a note of the particular commands used to perform the data retrieval (see comment below)
      • Create "samples" text file listing the SRR fastq files for running nf-core/rna-seq nextflow
      • Run nf-core/rnaseq using proper maximum intron size parameter using "tomato.config"

      Notes:

      Methods used to create positive control RNA-Seq data from SRP328042, according to the paper:

      2.5.2. Preparation of RNA-Seq Library and Sequencing Total RNA was extracted utilizing Trizol reagent (Invitrogen, Waltham, MA, USA). RNA quantity and quality were determined by NanoDrop 1000 spectrophotometer (Thermo Scientific Inc., Waltham, MA, USA), 1% agarose gel electrophoresis and Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). Following the protocol described by [28], strand-specific RNA-Seq libraries from 3 biological replicates for each group from WW and DS anthers were prepared using 1 ng/µL of total RNA sample and sequenced by Novogene Biotech (Beijing, China) on Illumina HiSeq 4000 system (Illumina, Inc., San Diego, CA, USA) according to the manufacturer’s instructions. The raw sequence reads were deposited into NCBI Sequence Read Archive under accession the number PRJNA746070.

      A PDF copy of the protocol paper (reference 28) for RNA-Seq library synthesis is attached.

        Attachments

          Issue Links

            Activity

            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Same error occurred. Ann, Rob, and Molly met via zoom to trouble-shoot. Rob suggested trying to run the pipeline using the older configuration and older data, since that worked. Rob & Molly proceeding with trouble-shooting as I am out of ideas!

            Show
            ann.loraine Ann Loraine added a comment - - edited Same error occurred. Ann, Rob, and Molly met via zoom to trouble-shoot. Rob suggested trying to run the pipeline using the older configuration and older data, since that worked. Rob & Molly proceeding with trouble-shooting as I am out of ideas!
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Nextflow nf-core/rnaseq pipeline update:

            Directory Location: /nobackup/tomato_genome/alt_splicing/SRP252265

            Dr. Reid and I successfully ran the pipeline with SRP252265 for out.4.txt:

            [nf-core/rnaseq] Pipeline completed successfully
            Completed at: 02-Aug-2022 16:48:38
            Duration : 1m 38s
            CPU hours : 481.7 (100% cached)
            Succeeded : 2
            Cached : 828_

            Solution: Added module load singularity to steps and removed the symbolic link for nf-core-rnaseq-3.4 and directly downloaded the files to the directory while not in an interactive session:

            module load nf-core
            nf-core download rnaseq -r 3.4

            The only symbolic links present were for the file resources:

            S_lycopersicum_Jun_2022.bed
            S_lycopersicum_Jun_2022.fa
            S_lycopersicum_Jun_2022.gtf
            tomato.config
            doIt.sh

            Steps for successful run:

            1. Add .bash_profile configurations
            2. Start a tmux session (tmux new -s base)
            3. Start an interactive session (srun --partition Andromeda --cpus-per-task 16 --mem-per-cpu 12000 --time 60:00:00 --pty bash)
            4. Activate virtual environment (module load nf-core, module load singularity)
            5. Start Nextflow (./doIt.sh SRP252265.csv S_lycopersicum_Jun_2022.fa S_lycopersicum_Jun_2022.gtf S_lycopersicum_Jun_2022.bed tomato.config 1> out.0.txt 2> err.0.txt )
            6. Check out file and results directory

            Possible error to arise:
            Error executing process > 'NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)'

            Caused by:
            Process `NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)` terminated with an error exit status (1)

            Work dir:
            /nobackup/tomato_genome/alt_splicing/SRP252265/work/1e/a6b39f6fb5d57e905cdc43045523cd

            Line 52 in versions file:
            TRIMGALORE:
            trimgalore: perl: error while loading shared libraries: libperl.so: cannot open shared object file: No such file or directory

            Discussion: The following error is for a file that contains all of the versions of modules that were used during the pipeline process. The error line 52 in the versions file was for trimgalore. Trimgalore did successfully work with the use of the pipelines resources but on the cluster the module was nonexistent. So the process worked and the results were still there after it finished it was just informing us that the version type couldn't be found for the file to record.

            We also created a script to use for easier use (still in the works to be seen as successful):

            #!/bin/bash
            
            #SBATCH --time=296:30:00
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=24
            #SBATCH --mem=401gb
            #SBATCH --job-name=nnnnfcore-splice
            #SBATCH --partition=Draco
            #SBATCH --output=rrr-%x.%j.out
            #SBATCH --error=rrr-%x.%j.err
            #SBATCH --mail-type=END,FAIL
            #SBATCH --mail-user=mdavi258@uncc.edu
            
            umask 007
            set -eu
            
            #file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /nobackup/tomato_genome/rnaseq-phase1/trinity/names.txt)
            #dir=$(sed -n -e "${PBS_ARRAYID}p" /lustre/groups/lorainelab/data/illumina/sweet_potato/filtered/dir.txt)
            
            echo "Launching Nextflow NF-Core"
            
            #module load java8 
            module load singularity
            module load nf-core
            
            ##Always use slurm
            export NXF_EXECUTOR=slurm
            ##Save Singularity containers in home dir
            export NXF_SINGULARITY_CAHCEDIR="$HOME/nxf"
            ## Tell nextflow not to use internet
            export NXF_OFFLINE='TRUE'
            ## Control Java Heap Size
            export NXF_OPTS="-Xms2g -Xmx8g"
            
            ###Change to correct directory with fastq and symbolic link files
            cd /nobackup/tomato_genome/alt_splicing/SRP328042-molly
            
            #makeblastdb -in  /projects/tomato_genome/db/reference-4.0/ITAG4.0_cDNA.fasta -dbtype nucl
            
            #nextflow run nf-core/rnaseq –profile ../../test,singularity
            
            #/nobackup/tomato_genome/nfcore_rnaseq/tomato.config
            
            
            
            ###Change to correct names in directory
            nextflow run nf-core-rnaseq-3.4/workflow \
                     -resume \
                     -profile singularity \
                     -c tomato.config \
                     --aligner star_salmon \
                     --save_trimmed \
                     --fasta S_lycopersicum_Jun_2022.fa \
                     --input SRP328042.csv \
                     --gtf S_lycopersicum_Jun_2022.gtf  \
                     --gene_bed S_lycopersicum_Jun_2022.bed \
                     --skip_biotype_qc \
                     --skip_markduplicates \
                     --skip_bigwig \
                     --skip_stringtie \
                     --skip_qualimap \
                     --skip_fastqc
            
            
            echo "Nextflow Pipeline Finished"
            

            Troubleshooting:

            • Need to figure out if it would work better in Andromeda than Draco partition.
            • Using the script seems so have more random errors that do not make sense than using the srun which feels like a more organized and precise system to use.
            Show
            Mdavis4290 Molly Davis added a comment - - edited Nextflow nf-core/rnaseq pipeline update: Directory Location: /nobackup/tomato_genome/alt_splicing/SRP252265 Dr. Reid and I successfully ran the pipeline with SRP252265 for out.4.txt : [nf-core/rnaseq] Pipeline completed successfully Completed at: 02-Aug-2022 16:48:38 Duration : 1m 38s CPU hours : 481.7 (100% cached) Succeeded : 2 Cached : 828_ Solution : Added module load singularity to steps and removed the symbolic link for nf-core-rnaseq-3.4 and directly downloaded the files to the directory while not in an interactive session: module load nf-core nf-core download rnaseq -r 3.4 The only symbolic links present were for the file resources: S_lycopersicum_Jun_2022.bed S_lycopersicum_Jun_2022.fa S_lycopersicum_Jun_2022.gtf tomato.config doIt.sh Steps for successful run: Add .bash_profile configurations Start a tmux session (tmux new -s base) Start an interactive session (srun --partition Andromeda --cpus-per-task 16 --mem-per-cpu 12000 --time 60:00:00 --pty bash) Activate virtual environment (module load nf-core, module load singularity) Start Nextflow (./doIt.sh SRP252265.csv S_lycopersicum_Jun_2022.fa S_lycopersicum_Jun_2022.gtf S_lycopersicum_Jun_2022.bed tomato.config 1> out.0.txt 2> err.0.txt ) Check out file and results directory Possible error to arise: Error executing process > 'NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)' Caused by: Process `NFCORE_RNASEQ:RNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)` terminated with an error exit status (1) Work dir: /nobackup/tomato_genome/alt_splicing/SRP252265/work/1e/a6b39f6fb5d57e905cdc43045523cd Line 52 in versions file: TRIMGALORE: trimgalore: perl: error while loading shared libraries: libperl.so: cannot open shared object file: No such file or directory Discussion: The following error is for a file that contains all of the versions of modules that were used during the pipeline process. The error line 52 in the versions file was for trimgalore. Trimgalore did successfully work with the use of the pipelines resources but on the cluster the module was nonexistent. So the process worked and the results were still there after it finished it was just informing us that the version type couldn't be found for the file to record. We also created a script to use for easier use (still in the works to be seen as successful): #!/bin/bash #SBATCH --time=296:30:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=24 #SBATCH --mem=401gb #SBATCH --job-name=nnnnfcore-splice #SBATCH --partition=Draco #SBATCH --output=rrr-%x.%j.out #SBATCH --error=rrr-%x.%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=mdavi258@uncc.edu umask 007 set -eu #file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /nobackup/tomato_genome/rnaseq-phase1/trinity/names.txt) #dir=$(sed -n -e "${PBS_ARRAYID}p" /lustre/groups/lorainelab/data/illumina/sweet_potato/filtered/dir.txt) echo "Launching Nextflow NF-Core" #module load java8 module load singularity module load nf-core ##Always use slurm export NXF_EXECUTOR=slurm ##Save Singularity containers in home dir export NXF_SINGULARITY_CAHCEDIR= "$HOME/nxf" ## Tell nextflow not to use internet export NXF_OFFLINE='TRUE' ## Control Java Heap Size export NXF_OPTS= "-Xms2g -Xmx8g" ###Change to correct directory with fastq and symbolic link files cd /nobackup/tomato_genome/alt_splicing/SRP328042-molly #makeblastdb -in /projects/tomato_genome/db/reference-4.0/ITAG4.0_cDNA.fasta -dbtype nucl #nextflow run nf-core/rnaseq –profile ../../test,singularity #/nobackup/tomato_genome/nfcore_rnaseq/tomato.config ###Change to correct names in directory nextflow run nf-core-rnaseq-3.4/workflow \ -resume \ -profile singularity \ -c tomato.config \ --aligner star_salmon \ --save_trimmed \ --fasta S_lycopersicum_Jun_2022.fa \ --input SRP328042.csv \ --gtf S_lycopersicum_Jun_2022.gtf \ --gene_bed S_lycopersicum_Jun_2022.bed \ --skip_biotype_qc \ --skip_markduplicates \ --skip_bigwig \ --skip_stringtie \ --skip_qualimap \ --skip_fastqc echo "Nextflow Pipeline Finished" Troubleshooting: Need to figure out if it would work better in Andromeda than Draco partition. Using the script seems so have more random errors that do not make sense than using the srun which feels like a more organized and precise system to use.
            Hide
            Mdavis4290 Molly Davis added a comment -

            Nextflow nf-core/rnaseq pipeline update:

            SRP328042 successfully ran through the nf-core pipeline.
            Working Directory:
            /nobackup/tomato_genome/alt_splicing/SRP328042-molly

            Troubleshooting: I skipped custom_dumpsoftwareversions in the doIt.sh script:
            nextflow run nf-core-rnaseq-3.4/workflow \
            -resume \
            -profile singularity \
            -c $CONFIG \
            --aligner star_salmon \
            --save_trimmed \
            --fasta $GENOMEFASTA \
            --input $SAMPLESHEET \
            --gtf $GTF \
            --gene_bed $GENEBED \
            --skip_biotype_qc \
            --skip_markduplicates \
            --skip_bigwig \
            --skip_stringtie \
            --skip_qualimap \
            --skip_fastqc \
            --skip_custom_dumpsoftwareversions

            Show
            Mdavis4290 Molly Davis added a comment - Nextflow nf-core/rnaseq pipeline update: SRP328042 successfully ran through the nf-core pipeline. Working Directory: /nobackup/tomato_genome/alt_splicing/SRP328042-molly Troubleshooting: I skipped custom_dumpsoftwareversions in the doIt.sh script: nextflow run nf-core-rnaseq-3.4/workflow \ -resume \ -profile singularity \ -c $CONFIG \ --aligner star_salmon \ --save_trimmed \ --fasta $GENOMEFASTA \ --input $SAMPLESHEET \ --gtf $GTF \ --gene_bed $GENEBED \ --skip_biotype_qc \ --skip_markduplicates \ --skip_bigwig \ --skip_stringtie \ --skip_qualimap \ --skip_fastqc \ --skip_custom_dumpsoftwareversions
            Hide
            ann.loraine Ann Loraine added a comment -

            During our meeting today, we looked over the files. Everything looks great, so moving this ticket to Done.

            Summary:

            Data processing output files are on the HPC cluster here:

            • /nobackup/tomato_genome/alt_splicing/SRP328042-molly
            • /nobackup/tomato_genome/alt_splicing/SRP252265
            Show
            ann.loraine Ann Loraine added a comment - During our meeting today, we looked over the files. Everything looks great, so moving this ticket to Done. Summary: Data processing output files are on the HPC cluster here: /nobackup/tomato_genome/alt_splicing/SRP328042-molly /nobackup/tomato_genome/alt_splicing/SRP252265
            Hide
            ann.loraine Ann Loraine added a comment -

            Planning to align data from two new possible possible controls: SRP100604 and SRP268884

            Show
            ann.loraine Ann Loraine added a comment - Planning to align data from two new possible possible controls: SRP100604 and SRP268884

              People

              • Assignee:
                Mdavis4290 Molly Davis
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: