Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2945

Run trimmomatic on HPC system using nextflow and Singularity

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      To make our HPC data processing easier and more robust, we are exploring using the nextflow workflow management system in conjunction with Singularity containers.

      For this task, develop a nextflow script that runs "trimmomatic" on all fastq files in a directory.

        Attachments

          Issue Links

            Activity

            ann.loraine Ann Loraine created issue -
            ann.loraine Ann Loraine made changes -
            Field Original Value New Value
            Epic Link IGBF-2323 [ 18477 ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBF-2909 [ IGBF-2909 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            ann.loraine Ann Loraine added a comment -

            From RR:

            The command can be found in this script:
            /projects/tomato_genome/scripts/rob/trim-phase2.slurm
            
            
            The command from that script is below:
            
            file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"   /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/halffile.txt)
            
            module load trimmomatic
            
            java -jar /apps/pkg/trimmomatic/0.39/Trimmomatic-0.39/trimmomatic-0.39.jar \
            PE -summary summary-${file}.txt -validatePairs  /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/${file}_R1_001.fastq.gz \
            /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021//${file}_R2_001.fastq.gz \
             ${file}-R1_paired.fq  ${file}-R1_unpaired.fq ${file}-R2_paired.fq  ${file}-R2_unpaired.fq \
             ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
            
            
            
            For nextflow, we'd have 2 file inputs. 4 file outputs.
            But I am hard coding the names file too so that should become an input in nextflow ( /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/halffile.txt).
            
            Rob
            
            Show
            ann.loraine Ann Loraine added a comment - From RR: The command can be found in this script: /projects/tomato_genome/scripts/rob/trim-phase2.slurm The command from that script is below: file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/halffile.txt) module load trimmomatic java -jar /apps/pkg/trimmomatic/0.39/Trimmomatic-0.39/trimmomatic-0.39.jar \ PE -summary summary-${file}.txt -validatePairs /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/${file}_R1_001.fastq.gz \ /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021 //${file}_R2_001.fastq.gz \ ${file}-R1_paired.fq ${file}-R1_unpaired.fq ${file}-R2_paired.fq ${file}-R2_unpaired.fq \ ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 For nextflow, we'd have 2 file inputs. 4 file outputs. But I am hard coding the names file too so that should become an input in nextflow ( /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/halffile.txt). Rob
            Hide
            ann.loraine Ann Loraine added a comment -
            Show
            ann.loraine Ann Loraine added a comment - Trimmomatic documentation: http://www.usadellab.org/cms/?page=trimmomatic
            Hide
            ann.loraine Ann Loraine added a comment - - edited
            [aloraine@str-i1 nextflow]$ module load singularity
            [aloraine@str-i1 nextflow]$ singularity pull trimmomatic_v0.39.sif oras://registry.forgemia.inra.fr/gafl/singularity/trimmomatic/trimmomatic:latest
            INFO:    Downloading oras image
            [aloraine@str-i1 nextflow]$ singularity run-help ./trimmomatic_v0.39.sif
            Container for trimmomatic
            A flexible read trimming tool for Illumina NGS data
            http://www.usadellab.org/cms/?page=trimmomatic
            
            Version: 0.39
            Package installation using Miniconda3-4.7.12
            All packages are in /opt/miniconda/bin & are in PATH
            Default runscript: trimmomatic
            
            Usage:
                trimmomatic_v0.39.sif --help
                or:
                singularity exec trimmomatic_v0.39.sif trimmomatic --help
            [aloraine@str-i1 nextflow]$ singularity exec trimmomatic_v0.39.sif trimmomatic --help
            INFO:    Converting SIF file to temporary sandbox...
            Usage: 
                   PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
               or: 
                   SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
               or: 
                   -version
            INFO:    Cleaning up image...
            

            Above worked fine on a head node, but failed due to a network error of some time when run from an Andromeda partition interactive session.

            Show
            ann.loraine Ann Loraine added a comment - - edited [aloraine@str-i1 nextflow]$ module load singularity [aloraine@str-i1 nextflow]$ singularity pull trimmomatic_v0.39.sif oras: //registry.forgemia.inra.fr/gafl/singularity/trimmomatic/trimmomatic:latest INFO: Downloading oras image [aloraine@str-i1 nextflow]$ singularity run-help ./trimmomatic_v0.39.sif Container for trimmomatic A flexible read trimming tool for Illumina NGS data http: //www.usadellab.org/cms/?page=trimmomatic Version: 0.39 Package installation using Miniconda3-4.7.12 All packages are in /opt/miniconda/bin & are in PATH Default runscript: trimmomatic Usage: trimmomatic_v0.39.sif --help or: singularity exec trimmomatic_v0.39.sif trimmomatic --help [aloraine@str-i1 nextflow]$ singularity exec trimmomatic_v0.39.sif trimmomatic --help INFO: Converting SIF file to temporary sandbox... Usage: PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>... or: SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>... or: -version INFO: Cleaning up image... Above worked fine on a head node, but failed due to a network error of some time when run from an Andromeda partition interactive session.
            ann.loraine Ann Loraine made changes -
            Comment [ Retrieving trimmomatic singularity image:

            {code}
            singularity pull trimmomatic_v0.39.sif oras://registry.forgemia.inra.fr/gafl/singularity/trimmomatic/trimmomatic:latest
            {code}

            Worked fine on a head node, but failed due to a network error of some time when run from an Andromeda partition interactive session.
            ]
            Hide
            ann.loraine Ann Loraine added a comment -

            Asked about error pulling Singularity container onto a cluster node. Reply:

            "Compute nodes do not have internet access, so Singularity pulls can only occur on the head nodes. So in this case, you'll want to pull the image on the head node into your account so that it's available on the compute nodes."

            Show
            ann.loraine Ann Loraine added a comment - Asked about error pulling Singularity container onto a cluster node. Reply: "Compute nodes do not have internet access, so Singularity pulls can only occur on the head nodes. So in this case, you'll want to pull the image on the head node into your account so that it's available on the compute nodes."
            Hide
            ann.loraine Ann Loraine added a comment -

            Running with singularity container on a cluster node;

            nextflow run trim.nf -with-singularity trimmomatic_v0.39.sif 
            

            The first time I did this, I got an error about the work directory not being found, possibly because the container had not "mounting" the local file system. Adding the following line to "nextflow.config" fixed the problem:

            singularity.autoMounts = true
            

            Can I add this configuration to the nextflow script itself?

            Show
            ann.loraine Ann Loraine added a comment - Running with singularity container on a cluster node; nextflow run trim.nf -with-singularity trimmomatic_v0.39.sif The first time I did this, I got an error about the work directory not being found, possibly because the container had not "mounting" the local file system. Adding the following line to "nextflow.config" fixed the problem: singularity.autoMounts = true Can I add this configuration to the nextflow script itself?
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Running:

            #!/usr/bin/env nextflow
            
            params.saveMode = 'copy'
            params.filePattern = "/projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/*_{R1,R2}_001.fastq.gz"
            params.resultsDir = 'results/trimmomatic'
            
            Channel
                .fromFilePairs( params.filePattern )
                .ifEmpty { error "Cannot find any reads matching: ${params.filePattern}" }
                .set { read_pairs_ch }
                
            process trim {
                time '2h'
                publishDir params.resultsDir, mode: params.saveMode
                
                input:
                tuple val(prefix), file(reads) from read_pairs_ch
            
                output:
                stdout result
            
                script:
            
                fq_1_paired = prefix + '_R1.paired.fastq'
                fq_1_unpaired = prefix + '_R1.unpaired.fastq'
                fq_2_paired = prefix + '_R2.paired.fastq'
                fq_2_unpaired = prefix + '_R2.unpaired.fastq'
            	  
                """
                trimmomatic \
                PE -phred33 \
                ${reads[0]} \
                ${reads[1]} \
                $fq_1_paired \
                $fq_1_unpaired \
                $fq_2_paired \
                $fq_2_unpaired \
                ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
                printf 'processed sample $prefix with trimmomatic version '
                trimmomatic -version
                """
            }
            
            result.view()
            

            So far it appears to be working OK. But even though a few jobs have finished, there's no "results" directory visible. Not sure why not.

            Show
            ann.loraine Ann Loraine added a comment - - edited Running: #!/usr/bin/env nextflow params.saveMode = 'copy' params.filePattern = "/projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/*_{R1,R2}_001.fastq.gz" params.resultsDir = 'results/trimmomatic' Channel .fromFilePairs( params.filePattern ) .ifEmpty { error "Cannot find any reads matching: ${params.filePattern}" } .set { read_pairs_ch } process trim { time '2h' publishDir params.resultsDir, mode: params.saveMode input: tuple val(prefix), file(reads) from read_pairs_ch output: stdout result script: fq_1_paired = prefix + '_R1.paired.fastq' fq_1_unpaired = prefix + '_R1.unpaired.fastq' fq_2_paired = prefix + '_R2.paired.fastq' fq_2_unpaired = prefix + '_R2.unpaired.fastq' """ trimmomatic \ PE -phred33 \ ${reads[0]} \ ${reads[1]} \ $fq_1_paired \ $fq_1_unpaired \ $fq_2_paired \ $fq_2_unpaired \ ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 printf 'processed sample $prefix with trimmomatic version ' trimmomatic -version """ } result.view() So far it appears to be working OK. But even though a few jobs have finished, there's no "results" directory visible. Not sure why not.
            Hide
            ann.loraine Ann Loraine added a comment -

            Answer to above: Files created need to be tracked in the output

            Show
            ann.loraine Ann Loraine added a comment - Answer to above: Files created need to be tracked in the output
            Show
            ann.loraine Ann Loraine added a comment - This was useful: https://gencore.bio.nyu.edu/three-useful-nextflow-patterns-every-computational-biologist-should-know/
            ann.loraine Ann Loraine made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Comment [ Final version of the nextflow script:

            {code}
            #!/usr/bin/env nextflow

            // test on one sample: nextflow run trim2.nf --dev -with-singularity trimmomatic_v0.39.sif
            // run all: runextflow run trim2.nf --dev -with-singularity trimmomatic_v0.39.sif
            params.dev = false
            params.number_of_inputs = 1
            params.saveMode = 'copy'
            //params.filePattern = "/projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/*_{R1,R2}_001.fastq.gz"
            params.filePattern = "fastq/*_{R1,R2}_001.fastq.gz"
            params.outdir = 'results'

            Channel
                .fromFilePairs( params.filePattern )
                .ifEmpty { error "Cannot find any reads matching: ${params.filePattern}" }
                .take( params.dev ? params.number_of_inputs : -1 )
                .set { read_pairs_ch }

                
            process trim {
                time '2h'

                publishDir "$params.outdir", pattern: '*.fq.gz', mode: 'copy'
                
                input:
                tuple val(prefix), file(reads) from read_pairs_ch

                output:
                file '*.fq.gz'

                script:
                fq_1_paired = prefix + '_R1.p.fq'
                fq_1_unpaired = prefix + '_R1.u.fq'
                fq_2_paired = prefix + '_R2.p.fq'
                fq_2_unpaired = prefix + '_R2.u.fq'
                """
                trimmomatic \
                PE -phred33 \
                ${reads[0]} \
                ${reads[1]} \
                $fq_1_paired \
                $fq_1_unpaired \
                $fq_2_paired \
                $fq_2_unpaired \
                ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
                gzip $fq_1_paired
                gzip $fq_1_unpaired
                gzip $fq_2_paired
                gzip $fq_2_unpaired
                """
            }
            {code}

            Outputs copied to a results directory. ]
            Hide
            ann.loraine Ann Loraine added a comment -

            Final version of script read to be merged into team repository:

            https://bitbucket.org/hotpollen/rna-seq/src/master/src/trimmomatic.nf

            Show
            ann.loraine Ann Loraine added a comment - Final version of script read to be merged into team repository: https://bitbucket.org/hotpollen/rna-seq/src/master/src/trimmomatic.nf
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            ann.loraine Ann Loraine made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            ann.loraine Ann Loraine made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
            ann.loraine Ann Loraine made changes -
            Link This issue blocks IGBF-2949 [ IGBF-2949 ]

              People

              • Assignee:
                ann.loraine Ann Loraine
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: