      To make our HPC data processing easier and more robust, we are exploring using the nextflow workflow management system in conjunction with Singularity containers.

      For this task, develop a nextflow script that runs "trimmomatic" on all fastq files in a directory.


            ann.loraine Ann Loraine added a comment -

            From RR:

            The command can be found in this script:
            The command from that script is below:
            file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"   /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/halffile.txt)
            module load trimmomatic
            java -jar /apps/pkg/trimmomatic/0.39/Trimmomatic-0.39/trimmomatic-0.39.jar \
            PE -summary summary-${file}.txt -validatePairs  /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/${file}_R1_001.fastq.gz \
            /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021//${file}_R2_001.fastq.gz \
             ${file}-R1_paired.fq  ${file}-R1_unpaired.fq ${file}-R2_paired.fq  ${file}-R2_unpaired.fq \
            For nextflow, we'd have 2 file inputs. 4 file outputs.
            But I am hard coding the names file too so that should become an input in nextflow ( /projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/halffile.txt).
            ann.loraine Ann Loraine added a comment -
            ann.loraine Ann Loraine added a comment - Trimmomatic documentation: http://www.usadellab.org/cms/?page=trimmomatic
            ann.loraine Ann Loraine added a comment - - edited
            [aloraine@str-i1 nextflow]$ module load singularity
            [aloraine@str-i1 nextflow]$ singularity pull trimmomatic_v0.39.sif oras://registry.forgemia.inra.fr/gafl/singularity/trimmomatic/trimmomatic:latest
            INFO:    Downloading oras image
            [aloraine@str-i1 nextflow]$ singularity run-help ./trimmomatic_v0.39.sif
            Container for trimmomatic
            A flexible read trimming tool for Illumina NGS data
            Version: 0.39
            Package installation using Miniconda3-4.7.12
            All packages are in /opt/miniconda/bin & are in PATH
            Default runscript: trimmomatic
                trimmomatic_v0.39.sif --help
                singularity exec trimmomatic_v0.39.sif trimmomatic --help
            [aloraine@str-i1 nextflow]$ singularity exec trimmomatic_v0.39.sif trimmomatic --help
            INFO:    Converting SIF file to temporary sandbox...
                   PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
                   SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
            INFO:    Cleaning up image...

            Above worked fine on a head node, but failed due to a network error of some time when run from an Andromeda partition interactive session.

            ann.loraine Ann Loraine added a comment -

            Asked about error pulling Singularity container onto a cluster node. Reply:

            "Compute nodes do not have internet access, so Singularity pulls can only occur on the head nodes. So in this case, you'll want to pull the image on the head node into your account so that it's available on the compute nodes."

            ann.loraine Ann Loraine added a comment -

            Running with singularity container on a cluster node;

            nextflow run trim.nf -with-singularity trimmomatic_v0.39.sif 

            The first time I did this, I got an error about the work directory not being found, possibly because the container had not "mounting" the local file system. Adding the following line to "nextflow.config" fixed the problem:

            singularity.autoMounts = true

            Can I add this configuration to the nextflow script itself?

            ann.loraine Ann Loraine added a comment - - edited


            #!/usr/bin/env nextflow
            params.saveMode = 'copy'
            params.filePattern = "/projects/tomato_genome/rnaseq/phase2-rnaseq-Sep2021/*_{R1,R2}_001.fastq.gz"
            params.resultsDir = 'results/trimmomatic'
                .fromFilePairs( params.filePattern )
                .ifEmpty { error "Cannot find any reads matching: ${params.filePattern}" }
                .set { read_pairs_ch }
            process trim {
                time '2h'
                publishDir params.resultsDir, mode: params.saveMode
                tuple val(prefix), file(reads) from read_pairs_ch
                stdout result
                fq_1_paired = prefix + '_R1.paired.fastq'
                fq_1_unpaired = prefix + '_R1.unpaired.fastq'
                fq_2_paired = prefix + '_R2.paired.fastq'
                fq_2_unpaired = prefix + '_R2.unpaired.fastq'
                trimmomatic \
                PE -phred33 \
                ${reads[0]} \
                ${reads[1]} \
                $fq_1_paired \
                $fq_1_unpaired \
                $fq_2_paired \
                $fq_2_unpaired \
                ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
                printf 'processed sample $prefix with trimmomatic version '
                trimmomatic -version

            So far it appears to be working OK. But even though a few jobs have finished, there's no "results" directory visible. Not sure why not.

            ann.loraine Ann Loraine added a comment -

            Answer to above: Files created need to be tracked in the output

            ann.loraine Ann Loraine added a comment -

            Final version of script read to be merged into team repository:


