Details
-
Type: Task
-
Status: Closed (View Workflow)
-
Priority: Major
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:2
-
Sprint:Spring 1 2022 Jan 3 - Jan 14, Spring 2 2022 Jan 18 - Jan 28, Spring 3 2022 Jan 31 - Feb 11, Spring 4 2022 Feb 14 - Feb 25, Spring 5 2022 Feb 28 - Mar 11
Description
Align and process RNA-Seq data using nf-core/rnaseq pipeline
Attachments
Issue Links
- relates to
-
IGBF-3099 Analyze differential expression using RNA-Seq
- To-Do
Activity
Commencing to download the latest nf-core/rnaseq pipeline code into /nobackup/lorainelab/salty_rice/rna-seq.
Version 3.4 of rnaseq pipeline is available. Downloading rnaseq pipeline for off-line use, since the cluster nodes do not allow connecting to internet
Invoked:
module load nf-core
Then:
nf-core download rnaseq
However, when I tried to use the "singularity" option, got an error. Something about a directory not being available. Re-did the download command this time without selecting that option.
There is now a new rnaseq pipeline directory: /nobackup/lorainelab/salty_rice/rna-seq/nf-core-rnaseq-3.5
Made symbolic links to sample sheet samplesheet_RNA-Seq.csv located in my home directory in my cloned copy of bseq_rice repository.
Tried running version 3.5 and got this error:
Nextflow version 21.04.0 does not match workflow required version: >=21.10.3
Installing previous version 3.4.
None of the errors seen when installing version 3.5 occured with 3.4. Not sure why. Proceeding with version 3.4 instead of 3.5.
Testing with:
nextflow run nf-core-rnaseq-3.4/workflow -profile test,singularity -c nfcore-rnaseq.config
Forgot to include the genomic sequence data in the local directory.
Need to make gtf and fa file.
Edited sample sheet. Columns 2 and 3 need to contain the name of the file. Added extension fastq.gz to each value.
When tried to run, got this error:
Command output: ERROR: Please check samplesheet header -> SRA run identifier,fastq_1,fastq_2,strandedness,genotype,treatment,5-Azacytidine,tissue,replicate,read length != sample,fastq_1,fastq_2,strandedness
Fixing the header to match spec: https://nf-co.re/rnaseq/usage#samplesheet-input
New error:
WARN: Process 'NFCORE_RNASEQ:RNASEQ:MULTIQC_TSV_FAIL_MAPPED' cannot be executed by 'slurm' executor -- Using 'local' executor instead WARN: Process 'NFCORE_RNASEQ:RNASEQ:MULTIQC_TSV_STRAND_CHECK' cannot be executed by 'slurm' executor -- Using 'local' executor instead ERROR: Please check input samplesheet -> Read 1 FastQ file does not exist! SRR7591232_1.fastq.gz
Sample sheet includes _1 extension even for samples that are not paired-end. Fixing this.
nf-core/rnaseq is now running.
Edits to sample sheet checked in. Script doIt.sh used to run the pipeline also added to the repo.
New error:
Command exit status: 255 Command output: rsem-extract-reference-transcripts rsem/genome 0 O_sativa_japonica_Oct_2011_genes.gtf None 0 rsem/O_sativa_japonica_Oct_2011.fa "rsem-extract-reference-transcripts rsem/genome 0 O_sativa_japonica_Oct_2011_genes.gtf None 0 rsem/O_sativa_japonica_Oct_2011.fa" failed! Plase check if you provide correct parameters/options for the pipeline! Command error: INFO: Converting SIF file to temporary sandbox... WARNING: Skipping mount /usr/local/singularity/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container The GTF file might be corrupted! Stop at line : Chr1 NA exon 2903 3268 . + . transcript_id "LOC_Os01g01010.1"; Error Message: Cannot find gene_id! INFO: Cleaning up image... Work dir: /nobackup/lorainelab/salty_rice/rna-seq/work/51/a375c11f7ab0a69457e646f29f0f25 Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Re-running bed2gtf.py with gene_id option:
bed2gtf.py -g 13 O_sativa_japonica_Oct_2011.bed | grep -v ChrUn | grep -v ChrSy > O_sativa_japonica_Oct_2011.gtf
bed2gtf.py version:
34eac2e (HEAD -> master, upstream/master) Improve GTF to/from BED conversion
Added output to bseq_rice repo in ExternalDataSets/for_nf-core
Asked NF to modify column 1 in the nf-core/rnaseq samples file. Using concatenated sample types codes from subsequent columns.
Relaunched nf-core/rnaseq version 3.4 pipline using new gtf file and new sample sheet file.
Got new errors affecting A.C.Y.S.3
>>>>> Now validing the length of the 2 paired-end infiles: A.C.Y.S.3_1_trimmed.fq.gz and A.C.Y.S.3_2_trimmed.fq.gz <<<<< Writing validated paired-end Read 1 reads to A.C.Y.S.3_1_val_1.fq.gz Writing validated paired-end Read 2 reads to A.C.Y.S.3_2_val_2.fq.gz Read 2 output is truncated at sequence count: 23469680, please check your paired-end input files! Terminating... INFO: Cleaning up image... Work dir: /nobackup/lorainelab/salty_rice/rna-seq/work/c7/25844b7fa5600e4c60222407461493 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
and
>>>>> Now validing the length of the 2 paired-end infiles: A.C.Y.S.3_1_trimmed.fq.gz and A.C.Y.S.3_2_trimmed.fq.gz <<<<< Writing validated paired-end Read 1 reads to A.C.Y.S.3_1_val_1.fq.gz Writing validated paired-end Read 2 reads to A.C.Y.S.3_2_val_2.fq.gz Read 2 output is truncated at sequence count: 23469680, please check your paired-end input files! Terminating... INFO: Cleaning up image... Work dir: /nobackup/lorainelab/salty_rice/rna-seq/work/c7/25844b7fa5600e4c60222407461493 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
NF recommends discarding R2 for the problem sample.
Deleted R2 file for A.C.Y.S.3. Re-running the pipeline.
Still running. Waiting for pipeline to finish.
Nextflow job got killed after 30 hours expired, but items were still in queue. No more items were in queue but files did not finish processing. Seems like when the nextflow job manager dies, all the jobs die too? Confusing. Anyway, started a new interactive job running nextflow with job scheduled for 60 hours instead of 30 as was done previously. Relaunched the rnaseq pipeline to hopefully finish the jobs.
Shortly after re-running the rnaseq pipeline, 72 jobs added to the queue.
Pipeline crashed again. Restarted with:
- Logged onto cluster.
- Re-attached to existing tmux session with "tmux attach" command
- Launched interactive job with:
srun --partition Andromeda --job-name kitten --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash
- Re-started pipeline with:
[aloraine@str-ac9 rna-seq]$ doIt.sh samplesheet_RNA-Seq.csv O_sativa_japonica_Oct_2011.fa O_sativa_japonica_Oct_2011.gtf O_sativa_japonica_Oct_2011.bed 2>3.err 1>3.out
The pipeline appears to have completed. To test / review:
- Dr. Freese please check that you have full access to all the files in the "results" directory
- Nextflow runs a data QC sub-pipeline using "multiQC"; please check the output (probably need to download it)
You could also use 'rclone' to migrate all or some of the data into a google drive. If you search jira for 'rclone' you will probably find pretty good instructions on how to do this.
I can access the files in the results directory.
The following processed bam files appear to be missing:
- A.C.Y.S.1
- A.C.Y.S.3
- A.E.Y.S.2
- A.E.Y.S.3
I cannot find a multiqc_report.html but there does appear to be a complete folder of all of the samples run through fastqc here:
/nobackup/lorainelab/salty_rice/rna-seq/results/fastqc
Since there appear to be missing files I am moving back to To Do.
I think I found the problem:
RUN STATISTICS FOR INPUT FILE: A.C.Y.S.3_2.fastq.gz
=============================================
23469680 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)Validate paired-end files A.C.Y.S.3_1_trimmed.fq.gz and A.C.Y.S.3_2_trimmed.fq.gz
file_1: A.C.Y.S.3_1_trimmed.fq.gz, file_2: A.C.Y.S.3_2_trimmed.fq.gz>>>>> Now validing the length of the 2 paired-end infiles: A.C.Y.S.3_1_trimmed.fq.gz and A.C.Y.S.3_2_trimmed.fq.gz <<<<<
Writing validated paired-end Read 1 reads to A.C.Y.S.3_1_val_1.fq.gz
Writing validated paired-end Read 2 reads to A.C.Y.S.3_2_val_2.fq.gzRead 2 output is truncated at sequence count: 23469680, please check your paired-end input files! Terminating...
To deal with this, let's only use read 1 for sample A.C.Y.S.3.
Error reports:
- A.C.Y.S.1 - doIt.out: "Read 2 output is truncated at sequence count: 40774078, please check your paired-end input files! Terminating..."
- A.C.Y.S.3 - 3.out: "Read 2 output is truncated at sequence count: 23469680, please check your paired-end input files! Terminating..."
- A.E.Y.S.2 - ?
- A.E.Y.S.3 - ?
Edited sample sheet to exclude R2 for A.C.Y.S.1 and A.C.Y.S.3 as per error message above.
Re-ran pipeline after launching 60-hour interactive session on Andromeda cluster:
(nf-core) [aloraine@str-abm1 rna-seq]$ doIt.sh samplesheet_RNA-Seq.csv O_sativa_japonica_Oct_2011.fa O_sativa_japonica_Oct_2011.gtf O_sativa_japonica_Oct_2011.bed 2>4.err 1>4.out
Pipeline halted very quickly after starting, choking on sample A.E.Y.S.2. Error appeared to be with pigz. Internet says this is a parallel implementation of gzip. Error message was:
pigz: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory
Trying again:
(nf-core) [aloraine@str-abm1 rna-seq]$ doIt.sh samplesheet_RNA-Seq.csv O_sativa_japonica_Oct_2011.fa O_sativa_japonica_Oct_2011.gtf O_sativa_japonica_Oct_2011.bed 2>5.err 1>5.out
New error on sample A.E.Y.S.3:
Read 2 output is truncated at sequence count: 25069958, please check your paired-end input files! Terminating...
Edited samples file to exclude R2 for A.E.Y.S.3 and tried again with:
(nf-core) [aloraine@str-abm1 rna-seq]$ doIt.sh samplesheet_RNA-Seq.csv O_sativa_japonica_Oct_2011.fa O_sativa_japonica_Oct_2011.gtf O_sativa_japonica_Oct_2011.bed 2>6.err 1>6.out
New error on A.E.Y.S.2:
Read 2 output is truncated at sequence count: 28905992, please check your paired-end input files! Terminating...
Edited samples file to exclude R2 for A.E.Y.S.2 and tried again with:
(nf-core) [aloraine@str-abm1 rna-seq]$ doIt.sh samplesheet_RNA-Seq.csv O_sativa_japonica_Oct_2011.fa O_sativa_japonica_Oct_2011.gtf O_sativa_japonica_Oct_2011.bed 2>7.err 1>7.out
New error:
yaml.scanner.ScannerError: mapping values are not allowed here
in "versions.yml", line 23, column 21
Tried again and got this result:
-[nf-core/rnaseq] Pipeline completed successfully- Completed at: 03-Mar-2022 14:11:45 Duration : 1m 2s CPU hours : 280.3 (100% cached) Succeeded : 2 Cached : 534
Checked bam output files in /nobackup/lorainelab/salty_rice/rna-seq/results/star_salmon.
They are:
A.C.N.R.1.sorted.bam A.C.N.S.3.sorted.bam A.E.N.R.2.sorted.bam A.E.Y.S.2.sorted.bam M.C.N.S.1.sorted.bam M.E.N.R.3.sorted.bam A.C.N.R.2.sorted.bam A.C.Y.S.1.sorted.bam A.E.N.R.3.sorted.bam A.E.Y.S.3.sorted.bam M.C.N.S.2.sorted.bam M.E.N.S.1.sorted.bam A.C.N.R.3.sorted.bam A.C.Y.S.2.sorted.bam A.E.N.S.1.sorted.bam M.C.N.R.1.sorted.bam M.C.N.S.3.sorted.bam M.E.N.S.2.sorted.bam A.C.N.S.1.sorted.bam A.C.Y.S.3.sorted.bam A.E.N.S.2.sorted.bam M.C.N.R.2.sorted.bam M.E.N.R.1.sorted.bam M.E.N.S.3.sorted.bam A.C.N.S.2.sorted.bam A.E.N.R.1.sorted.bam A.E.N.S.3.sorted.bam M.C.N.R.3.sorted.bam M.E.N.R.2.sorted.bam
Multi-qc directory is also available.
Requesting new first level review by Nowlan Freese.
All sorted bam files are present with file sizes that are in the range of what I would expect. I was able to download the multiqc_report.html and view it in a web browser.
Added config file to repository: nfcore-rnaseq.config. The file is required to ensure correct max intron size parameter is used.