[IGBF-3375] Run Nextflow with 2021 Palanivelu Lab-generated samples - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Process and deploy Palanivelu Lab data
Sprint:
Summer 4 2023 June 26, Summer 5 2023 July 10

Description

Directory: /projects/tomato_genome/rnaseq/ravi-tamaulipas

Copy fastq files to /nobackup/tomato_genome/... location
Make csv
Run nf-core pipeline (SL4 & SL5 runs)

Note: Check to make sure configuration is the same intron size as other tomato experiments

Attachments

Issue Links

blocks

IGBF-3376 Create scaled and junction files for 2021 Palanivelu Lab-generated samples

Closed

relates to

IGBF-3377 Make 2021 Palanivelu to sample sheet

Closed

IGBF-3389 Edit XML script to include SL4, SL5 tomato genomes

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Molly Davis added a comment - 29/Jun/23 3:11 PM - edited

HPC Directory: /nobackup/tomato_genome/Ravi_tamaulipas_2021
SL5 Directory: /nobackup/tomato_genome/Ravi_tamaulipas_2021/Ravi_2021_SL5
SL4 Directory: /nobackup/tomato_genome/Ravi_tamaulipas_2021/Ravi_2021_SL4

Next steps:

Check multiQC report
Add multiQC report to repo if no warnings
Add sample.csv to repo: https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/src/main/ExternalData/
Rename other sample.csv in that directory to 2023 date

MultiQC report notes: Compared SL4 and SL5 multiQC reports. SL5 had no warnings and strandedness was reverse. SL4 multiQC report did have a failed strandedness warning for 'Tamaulipas-pistils-3hr-25C-R2' & 'Tamaulipas-pistils-0hr-25C-R2'. It is saying that the two samples are unstranded and not reverse compared to the rest of the samples. For mapping, SL5 & SL4 are pretty similar and have no major differences but overall SL5 has higher alignment scores.

nf-core warnings:

-[nf-core/rnaseq] Pipeline completed successfully-
WARN: Process 'NFCORE_RNASEQ:RNASEQ:MULTIQC_TSV_FAIL_MAPPED' cannot be executed by 'slurm' executor -- Using 'local' executor instead
WARN: Process 'NFCORE_RNASEQ:RNASEQ:MULTIQC_TSV_STRAND_CHECK' cannot be executed by 'slurm' executor -- Using 'local' executor instead

Notes: I reran nextflow and the same warnings came up for SL4 but when looking at the multiqc report the mapping doesn't seem wrong and the SL5 mapping had no issues with those specific samples. I believe it is an issue with the pipeline itself or something needs to be changed for the SL4 files. For example, I am using directory locations for the SL4 files instead of copied versions:

./doIt.sh Ravi_tamaulipas_2021-samples.csv /nobackup/tomato_genome/nfcore_rnaseq/S_lycopersicum_Sep_2019.fa /nobackup/tomato_genome/nfcore_rnaseq/S_lycopersicum_Sep_2019.gtf /nobackup/tomato_genome/nfcore_rnaseq/S_lycopersicum_Sep_2019.bed tomato.config 1> out.1.txt 2> err.1.txt

Could also be an issue with slurm not being able to process correctly with SL4 but does fine with SL5.

Next step: rerun with CSV just have 'unstranded' for strandedness for SL4.

Show

Molly Davis added a comment - 29/Jun/23 3:11 PM - edited HPC Directory : /nobackup/tomato_genome/Ravi_tamaulipas_2021 SL5 Directory : /nobackup/tomato_genome/Ravi_tamaulipas_2021/Ravi_2021_SL5 SL4 Directory : /nobackup/tomato_genome/Ravi_tamaulipas_2021/Ravi_2021_SL4 Next steps: Check multiQC report Add multiQC report to repo if no warnings Add sample.csv to repo: https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/src/main/ExternalData/ Rename other sample.csv in that directory to 2023 date MultiQC report notes : Compared SL4 and SL5 multiQC reports. SL5 had no warnings and strandedness was reverse. SL4 multiQC report did have a failed strandedness warning for 'Tamaulipas-pistils-3hr-25C-R2' & 'Tamaulipas-pistils-0hr-25C-R2'. It is saying that the two samples are unstranded and not reverse compared to the rest of the samples. For mapping, SL5 & SL4 are pretty similar and have no major differences but overall SL5 has higher alignment scores. nf-core warnings: -[nf-core/rnaseq] Pipeline completed successfully- WARN: Process 'NFCORE_RNASEQ:RNASEQ:MULTIQC_TSV_FAIL_MAPPED' cannot be executed by 'slurm' executor -- Using 'local' executor instead WARN: Process 'NFCORE_RNASEQ:RNASEQ:MULTIQC_TSV_STRAND_CHECK' cannot be executed by 'slurm' executor -- Using 'local' executor instead Notes : I reran nextflow and the same warnings came up for SL4 but when looking at the multiqc report the mapping doesn't seem wrong and the SL5 mapping had no issues with those specific samples. I believe it is an issue with the pipeline itself or something needs to be changed for the SL4 files. For example, I am using directory locations for the SL4 files instead of copied versions: ./doIt.sh Ravi_tamaulipas_2021-samples.csv /nobackup/tomato_genome/nfcore_rnaseq/S_lycopersicum_Sep_2019.fa /nobackup/tomato_genome/nfcore_rnaseq/S_lycopersicum_Sep_2019.gtf /nobackup/tomato_genome/nfcore_rnaseq/S_lycopersicum_Sep_2019.bed tomato.config 1> out.1.txt 2> err.1.txt Could also be an issue with slurm not being able to process correctly with SL4 but does fine with SL5. Next step : rerun with CSV just have 'unstranded' for strandedness for SL4.

Hide

Permalink

Molly Davis added a comment - 07/Jul/23 9:57 AM - edited

Update: I reran SL4 with 'unstranded' as all of the strandedness samples and the mutliqc report still had warnings. Due to no changes to the SL4 multiqc report and SL5 having no problems I am going to stick with using 'reverse' as the strandedness and use the original results for SL4.

Next Step: Commit CSV and multiQC reports to repo. Then move to review. Please investigate the reports and discover strandedness issue.

Show

Molly Davis added a comment - 07/Jul/23 9:57 AM - edited Update : I reran SL4 with 'unstranded' as all of the strandedness samples and the mutliqc report still had warnings. Due to no changes to the SL4 multiqc report and SL5 having no problems I am going to stick with using 'reverse' as the strandedness and use the original results for SL4. Next Step : Commit CSV and multiQC reports to repo. Then move to review. Please investigate the reports and discover strandedness issue.

Hide

Permalink

Molly Davis added a comment - 10/Jul/23 2:57 PM

Branch: https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3375
Pull Request: https://bitbucket.org/hotpollen/pistil-rna-seq/pull-requests/6

Reviewer: Please look over the multiqc reports. There was an issue for the 2021 SL4 strandedness but no issue for SL5. If no further investigation is needed please merge. Thanks!

Show

Molly Davis added a comment - 10/Jul/23 2:57 PM Branch : https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3375 Pull Request : https://bitbucket.org/hotpollen/pistil-rna-seq/pull-requests/6 Reviewer : Please look over the multiqc reports. There was an issue for the 2021 SL4 strandedness but no issue for SL5. If no further investigation is needed please merge. Thanks!

Hide

Permalink

Ann Loraine added a comment - 11/Jul/23 9:11 AM - edited

PR merged.
Commencing review.
In future, please hold off on the pull requests until review is completed. Please respect the process. Skipping a bunch of steps in the process makes it a lot harder for me to manage and follow the work, as there are a lot of people doing work simultaneously on different projects. Thank you for understanding.

Show

Ann Loraine added a comment - 11/Jul/23 9:11 AM - edited PR merged. Commencing review. In future, please hold off on the pull requests until review is completed. Please respect the process. Skipping a bunch of steps in the process makes it a lot harder for me to manage and follow the work, as there are a lot of people doing work simultaneously on different projects. Thank you for understanding.

Hide

Permalink

Ann Loraine added a comment - 11/Jul/23 9:28 AM - edited

Looking at "General Statistics" tables in the SL4 and SL5 report.
Because the Web page is difficult for me to read, I imported the tables into an Excel spreadsheet, named: Ravi_2021_SL4_v_SL5_GeneralStatistics_multiqc_report.xlsx and added to the repository.

I note right off the bat that "M Reads Mapping" (millions of reads mapped) is larger for SL5 in 7 out of 10 samples. To keep track of which samples had more or less "M Reads Mapping" values, I color-coded cells: pale blue if the value was larger and pale orange if the value was smaller. Thus, the SL5 table contained many more pale blue cells than the SL4 table.

Next, I did sanity-checking on the pipelines. The "% BP trimmed", "% Dups", "%GC", "Length" and "M Seqs" columns should be the same in the SL4 and SL5 pipeline results, as these steps merely process the reads themselves, independently from the genome alignment target. Thus, for these metrics, using a different reference genome assembly (SL4 versus SL5) should not matter. I scanned the values manually, using my eyes. No differences spotted. Sanity check passes.

As [~molly] noted during scrum meetings, two samples evoked a "WARNING: Fail Strand Check" alert in the SL4 mapping but elicited no such warning in the SL5 mapping. WTF? I suppose this could have something to do with the difference in the annotations used, possibly. We cannot really tell at this level what is going on, so I recommended we proceed with deploying the data for genome browser visualization and then manually inspect these samples for a possible error. We might find something cool. Who knows?

Show

Ann Loraine added a comment - 11/Jul/23 9:28 AM - edited Looking at "General Statistics" tables in the SL4 and SL5 report. Because the Web page is difficult for me to read, I imported the tables into an Excel spreadsheet, named: Ravi_2021_SL4_v_SL5_GeneralStatistics_multiqc_report.xlsx and added to the repository. I note right off the bat that "M Reads Mapping" (millions of reads mapped) is larger for SL5 in 7 out of 10 samples. To keep track of which samples had more or less "M Reads Mapping" values, I color-coded cells: pale blue if the value was larger and pale orange if the value was smaller. Thus, the SL5 table contained many more pale blue cells than the SL4 table. Next, I did sanity-checking on the pipelines. The "% BP trimmed", "% Dups", "%GC", "Length" and "M Seqs" columns should be the same in the SL4 and SL5 pipeline results, as these steps merely process the reads themselves, independently from the genome alignment target. Thus, for these metrics, using a different reference genome assembly (SL4 versus SL5) should not matter. I scanned the values manually, using my eyes. No differences spotted. Sanity check passes. As [~molly] noted during scrum meetings, two samples evoked a "WARNING: Fail Strand Check" alert in the SL4 mapping but elicited no such warning in the SL5 mapping. WTF? I suppose this could have something to do with the difference in the annotations used, possibly. We cannot really tell at this level what is going on, so I recommended we proceed with deploying the data for genome browser visualization and then manually inspect these samples for a possible error. We might find something cool. Who knows?

Hide

Permalink

Molly Davis added a comment - 11/Jul/23 9:59 AM - edited

I agree! Moving forward and will inspect warning samples 'Tamaulipas-pistils-3hr-25C-R2' & 'Tamaulipas-pistils-0hr-25C-R2' in IGB once deployed. Thanks! [~aloraine]

Show

Molly Davis added a comment - 11/Jul/23 9:59 AM - edited I agree! Moving forward and will inspect warning samples 'Tamaulipas-pistils-3hr-25C-R2' & 'Tamaulipas-pistils-0hr-25C-R2' in IGB once deployed. Thanks! [~aloraine]

Hide

Permalink

Molly Davis added a comment - 11/Jul/23 10:09 AM

I will also make sure to wait for the pull request in the future! Thanks!

Show

Molly Davis added a comment - 11/Jul/23 10:09 AM I will also make sure to wait for the pull request in the future! Thanks!

People

Assignee:

Molly Davis

Reporter:

Molly Davis

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

28/Jun/23 5:09 PM

Updated:

27/Jul/23 4:24 PM

Resolved:

11/Jul/23 10:52 AM