Looking at "General Statistics" tables in the SL4 and SL5 report.
Because the Web page is difficult for me to read, I imported the tables into an Excel spreadsheet, named: Ravi_2021_SL4_v_SL5_GeneralStatistics_multiqc_report.xlsx and added to the repository.
I note right off the bat that "M Reads Mapping" (millions of reads mapped) is larger for SL5 in 7 out of 10 samples. To keep track of which samples had more or less "M Reads Mapping" values, I color-coded cells: pale blue if the value was larger and pale orange if the value was smaller. Thus, the SL5 table contained many more pale blue cells than the SL4 table.
Next, I did sanity-checking on the pipelines. The "% BP trimmed", "% Dups", "%GC", "Length" and "M Seqs" columns should be the same in the SL4 and SL5 pipeline results, as these steps merely process the reads themselves, independently from the genome alignment target. Thus, for these metrics, using a different reference genome assembly (SL4 versus SL5) should not matter. I scanned the values manually, using my eyes. No differences spotted. Sanity check passes.
As [~molly] noted during scrum meetings, two samples evoked a "WARNING: Fail Strand Check" alert in the SL4 mapping but elicited no such warning in the SL5 mapping. WTF? I suppose this could have something to do with the difference in the annotations used, possibly. We cannot really tell at this level what is going on, so I recommended we proceed with deploying the data for genome browser visualization and then manually inspect these samples for a possible error. We might find something cool. Who knows?
HPC Directory: /nobackup/tomato_genome/Ravi_tamaulipas_2021
SL5 Directory: /nobackup/tomato_genome/Ravi_tamaulipas_2021/Ravi_2021_SL5
SL4 Directory: /nobackup/tomato_genome/Ravi_tamaulipas_2021/Ravi_2021_SL4
Next steps:
MultiQC report notes: Compared SL4 and SL5 multiQC reports. SL5 had no warnings and strandedness was reverse. SL4 multiQC report did have a failed strandedness warning for 'Tamaulipas-pistils-3hr-25C-R2' & 'Tamaulipas-pistils-0hr-25C-R2'. It is saying that the two samples are unstranded and not reverse compared to the rest of the samples. For mapping, SL5 & SL4 are pretty similar and have no major differences but overall SL5 has higher alignment scores.
nf-core warnings:
Notes: I reran nextflow and the same warnings came up for SL4 but when looking at the multiqc report the mapping doesn't seem wrong and the SL5 mapping had no issues with those specific samples. I believe it is an issue with the pipeline itself or something needs to be changed for the SL4 files. For example, I am using directory locations for the SL4 files instead of copied versions:
Could also be an issue with slurm not being able to process correctly with SL4 but does fine with SL5.
Next step: rerun with CSV just have 'unstranded' for strandedness for SL4.