[IGBF-3420] Process 30-804059537 (KP 2023) data using S_lycopersicum_Sep_2019 (SL4) genome assembly and annotations - JIRA UNCC

Hide

Permalink

Molly Davis added a comment - 25/Aug/23 3:59 PM - edited

Directory: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019

./doIt.sh Kelsey_samples_2023.csv /projects/tomato_genome/fnb/nfcore_rnaseq/S_lycopersicum_Sep_2019.fa /projects/tomato_genome/fnb/nfcore_rnaseq/S_lycopersicum_Sep_2019.gtf /projects/tomato_genome/fnb/nfcore_rnaseq/S_lycopersicum_Sep_2019.bed tomato.config 1> out.0.txt 2> err.0.txt

Show

Molly Davis added a comment - 25/Aug/23 3:59 PM - edited Directory: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019 ./doIt.sh Kelsey_samples_2023.csv /projects/tomato_genome/fnb/nfcore_rnaseq/S_lycopersicum_Sep_2019.fa /projects/tomato_genome/fnb/nfcore_rnaseq/S_lycopersicum_Sep_2019.gtf /projects/tomato_genome/fnb/nfcore_rnaseq/S_lycopersicum_Sep_2019.bed tomato.config 1> out.0.txt 2> err.0.txt

Hide

Permalink

Molly Davis added a comment - 28/Aug/23 12:57 PM - edited

Launch renameBams.sh script:
./renameBams.sh
Launch Scaled Coverage graphs script:
./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err
Launch Junction files script:
./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Reviewer:

Check that files have reasonable sizes (no "zero" size files, for example)
Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file
Check that every bam file has a corresponding "FJ.bed.gz" file
Check that every bam file has a corresponding "scaled.bedgraph.gz" file
Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"

Reviewer: [~RobertReid]

Next Steps after review:

move just the coverage graphs, bam files, and junction files to this location: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537
Make sure permissions are granted for copying

Show

Molly Davis added a comment - 28/Aug/23 12:57 PM - edited Launch renameBams.sh script : ./renameBams.sh Launch Scaled Coverage graphs script : ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err Launch Junction files script : ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err Reviewer : Check that files have reasonable sizes (no "zero" size files, for example) Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file Check that every bam file has a corresponding "FJ.bed.gz" file Check that every bam file has a corresponding "scaled.bedgraph.gz" file Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi" Reviewer: [~RobertReid] Next Steps after review : move just the coverage graphs, bam files, and junction files to this location: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537 Make sure permissions are granted for copying

Hide

Permalink

Molly Davis added a comment - 28/Aug/23 1:03 PM - edited

Branch: https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420
Includes 30-804059537_SL4_multiqc_report.html & 30-804059537_SL4_salmon.merged.gene_counts.tsv

Reviewer: [~aloraine]

Show

Molly Davis added a comment - 28/Aug/23 1:03 PM - edited Branch : https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420 Includes 30-804059537_SL4_multiqc_report.html & 30-804059537_SL4_salmon.merged.gene_counts.tsv Reviewer: [~aloraine]

Hide

Permalink

Robert Reid added a comment - 31/Aug/23 8:22 AM

Checking out files in :
/projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019/results/star_salmon

.err files
All files here are 0 in size.
Except one!

~~rw-r~~r- 1 mdavi258 tomato_genome 370 Aug 28 13:56 Malintka-R1-0hr-25C-self.err

The error is

Exception in thread "main" java.lang.RuntimeException: Postion is too high (more than 64792705)
at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.setCurrentSequencePosition(TwoBitParser.java:191)
at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.loadFragment(TwoBitParser.java:332)
at org.lorainelab.findjunctions.FindJunctions.main(FindJunctions.java:249)
Malintka-R1-0hr-25C-self.err lines 1-4/4 (END)

Bam and bai
63 bam files
63 bai files

~~rw-r~~---- 1 mdavi258 tomato_genome 590M Aug 28 11:58 Malintka-R1-8hr-25C-self.bam
Sue enough this file is 590M while the rest range from 950M to 3GB. Broken!

Show

Robert Reid added a comment - 31/Aug/23 8:22 AM Checking out files in : /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019/results/star_salmon .err files All files here are 0 in size. Except one! rw-r r - 1 mdavi258 tomato_genome 370 Aug 28 13:56 Malintka-R1-0hr-25C-self.err The error is Exception in thread "main" java.lang.RuntimeException: Postion is too high (more than 64792705) at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.setCurrentSequencePosition(TwoBitParser.java:191) at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.loadFragment(TwoBitParser.java:332) at org.lorainelab.findjunctions.FindJunctions.main(FindJunctions.java:249) Malintka-R1-0hr-25C-self.err lines 1-4/4 (END) Bam and bai 63 bam files 63 bai files rw-r ---- 1 mdavi258 tomato_genome 590M Aug 28 11:58 Malintka-R1-8hr-25C-self.bam Sue enough this file is 590M while the rest range from 950M to 3GB. Broken!

Hide

Permalink

Ann Loraine added a comment - 31/Aug/23 4:57 PM

Moving this back to "to-do" as the first level review found an issue. MD will look into it, as per her comment during today's morning meeting. Please feel free to move this somewhere different as you see fit, of course.

attn: [~molly]

Show

Ann Loraine added a comment - 31/Aug/23 4:57 PM Moving this back to "to-do" as the first level review found an issue. MD will look into it, as per her comment during today's morning meeting. Please feel free to move this somewhere different as you see fit, of course. attn: [~molly]

Hide

Permalink

Molly Davis added a comment - 05/Sep/23 4:09 PM - edited

Nextflow Pipeline ran successfully with SL4 genome
Directory: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019
MultiQC report notes: No errors or warnings were present in the report. The output file is named '30-804059537_SL4_multiqc_report.html'.

Show

Molly Davis added a comment - 05/Sep/23 4:09 PM - edited Nextflow Pipeline ran successfully with SL4 genome Directory: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019 MultiQC report notes: No errors or warnings were present in the report. The output file is named '30-804059537_SL4_multiqc_report.html'.

Hide

Permalink

Molly Davis added a comment - 05/Sep/23 4:11 PM

Next steps:
Commit CSV and multiqc report to Pistil repo on bitbucket
Change sorted bam names
Create junction files
Create Coverage graphs

Show

Molly Davis added a comment - 05/Sep/23 4:11 PM Next steps: Commit CSV and multiqc report to Pistil repo on bitbucket Change sorted bam names Create junction files Create Coverage graphs

Hide

Permalink

Molly Davis added a comment - 05/Sep/23 4:17 PM

Launch renameBams.sh script:
./renameBams.sh
Launch Scaled Coverage graphs script:
./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err
Launch Junction files script:
./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Show

Molly Davis added a comment - 05/Sep/23 4:17 PM Launch renameBams.sh script : ./renameBams.sh Launch Scaled Coverage graphs script : ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err Launch Junction files script : ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Hide

Permalink

Molly Davis added a comment - 06/Sep/23 10:00 AM - edited

Directory: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019/results/star_salmon

Still running into same error from before when running find_junctions.sh script: Malintka-R1-0hr-25C-self.err

Exception in thread "main" java.lang.RuntimeException: Postion is too high (more than 64792705)
at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.setCurrentSequencePosition(TwoBitParser.java:191)
at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.loadFragment(TwoBitParser.java:332)
at org.lorainelab.findjunctions.FindJunctions.main(FindJunctions.java:249)

Next Step: Ask Nowlan Freese for help with java jar file (find-junctions-1.0.0-jar-with-dependencies.jar)(S_lycopersicum_Sep_2019.2bit).

Show

Molly Davis added a comment - 06/Sep/23 10:00 AM - edited Directory : /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/S_lycopersicum_Sep_2019/results/star_salmon Still running into same error from before when running find_junctions.sh script: Malintka-R1-0hr-25C-self.err Exception in thread "main" java.lang.RuntimeException: Postion is too high (more than 64792705) at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.setCurrentSequencePosition(TwoBitParser.java:191) at org.biojava.nbio.genome.parsers.twobit.TwoBitParser.loadFragment(TwoBitParser.java:332) at org.lorainelab.findjunctions.FindJunctions.main(FindJunctions.java:249) Next Step: Ask Nowlan Freese for help with java jar file (find-junctions-1.0.0-jar-with-dependencies.jar)(S_lycopersicum_Sep_2019.2bit).

Hide

Permalink

Molly Davis added a comment - 06/Sep/23 12:51 PM - edited

Meeting notes:
Issue: Looked at the bam file which looks fine but there are reads that expand upon the chromosome. They have been soft clipped but find_junctions isn't taking that into consideration and won't process the file entirely to create the .FJ.bed.gz file.

Extra notes: There is a read 64792,798 of the actual chromosome that is past the end of the SL4 genome. The reader should have clipped the read but didn't do this and now is not fitting correctly at the ends. The alignment didn't clip the ends to match the SL4 genome correctly just for this one sample Malintka-R1-0hr-25C-self. Junctions confused because what was soft clipped did pass but find_junctions for this bam file is unusable. Find_junctions isn't obeying that the files have been soft clipped. Aligner should hard-clip but instead is soft-clipping.

Code used:

SL4.0ch10:64782127-64791136 > samtoolsOutput.bam
java -jar find-junctions-1.0.0-jar-with-dependencies.jar -u -f 5 -b S_lycopersicum_Sep_2019.2bit -o myShortFileoutput.bed samtoolsOutput.bamsamtools view -b Malintka-R1-0hr-25C-self.bam

Possible solution: Change code to find_junctions to allow soft-clipping if over the distance required. But this is for only one file so would it be worth changing the code?

Could also review the multiqc report to maybe discover clipping issue: Per Base Sequence Content has failed and usually means there are contaminates in the library. Aligners should be able to deal with bases that don't align and soft-clip them. But in this case for this one sample it did not do this correctly?

Show

Molly Davis added a comment - 06/Sep/23 12:51 PM - edited Meeting notes : Issue : Looked at the bam file which looks fine but there are reads that expand upon the chromosome. They have been soft clipped but find_junctions isn't taking that into consideration and won't process the file entirely to create the .FJ.bed.gz file. Extra notes : There is a read 64792,798 of the actual chromosome that is past the end of the SL4 genome. The reader should have clipped the read but didn't do this and now is not fitting correctly at the ends. The alignment didn't clip the ends to match the SL4 genome correctly just for this one sample Malintka-R1-0hr-25C-self. Junctions confused because what was soft clipped did pass but find_junctions for this bam file is unusable. Find_junctions isn't obeying that the files have been soft clipped. Aligner should hard-clip but instead is soft-clipping. Code used: SL4.0ch10:64782127-64791136 > samtoolsOutput.bam java -jar find-junctions-1.0.0-jar-with-dependencies.jar -u -f 5 -b S_lycopersicum_Sep_2019.2bit -o myShortFileoutput.bed samtoolsOutput.bamsamtools view -b Malintka-R1-0hr-25C-self.bam Possible solution: Change code to find_junctions to allow soft-clipping if over the distance required. But this is for only one file so would it be worth changing the code? Could also review the multiqc report to maybe discover clipping issue: Per Base Sequence Content has failed and usually means there are contaminates in the library. Aligners should be able to deal with bases that don't align and soft-clip them. But in this case for this one sample it did not do this correctly?

Hide

Permalink

Ann Loraine added a comment - 07/Sep/23 10:54 AM - edited

Ann comment: Identify the aligned region and visualize it in IGB. Document and capture the image for presentations.

From meeting: Re-run / double-check on Malintka-R1-8hr-25C-self (not the same sample that messed up the find junctions program.)

Show

Ann Loraine added a comment - 07/Sep/23 10:54 AM - edited Ann comment: Identify the aligned region and visualize it in IGB. Document and capture the image for presentations. From meeting: Re-run / double-check on Malintka-R1-8hr-25C-self (not the same sample that messed up the find junctions program.)

Hide

Permalink

Molly Davis added a comment - 07/Sep/23 1:39 PM - edited

New ticket was made for junctions java error: ~~IGBF-3434~~

Next step for this ticket: Review Malintka-R1-8hr-25C-self.bam file size issue and see if the problem is repetitive after nextflow rerun

Show

Molly Davis added a comment - 07/Sep/23 1:39 PM - edited New ticket was made for junctions java error : IGBF-3434 Next step for this ticket : Review Malintka-R1-8hr-25C-self.bam file size issue and see if the problem is repetitive after nextflow rerun

Hide

Permalink

Molly Davis added a comment - 07/Sep/23 3:07 PM - edited

Malintka-R1-8hr-25C-self.bam

The files does seem to be very small (618061348M) compared to other bam files on the cluster. I checked the multiQC report and this sample does have the worst alignment score from unmapped reads due to them being too short. This means that this isn't a pipeline issue but more of a sample issue.

Next Step: Please review my conclusions and move ticket to done if there are no issues.

Show

Molly Davis added a comment - 07/Sep/23 3:07 PM - edited Malintka-R1-8hr-25C-self.bam The files does seem to be very small (618061348M) compared to other bam files on the cluster. I checked the multiQC report and this sample does have the worst alignment score from unmapped reads due to them being too short. This means that this isn't a pipeline issue but more of a sample issue. Next Step: Please review my conclusions and move ticket to done if there are no issues.

Hide

Permalink

Robert Reid added a comment - 13/Sep/23 1:54 PM

I took a peak at the number of reads that align for each expt.

Total reads that align:
Total Read Count
Heinz-Ovary-R1-0hr-25C-unpol.bam 50979464
Heinz-Ovary-R2-0hr-25C-unpol.bam 39862409
Heinz-Ovary-R3-0hr-25C-unpol.bam 52051210
Heinz-R1-0hr-25C-self.bam 40438938
Heinz-R1-3hr-25C-self.bam 49881450
Heinz-R1-3hr-37C-self.bam 51660872
Heinz-R1-8hr-25C-self.bam 22196798
Heinz-R1-8hr-37C-self.bam 50926978
Heinz-R2-0hr-25C-self.bam 49858292
Heinz-R2-3hr-25C-self.bam 50700025
Heinz-R2-3hr-37C-self.bam 54923225
Heinz-R2-8hr-25C-self.bam 35292739
Heinz-R2-8hr-37C-self.bam 51480404
Heinz-R3-0hr-25C-self.bam 51577031
Heinz-R3-3hr-25C-self.bam 45984816
Heinz-R3-3hr-37C-self.bam 50612275
Heinz-R3-8hr-25C-self.bam 48707431
Heinz-R3-8hr-37C-self.bam 51695156
Malintka-R1-0hr-25C-self.bam 50644328
Malintka-R1-3hr-25C-self.bam 28575876
Malintka-R1-3hr-37C-self.bam 50707327
Malintka-R1-8hr-25C-self.bam 11997289
Malintka-R1-8hr-37C-self.bam 49495351
Malintka-R2-0hr-25C-self.bam 52432853
Malintka-R2-3hr-25C-self.bam 45473568
Malintka-R2-3hr-37C-self.bam 44168728
Malintka-R2-8hr-25C-self.bam 46900467
Malintka-R2-8hr-37C-self.bam 51370963
Malintka-R3-0hr-25C-self.bam 47086317
Malintka-R3-3hr-25C-self.bam 51085380
Malintka-R3-3hr-37C-self.bam 29231411
Malintka-R3-8hr-25C-self.bam 44275394
Malintka-R3-8hr-37C-self.bam 55644007
Nagcarlang-R1-0hr-25C-self.bam 40968404
Nagcarlang-R1-3hr-25C-self.bam 46600257
Nagcarlang-R1-3hr-37C-self.bam 32781471
Nagcarlang-R1-8hr-25C-self.bam 46430696
Nagcarlang-R1-8hr-37C-self.bam 49122203
Nagcarlang-R2-0hr-25C-self.bam 41919616
Nagcarlang-R2-3hr-25C-self.bam 25538537
Nagcarlang-R2-3hr-37C-self.bam 48518069
Nagcarlang-R2-8hr-25C-self.bam 44389303
Nagcarlang-R2-8hr-37C-self.bam 55567552
Nagcarlang-R3-0hr-25C-self.bam 35216477
Nagcarlang-R3-3hr-25C-self.bam 53420765
Nagcarlang-R3-3hr-37C-self.bam 55258962
Nagcarlang-R3-8hr-25C-self.bam 50940477
Nagcarlang-R3-8hr-37C-self.bam 51768199
Tamaulipas-R1-0hr-25C-self.bam 19287669
Tamaulipas-R1-3hr-25C-self.bam 54518901
Tamaulipas-R1-3hr-37C-self.bam 47761377
Tamaulipas-R1-8hr-25C-self.bam 48604790
Tamaulipas-R1-8hr-37C-self.bam 50484499
Tamaulipas-R2-0hr-25C-self.bam 46447680
Tamaulipas-R2-3hr-25C-self.bam 48552903
Tamaulipas-R2-3hr-37C-self.bam 46727628
Tamaulipas-R2-8hr-25C-self.bam 56810522
Tamaulipas-R2-8hr-37C-self.bam 48005344
Tamaulipas-R3-0hr-25C-self.bam 40520115
Tamaulipas-R3-3hr-25C-self.bam 42195143
Tamaulipas-R3-3hr-37C-self.bam 49785458
Tamaulipas-R3-8hr-25C-self.bam 45882212
Tamaulipas-R3-8hr-37C-self.bam 44242182

Script that does this is here:
/projects/tomato_genome/scripts/rob/summarizeBAMAlignments.slurm

Show

Robert Reid added a comment - 13/Sep/23 1:54 PM I took a peak at the number of reads that align for each expt. Total reads that align: Total Read Count Heinz-Ovary-R1-0hr-25C-unpol.bam 50979464 Heinz-Ovary-R2-0hr-25C-unpol.bam 39862409 Heinz-Ovary-R3-0hr-25C-unpol.bam 52051210 Heinz-R1-0hr-25C-self.bam 40438938 Heinz-R1-3hr-25C-self.bam 49881450 Heinz-R1-3hr-37C-self.bam 51660872 Heinz-R1-8hr-25C-self.bam 22196798 Heinz-R1-8hr-37C-self.bam 50926978 Heinz-R2-0hr-25C-self.bam 49858292 Heinz-R2-3hr-25C-self.bam 50700025 Heinz-R2-3hr-37C-self.bam 54923225 Heinz-R2-8hr-25C-self.bam 35292739 Heinz-R2-8hr-37C-self.bam 51480404 Heinz-R3-0hr-25C-self.bam 51577031 Heinz-R3-3hr-25C-self.bam 45984816 Heinz-R3-3hr-37C-self.bam 50612275 Heinz-R3-8hr-25C-self.bam 48707431 Heinz-R3-8hr-37C-self.bam 51695156 Malintka-R1-0hr-25C-self.bam 50644328 Malintka-R1-3hr-25C-self.bam 28575876 Malintka-R1-3hr-37C-self.bam 50707327 Malintka-R1-8hr-25C-self.bam 11997289 Malintka-R1-8hr-37C-self.bam 49495351 Malintka-R2-0hr-25C-self.bam 52432853 Malintka-R2-3hr-25C-self.bam 45473568 Malintka-R2-3hr-37C-self.bam 44168728 Malintka-R2-8hr-25C-self.bam 46900467 Malintka-R2-8hr-37C-self.bam 51370963 Malintka-R3-0hr-25C-self.bam 47086317 Malintka-R3-3hr-25C-self.bam 51085380 Malintka-R3-3hr-37C-self.bam 29231411 Malintka-R3-8hr-25C-self.bam 44275394 Malintka-R3-8hr-37C-self.bam 55644007 Nagcarlang-R1-0hr-25C-self.bam 40968404 Nagcarlang-R1-3hr-25C-self.bam 46600257 Nagcarlang-R1-3hr-37C-self.bam 32781471 Nagcarlang-R1-8hr-25C-self.bam 46430696 Nagcarlang-R1-8hr-37C-self.bam 49122203 Nagcarlang-R2-0hr-25C-self.bam 41919616 Nagcarlang-R2-3hr-25C-self.bam 25538537 Nagcarlang-R2-3hr-37C-self.bam 48518069 Nagcarlang-R2-8hr-25C-self.bam 44389303 Nagcarlang-R2-8hr-37C-self.bam 55567552 Nagcarlang-R3-0hr-25C-self.bam 35216477 Nagcarlang-R3-3hr-25C-self.bam 53420765 Nagcarlang-R3-3hr-37C-self.bam 55258962 Nagcarlang-R3-8hr-25C-self.bam 50940477 Nagcarlang-R3-8hr-37C-self.bam 51768199 Tamaulipas-R1-0hr-25C-self.bam 19287669 Tamaulipas-R1-3hr-25C-self.bam 54518901 Tamaulipas-R1-3hr-37C-self.bam 47761377 Tamaulipas-R1-8hr-25C-self.bam 48604790 Tamaulipas-R1-8hr-37C-self.bam 50484499 Tamaulipas-R2-0hr-25C-self.bam 46447680 Tamaulipas-R2-3hr-25C-self.bam 48552903 Tamaulipas-R2-3hr-37C-self.bam 46727628 Tamaulipas-R2-8hr-25C-self.bam 56810522 Tamaulipas-R2-8hr-37C-self.bam 48005344 Tamaulipas-R3-0hr-25C-self.bam 40520115 Tamaulipas-R3-3hr-25C-self.bam 42195143 Tamaulipas-R3-3hr-37C-self.bam 49785458 Tamaulipas-R3-8hr-25C-self.bam 45882212 Tamaulipas-R3-8hr-37C-self.bam 44242182 Script that does this is here: /projects/tomato_genome/scripts/rob/summarizeBAMAlignments.slurm

Hide

Permalink

Robert Reid added a comment - 13/Sep/23 1:58 PM

Compared to the average number of reads aligned, the 3 expts with much lower total reads are:
name num aligned Percent of average
Tamaulipas-R1-0hr-25C-self.bam 19287669 -57.8
Malintka-R1-8hr-25C-self.bam 11997289 -73.7
Heinz-R1-8hr-25C-self.bam 22196798 -51.4

Going to explore next if this pattern is observed throughout NEXTFLOW and post processing.

Show

Robert Reid added a comment - 13/Sep/23 1:58 PM Compared to the average number of reads aligned, the 3 expts with much lower total reads are: name num aligned Percent of average Tamaulipas-R1-0hr-25C-self.bam 19287669 -57.8 Malintka-R1-8hr-25C-self.bam 11997289 -73.7 Heinz-R1-8hr-25C-self.bam 22196798 -51.4 Going to explore next if this pattern is observed throughout NEXTFLOW and post processing.

Hide

Permalink

Robert Reid added a comment - 21/Sep/23 2:36 PM

Bam checking shows the BAMs seem intact. The temporary steps in NEXTFLOW are lost as part of the post cleanup so we can't peak at what happens mid process.

I left a copy of the slurm script in:

/projects/tomato_genome/scripts/rob/summarizeBAMAlignments.slurm

So Bam files seem fine at this stage.

Show

Robert Reid added a comment - 21/Sep/23 2:36 PM Bam checking shows the BAMs seem intact. The temporary steps in NEXTFLOW are lost as part of the post cleanup so we can't peak at what happens mid process. I left a copy of the slurm script in: /projects/tomato_genome/scripts/rob/summarizeBAMAlignments.slurm So Bam files seem fine at this stage.

Hide

Permalink

Molly Davis added a comment - 25/Sep/23 12:34 PM - edited

Had to rebase my branch:

git checkout IGBF-3420
git pull --rebase upstream main
git push --force origin IGBF-3420

Branch: https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420
Pull Request: https://bitbucket.org/hotpollen/pistil-rna-seq/pull-requests/9

Note: Files should include SL4 Multiqc report and counts.tsv file

Show

Molly Davis added a comment - 25/Sep/23 12:34 PM - edited Had to rebase my branch: git checkout IGBF-3420 git pull --rebase upstream main git push --force origin IGBF-3420 Branch : https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420 Pull Request : https://bitbucket.org/hotpollen/pistil-rna-seq/pull-requests/9 Note: Files should include SL4 Multiqc report and counts.tsv file

Hide

Permalink

Molly Davis added a comment - 25/Sep/23 12:58 PM - edited

Just moved all coverage graphs, bam files, and junction files to this location: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537

Permission note:
'- rw-rw-r- -' (first rw references owner can read and write, next rw references group can read and write, next r references everyone)

Show

Molly Davis added a comment - 25/Sep/23 12:58 PM - edited Just moved all coverage graphs, bam files, and junction files to this location: /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537 Permission note: '- rw-rw-r- -' (first rw references owner can read and write, next rw references group can read and write, next r references everyone)

Hide

Permalink

Ann Loraine added a comment - 26/Sep/23 4:17 PM

Suggestions for testing:

View the multi_qc report, compare to SL5 counterpart
Check that the gene counts file has the expected number of lines (same as equivalent files from other SL4 data processing)

Show

Ann Loraine added a comment - 26/Sep/23 4:17 PM Suggestions for testing: View the multi_qc report, compare to SL5 counterpart Check that the gene counts file has the expected number of lines (same as equivalent files from other SL4 data processing)

Hide

Permalink

Molly Davis added a comment - 27/Sep/23 1:34 PM

Testing:

Comparing SL4 and SL5 multiqc report: Both reports are pretty similar. of course mapping differs between each genome but overall there are no major differences.
The counts.tsv file has the correct amount of rows for genes. This was compared with muday counts.tsv SL4 output data.
One note is that the SL5 counts.tsv file is not present in the repo.

Show

Molly Davis added a comment - 27/Sep/23 1:34 PM Testing : Comparing SL4 and SL5 multiqc report: Both reports are pretty similar. of course mapping differs between each genome but overall there are no major differences. The counts.tsv file has the correct amount of rows for genes. This was compared with muday counts.tsv SL4 output data. One note is that the SL5 counts.tsv file is not present in the repo.

Hide

Permalink

Molly Davis added a comment - 06/Oct/23 3:26 PM

Branch: https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420b

Adds file: ExternalData/30-804059537-SL5-salmon.merged.gene_counts.tsv

Show

Molly Davis added a comment - 06/Oct/23 3:26 PM Branch : https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420b Adds file : ExternalData/30-804059537-SL5-salmon.merged.gene_counts.tsv

Hide

Permalink

Robert Reid added a comment - 11/Oct/23 10:32 AM - edited

Beginning with review of all things in branch above.
Took a quick glance at the .tsv salmon file in the recent branch and it appears intact and correct, i.e., # of rows expected and # of columns.

Robust counts! Lots of things going on expression wise. This might be a very interesting dataset!

Moving on to the cluster ( /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537) to check those bits.

Show

Robert Reid added a comment - 11/Oct/23 10:32 AM - edited Beginning with review of all things in branch above. Took a quick glance at the .tsv salmon file in the recent branch and it appears intact and correct, i.e., # of rows expected and # of columns. Robust counts! Lots of things going on expression wise. This might be a very interesting dataset! Moving on to the cluster ( /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537) to check those bits.

Hide

Permalink

Robert Reid added a comment - 11/Oct/23 10:56 AM

In /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537

As expected:
63 bams
63 bais
126 .gz files (63 bedgraphs, 63 beds)
126 .tbi files to go with the bed files.

The BAM files have a wide range of sizes which at this point we are thinking is just how those sequences are.
Taking a quick peak at the raw files, /projects/tomato_genome/rnaseq/30-804059537-kelsie/00_fastq$
The size of these fastq files are not as wide ranging as the BAM files are.
Makes me think that either the BAM alignment had issues on some samples (less likely, but we will find out later when we rerun these as SRA's) or something in the sample prep went awry and there is contamination. To test this we could run STAR, and get all the reads that do not align. I think NEXTFLOW is deleting the unaligned reads.

If we get our hands on unaligned reads we can blast against the NR DB and see if there is other things that got sequenced, like for example the technician!
I'll review more soon.

Show

Robert Reid added a comment - 11/Oct/23 10:56 AM In /projects/tomato_genome/fnb/dataprocessing/30-804059537-KP/for_quickload/S_lycopersicum_Sep_2019/30-804059537 As expected: 63 bams 63 bais 126 .gz files (63 bedgraphs, 63 beds) 126 .tbi files to go with the bed files. The BAM files have a wide range of sizes which at this point we are thinking is just how those sequences are. Taking a quick peak at the raw files, /projects/tomato_genome/rnaseq/30-804059537-kelsie/00_fastq$ The size of these fastq files are not as wide ranging as the BAM files are. Makes me think that either the BAM alignment had issues on some samples (less likely, but we will find out later when we rerun these as SRA's) or something in the sample prep went awry and there is contamination. To test this we could run STAR, and get all the reads that do not align. I think NEXTFLOW is deleting the unaligned reads. If we get our hands on unaligned reads we can blast against the NR DB and see if there is other things that got sequenced, like for example the technician! I'll review more soon.

Hide

Permalink

Robert Reid added a comment - 12/Oct/23 9:11 AM

For this:
https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420b

Is all that is needed is to check out the .tsv file?
I'll assume so and bounce this ticket back to Molly!

Show

Robert Reid added a comment - 12/Oct/23 9:11 AM For this: https://bitbucket.org/mdavis4290/molly-pistil-rna-seq/branch/IGBF-3420b Is all that is needed is to check out the .tsv file? I'll assume so and bounce this ticket back to Molly!

Hide

Permalink

Molly Davis added a comment - 12/Oct/23 10:17 AM - edited

Pull Request: https://bitbucket.org/hotpollen/pistil-rna-seq/pull-requests/10

Note: Adds 30-804059537-SL5-salmon.merged.gene_counts.tsv to external folder with SL4 tsv file so comparisons can be made.

Show

Molly Davis added a comment - 12/Oct/23 10:17 AM - edited Pull Request : https://bitbucket.org/hotpollen/pistil-rna-seq/pull-requests/10 Note : Adds 30-804059537-SL5-salmon.merged.gene_counts.tsv to external folder with SL4 tsv file so comparisons can be made.

Hide

Permalink

Ann Loraine added a comment - 12/Oct/23 12:02 PM

Merged the PR but then changed name from 30-804059537-SL5-salmon.merged.gene_counts.tsv to 30-804059537-SL5_salmon.merged.gene_counts.tsv to match convention established for other such files throughout the project.

Show

Ann Loraine added a comment - 12/Oct/23 12:02 PM Merged the PR but then changed name from 30-804059537-SL5-salmon.merged.gene_counts.tsv to 30-804059537-SL5_salmon.merged.gene_counts.tsv to match convention established for other such files throughout the project.

Process 30-804059537 (KP 2023) data using S_lycopersicum_Sep_2019 (SL4) genome assembly and annotations

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates