[IGBF-3406] Re-run Nextflow Muday time course data with SL5 data downloaded from SRA - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Prepare flavonoid RNA-Seq time course analyses for publication
Sprint:
Summer 7 2023 Aug 7, Fall 2 2023 Sep 17, Fall 3 2023 Oct 2

Description

Refer to ticket ~~IGBF-3423~~ to locate SRA data.

SRP460750

For this task, we need to confirm and sanity-check the Muday time course data that Rob uploaded and submitted to the Sequence Read Archive.
If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
For this task:

Check SRP on NCBI and review submission
Download the data onto the cluster by using the SRP name
Run nf-core/rnaseq pipeline
Run our coverage graph and junctions scripts on the data

Note that all files should now use their "SRR" names instead of the existing file names.

Attachments

Issue Links

relates to

IGBF-3498 Review SRA Submissions

Closed

IGBF-3499 Complete evaluation of RNA-Seq dataset submissions

Closed

IGBF-3500 Re-run mark-2022-timeseries data with data downloaded from SRA

Closed

IGBF-3507 Re-run Nextflow Muday time course data with SL4 and data downloaded from SRA

Closed

IGBF-3346 Prepping data and entries for SRA for Muday time course

To-Do

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Molly Davis added a comment - 18/Aug/23 11:29 AM - edited

Re-run Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5

Prefetch SRR Script:

#! /bin/bash

#SBATCH --job-name=prefetch_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00

cd  /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5
module load sra-tools/2.11.0
vdb-config --interactive

files=(
SRR25478240
SRR25478241
SRR25478242
SRR25478243
SRR25478244
SRR25478245
SRR25478246
SRR25478247
SRR25478248
SRR25478249
SRR25478250
SRR25478251
SRR25478252
SRR25478253
SRR25478254
SRR25478255
SRR25478256
SRR25478257
SRR25478258
SRR25478259
SRR25478260
SRR25478261
SRR25478262
SRR25478263
SRR25478264
SRR25478265
SRR25478266
SRR25478267
SRR25478268
SRR25478269
SRR25478270
SRR25478271
SRR25478272
SRR25478273
SRR25478274
SRR25478275
SRR25478276
SRR25478277
SRR25478278
SRR25478279
SRR25478280
SRR25478281
SRR25478282
SRR25478283
SRR25478284
SRR25478285
SRR25478286
SRR25478287
SRR25478288
SRR25478289
SRR25478290
SRR25478291
SRR25478292
SRR25478293
SRR25478294
SRR25478295
SRR25478296
SRR25478297
SRR25478298
SRR25478299
SRR25478300
SRR25478301
SRR25478302
SRR25478303
SRR25478304
SRR25478305
SRR25478306
SRR25478307
SRR25478308
SRR25478309
SRR25478310
SRR25478311

)

for f in "${files[@]}"; do echo $f; prefetch $f;  done

Execute:

chmod u+x prefetch.slurm

sbatch prefetch.slurm

Faster Dump Script:

#! /bin/bash

#SBATCH --job-name=fastqdump_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=40gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00
#SBATCH --array=1-72

#setting up where to grab files from
file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/Sra_ids.txt)


cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5
module load sra-tools/2.11.0

echo "Starting faster-qdump on $file";

cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/$file

fasterq-dump ${file}.sra

perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq

cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/${file}_1.fastq
cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/${file}_2.fastq 

echo "finished"

Execute:

chmod u+x fasterdump.slurm

sbatch fasterdump.slurm

Show

Molly Davis added a comment - 18/Aug/23 11:29 AM - edited Re-run Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5 Prefetch SRR Script : #! /bin/bash #SBATCH --job-name=prefetch_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5 module load sra-tools/2.11.0 vdb-config --interactive files=( SRR25478240 SRR25478241 SRR25478242 SRR25478243 SRR25478244 SRR25478245 SRR25478246 SRR25478247 SRR25478248 SRR25478249 SRR25478250 SRR25478251 SRR25478252 SRR25478253 SRR25478254 SRR25478255 SRR25478256 SRR25478257 SRR25478258 SRR25478259 SRR25478260 SRR25478261 SRR25478262 SRR25478263 SRR25478264 SRR25478265 SRR25478266 SRR25478267 SRR25478268 SRR25478269 SRR25478270 SRR25478271 SRR25478272 SRR25478273 SRR25478274 SRR25478275 SRR25478276 SRR25478277 SRR25478278 SRR25478279 SRR25478280 SRR25478281 SRR25478282 SRR25478283 SRR25478284 SRR25478285 SRR25478286 SRR25478287 SRR25478288 SRR25478289 SRR25478290 SRR25478291 SRR25478292 SRR25478293 SRR25478294 SRR25478295 SRR25478296 SRR25478297 SRR25478298 SRR25478299 SRR25478300 SRR25478301 SRR25478302 SRR25478303 SRR25478304 SRR25478305 SRR25478306 SRR25478307 SRR25478308 SRR25478309 SRR25478310 SRR25478311 ) for f in "${files[@]}" ; do echo $f; prefetch $f; done Execute : chmod u+x prefetch.slurm sbatch prefetch.slurm Faster Dump Script : #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-72 #setting up where to grab files from file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/Sra_ids.txt) cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5 module load sra-tools/2.11.0 echo "Starting faster-qdump on $file" ; cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/$file fasterq-dump ${file}.sra perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/${file}_1.fastq cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/${file}_2.fastq echo "finished" Execute : chmod u+x fasterdump.slurm sbatch fasterdump.slurm

Hide

Permalink

Molly Davis added a comment - 24/Aug/23 4:05 PM - edited

Nextflow Pipeline ran successfully with SL5 genome
Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750
MultiQC report notes: No errors or warnings were present in the report. The output file is named 'SRP460750_multiqc_report.html'.

Show

Molly Davis added a comment - 24/Aug/23 4:05 PM - edited Nextflow Pipeline ran successfully with SL5 genome Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750 MultiQC report notes: No errors or warnings were present in the report. The output file is named 'SRP460750_multiqc_report.html'.

Hide

Permalink

Molly Davis added a comment - 24/Aug/23 4:07 PM

Next steps:

Commit CSV and multiqc report to Flavonoid repo on bitbucket
Change sorted bam names
Create junction files
Create Coverage graphs

Show

Molly Davis added a comment - 24/Aug/23 4:07 PM Next steps: Commit CSV and multiqc report to Flavonoid repo on bitbucket Change sorted bam names Create junction files Create Coverage graphs

Hide

Permalink

Molly Davis added a comment - 25/Aug/23 1:58 PM

Launch renameBams.sh script:
./renameBams.sh
Launch Scaled Coverage graphs script:
./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err
Launch Junction files script:
./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Show

Molly Davis added a comment - 25/Aug/23 1:58 PM Launch renameBams.sh script : ./renameBams.sh Launch Scaled Coverage graphs script : ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err Launch Junction files script : ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Hide

Permalink

Ann Loraine added a comment - 29/Aug/23 10:47 AM - edited

We discovered that the SRA has assigned the same "SRP" identifier to the samples from the Muday Lab timecourse and the Johnson lab timcourse experiments.
Thanks to this, there is not an easy way to separate the data files into different collections, unless the analyst is aware of the fact that these samples should be considered together in data analysis.
We will stop running the re-processing pipeline pending resolution of the study name (SRP id) issue. We want the two collections of runs to have different study identifiers (SRP identifiers).
Restating the problem: The SRA has assigned the same "SRP" study name to all the samples from the Muday lab's time course experiments and the Johnson lab's time course. Thus, there is no easy way to distinguish the samples from the two very different experiments.

References regarding what an "SRP" is supposed to be, according to the community and to the SRA itself:

Also, the search interface for the SRA here: https://trace.ncbi.nlm.nih.gov/Traces/?view=study uses the term "study" in the search interface. Thus, "SRP" is a "study".

After some more discussion with RR, AL, and MD, we realized that maybe the issue has to do with the fact that the Muday lab and Johnson Lab samples were submitted using the same descriptive text for the study, text that isn't captured in the sample submission spreadsheet but which is instead added by the investigator who submits the data, using a form on the SRA sample submission pages at the Web site itself. So, this may be why the people at the SRA decided to group all the samples together into the same "SRP" (study) accession.

Next, we'll try giving each study a more bespoke description of the study and more bespoke title of the study. This will make it more obvious to everybody at the SRA which samples / sequence files truly belong together under the same SRP identifier.

Show

Ann Loraine added a comment - 29/Aug/23 10:47 AM - edited We discovered that the SRA has assigned the same "SRP" identifier to the samples from the Muday Lab timecourse and the Johnson lab timcourse experiments. Thanks to this, there is not an easy way to separate the data files into different collections, unless the analyst is aware of the fact that these samples should be considered together in data analysis. We will stop running the re-processing pipeline pending resolution of the study name (SRP id) issue. We want the two collections of runs to have different study identifiers (SRP identifiers). Restating the problem: The SRA has assigned the same "SRP" study name to all the samples from the Muday lab's time course experiments and the Johnson lab's time course. Thus, there is no easy way to distinguish the samples from the two very different experiments. References regarding what an "SRP" is supposed to be, according to the community and to the SRA itself: https://www.ncbi.nlm.nih.gov/sra/docs/submitportal/ https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/ Also, the search interface for the SRA here: https://trace.ncbi.nlm.nih.gov/Traces/?view=study uses the term "study" in the search interface. Thus, "SRP" is a "study". After some more discussion with RR, AL, and MD, we realized that maybe the issue has to do with the fact that the Muday lab and Johnson Lab samples were submitted using the same descriptive text for the study, text that isn't captured in the sample submission spreadsheet but which is instead added by the investigator who submits the data, using a form on the SRA sample submission pages at the Web site itself. So, this may be why the people at the SRA decided to group all the samples together into the same "SRP" (study) accession. Next, we'll try giving each study a more bespoke description of the study and more bespoke title of the study. This will make it more obvious to everybody at the SRA which samples / sequence files truly belong together under the same SRP identifier.

Hide

Permalink

Molly Davis added a comment - 09/Oct/23 4:02 PM

Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/results/star_salmon

Review:
Check that files have reasonable sizes (no "zero" size files, for example)
Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file
Check that every bam file has a corresponding "FJ.bed.gz" file
Check that every bam file has a corresponding "scaled.bedgraph.gz" file
Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"
Reviewer: Robert Reid

Show

Molly Davis added a comment - 09/Oct/23 4:02 PM Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/results/star_salmon Review : Check that files have reasonable sizes (no "zero" size files, for example) Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file Check that every bam file has a corresponding "FJ.bed.gz" file Check that every bam file has a corresponding "scaled.bedgraph.gz" file Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi" Reviewer : Robert Reid

Hide

Permalink

Robert Reid added a comment - 10/Oct/23 9:20 AM

The folder /projects/tomato_genome/fnb/dataprocessing/SRP460750/results/star_salmon exists and is accesible.
Every jobs.err file is zero, good.
All .out files are of expected similar size. And there is the expected 72 files.
There are 72 tbi files for the bed and bedgraph index as expected. All are around 70k.
There are 144 gzipped files, 72 for bed and 72 for bedgraph. All size are as expected and consistent.
The file salmon.merged.gene_counts.tsv has the number of genes we expect to see (~ 36,649)
There are 72 bai and bam files and the bam files are consistently about 2.4GB. All good.

Everything appears to be correct!

Show

Robert Reid added a comment - 10/Oct/23 9:20 AM The folder /projects/tomato_genome/fnb/dataprocessing/SRP460750/results/star_salmon exists and is accesible. Every jobs.err file is zero, good. All .out files are of expected similar size. And there is the expected 72 files. There are 72 tbi files for the bed and bedgraph index as expected. All are around 70k. There are 144 gzipped files, 72 for bed and 72 for bedgraph. All size are as expected and consistent. The file salmon.merged.gene_counts.tsv has the number of genes we expect to see (~ 36,649) There are 72 bai and bam files and the bam files are consistently about 2.4GB. All good. Everything appears to be correct!

Hide

Permalink

Molly Davis added a comment - 10/Oct/23 9:24 AM

Branch: https://bitbucket.org/mdavis4290/molly3-flavonoid-rnaseq/branch/IGBF-3406

Show

Molly Davis added a comment - 10/Oct/23 9:24 AM Branch : https://bitbucket.org/mdavis4290/molly3-flavonoid-rnaseq/branch/IGBF-3406

Hide

Permalink

Molly Davis added a comment - 12/Oct/23 10:27 AM

It seems the files were already merged into the team repo. Moving to done!

Show

Molly Davis added a comment - 12/Oct/23 10:27 AM It seems the files were already merged into the team repo. Moving to done!

People

Assignee:

Molly Davis

Reporter:

Molly Davis

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

08/Aug/23 2:13 PM

Updated:

11/Dec/23 10:52 AM

Resolved:

12/Oct/23 10:27 AM