[IGBF-3500] Re-run mark-2022-timeseries data with data downloaded from SRA - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Support NSF pollen grant
Sprint:
Fall 6, Fall 7, Spring 1

Description

Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.

For this task, we need to confirm and sanity-check the mark 2022 time series data that Rob uploaded and submitted to the Sequence Read Archive.

If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
For this task:

Check SRP on NCBI and review submission
Download the data onto the cluster by using the SRP name
Run nf-core/rnaseq pipeline
Run our coverage graph and junctions scripts on the data

Note that all files should now use their "SRR" names instead of the existing file names.

Attachments

Issue Links

relates to

IGBF-3406 Re-run Nextflow Muday time course data with SL5 data downloaded from SRA

Closed

IGBF-3499 Complete evaluation of RNA-Seq dataset submissions

Closed

Activity

Ascending order - Click to sort in descending order

Molly Davis created issue - 15/Nov/23 4:04 PM

Molly Davis made changes - 15/Nov/23 4:04 PM

Field	Original Value	New Value
Epic Link		IGBF-2993 [ 21429 ]

Molly Davis made changes - 15/Nov/23 4:04 PM

Link

This issue relates to ~~IGBF-3406~~ [ ~~IGBF-3406~~ ]

Molly Davis made changes - 15/Nov/23 4:04 PM

Link

This issue relates to ~~IGBF-3499~~ [ ~~IGBF-3499~~ ]

Molly Davis made changes - 15/Nov/23 4:04 PM

Rank

Ranked higher

Molly Davis made changes - 15/Nov/23 4:11 PM

Assignee

Molly Davis [ molly ]

Molly Davis made changes - 16/Nov/23 1:57 PM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Molly Davis added a comment - 16/Nov/23 2:06 PM - edited

Re-run Directory: /projects/tomato_genome/fnb/dataprocessing/SRP441343
SL4: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
SL5: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5

Prefetch SRR Script:

#! /bin/bash

#SBATCH --job-name=prefetch_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00

cd   /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
module load sra-tools/2.11.0
vdb-config --interactive

files=(
SRR24836276
SRR24836277
SRR24836278
SRR24836279
SRR24836280
SRR24836281
SRR24836282
SRR24836283
SRR24836284
SRR24836285
SRR24836286
SRR24836287
SRR24836288
SRR24836289
SRR24836290
SRR24836291
SRR24836292
SRR24836293
SRR24836294
SRR24836295
SRR24836296
SRR24836297
SRR24836298
SRR24836299
SRR24836300
SRR24836301
SRR24836302
SRR24836303
SRR24836304
SRR24836305
SRR24836306
SRR24836307
SRR24836308
SRR24836309
SRR24836310
SRR24836311
SRR24836312
SRR24836313
SRR24836314
SRR24836315
SRR24836316
SRR24836317
SRR24836318
SRR24836319
SRR24836320
SRR24836321
SRR24836322
SRR24836323
SRR24836324
SRR24836325
SRR24836326
SRR24836327
SRR24836328
SRR24836329
)

for f in "${files[@]}"; do echo $f; prefetch $f;  done

Execute:

chmod u+x prefetch.slurm

sbatch prefetch.slurm

Faster Dump Script:

#! /bin/bash

#SBATCH --job-name=fastqdump_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=40gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00
#SBATCH --array=1-54

#setting up where to grab files from
file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/Sra_ids.txt)


cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
module load sra-tools/2.11.0

echo "Starting faster-qdump on $file";

cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/$file

fasterq-dump ${file}.sra

perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq

cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_1.fastq
cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_2.fastq 

echo "finished"

Execute:

chmod u+x fasterdump.slurm

sbatch fasterdump.slurm

Show

Molly Davis added a comment - 16/Nov/23 2:06 PM - edited Re-run Directory : /projects/tomato_genome/fnb/dataprocessing/SRP441343 SL4 : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 SL5 : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5 Prefetch SRR Script : #! /bin/bash #SBATCH --job-name=prefetch_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 module load sra-tools/2.11.0 vdb-config --interactive files=( SRR24836276 SRR24836277 SRR24836278 SRR24836279 SRR24836280 SRR24836281 SRR24836282 SRR24836283 SRR24836284 SRR24836285 SRR24836286 SRR24836287 SRR24836288 SRR24836289 SRR24836290 SRR24836291 SRR24836292 SRR24836293 SRR24836294 SRR24836295 SRR24836296 SRR24836297 SRR24836298 SRR24836299 SRR24836300 SRR24836301 SRR24836302 SRR24836303 SRR24836304 SRR24836305 SRR24836306 SRR24836307 SRR24836308 SRR24836309 SRR24836310 SRR24836311 SRR24836312 SRR24836313 SRR24836314 SRR24836315 SRR24836316 SRR24836317 SRR24836318 SRR24836319 SRR24836320 SRR24836321 SRR24836322 SRR24836323 SRR24836324 SRR24836325 SRR24836326 SRR24836327 SRR24836328 SRR24836329 ) for f in "${files[@]}" ; do echo $f; prefetch $f; done Execute : chmod u+x prefetch.slurm sbatch prefetch.slurm Faster Dump Script : #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-54 #setting up where to grab files from file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/Sra_ids.txt) cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 module load sra-tools/2.11.0 echo "Starting faster-qdump on $file" ; cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/$file fasterq-dump ${file}.sra perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_1.fastq cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_2.fastq echo "finished" Execute : chmod u+x fasterdump.slurm sbatch fasterdump.slurm

Molly Davis made changes - 16/Nov/23 2:12 PM

Description

SRP441343

Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.

Ann Loraine made changes - 27/Nov/23 9:11 AM

Sprint

Fall 6 [ 182 ]

Fall 6, Fall 7 [ 182, 183 ]

Ann Loraine made changes - 27/Nov/23 9:11 AM

Rank

Ranked higher

Hide

Permalink

Molly Davis added a comment - 05/Dec/23 10:19 AM - edited

Nextflow Pipeline ran successfully with SL4 & SL5 genome
Directory:

/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5
/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4

MultiQC report notes: No errors or warnings were present in the reports. The output files are named 'SRP441343_SL5_multiqc_report.html' & 'SRP441343_SL4_multiqc_report.html'.

Show

Molly Davis added a comment - 05/Dec/23 10:19 AM - edited Nextflow Pipeline ran successfully with SL4 & SL5 genome Directory: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5 /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 MultiQC report notes: No errors or warnings were present in the reports. The output files are named 'SRP441343_SL5_multiqc_report.html' & 'SRP441343_SL4_multiqc_report.html'.

Hide

Permalink

Molly Davis added a comment - 05/Dec/23 10:19 AM

Next steps:
Commit CSV and multiqc report to Flavonoid repo on bitbucket
Change sorted bam names
Create junction files
Create Coverage graphs

Show

Molly Davis added a comment - 05/Dec/23 10:19 AM Next steps : Commit CSV and multiqc report to Flavonoid repo on bitbucket Change sorted bam names Create junction files Create Coverage graphs

Hide

Permalink

Molly Davis added a comment - 05/Dec/23 10:20 AM

Launch renameBams.sh script:
./renameBams.sh
Launch Scaled Coverage graphs script:
./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err
Launch Junction files script:
./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Show

Molly Davis added a comment - 05/Dec/23 10:20 AM Launch renameBams.sh script : ./renameBams.sh Launch Scaled Coverage graphs script : ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err Launch Junction files script : ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Hide

Permalink

Molly Davis added a comment - 05/Dec/23 4:41 PM

Directories:

/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon/
/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5/results/star_salmon/

Reviewer:
Check that files have reasonable sizes (no "zero" size files, for example)
Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file
Check that every bam file has a corresponding "FJ.bed.gz" file
Check that every bam file has a corresponding "scaled.bedgraph.gz" file
Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"

Branch: https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500

SRP441343_SL4_multiqc_report.html
SRP441343_SL5_multiqc_report.html
SRP441343.csv

Show

Molly Davis added a comment - 05/Dec/23 4:41 PM Directories : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon/ /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5/results/star_salmon/ Reviewer : Check that files have reasonable sizes (no "zero" size files, for example) Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file Check that every bam file has a corresponding "FJ.bed.gz" file Check that every bam file has a corresponding "scaled.bedgraph.gz" file Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi" Branch : https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500 SRP441343_SL4_multiqc_report.html SRP441343_SL5_multiqc_report.html SRP441343.csv

Molly Davis made changes - 05/Dec/23 4:42 PM

Assignee

Molly Davis [ molly ]

Molly Davis made changes - 05/Dec/23 4:42 PM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Molly Davis made changes - 07/Dec/23 10:53 AM

Assignee

Robert Reid [ robertreid ]

Molly Davis made changes - 07/Dec/23 4:37 PM

Description

Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.

Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.

For this task, we need to confirm and sanity-check the mark 2022 time series data that Rob uploaded and submitted to the Sequence Read Archive.

If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
For this task:
* Check SRP on NCBI and review submission
* Download the data onto the cluster by using the SRP name
* Run nf-core/rnaseq pipeline
* Run our coverage graph and junctions scripts on the data

Note that all files should now use their "SRR" names instead of the existing file names.

Hide

Permalink

Robert Reid added a comment - 08/Dec/23 8:43 AM

SL4 Folder:

Overall NFCore structure appears as expected.

We see the expected # of bam, bai, bedgraph ad bed files, 54!
/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bam | wc -l
54
rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bai | wc -l
54
rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *gz | wc -l
108
rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bedgraph.gz | wc -l
54
rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bed.gz | wc -l
54
There are 54 tbi files as expected. (*scaled.bedgraph.gz.tbi)

All of the .err files are 0 bytes in size. As expected. All other files have at least some size!

scaled.bedgraph.gz.tbi files are all around 60-70kb in size.
All of the bamfiles are 1.3GB to 2.7GB in size as expected.
.bai files are look fine.

Next up SL5.

Show

Robert Reid added a comment - 08/Dec/23 8:43 AM SL4 Folder: Overall NFCore structure appears as expected. We see the expected # of bam, bai, bedgraph ad bed files, 54! /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bam | wc -l 54 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bai | wc -l 54 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *gz | wc -l 108 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bedgraph.gz | wc -l 54 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bed.gz | wc -l 54 There are 54 tbi files as expected. (*scaled.bedgraph.gz.tbi) All of the .err files are 0 bytes in size. As expected. All other files have at least some size! scaled.bedgraph.gz.tbi files are all around 60-70kb in size. All of the bamfiles are 1.3GB to 2.7GB in size as expected. .bai files are look fine. Next up SL5.

Molly Davis made changes - 08/Dec/23 9:23 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Hide

Permalink

Robert Reid added a comment - 08/Dec/23 9:43 AM

SL5 folder

The NFcore file structure is as expected.

54 files for each bam, bai, bed.gz. and bed.gz.tbi

Bam files are expected.

All .err files are 0.

All looks good.

Next up:
csv files and html in bitbucket.

Show

Robert Reid added a comment - 08/Dec/23 9:43 AM SL5 folder The NFcore file structure is as expected. 54 files for each bam, bai, bed.gz. and bed.gz.tbi Bam files are expected. All .err files are 0. All looks good. Next up: csv files and html in bitbucket.

Hide

Permalink

Robert Reid added a comment - 08/Dec/23 9:46 AM

In https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500

the HTML files both exist with content.

The .csv file looks to have the full 55 lines (54 expts plus a header) with the proper fastq in each column.

I call this done!

Show

Robert Reid added a comment - 08/Dec/23 9:46 AM In https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500 the HTML files both exist with content. The .csv file looks to have the full 55 lines (54 expts plus a header) with the proper fastq in each column. I call this done!

Robert Reid made changes - 08/Dec/23 9:46 AM

Status

First Level Review in Progress [ 10301 ]

Ready for Pull Request [ 10304 ]

Robert Reid made changes - 08/Dec/23 9:46 AM

Assignee

Robert Reid [ robertreid ]

Molly Davis [ molly ]

Hide

Permalink

Molly Davis added a comment - 08/Dec/23 10:03 AM

Thank you! Robert Reid

PR: https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/13

Show

Molly Davis added a comment - 08/Dec/23 10:03 AM Thank you! Robert Reid PR : https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/13

Molly Davis made changes - 08/Dec/23 10:03 AM

Assignee

Molly Davis [ molly ]

Molly Davis made changes - 08/Dec/23 10:03 AM

Status

Ready for Pull Request [ 10304 ]

Pull Request Submitted [ 10101 ]

Ann Loraine made changes - 11/Dec/23 9:44 AM

Sprint

Fall 6, Fall 7 [ 182, 183 ]

Fall 6, Fall 7, Fall 8 [ 182, 183, 184 ]

Ann Loraine made changes - 11/Dec/23 9:44 AM

Rank

Ranked higher

Molly Davis made changes - 13/Dec/23 10:27 AM

Assignee

Ann Loraine [ aloraine ]

Ann Loraine made changes - 13/Dec/23 3:44 PM

Status

Pull Request Submitted [ 10101 ]

Reviewing Pull Request [ 10303 ]

Ann Loraine made changes - 13/Dec/23 3:44 PM

Status

Reviewing Pull Request [ 10303 ]

Merged Needs Testing [ 10002 ]

Ann Loraine made changes - 13/Dec/23 3:44 PM

Assignee

Ann Loraine [ aloraine ]

Hide

Permalink

Ann Loraine added a comment - 13/Dec/23 3:48 PM

Testing suggestions:

Open the newly added .html files in a Web browser to check that they didn't get corrupted somehow (by mistake, of course
Check that the .html files mention the expected SRA identifiers
Check that the SRA identifiers listed in the added csv files match up with the .html files
Check that the csv file SRA identifiers are repeated in the expected way in the expected columns (e.g., sample names match up with file names)
Make a note of any interesting (or not so interesting!) differences in results obtained for SL4 and SL5, recalling whether or not SL5 has more or less gene models and genes than SL4

Show

Ann Loraine added a comment - 13/Dec/23 3:48 PM Testing suggestions: Open the newly added .html files in a Web browser to check that they didn't get corrupted somehow (by mistake, of course Check that the .html files mention the expected SRA identifiers Check that the SRA identifiers listed in the added csv files match up with the .html files Check that the csv file SRA identifiers are repeated in the expected way in the expected columns (e.g., sample names match up with file names) Make a note of any interesting (or not so interesting!) differences in results obtained for SL4 and SL5, recalling whether or not SL5 has more or less gene models and genes than SL4

Ann Loraine made changes - 22/Dec/23 10:16 AM

Sprint

Fall 6, Fall 7, Fall 8 [ 182, 183, 184 ]

Fall 6, Fall 7, Spring 2 [ 182, 183, 186 ]

Molly Davis made changes - 16/Jan/24 9:49 AM

Sprint

Fall 6, Fall 7, Spring 2 [ 182, 183, 186 ]

Fall 6, Fall 7, Spring 1 [ 182, 183, 185 ]

Molly Davis made changes - 17/Jan/24 1:41 PM

Assignee

Molly Davis [ molly ]

Hide

Permalink

Molly Davis added a comment - 17/Jan/24 2:14 PM

Testing:

html files open and report accurate information
SRA identifiers are present
csv SRA identifiers match the SRA identifiers in the html files
the fastq file SRA identifiers match the sample SRA identifiers in the csv file
There seems to be more 'reads mapped' for SL5 than SL4. But for SL4 there are more '% Proper Pairs' than SL5.

Next step: prepare data to be moved from the cluster to IGB quick load. Refer to ~~IGBF-3499~~

Moving ticket to done!

Show

Molly Davis added a comment - 17/Jan/24 2:14 PM Testing : html files open and report accurate information SRA identifiers are present csv SRA identifiers match the SRA identifiers in the html files the fastq file SRA identifiers match the sample SRA identifiers in the csv file There seems to be more 'reads mapped' for SL5 than SL4. But for SL4 there are more '% Proper Pairs' than SL5. Next step: prepare data to be moved from the cluster to IGB quick load. Refer to IGBF-3499 Moving ticket to done!

Molly Davis made changes - 17/Jan/24 2:16 PM

Status

Merged Needs Testing [ 10002 ]

Post-merge Testing In Progress [ 10003 ]

Molly Davis made changes - 17/Jan/24 2:16 PM

Resolution		Done [ 10000 ]
Status	Post-merge Testing In Progress [ 10003 ]	Closed [ 6 ]

People

Assignee:

Molly Davis

Reporter:

Molly Davis

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

15/Nov/23 4:04 PM

Updated:

17/Jan/24 2:16 PM

Resolved:

17/Jan/24 2:16 PM