[IGBF-3500] Re-run mark-2022-timeseries data with data downloaded from SRA - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Support NSF pollen grant
Sprint:
Fall 6, Fall 7, Spring 1

Description

Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.

For this task, we need to confirm and sanity-check the mark 2022 time series data that Rob uploaded and submitted to the Sequence Read Archive.

If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
For this task:

Check SRP on NCBI and review submission
Download the data onto the cluster by using the SRP name
Run nf-core/rnaseq pipeline
Run our coverage graph and junctions scripts on the data

Note that all files should now use their "SRR" names instead of the existing file names.

Attachments

Issue Links

relates to

IGBF-3406 Re-run Nextflow Muday time course data with SL5 data downloaded from SRA

Closed

IGBF-3499 Complete evaluation of RNA-Seq dataset submissions

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Molly Davis added a comment - 16/Nov/23 2:06 PM - edited

Re-run Directory: /projects/tomato_genome/fnb/dataprocessing/SRP441343
SL4: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
SL5: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5

Prefetch SRR Script:

#! /bin/bash

#SBATCH --job-name=prefetch_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00

cd   /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
module load sra-tools/2.11.0
vdb-config --interactive

files=(
SRR24836276
SRR24836277
SRR24836278
SRR24836279
SRR24836280
SRR24836281
SRR24836282
SRR24836283
SRR24836284
SRR24836285
SRR24836286
SRR24836287
SRR24836288
SRR24836289
SRR24836290
SRR24836291
SRR24836292
SRR24836293
SRR24836294
SRR24836295
SRR24836296
SRR24836297
SRR24836298
SRR24836299
SRR24836300
SRR24836301
SRR24836302
SRR24836303
SRR24836304
SRR24836305
SRR24836306
SRR24836307
SRR24836308
SRR24836309
SRR24836310
SRR24836311
SRR24836312
SRR24836313
SRR24836314
SRR24836315
SRR24836316
SRR24836317
SRR24836318
SRR24836319
SRR24836320
SRR24836321
SRR24836322
SRR24836323
SRR24836324
SRR24836325
SRR24836326
SRR24836327
SRR24836328
SRR24836329
)

for f in "${files[@]}"; do echo $f; prefetch $f;  done

Execute:

chmod u+x prefetch.slurm

sbatch prefetch.slurm

Faster Dump Script:

#! /bin/bash

#SBATCH --job-name=fastqdump_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=40gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00
#SBATCH --array=1-54

#setting up where to grab files from
file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/Sra_ids.txt)


cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
module load sra-tools/2.11.0

echo "Starting faster-qdump on $file";

cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/$file

fasterq-dump ${file}.sra

perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq

cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_1.fastq
cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_2.fastq 

echo "finished"

Execute:

chmod u+x fasterdump.slurm

sbatch fasterdump.slurm

Show

Molly Davis added a comment - 16/Nov/23 2:06 PM - edited Re-run Directory : /projects/tomato_genome/fnb/dataprocessing/SRP441343 SL4 : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 SL5 : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5 Prefetch SRR Script : #! /bin/bash #SBATCH --job-name=prefetch_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 module load sra-tools/2.11.0 vdb-config --interactive files=( SRR24836276 SRR24836277 SRR24836278 SRR24836279 SRR24836280 SRR24836281 SRR24836282 SRR24836283 SRR24836284 SRR24836285 SRR24836286 SRR24836287 SRR24836288 SRR24836289 SRR24836290 SRR24836291 SRR24836292 SRR24836293 SRR24836294 SRR24836295 SRR24836296 SRR24836297 SRR24836298 SRR24836299 SRR24836300 SRR24836301 SRR24836302 SRR24836303 SRR24836304 SRR24836305 SRR24836306 SRR24836307 SRR24836308 SRR24836309 SRR24836310 SRR24836311 SRR24836312 SRR24836313 SRR24836314 SRR24836315 SRR24836316 SRR24836317 SRR24836318 SRR24836319 SRR24836320 SRR24836321 SRR24836322 SRR24836323 SRR24836324 SRR24836325 SRR24836326 SRR24836327 SRR24836328 SRR24836329 ) for f in "${files[@]}" ; do echo $f; prefetch $f; done Execute : chmod u+x prefetch.slurm sbatch prefetch.slurm Faster Dump Script : #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-54 #setting up where to grab files from file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/Sra_ids.txt) cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 module load sra-tools/2.11.0 echo "Starting faster-qdump on $file" ; cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/$file fasterq-dump ${file}.sra perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_1.fastq cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_2.fastq echo "finished" Execute : chmod u+x fasterdump.slurm sbatch fasterdump.slurm

5 older comments

Hide

Permalink

Robert Reid added a comment - 08/Dec/23 9:43 AM

SL5 folder

The NFcore file structure is as expected.

54 files for each bam, bai, bed.gz. and bed.gz.tbi

Bam files are expected.

All .err files are 0.

All looks good.

Next up:
csv files and html in bitbucket.

Show

Robert Reid added a comment - 08/Dec/23 9:43 AM SL5 folder The NFcore file structure is as expected. 54 files for each bam, bai, bed.gz. and bed.gz.tbi Bam files are expected. All .err files are 0. All looks good. Next up: csv files and html in bitbucket.

Hide

Permalink

Robert Reid added a comment - 08/Dec/23 9:46 AM

In https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500

the HTML files both exist with content.

The .csv file looks to have the full 55 lines (54 expts plus a header) with the proper fastq in each column.

I call this done!

Show

Robert Reid added a comment - 08/Dec/23 9:46 AM In https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500 the HTML files both exist with content. The .csv file looks to have the full 55 lines (54 expts plus a header) with the proper fastq in each column. I call this done!

Hide

Permalink

Molly Davis added a comment - 08/Dec/23 10:03 AM

Thank you! Robert Reid

PR: https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/13

Show

Molly Davis added a comment - 08/Dec/23 10:03 AM Thank you! Robert Reid PR : https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/13

Hide

Permalink

Ann Loraine added a comment - 13/Dec/23 3:48 PM

Testing suggestions:

Open the newly added .html files in a Web browser to check that they didn't get corrupted somehow (by mistake, of course
Check that the .html files mention the expected SRA identifiers
Check that the SRA identifiers listed in the added csv files match up with the .html files
Check that the csv file SRA identifiers are repeated in the expected way in the expected columns (e.g., sample names match up with file names)
Make a note of any interesting (or not so interesting!) differences in results obtained for SL4 and SL5, recalling whether or not SL5 has more or less gene models and genes than SL4

Show

Ann Loraine added a comment - 13/Dec/23 3:48 PM Testing suggestions: Open the newly added .html files in a Web browser to check that they didn't get corrupted somehow (by mistake, of course Check that the .html files mention the expected SRA identifiers Check that the SRA identifiers listed in the added csv files match up with the .html files Check that the csv file SRA identifiers are repeated in the expected way in the expected columns (e.g., sample names match up with file names) Make a note of any interesting (or not so interesting!) differences in results obtained for SL4 and SL5, recalling whether or not SL5 has more or less gene models and genes than SL4

Hide

Permalink

Molly Davis added a comment - 17/Jan/24 2:14 PM

Testing:

html files open and report accurate information
SRA identifiers are present
csv SRA identifiers match the SRA identifiers in the html files
the fastq file SRA identifiers match the sample SRA identifiers in the csv file
There seems to be more 'reads mapped' for SL5 than SL4. But for SL4 there are more '% Proper Pairs' than SL5.

Next step: prepare data to be moved from the cluster to IGB quick load. Refer to ~~IGBF-3499~~

Moving ticket to done!

Show

Molly Davis added a comment - 17/Jan/24 2:14 PM Testing : html files open and report accurate information SRA identifiers are present csv SRA identifiers match the SRA identifiers in the html files the fastq file SRA identifiers match the sample SRA identifiers in the csv file There seems to be more 'reads mapped' for SL5 than SL4. But for SL4 there are more '% Proper Pairs' than SL5. Next step: prepare data to be moved from the cluster to IGB quick load. Refer to IGBF-3499 Moving ticket to done!

People

Assignee:

Molly Davis

Reporter:

Molly Davis

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

15/Nov/23 4:04 PM

Updated:

17/Jan/24 2:16 PM

Resolved:

17/Jan/24 2:16 PM