[IGBF-3720] Re-run Nextflow Muday time course data again with SL5 and data downloaded from SRA - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Support NSF pollen grant
Sprint:
Spring 9

Description

SRP460750

Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5-2024-05-07

Previously we noticed that SRA had mismatched some of the data incorrectly and 16 of the sample-names were mislabeled. Dr. Reid reached out and had SRA change everything to the correct sample names. Now we must rerun the muday SRA data again on the cluster with nextflow and make sure the data is correctly labeled.

For this task, we need to confirm and sanity-check the muday time course data that Rob recently uploaded and submitted to the Sequence Read Archive.
If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
For this task:

Check SRP on NCBI and review submission
Download the data onto the cluster by using the SRP name
Run nf-core/rnaseq pipeline
Run our coverage graph and junctions scripts on the data

Note that all files should now use their "SRR" names instead of the existing file names.

Attachments

Issue Links

relates to

IGBF-3727 Create new input file for checking SRA code

Closed

IGBF-3739 Re-run SRA muday 2022 timeseries data again with SL4

Closed

IGBF-3507 Re-run Nextflow Muday time course data with SL4 and data downloaded from SRA

Closed

IGBF-3683 Update SRA to use the correct sample codes for Muday lab time course data

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Molly Davis added a comment - 07/May/24 10:37 AM - edited

Re-run Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5-2024-05-07

Prefetch Script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem=8gb
#SBATCH --time=15:00:00
#SBATCH --partition=Orion

#
# This script uses a tool from NCBI called "prefetch" to download data
# files from the Sequence Research Archive in ".fastq" format. Next,
# it uses another tool called "vdb-validate" to make sure the download
# did not fail.
#
# You can run this in parallel, on a cluster, using the following command:
#
# Run it like this:
#
#    cat srr.txt | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh
#
# where srr.txt is a file with SRR run accessions, one per line.
#
# How it works:
#
#  The preceding line of code delivers SRR run accessions, one at a time, to the xargs
#  command. The xargs command then runs sbatch, passing an SRR run accession as variable
#  named A. Note how we chose the variable name "A" using the -I option. Also note
#  how we can refer to the value represented by variable "A" without using a $ symbol
#  prefix. It's a thing xargs can do, but hard to explain.
#
#  Observe that the "sbatch" command gets some options, too. It's final argument
#  is the name of this script. And, it's getting an environment variable named S. 
#
# The code is passing a lot of stuff around, from cat to xargs to sbatch to prefetch. 
#
# Did the download work as expected? How do you know? View the output files which are all named
# after the SRR identifier. Check what the "validate" program wrote. 
#
#
# Other things to know:
#
# (1) This script assumes that the user has configured their account
#     (see: ~/.ncbi/user-settings.mkdf) to save output to the current working
#     directory. 
# (2) prefetch downloads run files as ".sra" format to a new folder named
#     for the SRR accession and save the downloaded ".sra" format file. It
#     creates that new folder if it does not already exist.
# (3) Why prefetch? Do this because if the download halts, prefetch can
#     resume re-download and save you some time and server power.
#
# For more info about prefetch, see this excellent documentation:
# https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump
#
#
cd $SLURM_SUBMIT_DIR
module load sra-tools/2.11.0
CMD1="prefetch $S"
echo "Running: $CMD1"
$CMD1
CMD2="vdb-validate $S/$S.sra"
echo "Running: $CMD2"
$CMD2

Note: Make sure to create the srr.txt file with the list of SRR names. Use the Accession list from NCBI.

Run Script:

cat srr.txt | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh

Faster Dump Script:

#! /bin/bash

#SBATCH --job-name=fastqdump_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=40gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00
#SBATCH --array=1-72
#setting up where to grab prefetch SRR files from
SRR_name=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  $SLURM_SUBMIT_DIR/srr.txt)


cd $SLURM_SUBMIT_DIR
module load sra-tools/2.11.0

echo "Starting faster-qdump on $SRR_name";

cd $SLURM_SUBMIT_DIR/$SRR_name

fasterq-dump ${SRR_name}.sra

gzip *.fastq

mv ${SRR_name}_1.fastq $SLURM_SUBMIT_DIR/${SRR_name}_1.fastq.gz
mv ${SRR_name}_2.fastq $SLURM_SUBMIT_DIR/${SRR_name}_2.fastq.gz

echo "finished"

Run Script:

chmod u+x fasterdump.slurm

sbatch fasterdump.slurm

Show

Molly Davis added a comment - 07/May/24 10:37 AM - edited Re-run Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5-2024-05-07 Prefetch Script : #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem=8gb #SBATCH --time=15:00:00 #SBATCH --partition=Orion # # This script uses a tool from NCBI called "prefetch" to download data # files from the Sequence Research Archive in ".fastq" format. Next, # it uses another tool called "vdb-validate" to make sure the download # did not fail. # # You can run this in parallel, on a cluster, using the following command: # # Run it like this : # # cat srr.txt | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh # # where srr.txt is a file with SRR run accessions, one per line. # # How it works: # # The preceding line of code delivers SRR run accessions, one at a time, to the xargs # command. The xargs command then runs sbatch, passing an SRR run accession as variable # named A. Note how we chose the variable name "A" using the -I option. Also note # how we can refer to the value represented by variable "A" without using a $ symbol # prefix. It's a thing xargs can do , but hard to explain. # # Observe that the "sbatch" command gets some options, too. It's final argument # is the name of this script. And, it's getting an environment variable named S. # # The code is passing a lot of stuff around, from cat to xargs to sbatch to prefetch. # # Did the download work as expected? How do you know? View the output files which are all named # after the SRR identifier. Check what the "validate" program wrote. # # # Other things to know: # # (1) This script assumes that the user has configured their account # (see: ~/.ncbi/user-settings.mkdf) to save output to the current working # directory. # (2) prefetch downloads run files as ".sra" format to a new folder named # for the SRR accession and save the downloaded ".sra" format file. It # creates that new folder if it does not already exist. # (3) Why prefetch? Do this because if the download halts, prefetch can # resume re-download and save you some time and server power. # # For more info about prefetch, see this excellent documentation: # https: //github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump # # cd $SLURM_SUBMIT_DIR module load sra-tools/2.11.0 CMD1= "prefetch $S" echo "Running: $CMD1" $CMD1 CMD2= "vdb-validate $S/$S.sra" echo "Running: $CMD2" $CMD2 Note: Make sure to create the srr.txt file with the list of SRR names. Use the Accession list from NCBI. Run Script : cat srr.txt | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh Faster Dump Script : #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-72 #setting up where to grab prefetch SRR files from SRR_name=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" $SLURM_SUBMIT_DIR/srr.txt) cd $SLURM_SUBMIT_DIR module load sra-tools/2.11.0 echo "Starting faster-qdump on $SRR_name" ; cd $SLURM_SUBMIT_DIR/$SRR_name fasterq-dump ${SRR_name}.sra gzip *.fastq mv ${SRR_name}_1.fastq $SLURM_SUBMIT_DIR/${SRR_name}_1.fastq.gz mv ${SRR_name}_2.fastq $SLURM_SUBMIT_DIR/${SRR_name}_2.fastq.gz echo "finished" Run Script : chmod u+x fasterdump.slurm sbatch fasterdump.slurm

Hide

Permalink

Molly Davis added a comment - 08/May/24 4:06 PM - edited

Nextflow Pipeline ran successfully with SL5 genome
Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5-2024-05-07
MultiQC report notes: No errors or warnings were present in the report. The output file is named 'SRP460750_SL5_multiqc_report.html'.

Show

Molly Davis added a comment - 08/May/24 4:06 PM - edited Nextflow Pipeline ran successfully with SL5 genome Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5-2024-05-07 MultiQC report notes : No errors or warnings were present in the report. The output file is named 'SRP460750_SL5_multiqc_report.html'.

Hide

Permalink

Molly Davis added a comment - 08/May/24 4:07 PM

Next steps:
Commit multiqc report to Flavonoid repo on bitbucket
Change sorted bam names
Create junction files
Create Coverage graphs

Show

Molly Davis added a comment - 08/May/24 4:07 PM Next steps : Commit multiqc report to Flavonoid repo on bitbucket Change sorted bam names Create junction files Create Coverage graphs

Hide

Permalink

Molly Davis added a comment - 08/May/24 4:07 PM

Launch renameBams.sh script:
./renameBams.sh
Launch Scaled Coverage graphs script:
./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err
Launch Junction files script:
./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Show

Molly Davis added a comment - 08/May/24 4:07 PM Launch renameBams.sh script : ./renameBams.sh Launch Scaled Coverage graphs script : ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err Launch Junction files script : ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Hide

Permalink

Molly Davis added a comment - 09/May/24 4:50 PM

Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5-2024-05-07/results/star_salmon
Reviewer:
Check that files have reasonable sizes (no "zero" size files, for example)
Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file
Check that every bam file has a corresponding "FJ.bed.gz" file
Check that every bam file has a corresponding "scaled.bedgraph.gz" file
Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"

Show

Molly Davis added a comment - 09/May/24 4:50 PM Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5-2024-05-07/results/star_salmon Reviewer : Check that files have reasonable sizes (no "zero" size files, for example) Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file Check that every bam file has a corresponding "FJ.bed.gz" file Check that every bam file has a corresponding "scaled.bedgraph.gz" file Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"

Hide

Permalink

Molly Davis added a comment - 09/May/24 5:06 PM

Branch: https://bitbucket.org/mdavis4290/molly5-flavonoid-rnaseq/branch/IGBF-3720

Note: The new multiQC report for SL5 is in this branch. I deleted the SL4 old report. SL4 has not been run yet with the new re-annotated SRA data.

Show

Molly Davis added a comment - 09/May/24 5:06 PM Branch : https://bitbucket.org/mdavis4290/molly5-flavonoid-rnaseq/branch/IGBF-3720 Note : The new multiQC report for SL5 is in this branch. I deleted the SL4 old report. SL4 has not been run yet with the new re-annotated SRA data.

Hide

Permalink

Molly Davis added a comment - 10/May/24 10:29 AM

PR: https://bitbucket.org/hotpollen/flavonoid-rnaseq/pull-requests/46

Show

Molly Davis added a comment - 10/May/24 10:29 AM PR : https://bitbucket.org/hotpollen/flavonoid-rnaseq/pull-requests/46

People

Assignee:

Molly Davis

Reporter:

Molly Davis

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

07/May/24 10:26 AM

Updated:

21/May/24 10:01 AM

Resolved:

12/May/24 4:31 PM