[IGBF-3228] Download and process data for SRP100604 and SRP268884 - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Support NSF pollen grant
Sprint:
Fall 7 2022 Nov 21, Fall 8 2022 Dec 5, Spring 1 2023 Dec 26, Spring 2 2023 Jan 16, Spring 3 2023 Feb 1, Spring 4 2023 Feb 21

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

Final_SRP268884_multiqc_report.html
7.69 MB
20/Jan/23 10:33 AM
Screen Shot 2022-11-30 at 6.32.58 PM.png
47 kB
01/Dec/22 10:54 AM
SRP100604_multiqc_report.html
1.85 MB
19/Jan/23 1:22 PM

Issue Links

blocks

IGBF-3239 Create scaled coverage graphs for SRP100604 and SRP268884

Closed

relates to

IGBF-3238 Create sample description sheets for SRP100604 and SRP268884

Closed

IGBF-3143 Run RNA-Seq data processing pipeline on positive splicing control and experimental samples

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Molly Davis added a comment - 22/Nov/22 2:34 PM - edited

SRP100604 and SRP268884 have been uploaded and fastq files created.

SRA links:
https://www.ncbi.nlm.nih.gov/sra?term=SRP100604
https://www.ncbi.nlm.nih.gov/sra?term=SRP268884

Code Example

prefetch.slurm

#! /bin/bash

#SBATCH --job-name=prefetch_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00

cd /nobackup/tomato_genome/alt_splicing/SRP100604
module load sra-tools/2.11.0
vdb-config --interactive

files=(
       	SRR5279858
        SRR5279875
        SRR5279883
        SRR5280323
        SRR5280370
        SRR5280382
        SRR5280383
        SRR5280392
        SRR5282476
        SRR5282478
        SRR5282480
        SRR5282481
)

for f in "${files[@]}"; do echo $f; prefetch $f;  done

fasterdump.slurm

#! /bin/bash

#SBATCH --job-name=fastqdump_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=40gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00
#SBATCH --array=1-12

#setting up where to grab files from
file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  /nobackup/tomato_genome/alt_splicing/SRP100604/Sra_ids.txt)


cd /nobackup/tomato_genome/alt_splicing/SRP100604
module load sra-tools/2.11.0

echo "Starting faster-qdump on $file";

cd /nobackup/tomato_genome/alt_splicing/SRP100604/$file

fasterq-dump ${file}.sra

perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq

cp ${file}_1.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_1.fastq
cp ${file}_2.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_2.fastq 

echo "finished"

Comments on results
Directory: /nobackup/tomato_genome/alt_splicing/SRP100604

SRP100604: There were some SRR files that were not double stranded but were single stranded so it could not make _1.fastq and _2.fastq files.

List of those SRR's-
SRR5282476
SRR5282478
SRR5282480
SRR5282481

Directory: /nobackup/tomato_genome/alt_splicing/SRP268884

SRP268884: Produces all double stranded _1.fastq and _2.fastq files.

Next Step: Run Nextflow rnaseq/nf-core pipeline on SRP268884.

Question: Should we still use SRP100604 if it contains single stranded SRR files or just use the double stranded files that it contained?

[~aloraine]

aloraine's answer to the above query: Go ahead and use all the available data in SRP100604. I believe that nextflow is able to handle this complication intelligently. I think you can omit the "second" file name in the "samples" file for single end runs. (Please note that "single strand" is not the same thing as "single end" - make sure that we are talking about the same thing before proceeding.)

Show

Molly Davis added a comment - 22/Nov/22 2:34 PM - edited SRP100604 and SRP268884 have been uploaded and fastq files created. SRA links: https://www.ncbi.nlm.nih.gov/sra?term=SRP100604 https://www.ncbi.nlm.nih.gov/sra?term=SRP268884 Code Example prefetch.slurm #! /bin/bash #SBATCH --job-name=prefetch_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 cd /nobackup/tomato_genome/alt_splicing/SRP100604 module load sra-tools/2.11.0 vdb-config --interactive files=( SRR5279858 SRR5279875 SRR5279883 SRR5280323 SRR5280370 SRR5280382 SRR5280383 SRR5280392 SRR5282476 SRR5282478 SRR5282480 SRR5282481 ) for f in "${files[@]}" ; do echo $f; prefetch $f; done fasterdump.slurm #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-12 #setting up where to grab files from file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /nobackup/tomato_genome/alt_splicing/SRP100604/Sra_ids.txt) cd /nobackup/tomato_genome/alt_splicing/SRP100604 module load sra-tools/2.11.0 echo "Starting faster-qdump on $file" ; cd /nobackup/tomato_genome/alt_splicing/SRP100604/$file fasterq-dump ${file}.sra perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq cp ${file}_1.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_1.fastq cp ${file}_2.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_2.fastq echo "finished" Comments on results Directory: /nobackup/tomato_genome/alt_splicing/SRP100604 SRP100604: There were some SRR files that were not double stranded but were single stranded so it could not make _1.fastq and _2.fastq files. List of those SRR's- SRR5282476 SRR5282478 SRR5282480 SRR5282481 Directory: /nobackup/tomato_genome/alt_splicing/SRP268884 SRP268884: Produces all double stranded _1.fastq and _2.fastq files. Next Step: Run Nextflow rnaseq/nf-core pipeline on SRP268884. Question: Should we still use SRP100604 if it contains single stranded SRR files or just use the double stranded files that it contained? [~aloraine] aloraine's answer to the above query: Go ahead and use all the available data in SRP100604. I believe that nextflow is able to handle this complication intelligently. I think you can omit the "second" file name in the "samples" file for single end runs. (Please note that "single strand" is not the same thing as "single end" - make sure that we are talking about the same thing before proceeding.)

9 older comments

Hide

Permalink

Molly Davis added a comment - 20/Jan/23 10:31 AM - edited

Updated Multiqc report:
Final_SRP268884_multiqc_report.html

Updated Bitbucket CSV file:
https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/9

Updated Scaled Coverage Graphs:
/nobackup/tomato_genome/alt_splicing/for_igbquickload/coverage_graphs_2/coverage_graphs_SRP268884

Show

Molly Davis added a comment - 20/Jan/23 10:31 AM - edited Updated Multiqc report : Final_SRP268884_multiqc_report.html Updated Bitbucket CSV file : https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/9 Updated Scaled Coverage Graphs : /nobackup/tomato_genome/alt_splicing/for_igbquickload/coverage_graphs_2/coverage_graphs_SRP268884

Hide

Permalink

Ann Loraine added a comment - 29/Jan/23 4:01 PM

Processing parameters additionally include "tomato.config" file:

params {
    modules {
        'star_align' {
            args            = '--alignIntronMax 13000 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM 
Unsorted --readFilesCommand zcat --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes N
H HI AS NM MD --quantTranscriptomeBan Singleend'
        }
        'hisat2_align' {
            args            = "--max-intronlen 13000 --met-stderr --new-summary --dta"
        }
    }
}

Show

Ann Loraine added a comment - 29/Jan/23 4:01 PM Processing parameters additionally include "tomato.config" file: params { modules { 'star_align' { args = '--alignIntronMax 13000 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand zcat --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes N H HI AS NM MD --quantTranscriptomeBan Singleend' } 'hisat2_align' { args = "--max-intronlen 13000 --met-stderr -- new -summary --dta" } } }

Hide

Permalink

Ann Loraine added a comment - 23/Feb/23 9:02 AM

Adding multiqc files to repository hotpollen/splicing-analysis.git.

Show

Ann Loraine added a comment - 23/Feb/23 9:02 AM Adding multiqc files to repository hotpollen/splicing-analysis.git.

Hide

Permalink

Ann Loraine added a comment - 23/Feb/23 9:11 AM - edited

Copied files to renci host:

SRP100604 coverage graphs
SRP100604 alignment files
SRP100604 junction files
SRP268884 coverage graphs
SRP268884 alignment files
SRP268884 junction files

Example invocation:

scp -J aloraine@hop.renci.org -r SRP268884_bam aloraine@lorainelab-quickload.scidas.org:/projects/igbquickload/lorainelab/www/main/htdocs/hotpollen/S_lycopersicum_Jun_2022/SRP268884/.

Show

Ann Loraine added a comment - 23/Feb/23 9:11 AM - edited Copied files to renci host: SRP100604 coverage graphs SRP100604 alignment files SRP100604 junction files SRP268884 coverage graphs SRP268884 alignment files SRP268884 junction files Example invocation: scp -J aloraine@hop.renci.org -r SRP268884_bam aloraine@lorainelab-quickload.scidas.org:/projects/igbquickload/lorainelab/www/main/htdocs/hotpollen/S_lycopersicum_Jun_2022/SRP268884/.

Hide

Permalink

Ann Loraine added a comment - 23/Feb/23 1:01 PM

All files are made, transferred to RENCI for hosting. Moving to DONE.

Show

Ann Loraine added a comment - 23/Feb/23 1:01 PM All files are made, transferred to RENCI for hosting. Moving to DONE.

People

Assignee:

Molly Davis

Reporter:

Ann Loraine

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

17/Nov/22 12:49 PM

Updated:

23/Feb/23 1:01 PM

Resolved:

23/Feb/23 1:01 PM