[IGBF-3228] Download and process data for SRP100604 and SRP268884 - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Support NSF pollen grant
Sprint:
Fall 7 2022 Nov 21, Fall 8 2022 Dec 5, Spring 1 2023 Dec 26, Spring 2 2023 Jan 16, Spring 3 2023 Feb 1, Spring 4 2023 Feb 21

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

Final_SRP268884_multiqc_report.html
7.69 MB
20/Jan/23 10:33 AM
Screen Shot 2022-11-30 at 6.32.58 PM.png
47 kB
01/Dec/22 10:54 AM
SRP100604_multiqc_report.html
1.85 MB
19/Jan/23 1:22 PM

Issue Links

blocks

IGBF-3239 Create scaled coverage graphs for SRP100604 and SRP268884

Closed

relates to

IGBF-3238 Create sample description sheets for SRP100604 and SRP268884

Closed

IGBF-3143 Run RNA-Seq data processing pipeline on positive splicing control and experimental samples

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Molly Davis added a comment - 22/Nov/22 2:34 PM - edited

SRP100604 and SRP268884 have been uploaded and fastq files created.

SRA links:
https://www.ncbi.nlm.nih.gov/sra?term=SRP100604
https://www.ncbi.nlm.nih.gov/sra?term=SRP268884

Code Example

prefetch.slurm

#! /bin/bash

#SBATCH --job-name=prefetch_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00

cd /nobackup/tomato_genome/alt_splicing/SRP100604
module load sra-tools/2.11.0
vdb-config --interactive

files=(
       	SRR5279858
        SRR5279875
        SRR5279883
        SRR5280323
        SRR5280370
        SRR5280382
        SRR5280383
        SRR5280392
        SRR5282476
        SRR5282478
        SRR5282480
        SRR5282481
)

for f in "${files[@]}"; do echo $f; prefetch $f;  done

fasterdump.slurm

#! /bin/bash

#SBATCH --job-name=fastqdump_SRR
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=40gb
#SBATCH --output=%x_%j.out
#SBATCH --time=24:00:00
#SBATCH --array=1-12

#setting up where to grab files from
file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  /nobackup/tomato_genome/alt_splicing/SRP100604/Sra_ids.txt)


cd /nobackup/tomato_genome/alt_splicing/SRP100604
module load sra-tools/2.11.0

echo "Starting faster-qdump on $file";

cd /nobackup/tomato_genome/alt_splicing/SRP100604/$file

fasterq-dump ${file}.sra

perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq

cp ${file}_1.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_1.fastq
cp ${file}_2.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_2.fastq 

echo "finished"

Comments on results
Directory: /nobackup/tomato_genome/alt_splicing/SRP100604

SRP100604: There were some SRR files that were not double stranded but were single stranded so it could not make _1.fastq and _2.fastq files.

List of those SRR's-
SRR5282476
SRR5282478
SRR5282480
SRR5282481

Directory: /nobackup/tomato_genome/alt_splicing/SRP268884

SRP268884: Produces all double stranded _1.fastq and _2.fastq files.

Next Step: Run Nextflow rnaseq/nf-core pipeline on SRP268884.

Question: Should we still use SRP100604 if it contains single stranded SRR files or just use the double stranded files that it contained?

[~aloraine]

aloraine's answer to the above query: Go ahead and use all the available data in SRP100604. I believe that nextflow is able to handle this complication intelligently. I think you can omit the "second" file name in the "samples" file for single end runs. (Please note that "single strand" is not the same thing as "single end" - make sure that we are talking about the same thing before proceeding.)

Show

Molly Davis added a comment - 22/Nov/22 2:34 PM - edited SRP100604 and SRP268884 have been uploaded and fastq files created. SRA links: https://www.ncbi.nlm.nih.gov/sra?term=SRP100604 https://www.ncbi.nlm.nih.gov/sra?term=SRP268884 Code Example prefetch.slurm #! /bin/bash #SBATCH --job-name=prefetch_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 cd /nobackup/tomato_genome/alt_splicing/SRP100604 module load sra-tools/2.11.0 vdb-config --interactive files=( SRR5279858 SRR5279875 SRR5279883 SRR5280323 SRR5280370 SRR5280382 SRR5280383 SRR5280392 SRR5282476 SRR5282478 SRR5282480 SRR5282481 ) for f in "${files[@]}" ; do echo $f; prefetch $f; done fasterdump.slurm #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-12 #setting up where to grab files from file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /nobackup/tomato_genome/alt_splicing/SRP100604/Sra_ids.txt) cd /nobackup/tomato_genome/alt_splicing/SRP100604 module load sra-tools/2.11.0 echo "Starting faster-qdump on $file" ; cd /nobackup/tomato_genome/alt_splicing/SRP100604/$file fasterq-dump ${file}.sra perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq cp ${file}_1.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_1.fastq cp ${file}_2.fastq /nobackup/tomato_genome/alt_splicing/SRP100604/${file}_2.fastq echo "finished" Comments on results Directory: /nobackup/tomato_genome/alt_splicing/SRP100604 SRP100604: There were some SRR files that were not double stranded but were single stranded so it could not make _1.fastq and _2.fastq files. List of those SRR's- SRR5282476 SRR5282478 SRR5282480 SRR5282481 Directory: /nobackup/tomato_genome/alt_splicing/SRP268884 SRP268884: Produces all double stranded _1.fastq and _2.fastq files. Next Step: Run Nextflow rnaseq/nf-core pipeline on SRP268884. Question: Should we still use SRP100604 if it contains single stranded SRR files or just use the double stranded files that it contained? [~aloraine] aloraine's answer to the above query: Go ahead and use all the available data in SRP100604. I believe that nextflow is able to handle this complication intelligently. I think you can omit the "second" file name in the "samples" file for single end runs. (Please note that "single strand" is not the same thing as "single end" - make sure that we are talking about the same thing before proceeding.)

Hide

Permalink

Molly Davis added a comment - 01/Dec/22 10:54 AM

Nextflow Update:

Pipeline ran successfully with SRP268884.

Notes: Could not find whether data was stranded or unstranded on csv file. Went ahead and just put stranded on csv file. To discover that you can run an alignment first and view files in IGB. Stranded = no mismatch of stranded.

Nextflow Notes: Took a couple times for nextflow to run. It would randomly stop and throw a random error and I would just run it again without changing anything and it would work.

[~aloraine]

Show

Molly Davis added a comment - 01/Dec/22 10:54 AM Nextflow Update : Pipeline ran successfully with SRP268884. Notes : Could not find whether data was stranded or unstranded on csv file. Went ahead and just put stranded on csv file. To discover that you can run an alignment first and view files in IGB. Stranded = no mismatch of stranded. Nextflow Notes : Took a couple times for nextflow to run. It would randomly stop and throw a random error and I would just run it again without changing anything and it would work. [~aloraine]

Hide

Permalink

Molly Davis added a comment - 16/Dec/22 2:43 PM - edited

SRP100604 Update:

Directory: /nobackup/tomato_genome/alt_splicing/SRP100604

Nextflow ran successfully with SRP100604. I made a csv sample sheet with paired and single end fastq files.

Notes: No issues with Nextflow this time. I am having trouble getting rsync to work with my local machine to view output files to determine whether the projects are stranded or unstranded. Will have to probably contact researchers to get that information if I can't get rsync to work.

Resources/Instruction:
Stranded or unstranded- https://nf-co.re/rnaseq/1.2/output#rseqc
How to make .csv sample sheet for single and paired- https://nf-co.re/rnaseq/usage

Next step: I need to fix names for bam files so only the SRR names are there.

[~aloraine]

Show

Molly Davis added a comment - 16/Dec/22 2:43 PM - edited SRP100604 Update: Directory: /nobackup/tomato_genome/alt_splicing/SRP100604 Nextflow ran successfully with SRP100604. I made a csv sample sheet with paired and single end fastq files. Notes: No issues with Nextflow this time. I am having trouble getting rsync to work with my local machine to view output files to determine whether the projects are stranded or unstranded. Will have to probably contact researchers to get that information if I can't get rsync to work. Resources/Instruction: Stranded or unstranded- https://nf-co.re/rnaseq/1.2/output#rseqc How to make .csv sample sheet for single and paired- https://nf-co.re/rnaseq/usage Next step: I need to fix names for bam files so only the SRR names are there. [~aloraine]

Hide

Permalink

Molly Davis added a comment - 12/Jan/23 4:04 PM - edited

Update: I have renamed the bam files and reviewed the results from the nextflow pipeline. I added the sample csv files used for nextflow to bitbucket.

Next Steps: The files need to be reviewed by someone on the team please. The next task is to make description sheets for both samples (~~IGBF-3238~~).

[~aloraine]

Show

Molly Davis added a comment - 12/Jan/23 4:04 PM - edited Update : I have renamed the bam files and reviewed the results from the nextflow pipeline. I added the sample csv files used for nextflow to bitbucket. Next Steps : The files need to be reviewed by someone on the team please. The next task is to make description sheets for both samples ( IGBF-3238 ). [~aloraine]

Hide

Permalink

Ann Loraine added a comment - 13/Jan/23 10:18 AM - edited

In reviewing the PR for the sample sheets, I found what looks like a possible problem. Here is the comment I made on the PR regarding this possible problem:

I have a concern about this line: SRR5282476,SRR5282476.fastq.gz,SRR5282476.fastq.gz,unstranded

In this case, the sample run (SRR5282474) was sequenced on only one end, making it a “single end” not “paired end” dataset. In this case, I believe what you need to do with nextflow is simple omit field 3, like so:

SRR5282476,SRR5282476.fastq.gz,,unstranded

I’m concerned about what happens to the “run” if you do not do this. For example, the software may run incorrectly.

Just to be on the safe side, I think it would be a good idea to re-run the nextflow pipeline for this entire set of samples.

Also, another way to assess the situation is to look at the output of the multiqc processing. This might be helpful.

Show

Ann Loraine added a comment - 13/Jan/23 10:18 AM - edited In reviewing the PR for the sample sheets , I found what looks like a possible problem. Here is the comment I made on the PR regarding this possible problem: I have a concern about this line: SRR5282476,SRR5282476.fastq.gz,SRR5282476.fastq.gz,unstranded In this case, the sample run (SRR5282474) was sequenced on only one end, making it a “single end” not “paired end” dataset. In this case, I believe what you need to do with nextflow is simple omit field 3, like so: SRR5282476,SRR5282476.fastq.gz,,unstranded I’m concerned about what happens to the “run” if you do not do this. For example, the software may run incorrectly. Just to be on the safe side, I think it would be a good idea to re-run the nextflow pipeline for this entire set of samples. Also, another way to assess the situation is to look at the output of the multiqc processing. This might be helpful.

Hide

Permalink

Molly Davis added a comment - 13/Jan/23 11:37 AM

I reviewed the files I submitted to bitbucket and SRP100604 needed to be updated to the new one I manually fixed on the cluster. The results were used with the correct updated file and I believe there is no need for a re-run of nextflow due to the right files being used. I have updated bitbucket with the correct updated file now and resubmitted the pull request.

https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/8

Update: The pull request was merged with the correct csv files.

Show

Molly Davis added a comment - 13/Jan/23 11:37 AM I reviewed the files I submitted to bitbucket and SRP100604 needed to be updated to the new one I manually fixed on the cluster. The results were used with the correct updated file and I believe there is no need for a re-run of nextflow due to the right files being used. I have updated bitbucket with the correct updated file now and resubmitted the pull request. https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/8 Update : The pull request was merged with the correct csv files.

Hide

Permalink

Ann Loraine added a comment - 13/Jan/23 11:39 AM

Updated PR is now merged to branch "main."

Show

Ann Loraine added a comment - 13/Jan/23 11:39 AM Updated PR is now merged to branch "main."

Hide

Permalink

Molly Davis added a comment - 18/Jan/23 10:46 AM

Comment: Need to check MultiQC report to check strandedness. Having trouble with scp and rsync so I haven't been able to download from cluster and check myself.

Show

Molly Davis added a comment - 18/Jan/23 10:46 AM Comment : Need to check MultiQC report to check strandedness. Having trouble with scp and rsync so I haven't been able to download from cluster and check myself.

Hide

Permalink

Ann Loraine added a comment - 19/Jan/23 8:52 AM - edited

I agree with the previous comment.

Next steps:

Download multiqc reports
Rename each to use the SRP number, e.g., SRP268884_multiqc_report.html
Attach to this ticket
Review the content

To download, use scp.

Example invocation - on your local:

scp aloraine@hpc.uncc.edu:/nobackup/tomato_genome/alt_splicing/SRP268884/results/multiqc/star_salmon/multiqc_report.html SRP268884_multiqc_report.html

When do this, you need to authorize via the usual two-factor authentication.
The above example command renames the downloaded file.
To download and keep the same file name, replace the second argument with "."

Show

Ann Loraine added a comment - 19/Jan/23 8:52 AM - edited I agree with the previous comment. Next steps: Download multiqc reports Rename each to use the SRP number, e.g., SRP268884_multiqc_report.html Attach to this ticket Review the content To download, use scp. Example invocation - on your local: scp aloraine@hpc.uncc.edu:/nobackup/tomato_genome/alt_splicing/SRP268884/results/multiqc/star_salmon/multiqc_report.html SRP268884_multiqc_report.html When do this, you need to authorize via the usual two-factor authentication. The above example command renames the downloaded file. To download and keep the same file name, replace the second argument with "."

Hide

Permalink

Molly Davis added a comment - 19/Jan/23 1:22 PM - edited

Notes: Downloaded multiqc reports on local machine with code-

scp mdavi258@hpc.uncc.edu:/nobackup/tomato_genome/alt_splicing/SRP268884/results/multiqc/star_salmon/multiqc_report.html SRP268884_multiqc_report.html

[^SRP268884_multiqc_report.html]
SRP100604_multiqc_report.html

Update: I need to rerun SRP268884 and fix csv sample strandedness to 'reverse'. After successfully running I need to review multiqc report and add it to this ticket, update the csv file on bitbucket, remove sorted names, and redo scaled coverage graphs for SRP268884.

Side Note: I would like to improve my skills in reading and understanding multiqc reports.

Show

Molly Davis added a comment - 19/Jan/23 1:22 PM - edited Notes : Downloaded multiqc reports on local machine with code- scp mdavi258@hpc.uncc.edu:/nobackup/tomato_genome/alt_splicing/SRP268884/results/multiqc/star_salmon/multiqc_report.html SRP268884_multiqc_report.html [^SRP268884_multiqc_report.html] SRP100604_multiqc_report.html Update: I need to rerun SRP268884 and fix csv sample strandedness to 'reverse'. After successfully running I need to review multiqc report and add it to this ticket, update the csv file on bitbucket, remove sorted names, and redo scaled coverage graphs for SRP268884. Side Note: I would like to improve my skills in reading and understanding multiqc reports.

Hide

Permalink

Molly Davis added a comment - 20/Jan/23 10:31 AM - edited

Updated Multiqc report:
Final_SRP268884_multiqc_report.html

Updated Bitbucket CSV file:
https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/9

Updated Scaled Coverage Graphs:
/nobackup/tomato_genome/alt_splicing/for_igbquickload/coverage_graphs_2/coverage_graphs_SRP268884

Show

Molly Davis added a comment - 20/Jan/23 10:31 AM - edited Updated Multiqc report : Final_SRP268884_multiqc_report.html Updated Bitbucket CSV file : https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/9 Updated Scaled Coverage Graphs : /nobackup/tomato_genome/alt_splicing/for_igbquickload/coverage_graphs_2/coverage_graphs_SRP268884

Hide

Permalink

Ann Loraine added a comment - 29/Jan/23 4:01 PM

Processing parameters additionally include "tomato.config" file:

params {
    modules {
        'star_align' {
            args            = '--alignIntronMax 13000 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM 
Unsorted --readFilesCommand zcat --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes N
H HI AS NM MD --quantTranscriptomeBan Singleend'
        }
        'hisat2_align' {
            args            = "--max-intronlen 13000 --met-stderr --new-summary --dta"
        }
    }
}

Show

Ann Loraine added a comment - 29/Jan/23 4:01 PM Processing parameters additionally include "tomato.config" file: params { modules { 'star_align' { args = '--alignIntronMax 13000 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand zcat --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes N H HI AS NM MD --quantTranscriptomeBan Singleend' } 'hisat2_align' { args = "--max-intronlen 13000 --met-stderr -- new -summary --dta" } } }

Hide

Permalink

Ann Loraine added a comment - 23/Feb/23 9:02 AM

Adding multiqc files to repository hotpollen/splicing-analysis.git.

Show

Ann Loraine added a comment - 23/Feb/23 9:02 AM Adding multiqc files to repository hotpollen/splicing-analysis.git.

Hide

Permalink

Ann Loraine added a comment - 23/Feb/23 9:11 AM - edited

Copied files to renci host:

SRP100604 coverage graphs
SRP100604 alignment files
SRP100604 junction files
SRP268884 coverage graphs
SRP268884 alignment files
SRP268884 junction files

Example invocation:

scp -J aloraine@hop.renci.org -r SRP268884_bam aloraine@lorainelab-quickload.scidas.org:/projects/igbquickload/lorainelab/www/main/htdocs/hotpollen/S_lycopersicum_Jun_2022/SRP268884/.

Show

Ann Loraine added a comment - 23/Feb/23 9:11 AM - edited Copied files to renci host: SRP100604 coverage graphs SRP100604 alignment files SRP100604 junction files SRP268884 coverage graphs SRP268884 alignment files SRP268884 junction files Example invocation: scp -J aloraine@hop.renci.org -r SRP268884_bam aloraine@lorainelab-quickload.scidas.org:/projects/igbquickload/lorainelab/www/main/htdocs/hotpollen/S_lycopersicum_Jun_2022/SRP268884/.

Hide

Permalink

Ann Loraine added a comment - 23/Feb/23 1:01 PM

All files are made, transferred to RENCI for hosting. Moving to DONE.

Show

Ann Loraine added a comment - 23/Feb/23 1:01 PM All files are made, transferred to RENCI for hosting. Moving to DONE.

People

Assignee:

Molly Davis

Reporter:

Ann Loraine

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

17/Nov/22 12:49 PM

Updated:

23/Feb/23 1:01 PM

Resolved:

23/Feb/23 1:01 PM