[IGBF-3849] Process SRP454305 Goldstein Irradiation 2024 data set - JIRA UNCC

Hide

Permalink

Ann Loraine added a comment - 06/Aug/24 1:45 PM - edited

PREFETCH step

Pre-fetching SRA files with:

cut -d , -f 1 SRP454305_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh

in:

/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305

Confirmed it worked with:

[aloraine@str-i1 SRP454305]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l
[aloraine@str-i1 SRP454305]$ cut -d , -f 1 *_SraRunTable.txt | grep -v Run | wc -l
12

using prefetch.sh:

[aloraine@str-i1 SRP454305]$ ls -l prefetch.sh 
lrwxrwxrwx 1 aloraine tomato_genome 46 Aug  6 13:42 prefetch.sh -> /users/aloraine/src/tardigrade/src/prefetch.sh

using SRP454305_SraRunTable.txt:

[aloraine@str-i1 SRP454305]$ ls -lh *.txt
lrwxrwxrwx 1 aloraine tomato_genome 88 Aug  6 13:43 SRP454305_SraRunTable.txt -> /users/aloraine/src/tardigrade/Documentation/RunSelectorOutput/SRP454305_SraRunTable.txt

Show

Ann Loraine added a comment - 06/Aug/24 1:45 PM - edited PREFETCH step Pre-fetching SRA files with: cut -d , -f 1 SRP454305_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh in: /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305 Confirmed it worked with: [aloraine@str-i1 SRP454305]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l [aloraine@str-i1 SRP454305]$ cut -d , -f 1 *_SraRunTable.txt | grep -v Run | wc -l 12 using prefetch.sh: [aloraine@str-i1 SRP454305]$ ls -l prefetch.sh lrwxrwxrwx 1 aloraine tomato_genome 46 Aug 6 13:42 prefetch.sh -> /users/aloraine/src/tardigrade/src/prefetch.sh using SRP454305_SraRunTable.txt: [aloraine@str-i1 SRP454305]$ ls -lh *.txt lrwxrwxrwx 1 aloraine tomato_genome 88 Aug 6 13:43 SRP454305_SraRunTable.txt -> /users/aloraine/src/tardigrade/Documentation/RunSelectorOutput/SRP454305_SraRunTable.txt

Hide

Permalink

Ann Loraine added a comment - 06/Aug/24 1:50 PM - edited

FASTERQ-DUMP step

Create fastq files with:

cut -d , -f 1 *_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --output=A.out --error=A.err fasterq-dump.sh

in:

/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305

using fasterq-dump.sh:

[aloraine@str-i1 SRP454305]$ ls -lh fasterq-dump.sh 
lrwxrwxrwx 1 aloraine tomato_genome 50 Aug  6 13:51 fasterq-dump.sh -> /users/aloraine/src/tardigrade/src/fasterq-dump.sh

Show

Ann Loraine added a comment - 06/Aug/24 1:50 PM - edited FASTERQ-DUMP step Create fastq files with: cut -d , -f 1 *_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --output=A.out --error=A.err fasterq-dump.sh in: /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305 using fasterq-dump.sh: [aloraine@str-i1 SRP454305]$ ls -lh fasterq-dump.sh lrwxrwxrwx 1 aloraine tomato_genome 50 Aug 6 13:51 fasterq-dump.sh -> /users/aloraine/src/tardigrade/src/fasterq-dump.sh

Hide

Permalink

Ann Loraine added a comment - 06/Aug/24 4:18 PM - edited

GZIP step

Compressed the newly created fastq files with:

ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh

in

/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305

using gzip.sh:

[aloraine@str-i1 SRP454305]$ ls -lh gzip.sh 
lrwxrwxrwx 1 aloraine tomato_genome 42 Aug  6 16:17 gzip.sh -> /users/aloraine/src/tardigrade/src/gzip.sh

Show

Ann Loraine added a comment - 06/Aug/24 4:18 PM - edited GZIP step Compressed the newly created fastq files with: ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305 using gzip.sh: [aloraine@str-i1 SRP454305]$ ls -lh gzip.sh lrwxrwxrwx 1 aloraine tomato_genome 42 Aug 6 16:17 gzip.sh -> /users/aloraine/src/tardigrade/src/gzip.sh

Hide

Permalink

Ann Loraine added a comment - 06/Aug/24 4:24 PM - edited

NF-CORE/RNASEQ SETUP step

1) Make samples.csv:

echo sample,fastq_1,fastq_2,strandedness > samples.csv
cut -f 1 -d , *_SraRunTable.txt | grep -v Run | xargs -I A echo  A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv

Using *_SraRunTable.txt:

[aloraine@str-i1 SRP454305]$ ls -lh *.txt
lrwxrwxrwx 1 aloraine tomato_genome 88 Aug  6 13:43 SRP454305_SraRunTable.txt -> /users/aloraine/src/tardigrade/Documentation/RunSelectorOutput/SRP454305_SraRunTable.txt

Note: the "echo" command assumes each sample has _1 and _2 files and all are paired-end. This same command won't work if there are samples that are not paired end. In those cases, the "fastq_2" field should be empty.

samples.csv:

[aloraine@str-i1 SRP454305]$ more samples.csv
sample,fastq_1,fastq_2,strandedness
SRR25590736,SRR25590736_1.fastq.gz,SRR25590736_2.fastq.gz,auto
SRR25590737,SRR25590737_1.fastq.gz,SRR25590737_2.fastq.gz,auto
SRR25590738,SRR25590738_1.fastq.gz,SRR25590738_2.fastq.gz,auto
SRR25590739,SRR25590739_1.fastq.gz,SRR25590739_2.fastq.gz,auto
SRR25590740,SRR25590740_1.fastq.gz,SRR25590740_2.fastq.gz,auto
SRR25590741,SRR25590741_1.fastq.gz,SRR25590741_2.fastq.gz,auto
SRR25590742,SRR25590742_1.fastq.gz,SRR25590742_2.fastq.gz,auto
SRR25590743,SRR25590743_1.fastq.gz,SRR25590743_2.fastq.gz,auto
SRR25590744,SRR25590744_1.fastq.gz,SRR25590744_2.fastq.gz,auto
SRR25590745,SRR25590745_1.fastq.gz,SRR25590745_2.fastq.gz,auto
SRR25590746,SRR25590746_1.fastq.gz,SRR25590746_2.fastq.gz,auto
SRR25590747,SRR25590747_1.fastq.gz,SRR25590747_2.fastq.gz,auto

2) Confirm nexflow environment variables are set up properly:

[aloraine@str-i1 SRP454305]$ printenv | grep NXF
NXF_OFFLINE=FALSE
NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2
NXF_OPTS=-Xms1g -Xmx4g
NXF_EXECUTOR=slurm

3) Download genome assembly-specific files required for the pipeline to run:

wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa
wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz
wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gtf

4) Uncompress bed.gz file to stdout, remove columns 13 and 14, and save uncompressed trimmed version to a file:

gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut -f1-12 > H_exemplaris_Z151_Apr_2017.bed

5) Make a link to parameters file needed to run nf-core/rnaseq pipeline:

ln -s ~/src/tardigrade/src/H_exemplaris_Z151_Apr_2017-params.yaml .

Show

Ann Loraine added a comment - 06/Aug/24 4:24 PM - edited NF-CORE/RNASEQ SETUP step 1) Make samples.csv: echo sample,fastq_1,fastq_2,strandedness > samples.csv cut -f 1 -d , *_SraRunTable.txt | grep -v Run | xargs -I A echo A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv Using *_SraRunTable.txt: [aloraine@str-i1 SRP454305]$ ls -lh *.txt lrwxrwxrwx 1 aloraine tomato_genome 88 Aug 6 13:43 SRP454305_SraRunTable.txt -> /users/aloraine/src/tardigrade/Documentation/RunSelectorOutput/SRP454305_SraRunTable.txt Note: the "echo" command assumes each sample has _1 and _2 files and all are paired-end. This same command won't work if there are samples that are not paired end. In those cases, the "fastq_2" field should be empty. samples.csv: [aloraine@str-i1 SRP454305] $ more samples.csv sample,fastq_1,fastq_2,strandedness SRR25590736,SRR25590736_1.fastq.gz,SRR25590736_2.fastq.gz,auto SRR25590737,SRR25590737_1.fastq.gz,SRR25590737_2.fastq.gz,auto SRR25590738,SRR25590738_1.fastq.gz,SRR25590738_2.fastq.gz,auto SRR25590739,SRR25590739_1.fastq.gz,SRR25590739_2.fastq.gz,auto SRR25590740,SRR25590740_1.fastq.gz,SRR25590740_2.fastq.gz,auto SRR25590741,SRR25590741_1.fastq.gz,SRR25590741_2.fastq.gz,auto SRR25590742,SRR25590742_1.fastq.gz,SRR25590742_2.fastq.gz,auto SRR25590743,SRR25590743_1.fastq.gz,SRR25590743_2.fastq.gz,auto SRR25590744,SRR25590744_1.fastq.gz,SRR25590744_2.fastq.gz,auto SRR25590745,SRR25590745_1.fastq.gz,SRR25590745_2.fastq.gz,auto SRR25590746,SRR25590746_1.fastq.gz,SRR25590746_2.fastq.gz,auto SRR25590747,SRR25590747_1.fastq.gz,SRR25590747_2.fastq.gz,auto 2) Confirm nexflow environment variables are set up properly: [aloraine@str-i1 SRP454305]$ printenv | grep NXF NXF_OFFLINE=FALSE NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2 NXF_OPTS=-Xms1g -Xmx4g NXF_EXECUTOR=slurm 3) Download genome assembly-specific files required for the pipeline to run: wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gtf 4) Uncompress bed.gz file to stdout, remove columns 13 and 14, and save uncompressed trimmed version to a file: gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut -f1-12 > H_exemplaris_Z151_Apr_2017.bed 5) Make a link to parameters file needed to run nf-core/rnaseq pipeline: ln -s ~/src/tardigrade/src/H_exemplaris_Z151_Apr_2017-params.yaml .

Hide

Permalink

Ann Loraine added a comment - 06/Aug/24 4:37 PM - edited

NF-CORE/RNASEQ RUN step

1) Launch tmux session with:

tmux new -s base

2) Launch long-lived interactive cluster job with:

[aloraine@str-i1 SRP454305]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash
[aloraine@str-bm1 SRP454305]$

Note how the change in prompt let's me know the interactive job has started.

3) Load nextflow module with:

module load nf-core

with output:

Loading nf-core/2.12.1
  Loading requirement: anaconda3/2020.11
(nf-core-2.12.1) [aloraine@str-bm1 SRP454305]$

4) Launch the pipeline with:

nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1

5) Check it worked:

Re-attach to the tmux session used to launch nf-core/rnaseq pipeline:

[aloraine@str-i1 ~]$ tmux attach -t base

Observed that the nextflow command completed.

Checked output with:

(nf-core-2.12.1) [aloraine@str-bm1 SRP454305]$ tail out.1
...
-[nf-core/rnaseq] Pipeline completed successfully -
Completed at: 06-Aug-2024 17:17:40
Duration    : 31m 21s
CPU hours   : 62.3
Succeeded   : 230

Show

Ann Loraine added a comment - 06/Aug/24 4:37 PM - edited NF-CORE/RNASEQ RUN step 1) Launch tmux session with: tmux new -s base 2) Launch long-lived interactive cluster job with: [aloraine@str-i1 SRP454305]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash [aloraine@str-bm1 SRP454305]$ Note how the change in prompt let's me know the interactive job has started. 3) Load nextflow module with: module load nf-core with output: Loading nf-core/2.12.1 Loading requirement: anaconda3/2020.11 (nf-core-2.12.1) [aloraine@str-bm1 SRP454305]$ 4) Launch the pipeline with: nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1 5) Check it worked: Re-attach to the tmux session used to launch nf-core/rnaseq pipeline: [aloraine@str-i1 ~]$ tmux attach -t base Observed that the nextflow command completed. Checked output with: (nf-core-2.12.1) [aloraine@str-bm1 SRP454305]$ tail out.1 ... -[nf-core/rnaseq] Pipeline completed successfully - Completed at: 06-Aug-2024 17:17:40 Duration : 31m 21s CPU hours : 62.3 Succeeded : 230

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 10:45 AM - edited

DOCUMENT RESULTS step

Adding the multiqc report to the git repository with:

[aloraine@str-i1 SRP454305]$ cp results/multiqc/star_salmon/multiqc_report.html  ~/src/tardigrade/Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html
[aloraine@str-i1 SRP454305]$ cd ~/src/tardigrade/
[aloraine@str-i1 tardigrade]$ git add Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html
[aloraine@str-i1 tardigrade]$ git commit -m "IGBF-3849 Align SRP454305 RNA-Seq data to April 2017 H exemplaris reference genome assembly" 
[main 68c8518] IGBF-3849 Align SRP454305 RNA-Seq data to April 2017 H exemplaris reference genome assembly
 1 file changed, 9140 insertions(+)
 create mode 100644 Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html
[aloraine@str-i1 tardigrade]$ git push origin main

Note how the file I add to the repository contains a prefix: the SRA data set accession name "SRP454305" appended to the IGB genome name "H_exemplaris_Z151_Apr_2017". This because we likely will run the same dataset against more than one genome assembly.

Also, I was a little lazy here and did not bother to make a branch - I pushed directly to the main branch.

Checked output by opening the multiqc report in my local Web browser. I observed no warning messages about the data. Also, I noticed that the principle components plot had four distinct clusters, three points per cluster. The SRR accessions in each cluster were numbered consecutively, corresponding to three biological replicates, I presume. Very nice data, looks like! Good job Goldstein lab !

Show

Ann Loraine added a comment - 07/Aug/24 10:45 AM - edited DOCUMENT RESULTS step Adding the multiqc report to the git repository with: [aloraine@str-i1 SRP454305]$ cp results/multiqc/star_salmon/multiqc_report.html ~/src/tardigrade/Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html [aloraine@str-i1 SRP454305]$ cd ~/src/tardigrade/ [aloraine@str-i1 tardigrade]$ git add Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html [aloraine@str-i1 tardigrade]$ git commit -m "IGBF-3849 Align SRP454305 RNA-Seq data to April 2017 H exemplaris reference genome assembly" [main 68c8518] IGBF-3849 Align SRP454305 RNA-Seq data to April 2017 H exemplaris reference genome assembly 1 file changed, 9140 insertions(+) create mode 100644 Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html [aloraine@str-i1 tardigrade]$ git push origin main Note how the file I add to the repository contains a prefix: the SRA data set accession name "SRP454305" appended to the IGB genome name "H_exemplaris_Z151_Apr_2017". This because we likely will run the same dataset against more than one genome assembly. Also, I was a little lazy here and did not bother to make a branch - I pushed directly to the main branch. Checked output by opening the multiqc report in my local Web browser. I observed no warning messages about the data. Also, I noticed that the principle components plot had four distinct clusters, three points per cluster. The SRR accessions in each cluster were numbered consecutively, corresponding to three biological replicates, I presume. Very nice data, looks like! Good job Goldstein lab !

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 10:53 AM - edited

RENAME BAMs step

Nextflow inserts the suffix "sorted" in BAM file names. This is redundant, so we always remove that suffix.

Renaming BAMs with:

[aloraine@str-i1 star_salmon]$ ./renameBams.sh

in:

/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon

using:

lrwxrwxrwx 1 aloraine tomato_genome 48 Aug  7 10:55 renameBams.sh -> /users/aloraine/src/tardigrade/src/renameBams.sh

Check it worked:

[aloraine@str-i1 star_salmon]$ ls -lh *bam*
-rw-r----- 1 aloraine tomato_genome 1.2G Aug  6 17:06 SRR25590736.bam
-rw-r----- 1 aloraine tomato_genome 294K Aug  6 17:06 SRR25590736.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.5G Aug  6 17:09 SRR25590737.bam
-rw-r----- 1 aloraine tomato_genome 317K Aug  6 17:09 SRR25590737.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.4G Aug  6 17:06 SRR25590738.bam
-rw-r----- 1 aloraine tomato_genome 315K Aug  6 17:07 SRR25590738.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.4G Aug  6 17:08 SRR25590739.bam
-rw-r----- 1 aloraine tomato_genome 316K Aug  6 17:08 SRR25590739.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.4G Aug  6 17:06 SRR25590740.bam
-rw-r----- 1 aloraine tomato_genome 316K Aug  6 17:06 SRR25590740.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.4G Aug  6 17:07 SRR25590741.bam
-rw-r----- 1 aloraine tomato_genome 316K Aug  6 17:08 SRR25590741.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.3G Aug  6 17:07 SRR25590742.bam
-rw-r----- 1 aloraine tomato_genome 294K Aug  6 17:08 SRR25590742.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.4G Aug  6 17:08 SRR25590743.bam
-rw-r----- 1 aloraine tomato_genome 308K Aug  6 17:08 SRR25590743.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.5G Aug  6 17:10 SRR25590744.bam
-rw-r----- 1 aloraine tomato_genome 314K Aug  6 17:10 SRR25590744.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.3G Aug  6 17:05 SRR25590745.bam
-rw-r----- 1 aloraine tomato_genome 310K Aug  6 17:05 SRR25590745.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.3G Aug  6 17:04 SRR25590746.bam
-rw-r----- 1 aloraine tomato_genome 309K Aug  6 17:04 SRR25590746.bam.bai
-rw-r----- 1 aloraine tomato_genome 1.3G Aug  6 17:05 SRR25590747.bam
-rw-r----- 1 aloraine tomato_genome 307K Aug  6 17:06 SRR25590747.bam.bai

Show

Ann Loraine added a comment - 07/Aug/24 10:53 AM - edited RENAME BAMs step Nextflow inserts the suffix "sorted" in BAM file names. This is redundant, so we always remove that suffix. Renaming BAMs with: [aloraine@str-i1 star_salmon]$ ./renameBams.sh in: /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon using: lrwxrwxrwx 1 aloraine tomato_genome 48 Aug 7 10:55 renameBams.sh -> /users/aloraine/src/tardigrade/src/renameBams.sh Check it worked: [aloraine@str-i1 star_salmon]$ ls -lh *bam* -rw-r----- 1 aloraine tomato_genome 1.2G Aug 6 17:06 SRR25590736.bam -rw-r----- 1 aloraine tomato_genome 294K Aug 6 17:06 SRR25590736.bam.bai -rw-r----- 1 aloraine tomato_genome 1.5G Aug 6 17:09 SRR25590737.bam -rw-r----- 1 aloraine tomato_genome 317K Aug 6 17:09 SRR25590737.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:06 SRR25590738.bam -rw-r----- 1 aloraine tomato_genome 315K Aug 6 17:07 SRR25590738.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:08 SRR25590739.bam -rw-r----- 1 aloraine tomato_genome 316K Aug 6 17:08 SRR25590739.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:06 SRR25590740.bam -rw-r----- 1 aloraine tomato_genome 316K Aug 6 17:06 SRR25590740.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:07 SRR25590741.bam -rw-r----- 1 aloraine tomato_genome 316K Aug 6 17:08 SRR25590741.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:07 SRR25590742.bam -rw-r----- 1 aloraine tomato_genome 294K Aug 6 17:08 SRR25590742.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:08 SRR25590743.bam -rw-r----- 1 aloraine tomato_genome 308K Aug 6 17:08 SRR25590743.bam.bai -rw-r----- 1 aloraine tomato_genome 1.5G Aug 6 17:10 SRR25590744.bam -rw-r----- 1 aloraine tomato_genome 314K Aug 6 17:10 SRR25590744.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:05 SRR25590745.bam -rw-r----- 1 aloraine tomato_genome 310K Aug 6 17:05 SRR25590745.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:04 SRR25590746.bam -rw-r----- 1 aloraine tomato_genome 309K Aug 6 17:04 SRR25590746.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:05 SRR25590747.bam -rw-r----- 1 aloraine tomato_genome 307K Aug 6 17:06 SRR25590747.bam.bai

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 11:11 AM

COVERAGE GRAPHS step

I am doing this step in a subdirectory of /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon. This is so that I can easily collect results for distribution in an IGB Quickload and also to avoid name collisions with a run of "find junctions," which I will probably do at the same time.

Making coverage graphs:

1) Made a sub-directory and filled it with symbolic links to the renamed BAM files from the parent directory:

[aloraine@str-i1 star_salmon]$ mkdir coverage_graphs 
[aloraine@str-i1 star_salmon]$ cd coverage_graphs/
[aloraine@str-i1 coverage_graphs]$ ln -s ../*bam* .
[aloraine@str-i1 coverage_graphs]$ ls
SRR25590736.bam      SRR25590740.bam      SRR25590744.bam
SRR25590736.bam.bai  SRR25590740.bam.bai  SRR25590744.bam.bai
SRR25590737.bam      SRR25590741.bam      SRR25590745.bam
SRR25590737.bam.bai  SRR25590741.bam.bai  SRR25590745.bam.bai
SRR25590738.bam      SRR25590742.bam      SRR25590746.bam
SRR25590738.bam.bai  SRR25590742.bam.bai  SRR25590746.bam.bai
SRR25590739.bam      SRR25590743.bam      SRR25590747.bam
SRR25590739.bam.bai  SRR25590743.bam.bai  SRR25590747.bam.bai

2) Make symbolic links to two scripts I need to run the coverage graph code:

[aloraine@str-i1 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
[aloraine@str-i1 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh .

3) Launch the coverage graph generation code with:

[aloraine@str-i1 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.err 2>jobs.err

Show

Ann Loraine added a comment - 07/Aug/24 11:11 AM COVERAGE GRAPHS step I am doing this step in a subdirectory of /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon. This is so that I can easily collect results for distribution in an IGB Quickload and also to avoid name collisions with a run of "find junctions," which I will probably do at the same time. Making coverage graphs: 1) Made a sub-directory and filled it with symbolic links to the renamed BAM files from the parent directory: [aloraine@str-i1 star_salmon]$ mkdir coverage_graphs [aloraine@str-i1 star_salmon]$ cd coverage_graphs/ [aloraine@str-i1 coverage_graphs]$ ln -s ../*bam* . [aloraine@str-i1 coverage_graphs]$ ls SRR25590736.bam SRR25590740.bam SRR25590744.bam SRR25590736.bam.bai SRR25590740.bam.bai SRR25590744.bam.bai SRR25590737.bam SRR25590741.bam SRR25590745.bam SRR25590737.bam.bai SRR25590741.bam.bai SRR25590745.bam.bai SRR25590738.bam SRR25590742.bam SRR25590746.bam SRR25590738.bam.bai SRR25590742.bam.bai SRR25590746.bam.bai SRR25590739.bam SRR25590743.bam SRR25590747.bam SRR25590739.bam.bai SRR25590743.bam.bai SRR25590747.bam.bai 2) Make symbolic links to two scripts I need to run the coverage graph code: [aloraine@str-i1 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh . [aloraine@str-i1 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh . 3) Launch the coverage graph generation code with: [aloraine@str-i1 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.err 2>jobs.err

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 11:12 AM - edited

FINDJUNCTIONS step

Similar to the coverage graphs step, made a new subdirectory in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon.

1) Made the find junctions "working" directory and added symbolic links to BAM and BAM index files in the parent directory with:

[aloraine@str-i1 star_salmon]$ mkdir find_junctions
[aloraine@str-i1 star_salmon]$ cd find_junctions/
[aloraine@str-i1 find_junctions]$ ln -s ../*bam* .

2) Download required required input 2bit file into the directory with:

wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit

3) Make symbolic links to scripts and jar file with code:

ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
ln -s ~/src/tardigrade/src/find_junctions.sh
ln -s src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar

4) Launch jobs with:

sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Show

Ann Loraine added a comment - 07/Aug/24 11:12 AM - edited FINDJUNCTIONS step Similar to the coverage graphs step, made a new subdirectory in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon. 1) Made the find junctions "working" directory and added symbolic links to BAM and BAM index files in the parent directory with: [aloraine@str-i1 star_salmon]$ mkdir find_junctions [aloraine@str-i1 star_salmon]$ cd find_junctions/ [aloraine@str-i1 find_junctions]$ ln -s ../*bam* . 2) Download required required input 2bit file into the directory with: wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit 3) Make symbolic links to scripts and jar file with code: ln -s ~/src/tardigrade/src/sbatch-doIt.sh . ln -s ~/src/tardigrade/src/find_junctions.sh ln -s src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar 4) Launch jobs with: sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 7:51 PM - edited

DATA TRANSFER SETUP step

1) Create directory for transfer in /projects/tomato_genome/fnb/dataprocessing/tardigrade

[aloraine@str-i1 tardigrade]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade
mkdir for_quickload

We will use this to store everything we will trasfer to Quickload for this "tardigrade" project

2) Make directory for tardigrade genome assembly

[aloraine@str-i1 for_quickload]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload
mkdir H_exemplaris_Z151_Apr_2017

Note: the above two steps only need to be done once!

3) Make subdirectory for this data set, in the genome assembly directory used for alignments:

[aloraine@str-i1 for_quickload]$ cd H_exemplaris_Z151_Apr_2017/
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ mkdir SRP454305
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ cd SRP454305/
[aloraine@str-i1 SRP454305]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305

4) Move bam, scaled coverage graph, junction files into this location:

coverage graphs, from inside the directory containing them:

[aloraine@str-i1 SRP454305]$ mv ../../../SRP454305/results/star_salmon/coverage_graphs/*bedgraph* .
[aloraine@str-i1 SRP454305]$ ls
SRR25590736.scaled.bedgraph.gz      SRR25590739.scaled.bedgraph.gz      SRR25590742.scaled.bedgraph.gz      SRR25590745.scaled.bedgraph.gz
SRR25590736.scaled.bedgraph.gz.tbi  SRR25590739.scaled.bedgraph.gz.tbi  SRR25590742.scaled.bedgraph.gz.tbi  SRR25590745.scaled.bedgraph.gz.tbi
SRR25590737.scaled.bedgraph.gz      SRR25590740.scaled.bedgraph.gz      SRR25590743.scaled.bedgraph.gz      SRR25590746.scaled.bedgraph.gz
SRR25590737.scaled.bedgraph.gz.tbi  SRR25590740.scaled.bedgraph.gz.tbi  SRR25590743.scaled.bedgraph.gz.tbi  SRR25590746.scaled.bedgraph.gz.tbi
SRR25590738.scaled.bedgraph.gz      SRR25590741.scaled.bedgraph.gz      SRR25590744.scaled.bedgraph.gz      SRR25590747.scaled.bedgraph.gz
SRR25590738.scaled.bedgraph.gz.tbi  SRR25590741.scaled.bedgraph.gz.tbi  SRR25590744.scaled.bedgraph.gz.tbi  SRR25590747.scaled.bedgraph.gz.tbi

Bam files, from inside the directory containing them:

mv *.bam*  ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/.

Junction files, from inside the directory containing them:

mv *.FJ.* ../../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/.

5) Make all files world-readable and make all directories world-readable and world-executable:

files:

[aloraine@str-i1 SRP454305]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305
[aloraine@str-i1 SRP454305]$ chmod a+r *

directory:

[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ chmod a+rx SRP454305

Show

Ann Loraine added a comment - 07/Aug/24 7:51 PM - edited DATA TRANSFER SETUP step 1) Create directory for transfer in /projects/tomato_genome/fnb/dataprocessing/tardigrade [aloraine@str-i1 tardigrade]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade mkdir for_quickload We will use this to store everything we will trasfer to Quickload for this "tardigrade" project 2) Make directory for tardigrade genome assembly [aloraine@str-i1 for_quickload]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload mkdir H_exemplaris_Z151_Apr_2017 Note: the above two steps only need to be done once! 3) Make subdirectory for this data set, in the genome assembly directory used for alignments: [aloraine@str-i1 for_quickload]$ cd H_exemplaris_Z151_Apr_2017/ [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ mkdir SRP454305 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ cd SRP454305/ [aloraine@str-i1 SRP454305]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305 4) Move bam, scaled coverage graph, junction files into this location: coverage graphs, from inside the directory containing them: [aloraine@str-i1 SRP454305]$ mv ../../../SRP454305/results/star_salmon/coverage_graphs/*bedgraph* . [aloraine@str-i1 SRP454305]$ ls SRR25590736.scaled.bedgraph.gz SRR25590739.scaled.bedgraph.gz SRR25590742.scaled.bedgraph.gz SRR25590745.scaled.bedgraph.gz SRR25590736.scaled.bedgraph.gz.tbi SRR25590739.scaled.bedgraph.gz.tbi SRR25590742.scaled.bedgraph.gz.tbi SRR25590745.scaled.bedgraph.gz.tbi SRR25590737.scaled.bedgraph.gz SRR25590740.scaled.bedgraph.gz SRR25590743.scaled.bedgraph.gz SRR25590746.scaled.bedgraph.gz SRR25590737.scaled.bedgraph.gz.tbi SRR25590740.scaled.bedgraph.gz.tbi SRR25590743.scaled.bedgraph.gz.tbi SRR25590746.scaled.bedgraph.gz.tbi SRR25590738.scaled.bedgraph.gz SRR25590741.scaled.bedgraph.gz SRR25590744.scaled.bedgraph.gz SRR25590747.scaled.bedgraph.gz SRR25590738.scaled.bedgraph.gz.tbi SRR25590741.scaled.bedgraph.gz.tbi SRR25590744.scaled.bedgraph.gz.tbi SRR25590747.scaled.bedgraph.gz.tbi Bam files, from inside the directory containing them: mv *.bam* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/. Junction files, from inside the directory containing them: mv *.FJ.* ../../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/. 5) Make all files world-readable and make all directories world-readable and world-executable: files: [aloraine@str-i1 SRP454305]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305 [aloraine@str-i1 SRP454305]$ chmod a+r * directory: [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ chmod a+rx SRP454305

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 8:15 PM - edited

RSYNC step

1) Logged into data.bioviz.org (a virtual machine hosted on UNC Charlotte infrastructure) and moved to data deployment location in the file system there:

local aloraine$ ssh aloraine@data.bioviz.org
cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017

Some things to note:

I have deployed my public key into the authorized_hosts file in my "aloraine" account in data.bioviz.org. This way, I don't have to enter my password.
If I did need to enter my password, I would enter my Charlotte.edu password.
Anyone else wanting to do this will need to get an account on the data.bioviz.org
Note that we are inside a directory named for the reference genome assembly we used.

2) Make a new directory for this new data set to be deployed:

aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ pwd
/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls
SRP450893  SRP484252
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP454305

3) Make sure it is group write-able and that its permissions match the other directories in the same location:

aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh
total 12K
drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul  2 09:52 SRP450893
drwxr-xr-x 2 aloraine domain users           4.0K Aug  7 20:08 SRP454305
drwxrwxr-x 2 aloraine domain users           4.0K Jul  3 13:42 SRP484252
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ chmod g+w SRP454305
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh
total 12K
drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul  2 09:52 SRP450893
drwxrwxr-x 2 aloraine domain users           4.0K Aug  7 20:08 SRP454305
drwxrwxr-x 2 aloraine domain users           4.0K Jul  3 13:42 SRP484252

4) Start the data transfer using tmux and then rsync:

tmux:

tmux new -s transfer

rsync:

rsync -rtpvz aloraine@hpc.charlotte.edu:/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/* SRP454305/.

Note: You can repeat the above rsync command any time you add new content to the source directory on hpc.charlotte.edu. Only the new files will get copied.

Note: I could probably just "rsync" the entire genome directory. I think that this would automatically copy any new "SRP" directories and their contents over to data.bioviz.org.

Show

Ann Loraine added a comment - 07/Aug/24 8:15 PM - edited RSYNC step 1) Logged into data.bioviz.org (a virtual machine hosted on UNC Charlotte infrastructure) and moved to data deployment location in the file system there: local aloraine$ ssh aloraine@data.bioviz.org cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017 Some things to note: I have deployed my public key into the authorized_hosts file in my "aloraine" account in data.bioviz.org. This way, I don't have to enter my password. If I did need to enter my password, I would enter my Charlotte.edu password. Anyone else wanting to do this will need to get an account on the data.bioviz.org Note that we are inside a directory named for the reference genome assembly we used. 2) Make a new directory for this new data set to be deployed: aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ pwd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls SRP450893 SRP484252 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP454305 3) Make sure it is group write-able and that its permissions match the other directories in the same location: aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh total 12K drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul 2 09:52 SRP450893 drwxr-xr-x 2 aloraine domain users 4.0K Aug 7 20:08 SRP454305 drwxrwxr-x 2 aloraine domain users 4.0K Jul 3 13:42 SRP484252 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ chmod g+w SRP454305 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh total 12K drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul 2 09:52 SRP450893 drwxrwxr-x 2 aloraine domain users 4.0K Aug 7 20:08 SRP454305 drwxrwxr-x 2 aloraine domain users 4.0K Jul 3 13:42 SRP484252 4) Start the data transfer using tmux and then rsync: tmux: tmux new -s transfer rsync: rsync -rtpvz aloraine@hpc.charlotte.edu:/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/* SRP454305/. Note : You can repeat the above rsync command any time you add new content to the source directory on hpc.charlotte.edu. Only the new files will get copied. Note : I could probably just "rsync" the entire genome directory. I think that this would automatically copy any new "SRP" directories and their contents over to data.bioviz.org.

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 8:26 PM - edited

ANNOTS.XML step

1) Opened the run file for this data set in Excel and save it, in Excel format, to tardigrade/Documentation/inputForMakeAnnotsXml (the tardigrade repository)

Note: Open SRP48452_for_AnnotsXml as a reference and guide!

2) Added five new columns to the front of the file, in from of "Run:

file name prefix
color
physical folder
study name
display name
url

3) Used Excel referencing to insert all the values in "Run" in "file name prefix"

4) Inserted hexadecimal colors codes for each sample. Made those cells have the same fill color as the colors I chose to help me assess their potential appearance and contrast in IGB.

5) Inserted the study code (e.g., SRP454305) in "physical folder" column

6) Used Excel reference to insert a human-friendly "study name" - this becomes the name of the folder where the data files will be listed in IGB.

7) Used Excel references to insert human-friendly "display name" values - these become the checkbox labels in IGB.

8) Used Excel references to make URLs for each file / data set. Used the "SRX" values in the existing "Experiment" column to construct the URL.

9) Added new columns as needed after the first five to use for sorting. For example, I added "Concentration" and then sorted the spreadsheet by concentration and then by run so that the lower concentration, control samples would appear first in the IGB data display list.

10) Edited the script makeAnnots.py to include the new spreadsheet in function getSampleSheets. Ran the script, which will add the new data files to annots.xml in tardigrade/ForGenomeBrowsers/quickload.

11) Checked how it looks by adding the above directory to IGB as a new quickload data source.

Show

Ann Loraine added a comment - 07/Aug/24 8:26 PM - edited ANNOTS.XML step 1) Opened the run file for this data set in Excel and save it, in Excel format, to tardigrade/Documentation/inputForMakeAnnotsXml (the tardigrade repository) Note : Open SRP48452_for_AnnotsXml as a reference and guide! 2) Added five new columns to the front of the file, in from of "Run: file name prefix color physical folder study name display name url 3) Used Excel referencing to insert all the values in "Run" in "file name prefix" 4) Inserted hexadecimal colors codes for each sample. Made those cells have the same fill color as the colors I chose to help me assess their potential appearance and contrast in IGB. 5) Inserted the study code (e.g., SRP454305) in "physical folder" column 6) Used Excel reference to insert a human-friendly "study name" - this becomes the name of the folder where the data files will be listed in IGB. 7) Used Excel references to insert human-friendly "display name" values - these become the checkbox labels in IGB. 8) Used Excel references to make URLs for each file / data set. Used the "SRX" values in the existing "Experiment" column to construct the URL. 9) Added new columns as needed after the first five to use for sorting. For example, I added "Concentration" and then sorted the spreadsheet by concentration and then by run so that the lower concentration, control samples would appear first in the IGB data display list. 10) Edited the script makeAnnots.py to include the new spreadsheet in function getSampleSheets. Ran the script, which will add the new data files to annots.xml in tardigrade/ForGenomeBrowsers/quickload. 11) Checked how it looks by adding the above directory to IGB as a new quickload data source.

Hide

Permalink

Ann Loraine added a comment - 12/Aug/24 5:44 AM - edited

CLEANUP step:

Removed the "work" directory within SRP454305 because it is ENORMOUS and we no longer need it.
Moved the entire SRP454305 directory into tardigrade/DONE

Show

Ann Loraine added a comment - 12/Aug/24 5:44 AM - edited CLEANUP step: Removed the "work" directory within SRP454305 because it is ENORMOUS and we no longer need it. Moved the entire SRP454305 directory into tardigrade/DONE

Process SRP454305 Goldstein Irradiation 2024 data set

Details

Description

Attachments

Issue Links

Activity

People

Dates