Details
-
Type: Task
-
Status: In Progress (View Workflow)
-
Priority: Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:1
-
Sprint:Summer 6
Description
Following the protocol notes described in IGBF-3790, process data set SRP454305.
This dataset consists of 12 samples of tardigrade H. exemplaris. The animals got dosed with radiation or a non-radiation control treatment.
Article link:
Attachments
Issue Links
- relates to
-
IGBF-3790 Run nf-core/rnaseq v 3.14 on SRP484252 (2024 Goldstein Lab)
- Closed
Activity
FASTERQ-DUMP step
Create fastq files with:
cut -d , -f 1 *_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --output=A.out --error=A.err fasterq-dump.sh
in:
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305
using fasterq-dump.sh:
[aloraine@str-i1 SRP454305]$ ls -lh fasterq-dump.sh lrwxrwxrwx 1 aloraine tomato_genome 50 Aug 6 13:51 fasterq-dump.sh -> /users/aloraine/src/tardigrade/src/fasterq-dump.sh
GZIP step
Compressed the newly created fastq files with:
ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh
in
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305
using gzip.sh:
[aloraine@str-i1 SRP454305]$ ls -lh gzip.sh lrwxrwxrwx 1 aloraine tomato_genome 42 Aug 6 16:17 gzip.sh -> /users/aloraine/src/tardigrade/src/gzip.sh
NF-CORE/RNASEQ SETUP step
1) Make samples.csv:
echo sample,fastq_1,fastq_2,strandedness > samples.csv cut -f 1 -d , *_SraRunTable.txt | grep -v Run | xargs -I A echo A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv
Using *_SraRunTable.txt:
[aloraine@str-i1 SRP454305]$ ls -lh *.txt lrwxrwxrwx 1 aloraine tomato_genome 88 Aug 6 13:43 SRP454305_SraRunTable.txt -> /users/aloraine/src/tardigrade/Documentation/RunSelectorOutput/SRP454305_SraRunTable.txt
Note: the "echo" command assumes each sample has _1 and _2 files and all are paired-end. This same command won't work if there are samples that are not paired end. In those cases, the "fastq_2" field should be empty.
samples.csv:
[aloraine@str-i1 SRP454305]$ more samples.csv
sample,fastq_1,fastq_2,strandedness
SRR25590736,SRR25590736_1.fastq.gz,SRR25590736_2.fastq.gz,auto
SRR25590737,SRR25590737_1.fastq.gz,SRR25590737_2.fastq.gz,auto
SRR25590738,SRR25590738_1.fastq.gz,SRR25590738_2.fastq.gz,auto
SRR25590739,SRR25590739_1.fastq.gz,SRR25590739_2.fastq.gz,auto
SRR25590740,SRR25590740_1.fastq.gz,SRR25590740_2.fastq.gz,auto
SRR25590741,SRR25590741_1.fastq.gz,SRR25590741_2.fastq.gz,auto
SRR25590742,SRR25590742_1.fastq.gz,SRR25590742_2.fastq.gz,auto
SRR25590743,SRR25590743_1.fastq.gz,SRR25590743_2.fastq.gz,auto
SRR25590744,SRR25590744_1.fastq.gz,SRR25590744_2.fastq.gz,auto
SRR25590745,SRR25590745_1.fastq.gz,SRR25590745_2.fastq.gz,auto
SRR25590746,SRR25590746_1.fastq.gz,SRR25590746_2.fastq.gz,auto
SRR25590747,SRR25590747_1.fastq.gz,SRR25590747_2.fastq.gz,auto
2) Confirm nexflow environment variables are set up properly:
[aloraine@str-i1 SRP454305]$ printenv | grep NXF NXF_OFFLINE=FALSE NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2 NXF_OPTS=-Xms1g -Xmx4g NXF_EXECUTOR=slurm
3) Download genome assembly-specific files required for the pipeline to run:
wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gtf
4) Uncompress bed.gz file to stdout, remove columns 13 and 14, and save uncompressed trimmed version to a file:
gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut -f1-12 > H_exemplaris_Z151_Apr_2017.bed
5) Make a link to parameters file needed to run nf-core/rnaseq pipeline:
ln -s ~/src/tardigrade/src/H_exemplaris_Z151_Apr_2017-params.yaml .
NF-CORE/RNASEQ RUN step
1) Launch tmux session with:
tmux new -s base
2) Launch long-lived interactive cluster job with:
[aloraine@str-i1 SRP454305]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash [aloraine@str-bm1 SRP454305]$
Note how the change in prompt let's me know the interactive job has started.
3) Load nextflow module with:
module load nf-core
with output:
Loading nf-core/2.12.1 Loading requirement: anaconda3/2020.11 (nf-core-2.12.1) [aloraine@str-bm1 SRP454305]$
4) Launch the pipeline with:
nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1
5) Check it worked:
Re-attach to the tmux session used to launch nf-core/rnaseq pipeline:
[aloraine@str-i1 ~]$ tmux attach -t base
Observed that the nextflow command completed.
Checked output with:
(nf-core-2.12.1) [aloraine@str-bm1 SRP454305]$ tail out.1 ... -[nf-core/rnaseq] Pipeline completed successfully - Completed at: 06-Aug-2024 17:17:40 Duration : 31m 21s CPU hours : 62.3 Succeeded : 230
DOCUMENT RESULTS step
Adding the multiqc report to the git repository with:
[aloraine@str-i1 SRP454305]$ cp results/multiqc/star_salmon/multiqc_report.html ~/src/tardigrade/Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html
[aloraine@str-i1 SRP454305]$ cd ~/src/tardigrade/
[aloraine@str-i1 tardigrade]$ git add Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html
[aloraine@str-i1 tardigrade]$ git commit -m "IGBF-3849 Align SRP454305 RNA-Seq data to April 2017 H exemplaris reference genome assembly"
[main 68c8518] IGBF-3849 Align SRP454305 RNA-Seq data to April 2017 H exemplaris reference genome assembly
1 file changed, 9140 insertions(+)
create mode 100644 Documentation/multiqcReports/SRP454305_H_exemplaris_Z151_Apr_2017-multiqc_report.html
[aloraine@str-i1 tardigrade]$ git push origin main
Note how the file I add to the repository contains a prefix: the SRA data set accession name "SRP454305" appended to the IGB genome name "H_exemplaris_Z151_Apr_2017". This because we likely will run the same dataset against more than one genome assembly.
Also, I was a little lazy here and did not bother to make a branch - I pushed directly to the main branch.
Checked output by opening the multiqc report in my local Web browser. I observed no warning messages about the data. Also, I noticed that the principle components plot had four distinct clusters, three points per cluster. The SRR accessions in each cluster were numbered consecutively, corresponding to three biological replicates, I presume. Very nice data, looks like! Good job Goldstein lab !
RENAME BAMs step
Nextflow inserts the suffix "sorted" in BAM file names. This is redundant, so we always remove that suffix.
Renaming BAMs with:
[aloraine@str-i1 star_salmon]$ ./renameBams.sh
in:
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon
using:
lrwxrwxrwx 1 aloraine tomato_genome 48 Aug 7 10:55 renameBams.sh -> /users/aloraine/src/tardigrade/src/renameBams.sh
Check it worked:
[aloraine@str-i1 star_salmon]$ ls -lh *bam* -rw-r----- 1 aloraine tomato_genome 1.2G Aug 6 17:06 SRR25590736.bam -rw-r----- 1 aloraine tomato_genome 294K Aug 6 17:06 SRR25590736.bam.bai -rw-r----- 1 aloraine tomato_genome 1.5G Aug 6 17:09 SRR25590737.bam -rw-r----- 1 aloraine tomato_genome 317K Aug 6 17:09 SRR25590737.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:06 SRR25590738.bam -rw-r----- 1 aloraine tomato_genome 315K Aug 6 17:07 SRR25590738.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:08 SRR25590739.bam -rw-r----- 1 aloraine tomato_genome 316K Aug 6 17:08 SRR25590739.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:06 SRR25590740.bam -rw-r----- 1 aloraine tomato_genome 316K Aug 6 17:06 SRR25590740.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:07 SRR25590741.bam -rw-r----- 1 aloraine tomato_genome 316K Aug 6 17:08 SRR25590741.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:07 SRR25590742.bam -rw-r----- 1 aloraine tomato_genome 294K Aug 6 17:08 SRR25590742.bam.bai -rw-r----- 1 aloraine tomato_genome 1.4G Aug 6 17:08 SRR25590743.bam -rw-r----- 1 aloraine tomato_genome 308K Aug 6 17:08 SRR25590743.bam.bai -rw-r----- 1 aloraine tomato_genome 1.5G Aug 6 17:10 SRR25590744.bam -rw-r----- 1 aloraine tomato_genome 314K Aug 6 17:10 SRR25590744.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:05 SRR25590745.bam -rw-r----- 1 aloraine tomato_genome 310K Aug 6 17:05 SRR25590745.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:04 SRR25590746.bam -rw-r----- 1 aloraine tomato_genome 309K Aug 6 17:04 SRR25590746.bam.bai -rw-r----- 1 aloraine tomato_genome 1.3G Aug 6 17:05 SRR25590747.bam -rw-r----- 1 aloraine tomato_genome 307K Aug 6 17:06 SRR25590747.bam.bai
COVERAGE GRAPHS step
I am doing this step in a subdirectory of /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon. This is so that I can easily collect results for distribution in an IGB Quickload and also to avoid name collisions with a run of "find junctions," which I will probably do at the same time.
Making coverage graphs:
1) Made a sub-directory and filled it with symbolic links to the renamed BAM files from the parent directory:
[aloraine@str-i1 star_salmon]$ mkdir coverage_graphs [aloraine@str-i1 star_salmon]$ cd coverage_graphs/ [aloraine@str-i1 coverage_graphs]$ ln -s ../*bam* . [aloraine@str-i1 coverage_graphs]$ ls SRR25590736.bam SRR25590740.bam SRR25590744.bam SRR25590736.bam.bai SRR25590740.bam.bai SRR25590744.bam.bai SRR25590737.bam SRR25590741.bam SRR25590745.bam SRR25590737.bam.bai SRR25590741.bam.bai SRR25590745.bam.bai SRR25590738.bam SRR25590742.bam SRR25590746.bam SRR25590738.bam.bai SRR25590742.bam.bai SRR25590746.bam.bai SRR25590739.bam SRR25590743.bam SRR25590747.bam SRR25590739.bam.bai SRR25590743.bam.bai SRR25590747.bam.bai
2) Make symbolic links to two scripts I need to run the coverage graph code:
[aloraine@str-i1 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh . [aloraine@str-i1 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh .
3) Launch the coverage graph generation code with:
[aloraine@str-i1 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.err 2>jobs.err
FINDJUNCTIONS step
Similar to the coverage graphs step, made a new subdirectory in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon.
1) Made the find junctions "working" directory and added symbolic links to BAM and BAM index files in the parent directory with:
[aloraine@str-i1 star_salmon]$ mkdir find_junctions [aloraine@str-i1 star_salmon]$ cd find_junctions/ [aloraine@str-i1 find_junctions]$ ln -s ../*bam* .
2) Download required required input 2bit file into the directory with:
wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit
3) Make symbolic links to scripts and jar file with code:
ln -s ~/src/tardigrade/src/sbatch-doIt.sh . ln -s ~/src/tardigrade/src/find_junctions.sh ln -s src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar
4) Launch jobs with:
sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err
DATA TRANSFER SETUP step
1) Create directory for transfer in /projects/tomato_genome/fnb/dataprocessing/tardigrade
[aloraine@str-i1 tardigrade]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade mkdir for_quickload
We will use this to store everything we will trasfer to Quickload for this "tardigrade" project
2) Make directory for tardigrade genome assembly
[aloraine@str-i1 for_quickload]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload mkdir H_exemplaris_Z151_Apr_2017
Note: the above two steps only need to be done once!
3) Make subdirectory for this data set, in the genome assembly directory used for alignments:
[aloraine@str-i1 for_quickload]$ cd H_exemplaris_Z151_Apr_2017/ [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ mkdir SRP454305 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ cd SRP454305/ [aloraine@str-i1 SRP454305]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305
4) Move bam, scaled coverage graph, junction files into this location:
coverage graphs, from inside the directory containing them:
[aloraine@str-i1 SRP454305]$ mv ../../../SRP454305/results/star_salmon/coverage_graphs/*bedgraph* . [aloraine@str-i1 SRP454305]$ ls SRR25590736.scaled.bedgraph.gz SRR25590739.scaled.bedgraph.gz SRR25590742.scaled.bedgraph.gz SRR25590745.scaled.bedgraph.gz SRR25590736.scaled.bedgraph.gz.tbi SRR25590739.scaled.bedgraph.gz.tbi SRR25590742.scaled.bedgraph.gz.tbi SRR25590745.scaled.bedgraph.gz.tbi SRR25590737.scaled.bedgraph.gz SRR25590740.scaled.bedgraph.gz SRR25590743.scaled.bedgraph.gz SRR25590746.scaled.bedgraph.gz SRR25590737.scaled.bedgraph.gz.tbi SRR25590740.scaled.bedgraph.gz.tbi SRR25590743.scaled.bedgraph.gz.tbi SRR25590746.scaled.bedgraph.gz.tbi SRR25590738.scaled.bedgraph.gz SRR25590741.scaled.bedgraph.gz SRR25590744.scaled.bedgraph.gz SRR25590747.scaled.bedgraph.gz SRR25590738.scaled.bedgraph.gz.tbi SRR25590741.scaled.bedgraph.gz.tbi SRR25590744.scaled.bedgraph.gz.tbi SRR25590747.scaled.bedgraph.gz.tbi
Bam files, from inside the directory containing them:
mv *.bam* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/.
Junction files, from inside the directory containing them:
mv *.FJ.* ../../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/.
5) Make all files world-readable and make all directories world-readable and world-executable:
files:
[aloraine@str-i1 SRP454305]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305 [aloraine@str-i1 SRP454305]$ chmod a+r *
directory:
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ chmod a+rx SRP454305
RSYNC step
1) Logged into data.bioviz.org (a virtual machine hosted on UNC Charlotte infrastructure) and moved to data deployment location in the file system there:
local aloraine$ ssh aloraine@data.bioviz.org cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017
Some things to note:
- I have deployed my public key into the authorized_hosts file in my "aloraine" account in data.bioviz.org. This way, I don't have to enter my password.
- If I did need to enter my password, I would enter my Charlotte.edu password.
- Anyone else wanting to do this will need to get an account on the data.bioviz.org
- Note that we are inside a directory named for the reference genome assembly we used.
2) Make a new directory for this new data set to be deployed:
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ pwd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls SRP450893 SRP484252 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP454305
3) Make sure it is group write-able and that its permissions match the other directories in the same location:
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh total 12K drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul 2 09:52 SRP450893 drwxr-xr-x 2 aloraine domain users 4.0K Aug 7 20:08 SRP454305 drwxrwxr-x 2 aloraine domain users 4.0K Jul 3 13:42 SRP484252 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ chmod g+w SRP454305 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh total 12K drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul 2 09:52 SRP450893 drwxrwxr-x 2 aloraine domain users 4.0K Aug 7 20:08 SRP454305 drwxrwxr-x 2 aloraine domain users 4.0K Jul 3 13:42 SRP484252
4) Start the data transfer using tmux and then rsync:
tmux:
tmux new -s transfer
rsync:
rsync -rtpvz aloraine@hpc.charlotte.edu:/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/* SRP454305/.
Note: You can repeat the above rsync command any time you add new content to the source directory on hpc.charlotte.edu. Only the new files will get copied.
Note: I could probably just "rsync" the entire genome directory. I think that this would automatically copy any new "SRP" directories and their contents over to data.bioviz.org.
ANNOTS.XML step
1) Opened the run file for this data set in Excel and save it, in Excel format, to tardigrade/Documentation/inputForMakeAnnotsXml (the tardigrade repository)
Note: Open SRP48452_for_AnnotsXml as a reference and guide!
2) Added five new columns to the front of the file, in from of "Run:
- file name prefix
- color
- physical folder
- study name
- display name
- url
3) Used Excel referencing to insert all the values in "Run" in "file name prefix"
4) Inserted hexadecimal colors codes for each sample. Made those cells have the same fill color as the colors I chose to help me assess their potential appearance and contrast in IGB.
5) Inserted the study code (e.g., SRP454305) in "physical folder" column
6) Used Excel reference to insert a human-friendly "study name" - this becomes the name of the folder where the data files will be listed in IGB.
7) Used Excel references to insert human-friendly "display name" values - these become the checkbox labels in IGB.
8) Used Excel references to make URLs for each file / data set. Used the "SRX" values in the existing "Experiment" column to construct the URL.
9) Added new columns as needed after the first five to use for sorting. For example, I added "Concentration" and then sorted the spreadsheet by concentration and then by run so that the lower concentration, control samples would appear first in the IGB data display list.
10) Edited the script makeAnnots.py to include the new spreadsheet in function getSampleSheets. Ran the script, which will add the new data files to annots.xml in tardigrade/ForGenomeBrowsers/quickload.
11) Checked how it looks by adding the above directory to IGB as a new quickload data source.
CLEANUP step:
- Removed the "work" directory within SRP454305 because it is ENORMOUS and we no longer need it.
- Moved the entire SRP454305 directory into tardigrade/DONE
PREFETCH step
Pre-fetching SRA files with:
in:
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305
Confirmed it worked with:
using prefetch.sh:
using SRP454305_SraRunTable.txt: