[IGBF-3790] Run nf-core/rnaseq v 3.14 on SRP484252 (2024 Goldstein Lab) - JIRA UNCC

Ann Loraine created issue - 26/Jun/24 3:25 AM

Ann Loraine made changes - 26/Jun/24 3:25 AM

Field	Original Value	New Value
Epic Link		IGBF-3778 [ 22997 ]

Ann Loraine made changes - 26/Jun/24 3:25 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Ann Loraine made changes - 26/Jun/24 5:54 PM

Status

In Progress [ 3 ]

To-Do [ 10305 ]

Ann Loraine made changes - 30/Jun/24 8:52 AM

Sprint

Summer 3 [ 197 ]

Summer 3, Summer 4 [ 197, 198 ]

Ann Loraine made changes - 30/Jun/24 8:52 AM

Rank

Ranked higher

Ann Loraine made changes - 01/Jul/24 4:50 PM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Ann Loraine added a comment - 01/Jul/24 5:08 PM - edited

Running prefetch jobs with:

cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh

in:

/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq

Confirmed it worked with:

[aloraine@str-i2 fastq]$ cat *.out | grep -c  "was downloaded successfully"
12
[aloraine@str-i2 fastq]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l
12

All 12 runs were prefetched correctly.

Show

Ann Loraine added a comment - 01/Jul/24 5:08 PM - edited Running prefetch jobs with: cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh in: /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq Confirmed it worked with: [aloraine@str-i2 fastq]$ cat *.out | grep -c "was downloaded successfully" 12 [aloraine@str-i2 fastq]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l 12 All 12 runs were prefetched correctly.

Hide

Permalink

Ann Loraine added a comment - 02/Jul/24 2:46 PM - edited

An issue:

It looks like the .sra files got put into subdirectories for some reason:

[aloraine@str-i2 fastq]$ find . | grep .sra
./SRR27595102/SRR27595102/SRR27595102.sra
./SRR27595103/SRR27595103/SRR27595103.sra
./SRR27595099/SRR27595099/SRR27595099.sra
./SRR27595110/SRR27595110/SRR27595110.sra
./SRR27595101/SRR27595101/SRR27595101.sra
./SRR27595100/SRR27595100/SRR27595100.sra
./SRR27595108/SRR27595108/SRR27595108.sra
./SRR27595104/SRR27595104/SRR27595104.sra
./SRR27595105/SRR27595105/SRR27595105.sra
./SRR27595107/SRR27595107/SRR27595107.sra
./SRR27595109/SRR27595109/SRR27595109.sra

Maybe I did not actually need to specify the subdirectories for the accessions to be saved?

Show

Ann Loraine added a comment - 02/Jul/24 2:46 PM - edited An issue: It looks like the .sra files got put into subdirectories for some reason: [aloraine@str-i2 fastq]$ find . | grep .sra ./SRR27595102/SRR27595102/SRR27595102.sra ./SRR27595103/SRR27595103/SRR27595103.sra ./SRR27595099/SRR27595099/SRR27595099.sra ./SRR27595110/SRR27595110/SRR27595110.sra ./SRR27595101/SRR27595101/SRR27595101.sra ./SRR27595100/SRR27595100/SRR27595100.sra ./SRR27595108/SRR27595108/SRR27595108.sra ./SRR27595104/SRR27595104/SRR27595104.sra ./SRR27595105/SRR27595105/SRR27595105.sra ./SRR27595107/SRR27595107/SRR27595107.sra ./SRR27595109/SRR27595109/SRR27595109.sra Maybe I did not actually need to specify the subdirectories for the accessions to be saved?

Hide

Permalink

Ann Loraine added a comment - 02/Jul/24 3:19 PM - edited

Before I proceed to the next steps, I am going to re-do this using a change in the script code. It will take more time, but I don't want to leave this issue unresolved.

The problem: My first version of slurm script prefetch.sh specified the output directory using:

prefetch $S -O $SLURM_SUBMIT_DIR/$S

This was wrong. There was no need to specify the name of the output directory this way. The prefetch program "knows" to create a directory named for the run id. Correct invocation is:

prefetch $S -O $SLURM_SUBMIT_DIR

New run completed without blocking errors. However, there was a warning. Not sure what it means.

Example:

Loading sra-tools/2.11.0
  Loading requirement: hdf5/1.10.7
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' metadata: md5 ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Table 'SEQUENCE' metadata: md5 ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ALTREAD': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ORIGINAL_QUALITY': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'READ': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'SPOT_GROUP': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'X': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'Y': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 warn: type unrecognized while validating database - Database '/projects/tomato_genome/fnb/dataprocessing/tardigrade/S
RP484252/fastq/SRR27595104/SRR27595104.sra' has unrecognized type 'NCBI:SRA:Illumina:db'
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' is consistent

I don't know what this means. Moving ahead to the next step anyway.

Show

Ann Loraine added a comment - 02/Jul/24 3:19 PM - edited Before I proceed to the next steps, I am going to re-do this using a change in the script code. It will take more time, but I don't want to leave this issue unresolved. The problem: My first version of slurm script prefetch.sh specified the output directory using: prefetch $S -O $SLURM_SUBMIT_DIR/$S This was wrong. There was no need to specify the name of the output directory this way. The prefetch program "knows" to create a directory named for the run id. Correct invocation is: prefetch $S -O $SLURM_SUBMIT_DIR New run completed without blocking errors. However, there was a warning. Not sure what it means. Example: Loading sra-tools/2.11.0 Loading requirement: hdf5/1.10.7 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' metadata: md5 ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Table 'SEQUENCE' metadata: md5 ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ALTREAD': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ORIGINAL_QUALITY': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'READ': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'SPOT_GROUP': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'X': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'Y': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 warn: type unrecognized while validating database - Database '/projects/tomato_genome/fnb/dataprocessing/tardigrade/S RP484252/fastq/SRR27595104/SRR27595104.sra' has unrecognized type 'NCBI:SRA:Illumina:db' 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' is consistent I don't know what this means. Moving ahead to the next step anyway.

Hide

Permalink

Ann Loraine added a comment - 02/Jul/24 3:31 PM - edited

Wrote a new script "fasterq-dump.sh" to convert the downloaded .sra files to fastq files.
Ran the script using command that pipes the run table file to xargs, sets variable A to the value of the input passed to xargs using the -I xargs option, and then used squeue to run fasterq-dump.sh, while exporting the variable A as variable S, as with the preceding commands. See comment in the script fasterq-dump.sh for example invocation.

Result: The .sra files were converted to fastq files, and each .sra file produced _1 and _2 (read 1 and read2) files, as expected since each of the .sra files was from a paired-end run of an Illumina sequencer.

Next, ran qzip.sh to compress (gzip) each fastq file, using xargs to loop over each fastq file name, like this:

ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh

After compressing the fastq files, deleted the downloaded .sra files. The downloaded fastq files are in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq.

Show

Ann Loraine added a comment - 02/Jul/24 3:31 PM - edited Wrote a new script "fasterq-dump.sh" to convert the downloaded .sra files to fastq files. Ran the script using command that pipes the run table file to xargs, sets variable A to the value of the input passed to xargs using the -I xargs option, and then used squeue to run fasterq-dump.sh, while exporting the variable A as variable S, as with the preceding commands. See comment in the script fasterq-dump.sh for example invocation. Result: The .sra files were converted to fastq files, and each .sra file produced _1 and _2 (read 1 and read2) files, as expected since each of the .sra files was from a paired-end run of an Illumina sequencer. Next, ran qzip.sh to compress (gzip) each fastq file, using xargs to loop over each fastq file name, like this: ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh After compressing the fastq files, deleted the downloaded .sra files. The downloaded fastq files are in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq.

Hide

Permalink

Ann Loraine added a comment - 02/Jul/24 10:30 PM - edited

Setting up everything needed to launch nf-core/rnaseq pipeline:

(1) Make input sample names file for SRP484252. It should look like this:

header: sample,fastq_1,fastq_2,strandedness
[sample name],[read 1 fastq file name],[read2 fastq file name], auto
... one line per SRR run id

Made the above required file with:

echo sample,fastq_1,fastq_2,strandedness > samples.csv
cut -f 1 -d , SRP484252_SraRunTable.txt | grep -v Run | xargs -I A echo  A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv

Confirmed contents of samples.csv:

[aloraine@str-i1 SRP484252]$ cat samples.csv 
sample,fastq_1,fastq_2,strandedness
SRR27595099,SRR27595099_1.fastq.gz,SRR27595099_2.fastq.gz,auto
SRR27595100,SRR27595100_1.fastq.gz,SRR27595100_2.fastq.gz,auto
SRR27595101,SRR27595101_1.fastq.gz,SRR27595101_2.fastq.gz,auto
SRR27595102,SRR27595102_1.fastq.gz,SRR27595102_2.fastq.gz,auto
SRR27595103,SRR27595103_1.fastq.gz,SRR27595103_2.fastq.gz,auto
SRR27595104,SRR27595104_1.fastq.gz,SRR27595104_2.fastq.gz,auto
SRR27595105,SRR27595105_1.fastq.gz,SRR27595105_2.fastq.gz,auto
SRR27595106,SRR27595106_1.fastq.gz,SRR27595106_2.fastq.gz,auto
SRR27595107,SRR27595107_1.fastq.gz,SRR27595107_2.fastq.gz,auto
SRR27595108,SRR27595108_1.fastq.gz,SRR27595108_2.fastq.gz,auto
SRR27595109,SRR27595109_1.fastq.gz,SRR27595109_2.fastq.gz,auto
SRR27595110,SRR27595110_1.fastq.gz,SRR27595110_2.fastq.gz,auto

(2) Set the following environment variables in my account by adding these lines to my .bash_profile file:

export NXF_OFFLINE=FALSE
export NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2
export NXF_OPTS=-Xms1g -Xmx4g
export NXF_EXECUTOR=slurm

NXF_SINGULARITY_CACHEDIR is a location my account has write permission. A location in my home directory would be fine, probably.

(3) Downloaded genome assembly-specific files required for the pipeline to run:

wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa
http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz
http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gff

Uncompressed bed.gz and removed the final two columns:

gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut f1-12 > H_exemplaris_Z151_Apr_2017.bed

(4) Started a tmux session on the head node by entering:

tmux new -s base

This ensures that I lose my connection, the session will continue to run.

If I get disconnected, I can log back into the same login (head) node, and enter:

tmux attach-session -t base

(5) Inside the tmux session, launched an interactive "job" on the cluster with:

[aloraine@str-i2 SRP484252]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash
[aloraine@str-c141 SRP484252]$

(6) Loaded nextflow module with:

[aloraine@str-c141 SRP484252]$ module load nf-core
Loading nf-core/2.12.1
  Loading requirement: anaconda3/2020.11
(nf-core-2.12.1) [aloraine@str-c141 SRP484252]$

(7) Ran nextflow with:

(nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1

This command runs nf-core/rna-seq pipeline in the background, saving file streams "standard out" to file out.1 and "standard error" to err.1. If there are errors, I will see them written to these files. Also, nextflow creates logging files with file name prefix ".nextflow.log". If something goes wrong, I can look at those files for help.

Show

Ann Loraine added a comment - 02/Jul/24 10:30 PM - edited Setting up everything needed to launch nf-core/rnaseq pipeline: (1) Make input sample names file for SRP484252. It should look like this: header: sample,fastq_1,fastq_2,strandedness [sample name] , [read 1 fastq file name] , [read2 fastq file name] , auto ... one line per SRR run id Made the above required file with: echo sample,fastq_1,fastq_2,strandedness > samples.csv cut -f 1 -d , SRP484252_SraRunTable.txt | grep -v Run | xargs -I A echo A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv Confirmed contents of samples.csv: [aloraine@str-i1 SRP484252]$ cat samples.csv sample,fastq_1,fastq_2,strandedness SRR27595099,SRR27595099_1.fastq.gz,SRR27595099_2.fastq.gz,auto SRR27595100,SRR27595100_1.fastq.gz,SRR27595100_2.fastq.gz,auto SRR27595101,SRR27595101_1.fastq.gz,SRR27595101_2.fastq.gz,auto SRR27595102,SRR27595102_1.fastq.gz,SRR27595102_2.fastq.gz,auto SRR27595103,SRR27595103_1.fastq.gz,SRR27595103_2.fastq.gz,auto SRR27595104,SRR27595104_1.fastq.gz,SRR27595104_2.fastq.gz,auto SRR27595105,SRR27595105_1.fastq.gz,SRR27595105_2.fastq.gz,auto SRR27595106,SRR27595106_1.fastq.gz,SRR27595106_2.fastq.gz,auto SRR27595107,SRR27595107_1.fastq.gz,SRR27595107_2.fastq.gz,auto SRR27595108,SRR27595108_1.fastq.gz,SRR27595108_2.fastq.gz,auto SRR27595109,SRR27595109_1.fastq.gz,SRR27595109_2.fastq.gz,auto SRR27595110,SRR27595110_1.fastq.gz,SRR27595110_2.fastq.gz,auto (2) Set the following environment variables in my account by adding these lines to my .bash_profile file: export NXF_OFFLINE=FALSE export NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2 export NXF_OPTS=-Xms1g -Xmx4g export NXF_EXECUTOR=slurm NXF_SINGULARITY_CACHEDIR is a location my account has write permission. A location in my home directory would be fine, probably. (3) Downloaded genome assembly-specific files required for the pipeline to run: wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gff Uncompressed bed.gz and removed the final two columns: gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut f1-12 > H_exemplaris_Z151_Apr_2017.bed (4) Started a tmux session on the head node by entering: tmux new -s base This ensures that I lose my connection, the session will continue to run. If I get disconnected, I can log back into the same login (head) node, and enter: tmux attach-session -t base (5) Inside the tmux session, launched an interactive "job" on the cluster with: [aloraine@str-i2 SRP484252]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash [aloraine@str-c141 SRP484252]$ (6) Loaded nextflow module with: [aloraine@str-c141 SRP484252]$ module load nf-core Loading nf-core/2.12.1 Loading requirement: anaconda3/2020.11 (nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ (7) Ran nextflow with: (nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1 This command runs nf-core/rna-seq pipeline in the background, saving file streams "standard out" to file out.1 and "standard error" to err.1. If there are errors, I will see them written to these files. Also, nextflow creates logging files with file name prefix ".nextflow.log". If something goes wrong, I can look at those files for help.

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 8:55 AM - edited

Update:

Pipeline nf-core/rnaeq revision 3.14 has finished.
Added mutltiqc report to repository "tardigrade" in Documentation/multiqcReports as file name SRP484252-multiqc_report.html.
Reviewed SRP484252-multiqc_report.html and found no problems.

Show

Ann Loraine added a comment - 03/Jul/24 8:55 AM - edited Update: Pipeline nf-core/rnaeq revision 3.14 has finished. Added mutltiqc report to repository "tardigrade" in Documentation/multiqcReports as file name SRP484252-multiqc_report.html. Reviewed SRP484252-multiqc_report.html and found no problems.

Ann Loraine made changes - 03/Jul/24 8:55 AM

Summary

Run pipeline on SRP484252 (2024 Goldstein Lab)

Run nf-core/rnaseq v 3.14 on SRP484252 (2024 Goldstein Lab)

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 9:32 AM

Proceeding to post nf-core/rnaseq data processing steps:

Changed to results directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
Made symbolic link from my home folder "src" directory file renameBams.sh with:

[aloraine@str-i2 star_salmon]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
[aloraine@str-i2 star_salmon]$ ln -s ~/src/tardigrade/src/renameBams.sh .

Ran with:

[aloraine@str-i2 star_salmon]$ renameBams.sh 
[aloraine@str-i2 star_salmon]$

This is not a slurm script. All it does is change file names.

For example, here are the new BAM file names:

[aloraine@str-i2 star_salmon]$ ls -lh *.bam
-rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595099.bam
-rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595100.bam
-rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595101.bam
-rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595102.bam
-rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:43 SRR27595103.bam
-rw-r----- 1 aloraine tomato_genome 1.7G Jul  3 00:43 SRR27595104.bam
-rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595105.bam
-rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:41 SRR27595106.bam
-rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595107.bam
-rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595108.bam
-rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595109.bam
-rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595110.bam

Made scaled coverage graphs in a subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon:

[aloraine@str-i2 star_salmon]$ mkdir coverage_graphs
[aloraine@str-i2 star_salmon]$ cd coverage_graphs/
[aloraine@str-i2 coverage_graphs]$ ln -s ../*bam* .
[aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh .
[aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
[aloraine@str-i2 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.out >jobs.err

Made junction files in another subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon:

[aloraine@str-i2 star_salmon]$ cd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
[aloraine@str-i2 star_salmon]$ mkdir find_junctions
[aloraine@str-i2 star_salmon]$ cd find_junctions/
[aloraine@str-i2 find_junctions]$ ln -s ../*bam* .
[aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
[aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find_junctions.sh .
[aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar .
[aloraine@str-i2 find_junctions]$ wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit 
[aloraine@str-i2 find_junctions]$ sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Show

Ann Loraine added a comment - 03/Jul/24 9:32 AM Proceeding to post nf-core/rnaseq data processing steps: Changed to results directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon Made symbolic link from my home folder "src" directory file renameBams.sh with: [aloraine@str-i2 star_salmon]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ ln -s ~/src/tardigrade/src/renameBams.sh . Ran with: [aloraine@str-i2 star_salmon]$ renameBams.sh [aloraine@str-i2 star_salmon]$ This is not a slurm script. All it does is change file names. For example, here are the new BAM file names: [aloraine@str-i2 star_salmon]$ ls -lh *.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595099.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595100.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595101.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595102.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:43 SRR27595103.bam -rw-r----- 1 aloraine tomato_genome 1.7G Jul 3 00:43 SRR27595104.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595105.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:41 SRR27595106.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595107.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595108.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595109.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595110.bam Made scaled coverage graphs in a subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon: [aloraine@str-i2 star_salmon]$ mkdir coverage_graphs [aloraine@str-i2 star_salmon]$ cd coverage_graphs/ [aloraine@str-i2 coverage_graphs]$ ln -s ../*bam* . [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh . [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh . [aloraine@str-i2 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.out >jobs.err Made junction files in another subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon: [aloraine@str-i2 star_salmon]$ cd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ mkdir find_junctions [aloraine@str-i2 star_salmon]$ cd find_junctions/ [aloraine@str-i2 find_junctions]$ ln -s ../*bam* . [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh . [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find_junctions.sh . [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar . [aloraine@str-i2 find_junctions]$ wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit [aloraine@str-i2 find_junctions]$ sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 9:53 AM - edited

After the data processing completes, I will copy the files to data hosting location. The location must be visible to the public internet - it's basically just a file hosting service that supports HTTP access from applications like IGB and, of course, Web browsers.

Once that is done, I will need to make it possible for users to select these new data files from within the IGB interface, in the "Available Data" section of the "Data Access" tab.

Currently, we are making tardigrade RNA-Seq data available as part of IGB's "RNA-Seq" Quickload data source.

For that, I just need to add the new files to the "annots.xml" file deployed in IGB's Quickload data source named "RNA-Seq." This is a "default" data source, meaning that when I download and install IGB, it is present in the list of Data Sources I can work with and access. Whenever I need to know more about a given Data Source, I can get more information about it by opening the Data Sources tab of the Preferences window. As of now, the "RNA-Seq" Quickload Data Source occupies the top of the list.

To add this new data set to the annots.xml, I need to:

Using the Run Table to start, create a new Excel spreadsheet file: tardigrade / Documentation / inputForMakeAnnotsXml / SRP484252_for_AnnotsXml.xlsx.

This file should have new columns specifying visual styles, such as foreground colors, for each sample (SRR run identifier). All the data files for a given SRR run identifier have the SRR run identifier as the first part of the file name.

I then add new code to tardigrade / src / makeAnnotsXml.py function "getSampleSheets" to import the new Excel spreadsheet SRP484252_for_AnnotsXml.xlsx

Within the tardigrade "src" directory, I will run makeAnnotsXml.py.

This code will read the spreadsheets and output a new "annots.xml", saving it to a local directory within the tardigrade clone: tardigrade/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017.

The directory "quickload" is itself a valid quickload data source. For testing, I add it as a local Quickload data source to IGB. All the files should be accessible now. I can open them and look around, and if I don't like the colors, I can change them by editing the spreadsheet "SRP484252_for_AnnotsXml.xlsx" and re-running makeAnnotsXml.py. When I do that, however, I will need to click the "refresh" button in the first column of the Data Sources table in the Data Sources tab of the Preferences window in IGB. I've noticed that sometimes this refresh doesn't work. I don't know why! If I observe weird behavior, I usually just remove the data source I'm testing and add it back again. Or restart IGB.

Note that makeAnnotsXml.py has dependencies on another repository called "igbquickload," which means I need to make sure that other code is in my "PYTHONPATH," an environment variable specifying where the python program can find dependencies imported in makeAnnotsXml.py code.

To make this work, I have added this line to my .bash_profile in my personal computer:

export SRC=$HOME/src
export PYTHONPATH=.:$SRC/igbquickload

And then I clone the repository in a subdirectory "src" I created in my home directory. (That's where I keep all my cloned repositories.)

The repository with dependencies is here: https://bitbucket.org/lorainelab/igbquickload/

Show

Ann Loraine added a comment - 03/Jul/24 9:53 AM - edited After the data processing completes, I will copy the files to data hosting location. The location must be visible to the public internet - it's basically just a file hosting service that supports HTTP access from applications like IGB and, of course, Web browsers. Once that is done, I will need to make it possible for users to select these new data files from within the IGB interface, in the "Available Data" section of the "Data Access" tab. Currently, we are making tardigrade RNA-Seq data available as part of IGB's "RNA-Seq" Quickload data source. For that, I just need to add the new files to the "annots.xml" file deployed in IGB's Quickload data source named "RNA-Seq." This is a "default" data source, meaning that when I download and install IGB, it is present in the list of Data Sources I can work with and access. Whenever I need to know more about a given Data Source, I can get more information about it by opening the Data Sources tab of the Preferences window. As of now, the "RNA-Seq" Quickload Data Source occupies the top of the list. To add this new data set to the annots.xml, I need to: Using the Run Table to start, create a new Excel spreadsheet file: tardigrade / Documentation / inputForMakeAnnotsXml / SRP484252_for_AnnotsXml.xlsx. This file should have new columns specifying visual styles, such as foreground colors, for each sample (SRR run identifier). All the data files for a given SRR run identifier have the SRR run identifier as the first part of the file name. I then add new code to tardigrade / src / makeAnnotsXml.py function "getSampleSheets" to import the new Excel spreadsheet SRP484252_for_AnnotsXml.xlsx Within the tardigrade "src" directory, I will run makeAnnotsXml.py. This code will read the spreadsheets and output a new "annots.xml", saving it to a local directory within the tardigrade clone: tardigrade/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017. The directory "quickload" is itself a valid quickload data source. For testing, I add it as a local Quickload data source to IGB. All the files should be accessible now. I can open them and look around, and if I don't like the colors, I can change them by editing the spreadsheet "SRP484252_for_AnnotsXml.xlsx" and re-running makeAnnotsXml.py. When I do that, however, I will need to click the "refresh" button in the first column of the Data Sources table in the Data Sources tab of the Preferences window in IGB. I've noticed that sometimes this refresh doesn't work. I don't know why! If I observe weird behavior, I usually just remove the data source I'm testing and add it back again. Or restart IGB. Note that makeAnnotsXml.py has dependencies on another repository called "igbquickload," which means I need to make sure that other code is in my "PYTHONPATH," an environment variable specifying where the python program can find dependencies imported in makeAnnotsXml.py code. To make this work, I have added this line to my .bash_profile in my personal computer: export SRC=$HOME/src export PYTHONPATH=.:$SRC/igbquickload And then I clone the repository in a subdirectory "src" I created in my home directory. (That's where I keep all my cloned repositories.) The repository with dependencies is here: https://bitbucket.org/lorainelab/igbquickload/

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 10:25 AM

Coverage graphs completed. Each "job" wrote information about its parameter settings to stderr. Here is an example:

SRR27595110.err
::::::::::::::
normalization: CPM
bamFilesList: ['SRR27595110.bam']
binLength: 1
numberOfSamples: None
blackListFileName: None
skipZeroOverZero: False
bed_and_bin: False
genomeChunkSize: None
defaultFragmentLength: read length
numberOfProcessors: 1
verbose: False
region: None
bedFile: None
minMappingQuality: None
ignoreDuplicates: False
chrsToSkip: []
stepSize: 1
center_read: False
samFlag_include: None
samFlag_exclude: None
minFragmentLength: 0
maxFragmentLength: 0
zerosToNans: False
smoothLength: None
save_data: False
out_file_for_raw_data: None
maxPairedFragmentLength: 1000

This is important for our future reference because users often want to know how the scaling (a kind of normalization) was done. The above notes indicate the scaling / normalization was done using a method called "CPM". CPM stands for "counts per million."

Show

Ann Loraine added a comment - 03/Jul/24 10:25 AM Coverage graphs completed. Each "job" wrote information about its parameter settings to stderr. Here is an example: SRR27595110.err :::::::::::::: normalization: CPM bamFilesList: ['SRR27595110.bam'] binLength: 1 numberOfSamples: None blackListFileName: None skipZeroOverZero: False bed_and_bin: False genomeChunkSize: None defaultFragmentLength: read length numberOfProcessors: 1 verbose: False region: None bedFile: None minMappingQuality: None ignoreDuplicates: False chrsToSkip: [] stepSize: 1 center_read: False samFlag_include: None samFlag_exclude: None minFragmentLength: 0 maxFragmentLength: 0 zerosToNans: False smoothLength: None save_data: False out_file_for_raw_data: None maxPairedFragmentLength: 1000 This is important for our future reference because users often want to know how the scaling (a kind of normalization) was done. The above notes indicate the scaling / normalization was done using a method called "CPM". CPM stands for "counts per million."

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 10:37 AM - edited

Preparing to transfer data to data hosting file system:

In top-level directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252 made new subdirectory structure: for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252
Confirmed completion with:

[aloraine@str-i2 coverage_graphs]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs
[aloraine@str-i2 coverage_graphs]$ ls -lh *.tbi | wc -l
12
[aloraine@str-i2 coverage_graphs]$ ls -lh *.bedgraph.gz | cut -f5 -d ' ' | grep -c "M" 
12

The above code confirmed that the data files (suffix .bedgraph.gz) and index files (suffix .tbi) exist and have non-zero size.

Moved coverage graphs (since those are done) to SRP484252 staging location with:

[aloraine@str-i2 coverage_graphs]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs
[aloraine@str-i2 coverage_graphs]$ ls *bedgraph*
SRR27595099.scaled.bedgraph.gz      SRR27595102.scaled.bedgraph.gz      SRR27595105.scaled.bedgraph.gz      SRR27595108.scaled.bedgraph.gz
SRR27595099.scaled.bedgraph.gz.tbi  SRR27595102.scaled.bedgraph.gz.tbi  SRR27595105.scaled.bedgraph.gz.tbi  SRR27595108.scaled.bedgraph.gz.tbi
SRR27595100.scaled.bedgraph.gz      SRR27595103.scaled.bedgraph.gz      SRR27595106.scaled.bedgraph.gz      SRR27595109.scaled.bedgraph.gz
SRR27595100.scaled.bedgraph.gz.tbi  SRR27595103.scaled.bedgraph.gz.tbi  SRR27595106.scaled.bedgraph.gz.tbi  SRR27595109.scaled.bedgraph.gz.tbi
SRR27595101.scaled.bedgraph.gz      SRR27595104.scaled.bedgraph.gz      SRR27595107.scaled.bedgraph.gz      SRR27595110.scaled.bedgraph.gz
SRR27595101.scaled.bedgraph.gz.tbi  SRR27595104.scaled.bedgraph.gz.tbi  SRR27595107.scaled.bedgraph.gz.tbi  SRR27595110.scaled.bedgraph.gz.tbi
[aloraine@str-i2 coverage_graphs]$ mv *bedgraph* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. 
[aloraine@str-i2 coverage_graphs]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.

Note that the final command ensures that all files in the staging location are world-readable. I did that because when I run the transfer command, I'll use an option that preserves file permissions.

Deployed RNA-Seq alignment files (bam and bam.bai) to staging location with:

[aloraine@str-i2 star_salmon]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
[aloraine@str-i2 star_salmon]$ ls *bam*
SRR27595099.bam      SRR27595101.bam      SRR27595103.bam      SRR27595105.bam      SRR27595107.bam      SRR27595109.bam
SRR27595099.bam.bai  SRR27595101.bam.bai  SRR27595103.bam.bai  SRR27595105.bam.bai  SRR27595107.bam.bai  SRR27595109.bam.bai
SRR27595100.bam      SRR27595102.bam      SRR27595104.bam      SRR27595106.bam      SRR27595108.bam      SRR27595110.bam
SRR27595100.bam.bai  SRR27595102.bam.bai  SRR27595104.bam.bai  SRR27595106.bam.bai  SRR27595108.bam.bai  SRR27595110.bam.bai
[aloraine@str-i2 star_salmon]$ cp *bam* ../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.

The find junctions jobs are still running, but I can get started copying the completed files that are ready now to the hosting site.

Show

Ann Loraine added a comment - 03/Jul/24 10:37 AM - edited Preparing to transfer data to data hosting file system: In top-level directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252 made new subdirectory structure: for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252 Confirmed completion with: [aloraine@str-i2 coverage_graphs]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs [aloraine@str-i2 coverage_graphs]$ ls -lh *.tbi | wc -l 12 [aloraine@str-i2 coverage_graphs]$ ls -lh *.bedgraph.gz | cut -f5 -d ' ' | grep -c "M" 12 The above code confirmed that the data files (suffix .bedgraph.gz) and index files (suffix .tbi) exist and have non-zero size. Moved coverage graphs (since those are done) to SRP484252 staging location with: [aloraine@str-i2 coverage_graphs]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs [aloraine@str-i2 coverage_graphs]$ ls *bedgraph* SRR27595099.scaled.bedgraph.gz SRR27595102.scaled.bedgraph.gz SRR27595105.scaled.bedgraph.gz SRR27595108.scaled.bedgraph.gz SRR27595099.scaled.bedgraph.gz.tbi SRR27595102.scaled.bedgraph.gz.tbi SRR27595105.scaled.bedgraph.gz.tbi SRR27595108.scaled.bedgraph.gz.tbi SRR27595100.scaled.bedgraph.gz SRR27595103.scaled.bedgraph.gz SRR27595106.scaled.bedgraph.gz SRR27595109.scaled.bedgraph.gz SRR27595100.scaled.bedgraph.gz.tbi SRR27595103.scaled.bedgraph.gz.tbi SRR27595106.scaled.bedgraph.gz.tbi SRR27595109.scaled.bedgraph.gz.tbi SRR27595101.scaled.bedgraph.gz SRR27595104.scaled.bedgraph.gz SRR27595107.scaled.bedgraph.gz SRR27595110.scaled.bedgraph.gz SRR27595101.scaled.bedgraph.gz.tbi SRR27595104.scaled.bedgraph.gz.tbi SRR27595107.scaled.bedgraph.gz.tbi SRR27595110.scaled.bedgraph.gz.tbi [aloraine@str-i2 coverage_graphs]$ mv *bedgraph* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. [aloraine@str-i2 coverage_graphs]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. Note that the final command ensures that all files in the staging location are world-readable. I did that because when I run the transfer command, I'll use an option that preserves file permissions. Deployed RNA-Seq alignment files (bam and bam.bai) to staging location with: [aloraine@str-i2 star_salmon]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ ls *bam* SRR27595099.bam SRR27595101.bam SRR27595103.bam SRR27595105.bam SRR27595107.bam SRR27595109.bam SRR27595099.bam.bai SRR27595101.bam.bai SRR27595103.bam.bai SRR27595105.bam.bai SRR27595107.bam.bai SRR27595109.bam.bai SRR27595100.bam SRR27595102.bam SRR27595104.bam SRR27595106.bam SRR27595108.bam SRR27595110.bam SRR27595100.bam.bai SRR27595102.bam.bai SRR27595104.bam.bai SRR27595106.bam.bai SRR27595108.bam.bai SRR27595110.bam.bai [aloraine@str-i2 star_salmon]$ cp *bam* ../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. The find junctions jobs are still running, but I can get started copying the completed files that are ready now to the hosting site.

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 10:53 AM - edited

Setting up and deploying data to hosting site:

Logged into the hosting site from my local computer, while using the UNC Charlotte VPN, with:

local aloraine$ ssh aloraine@data.bioviz.org

#############################################################################

Use of the University's computing and electronic communication resources is
conditioned on compliance with the University's Information Technology (IT)
policies (Policy Statements 311, 304, 303, 601.14, 307 and 302.) Pursuant to
those policies, the University will take any steps necessary to safeguard the
integrity of the University's computing and electronic communication resources
and to minimize the risk to both those resources and the end users of those
resources. Such safeguarding includes monitoring data traffic to detect
anomalous network activity, as well as accessing, retrieving, reading and/or
disclosing data communications when there is reasonable cause to suspect a
violation of applicable University policy or criminal law, or when monitoring
is otherwise required or permitted by law. 

#############################################################################

Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-88-generic x86_64)

 System information as of Wed Jul  3 10:44:00 AM EDT 2024

  System load:    0.22             Processes:               261
  Usage of /home: 0.0% of 5.82GB   Users logged in:         2
  Memory usage:   9%               IPv4 address for ens160: 10.16.57.232
  Swap usage:     0%

Expanded Security Maintenance for Applications is not enabled.

1 update can be applied immediately.
To see these additional updates run: apt list --upgradable

Enable ESM Apps to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status


3 updates could not be installed automatically. For more details,
see /var/log/unattended-upgrades/unattended-upgrades.log

*** System restart required ***

Note: I do not need to enter a password when I logged in because I previously added my local computers public key to the "authorized_hosts" file in my home directory's .ssh folder on the remote host.

Create deployment directory for this new SRP484252 dataset:

aloraine@cci-vm12:~$ cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017/
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ 
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP484252

Copy data from the cluster to this new location, using rsync, after starting a tmux session:

tmux new -s base

Start the transfer with:

aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ rsync -rtpvz aloraine@10.16.115.245:/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* SRP484252/.
Connected.

(aloraine@10.16.115.245) Password: 
(aloraine@10.16.115.245) Duo two-factor login for aloraine

Enter a passcode or select one of the following options:

 1. Phone call to XXX-XXX-XXXX
 2. Phone call to XXX-XXX-XXXX
 3. Phone call to XXX-XXX-XXXX
 4. Phone call to XXX-XXX-XXXX
 5. SMS passcodes to XXX-XXX-XXXX (next code starts with: 2)
 6. SMS passcodes to XXX-XXX-XXXX
 7. SMS passcodes to XXX-XXX-3048 (next code starts with: 2)

Passcode or option (1-7): 3
receiving incremental file list
SRR27595099.bam
...

The above code launches a tmux session, ensuring that if I get disconnected, the transfer won't halt. The next code uses rsync to make copies from the source file system onto this recipient file system. Later, when the find junctions output files are ready, I'll repeat this same command, and the new files will get copied. Files that are already copied won't get re-copied, however.

Note how rsync triggers a request for me to authenticate my user account. Because the UNC Charlotte cluster system is a desirable target for pests to bitcoin mine and do other useless awful crap, we have heavy security. No-one can access that system without providing a password and a second form of authentication. Putting your public key into your cluster account does nothing.

Show

Ann Loraine added a comment - 03/Jul/24 10:53 AM - edited Setting up and deploying data to hosting site: Logged into the hosting site from my local computer, while using the UNC Charlotte VPN, with: local aloraine$ ssh aloraine@data.bioviz.org ############################################################################# Use of the University's computing and electronic communication resources is conditioned on compliance with the University's Information Technology (IT) policies (Policy Statements 311, 304, 303, 601.14, 307 and 302.) Pursuant to those policies, the University will take any steps necessary to safeguard the integrity of the University's computing and electronic communication resources and to minimize the risk to both those resources and the end users of those resources. Such safeguarding includes monitoring data traffic to detect anomalous network activity, as well as accessing, retrieving, reading and/or disclosing data communications when there is reasonable cause to suspect a violation of applicable University policy or criminal law, or when monitoring is otherwise required or permitted by law. ############################################################################# Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-88- generic x86_64) System information as of Wed Jul 3 10:44:00 AM EDT 2024 System load: 0.22 Processes: 261 Usage of /home: 0.0% of 5.82GB Users logged in: 2 Memory usage: 9% IPv4 address for ens160: 10.16.57.232 Swap usage: 0% Expanded Security Maintenance for Applications is not enabled. 1 update can be applied immediately. To see these additional updates run: apt list --upgradable Enable ESM Apps to receive additional future security updates. See https: //ubuntu.com/esm or run: sudo pro status 3 updates could not be installed automatically. For more details, see / var /log/unattended-upgrades/unattended-upgrades.log *** System restart required *** Note: I do not need to enter a password when I logged in because I previously added my local computers public key to the "authorized_hosts" file in my home directory's .ssh folder on the remote host. Create deployment directory for this new SRP484252 dataset: aloraine@cci-vm12:~$ cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017/ aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP484252 Copy data from the cluster to this new location, using rsync, after starting a tmux session: tmux new -s base Start the transfer with: aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ rsync -rtpvz aloraine@10.16.115.245:/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* SRP484252/. Connected. (aloraine@10.16.115.245) Password: (aloraine@10.16.115.245) Duo two-factor login for aloraine Enter a passcode or select one of the following options: 1. Phone call to XXX-XXX-XXXX 2. Phone call to XXX-XXX-XXXX 3. Phone call to XXX-XXX-XXXX 4. Phone call to XXX-XXX-XXXX 5. SMS passcodes to XXX-XXX-XXXX (next code starts with: 2) 6. SMS passcodes to XXX-XXX-XXXX 7. SMS passcodes to XXX-XXX-3048 (next code starts with: 2) Passcode or option (1-7): 3 receiving incremental file list SRR27595099.bam ... The above code launches a tmux session, ensuring that if I get disconnected, the transfer won't halt. The next code uses rsync to make copies from the source file system onto this recipient file system. Later, when the find junctions output files are ready, I'll repeat this same command, and the new files will get copied. Files that are already copied won't get re-copied, however. Note how rsync triggers a request for me to authenticate my user account. Because the UNC Charlotte cluster system is a desirable target for pests to bitcoin mine and do other useless awful crap, we have heavy security. No-one can access that system without providing a password and a second form of authentication. Putting your public key into your cluster account does nothing.

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 10:56 AM - edited

Oops! I forgot to change the file permissions for the sequence alignment (bam and bai) files. That's OK. I'm curious to see if rsync is smart enough to pick up changes in file permissions on the source. I'll use this mistake as an opportunity to see how rsync behaves. I'll make the source bam and bai files world-readable and then repeat the command. Ideally, the recipient file permissions will change without the system re-copying the entire (very large) files onto the host.

Result:

I re-ran the rsync command after changing file permissions. Nothing got transferred. The file permissions did not update.

Show

Ann Loraine added a comment - 03/Jul/24 10:56 AM - edited Oops! I forgot to change the file permissions for the sequence alignment (bam and bai) files. That's OK. I'm curious to see if rsync is smart enough to pick up changes in file permissions on the source. I'll use this mistake as an opportunity to see how rsync behaves. I'll make the source bam and bai files world-readable and then repeat the command. Ideally, the recipient file permissions will change without the system re-copying the entire (very large) files onto the host. Result: I re-ran the rsync command after changing file permissions. Nothing got transferred. The file permissions did not update.

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 1:41 PM

Transferring find junctions outputs.

Checked that the files were made, as expected, with:

[aloraine@str-i2 find_junctions]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/find_junctions
[aloraine@str-i2 find_junctions]$ ls *FJ* | wc -l
24

Copying files to deployment directory (rsync source) and specifying desired permissions with:

[aloraine@str-i2 find_junctions]$ cp *FJ* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.
[aloraine@str-i2 find_junctions]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/*
[aloraine@str-i2 find_junctions]$ chmod g+w ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/*

Show

Ann Loraine added a comment - 03/Jul/24 1:41 PM Transferring find junctions outputs. Checked that the files were made, as expected, with: [aloraine@str-i2 find_junctions]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/find_junctions [aloraine@str-i2 find_junctions]$ ls *FJ* | wc -l 24 Copying files to deployment directory (rsync source) and specifying desired permissions with: [aloraine@str-i2 find_junctions]$ cp *FJ* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. [aloraine@str-i2 find_junctions]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* [aloraine@str-i2 find_junctions]$ chmod g+w ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/*

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 1:43 PM - edited

Repeated the rsync command and noticed that after that, all the file permissions in the target directory match the source directory. Maybe file permissions only get updated properly when some data get actually transferred?

Show

Ann Loraine added a comment - 03/Jul/24 1:43 PM - edited Repeated the rsync command and noticed that after that, all the file permissions in the target directory match the source directory. Maybe file permissions only get updated properly when some data get actually transferred?

Hide

Permalink

Ann Loraine added a comment - 03/Jul/24 1:44 PM - edited

Proceeding with adding the new data to annots.xml, as described in preceding comment.

Launched Excel and opened SRA/SRP484252_SraRunTable.txt (selected comma-delimiter option)
Saved first draft as to tardigrade/Documentation/inputForMakeAnnotsXml as SRP484252_for_AnnotsXml.xlsx
Also opened SRP450893_for_AnnotsXml.xlsx for reference
Edited SRP484252_for_AnnotsXml.xlsx by adding new columns to the front the spreadsheet, and a new column "concentration" for sorting; I wanted the samples to be listed in ascending order of Bleomycin concentration. Used formulas in most of the new columns so that SRR codes and the like would get included automatically, without my having to type a bunch of stuff.
Picked colors and saved the hex codes and the actual color as a fill color in the spreadsheet.
Edited src/makeAnnotsXml.py by adding the spreadsheet to a function.
Ran makeAnnotsXml.py. New annots.xml file got written out.
Committed and pushed all the changes.
To deploy the new changes to the rnaseq data source host, logged onto the RENCI VM, changed to the directory for Hypsibus, and ran a script there called "update.sh", which copies the latest annots.xml from the tardigrade repository hosted on bitbucket, with:

cd /projects/igbquickload/lorainelab/www/main/htdocs/rnaseq/H_exemplaris_Z151_Apr_2017
./update.sh

Here is update.sh:

#! /bin/bash
wget --backups=3 https://bitbucket.org/lorainelab/tardigrade/raw/main/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017/annots.xml

Confirmed the colors and the coverage graph files were accessible.

Moving to "needs testing"

Show

Ann Loraine added a comment - 03/Jul/24 1:44 PM - edited Proceeding with adding the new data to annots.xml, as described in preceding comment. Launched Excel and opened SRA/SRP484252_SraRunTable.txt (selected comma-delimiter option) Saved first draft as to tardigrade/Documentation/inputForMakeAnnotsXml as SRP484252_for_AnnotsXml.xlsx Also opened SRP450893_for_AnnotsXml.xlsx for reference Edited SRP484252_for_AnnotsXml.xlsx by adding new columns to the front the spreadsheet, and a new column "concentration" for sorting; I wanted the samples to be listed in ascending order of Bleomycin concentration. Used formulas in most of the new columns so that SRR codes and the like would get included automatically, without my having to type a bunch of stuff. Picked colors and saved the hex codes and the actual color as a fill color in the spreadsheet. Edited src/makeAnnotsXml.py by adding the spreadsheet to a function. Ran makeAnnotsXml.py. New annots.xml file got written out. Committed and pushed all the changes. To deploy the new changes to the rnaseq data source host, logged onto the RENCI VM, changed to the directory for Hypsibus, and ran a script there called "update.sh", which copies the latest annots.xml from the tardigrade repository hosted on bitbucket, with: cd /projects/igbquickload/lorainelab/www/main/htdocs/rnaseq/H_exemplaris_Z151_Apr_2017 ./update.sh Here is update.sh: #! /bin/bash wget --backups=3 https: //bitbucket.org/lorainelab/tardigrade/raw/main/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017/annots.xml Confirmed the colors and the coverage graph files were accessible. Moving to "needs testing"