Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3790

Run nf-core/rnaseq v 3.14 on SRP484252 (2024 Goldstein Lab)

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      For this task, create RNA-Seq alignments files (BAM), junction files (bed.gz), and scaled coverage graphs (bedgraph.gz) for data set SRP484252, submitted to Sequence Read Archive by Goldstein Lab at UNC Chapel Hill in 2024.

      All code will be saved to main branch in this repository - see: https://bitbucket.org/lorainelab/tardigrade/src/main/

        Attachments

          Issue Links

            Activity

            ann.loraine Ann Loraine created issue -
            ann.loraine Ann Loraine made changes -
            Field Original Value New Value
            Epic Link IGBF-3778 [ 22997 ]
            ann.loraine Ann Loraine made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Status In Progress [ 3 ] To-Do [ 10305 ]
            ann.loraine Ann Loraine made changes -
            Sprint Summer 3 [ 197 ] Summer 3, Summer 4 [ 197, 198 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Running prefetch jobs with:

            cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh
            

            in:

            /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq

            Confirmed it worked with:

            [aloraine@str-i2 fastq]$ cat *.out | grep -c  "was downloaded successfully"
            12
            [aloraine@str-i2 fastq]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l
            12
            

            All 12 runs were prefetched correctly.

            Show
            ann.loraine Ann Loraine added a comment - - edited Running prefetch jobs with: cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh in: /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq Confirmed it worked with: [aloraine@str-i2 fastq]$ cat *.out | grep -c "was downloaded successfully" 12 [aloraine@str-i2 fastq]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l 12 All 12 runs were prefetched correctly.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            An issue:

            It looks like the .sra files got put into subdirectories for some reason:

            [aloraine@str-i2 fastq]$ find . | grep .sra
            ./SRR27595102/SRR27595102/SRR27595102.sra
            ./SRR27595103/SRR27595103/SRR27595103.sra
            ./SRR27595099/SRR27595099/SRR27595099.sra
            ./SRR27595110/SRR27595110/SRR27595110.sra
            ./SRR27595101/SRR27595101/SRR27595101.sra
            ./SRR27595100/SRR27595100/SRR27595100.sra
            ./SRR27595108/SRR27595108/SRR27595108.sra
            ./SRR27595104/SRR27595104/SRR27595104.sra
            ./SRR27595105/SRR27595105/SRR27595105.sra
            ./SRR27595107/SRR27595107/SRR27595107.sra
            ./SRR27595109/SRR27595109/SRR27595109.sra
            

            Maybe I did not actually need to specify the subdirectories for the accessions to be saved?

            Show
            ann.loraine Ann Loraine added a comment - - edited An issue: It looks like the .sra files got put into subdirectories for some reason: [aloraine@str-i2 fastq]$ find . | grep .sra ./SRR27595102/SRR27595102/SRR27595102.sra ./SRR27595103/SRR27595103/SRR27595103.sra ./SRR27595099/SRR27595099/SRR27595099.sra ./SRR27595110/SRR27595110/SRR27595110.sra ./SRR27595101/SRR27595101/SRR27595101.sra ./SRR27595100/SRR27595100/SRR27595100.sra ./SRR27595108/SRR27595108/SRR27595108.sra ./SRR27595104/SRR27595104/SRR27595104.sra ./SRR27595105/SRR27595105/SRR27595105.sra ./SRR27595107/SRR27595107/SRR27595107.sra ./SRR27595109/SRR27595109/SRR27595109.sra Maybe I did not actually need to specify the subdirectories for the accessions to be saved?
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Before I proceed to the next steps, I am going to re-do this using a change in the script code. It will take more time, but I don't want to leave this issue unresolved.

            The problem: My first version of slurm script prefetch.sh specified the output directory using:

            prefetch $S -O $SLURM_SUBMIT_DIR/$S
            

            This was wrong. There was no need to specify the name of the output directory this way. The prefetch program "knows" to create a directory named for the run id. Correct invocation is:

            prefetch $S -O $SLURM_SUBMIT_DIR
            

            New run completed without blocking errors. However, there was a warning. Not sure what it means.

            Example:

            Loading sra-tools/2.11.0
              Loading requirement: hdf5/1.10.7
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' metadata: md5 ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Table 'SEQUENCE' metadata: md5 ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ALTREAD': checksums ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ORIGINAL_QUALITY': checksums ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'READ': checksums ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'SPOT_GROUP': checksums ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'X': checksums ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'Y': checksums ok
            2024-07-02T19:24:46 vdb-validate.2.11.0 warn: type unrecognized while validating database - Database '/projects/tomato_genome/fnb/dataprocessing/tardigrade/S
            RP484252/fastq/SRR27595104/SRR27595104.sra' has unrecognized type 'NCBI:SRA:Illumina:db'
            2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' is consistent
            

            I don't know what this means. Moving ahead to the next step anyway.

            Show
            ann.loraine Ann Loraine added a comment - - edited Before I proceed to the next steps, I am going to re-do this using a change in the script code. It will take more time, but I don't want to leave this issue unresolved. The problem: My first version of slurm script prefetch.sh specified the output directory using: prefetch $S -O $SLURM_SUBMIT_DIR/$S This was wrong. There was no need to specify the name of the output directory this way. The prefetch program "knows" to create a directory named for the run id. Correct invocation is: prefetch $S -O $SLURM_SUBMIT_DIR New run completed without blocking errors. However, there was a warning. Not sure what it means. Example: Loading sra-tools/2.11.0 Loading requirement: hdf5/1.10.7 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' metadata: md5 ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Table 'SEQUENCE' metadata: md5 ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ALTREAD': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ORIGINAL_QUALITY': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'READ': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'SPOT_GROUP': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'X': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'Y': checksums ok 2024-07-02T19:24:46 vdb-validate.2.11.0 warn: type unrecognized while validating database - Database '/projects/tomato_genome/fnb/dataprocessing/tardigrade/S RP484252/fastq/SRR27595104/SRR27595104.sra' has unrecognized type 'NCBI:SRA:Illumina:db' 2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' is consistent I don't know what this means. Moving ahead to the next step anyway.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Wrote a new script "fasterq-dump.sh" to convert the downloaded .sra files to fastq files.
            Ran the script using command that pipes the run table file to xargs, sets variable A to the value of the input passed to xargs using the -I xargs option, and then used squeue to run fasterq-dump.sh, while exporting the variable A as variable S, as with the preceding commands. See comment in the script fasterq-dump.sh for example invocation.

            Result: The .sra files were converted to fastq files, and each .sra file produced _1 and _2 (read 1 and read2) files, as expected since each of the .sra files was from a paired-end run of an Illumina sequencer.

            Next, ran qzip.sh to compress (gzip) each fastq file, using xargs to loop over each fastq file name, like this:

            ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh
            

            After compressing the fastq files, deleted the downloaded .sra files. The downloaded fastq files are in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq.

            Show
            ann.loraine Ann Loraine added a comment - - edited Wrote a new script "fasterq-dump.sh" to convert the downloaded .sra files to fastq files. Ran the script using command that pipes the run table file to xargs, sets variable A to the value of the input passed to xargs using the -I xargs option, and then used squeue to run fasterq-dump.sh, while exporting the variable A as variable S, as with the preceding commands. See comment in the script fasterq-dump.sh for example invocation. Result: The .sra files were converted to fastq files, and each .sra file produced _1 and _2 (read 1 and read2) files, as expected since each of the .sra files was from a paired-end run of an Illumina sequencer. Next, ran qzip.sh to compress (gzip) each fastq file, using xargs to loop over each fastq file name, like this: ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh After compressing the fastq files, deleted the downloaded .sra files. The downloaded fastq files are in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Setting up everything needed to launch nf-core/rnaseq pipeline:

            (1) Make input sample names file for SRP484252. It should look like this:

            header: sample,fastq_1,fastq_2,strandedness
            [sample name],[read 1 fastq file name],[read2 fastq file name], auto
            ... one line per SRR run id

            Made the above required file with:

            echo sample,fastq_1,fastq_2,strandedness > samples.csv
            cut -f 1 -d , SRP484252_SraRunTable.txt | grep -v Run | xargs -I A echo  A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv
            

            Confirmed contents of samples.csv:

            [aloraine@str-i1 SRP484252]$ cat samples.csv 
            sample,fastq_1,fastq_2,strandedness
            SRR27595099,SRR27595099_1.fastq.gz,SRR27595099_2.fastq.gz,auto
            SRR27595100,SRR27595100_1.fastq.gz,SRR27595100_2.fastq.gz,auto
            SRR27595101,SRR27595101_1.fastq.gz,SRR27595101_2.fastq.gz,auto
            SRR27595102,SRR27595102_1.fastq.gz,SRR27595102_2.fastq.gz,auto
            SRR27595103,SRR27595103_1.fastq.gz,SRR27595103_2.fastq.gz,auto
            SRR27595104,SRR27595104_1.fastq.gz,SRR27595104_2.fastq.gz,auto
            SRR27595105,SRR27595105_1.fastq.gz,SRR27595105_2.fastq.gz,auto
            SRR27595106,SRR27595106_1.fastq.gz,SRR27595106_2.fastq.gz,auto
            SRR27595107,SRR27595107_1.fastq.gz,SRR27595107_2.fastq.gz,auto
            SRR27595108,SRR27595108_1.fastq.gz,SRR27595108_2.fastq.gz,auto
            SRR27595109,SRR27595109_1.fastq.gz,SRR27595109_2.fastq.gz,auto
            SRR27595110,SRR27595110_1.fastq.gz,SRR27595110_2.fastq.gz,auto
            

            (2) Set the following environment variables in my account by adding these lines to my .bash_profile file:

            export NXF_OFFLINE=FALSE
            export NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2
            export NXF_OPTS=-Xms1g -Xmx4g
            export NXF_EXECUTOR=slurm
            

            NXF_SINGULARITY_CACHEDIR is a location my account has write permission. A location in my home directory would be fine, probably.

            (3) Downloaded genome assembly-specific files required for the pipeline to run:

            wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa
            http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz
            http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gff
            

            Uncompressed bed.gz and removed the final two columns:

            gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut f1-12 > H_exemplaris_Z151_Apr_2017.bed
            

            (4) Started a tmux session on the head node by entering:

            tmux new -s base
            

            This ensures that I lose my connection, the session will continue to run.

            If I get disconnected, I can log back into the same login (head) node, and enter:

            tmux attach-session -t base 
            

            (5) Inside the tmux session, launched an interactive "job" on the cluster with:

            [aloraine@str-i2 SRP484252]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash
            [aloraine@str-c141 SRP484252]$
            

            (6) Loaded nextflow module with:

            [aloraine@str-c141 SRP484252]$ module load nf-core
            Loading nf-core/2.12.1
              Loading requirement: anaconda3/2020.11
            (nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ 
            

            (7) Ran nextflow with:

            (nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1 

            This command runs nf-core/rna-seq pipeline in the background, saving file streams "standard out" to file out.1 and "standard error" to err.1. If there are errors, I will see them written to these files. Also, nextflow creates logging files with file name prefix ".nextflow.log". If something goes wrong, I can look at those files for help.

            Show
            ann.loraine Ann Loraine added a comment - - edited Setting up everything needed to launch nf-core/rnaseq pipeline: (1) Make input sample names file for SRP484252. It should look like this: header: sample,fastq_1,fastq_2,strandedness [sample name] , [read 1 fastq file name] , [read2 fastq file name] , auto ... one line per SRR run id Made the above required file with: echo sample,fastq_1,fastq_2,strandedness > samples.csv cut -f 1 -d , SRP484252_SraRunTable.txt | grep -v Run | xargs -I A echo A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv Confirmed contents of samples.csv: [aloraine@str-i1 SRP484252]$ cat samples.csv sample,fastq_1,fastq_2,strandedness SRR27595099,SRR27595099_1.fastq.gz,SRR27595099_2.fastq.gz,auto SRR27595100,SRR27595100_1.fastq.gz,SRR27595100_2.fastq.gz,auto SRR27595101,SRR27595101_1.fastq.gz,SRR27595101_2.fastq.gz,auto SRR27595102,SRR27595102_1.fastq.gz,SRR27595102_2.fastq.gz,auto SRR27595103,SRR27595103_1.fastq.gz,SRR27595103_2.fastq.gz,auto SRR27595104,SRR27595104_1.fastq.gz,SRR27595104_2.fastq.gz,auto SRR27595105,SRR27595105_1.fastq.gz,SRR27595105_2.fastq.gz,auto SRR27595106,SRR27595106_1.fastq.gz,SRR27595106_2.fastq.gz,auto SRR27595107,SRR27595107_1.fastq.gz,SRR27595107_2.fastq.gz,auto SRR27595108,SRR27595108_1.fastq.gz,SRR27595108_2.fastq.gz,auto SRR27595109,SRR27595109_1.fastq.gz,SRR27595109_2.fastq.gz,auto SRR27595110,SRR27595110_1.fastq.gz,SRR27595110_2.fastq.gz,auto (2) Set the following environment variables in my account by adding these lines to my .bash_profile file: export NXF_OFFLINE=FALSE export NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2 export NXF_OPTS=-Xms1g -Xmx4g export NXF_EXECUTOR=slurm NXF_SINGULARITY_CACHEDIR is a location my account has write permission. A location in my home directory would be fine, probably. (3) Downloaded genome assembly-specific files required for the pipeline to run: wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gff Uncompressed bed.gz and removed the final two columns: gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut f1-12 > H_exemplaris_Z151_Apr_2017.bed (4) Started a tmux session on the head node by entering: tmux new -s base This ensures that I lose my connection, the session will continue to run. If I get disconnected, I can log back into the same login (head) node, and enter: tmux attach-session -t base (5) Inside the tmux session, launched an interactive "job" on the cluster with: [aloraine@str-i2 SRP484252]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash [aloraine@str-c141 SRP484252]$ (6) Loaded nextflow module with: [aloraine@str-c141 SRP484252]$ module load nf-core Loading nf-core/2.12.1 Loading requirement: anaconda3/2020.11 (nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ (7) Ran nextflow with: (nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1 This command runs nf-core/rna-seq pipeline in the background, saving file streams "standard out" to file out.1 and "standard error" to err.1. If there are errors, I will see them written to these files. Also, nextflow creates logging files with file name prefix ".nextflow.log". If something goes wrong, I can look at those files for help.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Update:

            • Pipeline nf-core/rnaeq revision 3.14 has finished.
            • Added mutltiqc report to repository "tardigrade" in Documentation/multiqcReports as file name SRP484252-multiqc_report.html.
            • Reviewed SRP484252-multiqc_report.html and found no problems.
            Show
            ann.loraine Ann Loraine added a comment - - edited Update: Pipeline nf-core/rnaeq revision 3.14 has finished. Added mutltiqc report to repository "tardigrade" in Documentation/multiqcReports as file name SRP484252-multiqc_report.html. Reviewed SRP484252-multiqc_report.html and found no problems.
            ann.loraine Ann Loraine made changes -
            Summary Run pipeline on SRP484252 (2024 Goldstein Lab) Run nf-core/rnaseq v 3.14 on SRP484252 (2024 Goldstein Lab)
            Hide
            ann.loraine Ann Loraine added a comment -

            Proceeding to post nf-core/rnaseq data processing steps:

            • Changed to results directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
            • Made symbolic link from my home folder "src" directory file renameBams.sh with:
            [aloraine@str-i2 star_salmon]$ pwd
            /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
            [aloraine@str-i2 star_salmon]$ ln -s ~/src/tardigrade/src/renameBams.sh .
            

            Ran with:

            [aloraine@str-i2 star_salmon]$ renameBams.sh 
            [aloraine@str-i2 star_salmon]$ 
            

            This is not a slurm script. All it does is change file names.

            For example, here are the new BAM file names:

            [aloraine@str-i2 star_salmon]$ ls -lh *.bam
            -rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595099.bam
            -rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595100.bam
            -rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595101.bam
            -rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595102.bam
            -rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:43 SRR27595103.bam
            -rw-r----- 1 aloraine tomato_genome 1.7G Jul  3 00:43 SRR27595104.bam
            -rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:42 SRR27595105.bam
            -rw-r----- 1 aloraine tomato_genome 1.5G Jul  3 00:41 SRR27595106.bam
            -rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595107.bam
            -rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595108.bam
            -rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595109.bam
            -rw-r----- 1 aloraine tomato_genome 1.6G Jul  3 00:42 SRR27595110.bam
            
            • Made scaled coverage graphs in a subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon:
            [aloraine@str-i2 star_salmon]$ mkdir coverage_graphs
            [aloraine@str-i2 star_salmon]$ cd coverage_graphs/
            [aloraine@str-i2 coverage_graphs]$ ln -s ../*bam* .
            [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh .
            [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
            [aloraine@str-i2 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.out >jobs.err
            
            • Made junction files in another subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon:
            [aloraine@str-i2 star_salmon]$ cd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
            [aloraine@str-i2 star_salmon]$ mkdir find_junctions
            [aloraine@str-i2 star_salmon]$ cd find_junctions/
            [aloraine@str-i2 find_junctions]$ ln -s ../*bam* .
            [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
            [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find_junctions.sh .
            [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar .
            [aloraine@str-i2 find_junctions]$ wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit 
            [aloraine@str-i2 find_junctions]$ sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err
            
            Show
            ann.loraine Ann Loraine added a comment - Proceeding to post nf-core/rnaseq data processing steps: Changed to results directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon Made symbolic link from my home folder "src" directory file renameBams.sh with: [aloraine@str-i2 star_salmon]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ ln -s ~/src/tardigrade/src/renameBams.sh . Ran with: [aloraine@str-i2 star_salmon]$ renameBams.sh [aloraine@str-i2 star_salmon]$ This is not a slurm script. All it does is change file names. For example, here are the new BAM file names: [aloraine@str-i2 star_salmon]$ ls -lh *.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595099.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595100.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595101.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595102.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:43 SRR27595103.bam -rw-r----- 1 aloraine tomato_genome 1.7G Jul 3 00:43 SRR27595104.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595105.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:41 SRR27595106.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595107.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595108.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595109.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595110.bam Made scaled coverage graphs in a subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon: [aloraine@str-i2 star_salmon]$ mkdir coverage_graphs [aloraine@str-i2 star_salmon]$ cd coverage_graphs/ [aloraine@str-i2 coverage_graphs]$ ln -s ../*bam* . [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh . [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh . [aloraine@str-i2 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.out >jobs.err Made junction files in another subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon: [aloraine@str-i2 star_salmon]$ cd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ mkdir find_junctions [aloraine@str-i2 star_salmon]$ cd find_junctions/ [aloraine@str-i2 find_junctions]$ ln -s ../*bam* . [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh . [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find_junctions.sh . [aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar . [aloraine@str-i2 find_junctions]$ wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit [aloraine@str-i2 find_junctions]$ sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            After the data processing completes, I will copy the files to data hosting location. The location must be visible to the public internet - it's basically just a file hosting service that supports HTTP access from applications like IGB and, of course, Web browsers.

            Once that is done, I will need to make it possible for users to select these new data files from within the IGB interface, in the "Available Data" section of the "Data Access" tab.

            Currently, we are making tardigrade RNA-Seq data available as part of IGB's "RNA-Seq" Quickload data source.

            For that, I just need to add the new files to the "annots.xml" file deployed in IGB's Quickload data source named "RNA-Seq." This is a "default" data source, meaning that when I download and install IGB, it is present in the list of Data Sources I can work with and access. Whenever I need to know more about a given Data Source, I can get more information about it by opening the Data Sources tab of the Preferences window. As of now, the "RNA-Seq" Quickload Data Source occupies the top of the list.

            To add this new data set to the annots.xml, I need to:

            • Using the Run Table to start, create a new Excel spreadsheet file: tardigrade / Documentation / inputForMakeAnnotsXml / SRP484252_for_AnnotsXml.xlsx.

            This file should have new columns specifying visual styles, such as foreground colors, for each sample (SRR run identifier). All the data files for a given SRR run identifier have the SRR run identifier as the first part of the file name.

            • I then add new code to tardigrade / src / makeAnnotsXml.py function "getSampleSheets" to import the new Excel spreadsheet SRP484252_for_AnnotsXml.xlsx
            • Within the tardigrade "src" directory, I will run makeAnnotsXml.py.

            This code will read the spreadsheets and output a new "annots.xml", saving it to a local directory within the tardigrade clone: tardigrade/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017.

            The directory "quickload" is itself a valid quickload data source. For testing, I add it as a local Quickload data source to IGB. All the files should be accessible now. I can open them and look around, and if I don't like the colors, I can change them by editing the spreadsheet "SRP484252_for_AnnotsXml.xlsx" and re-running makeAnnotsXml.py. When I do that, however, I will need to click the "refresh" button in the first column of the Data Sources table in the Data Sources tab of the Preferences window in IGB. I've noticed that sometimes this refresh doesn't work. I don't know why! If I observe weird behavior, I usually just remove the data source I'm testing and add it back again. Or restart IGB.

            Note that makeAnnotsXml.py has dependencies on another repository called "igbquickload," which means I need to make sure that other code is in my "PYTHONPATH," an environment variable specifying where the python program can find dependencies imported in makeAnnotsXml.py code.

            To make this work, I have added this line to my .bash_profile in my personal computer:

            export SRC=$HOME/src
            export PYTHONPATH=.:$SRC/igbquickload
            

            And then I clone the repository in a subdirectory "src" I created in my home directory. (That's where I keep all my cloned repositories.)

            The repository with dependencies is here: https://bitbucket.org/lorainelab/igbquickload/

            Show
            ann.loraine Ann Loraine added a comment - - edited After the data processing completes, I will copy the files to data hosting location. The location must be visible to the public internet - it's basically just a file hosting service that supports HTTP access from applications like IGB and, of course, Web browsers. Once that is done, I will need to make it possible for users to select these new data files from within the IGB interface, in the "Available Data" section of the "Data Access" tab. Currently, we are making tardigrade RNA-Seq data available as part of IGB's "RNA-Seq" Quickload data source. For that, I just need to add the new files to the "annots.xml" file deployed in IGB's Quickload data source named "RNA-Seq." This is a "default" data source, meaning that when I download and install IGB, it is present in the list of Data Sources I can work with and access. Whenever I need to know more about a given Data Source, I can get more information about it by opening the Data Sources tab of the Preferences window. As of now, the "RNA-Seq" Quickload Data Source occupies the top of the list. To add this new data set to the annots.xml, I need to: Using the Run Table to start, create a new Excel spreadsheet file: tardigrade / Documentation / inputForMakeAnnotsXml / SRP484252_for_AnnotsXml.xlsx. This file should have new columns specifying visual styles, such as foreground colors, for each sample (SRR run identifier). All the data files for a given SRR run identifier have the SRR run identifier as the first part of the file name. I then add new code to tardigrade / src / makeAnnotsXml.py function "getSampleSheets" to import the new Excel spreadsheet SRP484252_for_AnnotsXml.xlsx Within the tardigrade "src" directory, I will run makeAnnotsXml.py. This code will read the spreadsheets and output a new "annots.xml", saving it to a local directory within the tardigrade clone: tardigrade/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017. The directory "quickload" is itself a valid quickload data source. For testing, I add it as a local Quickload data source to IGB. All the files should be accessible now. I can open them and look around, and if I don't like the colors, I can change them by editing the spreadsheet "SRP484252_for_AnnotsXml.xlsx" and re-running makeAnnotsXml.py. When I do that, however, I will need to click the "refresh" button in the first column of the Data Sources table in the Data Sources tab of the Preferences window in IGB. I've noticed that sometimes this refresh doesn't work. I don't know why! If I observe weird behavior, I usually just remove the data source I'm testing and add it back again. Or restart IGB. Note that makeAnnotsXml.py has dependencies on another repository called "igbquickload," which means I need to make sure that other code is in my "PYTHONPATH," an environment variable specifying where the python program can find dependencies imported in makeAnnotsXml.py code. To make this work, I have added this line to my .bash_profile in my personal computer: export SRC=$HOME/src export PYTHONPATH=.:$SRC/igbquickload And then I clone the repository in a subdirectory "src" I created in my home directory. (That's where I keep all my cloned repositories.) The repository with dependencies is here: https://bitbucket.org/lorainelab/igbquickload/
            Hide
            ann.loraine Ann Loraine added a comment -

            Coverage graphs completed. Each "job" wrote information about its parameter settings to stderr. Here is an example:

            SRR27595110.err
            ::::::::::::::
            normalization: CPM
            bamFilesList: ['SRR27595110.bam']
            binLength: 1
            numberOfSamples: None
            blackListFileName: None
            skipZeroOverZero: False
            bed_and_bin: False
            genomeChunkSize: None
            defaultFragmentLength: read length
            numberOfProcessors: 1
            verbose: False
            region: None
            bedFile: None
            minMappingQuality: None
            ignoreDuplicates: False
            chrsToSkip: []
            stepSize: 1
            center_read: False
            samFlag_include: None
            samFlag_exclude: None
            minFragmentLength: 0
            maxFragmentLength: 0
            zerosToNans: False
            smoothLength: None
            save_data: False
            out_file_for_raw_data: None
            maxPairedFragmentLength: 1000
            

            This is important for our future reference because users often want to know how the scaling (a kind of normalization) was done. The above notes indicate the scaling / normalization was done using a method called "CPM". CPM stands for "counts per million."

            Show
            ann.loraine Ann Loraine added a comment - Coverage graphs completed. Each "job" wrote information about its parameter settings to stderr. Here is an example: SRR27595110.err :::::::::::::: normalization: CPM bamFilesList: ['SRR27595110.bam'] binLength: 1 numberOfSamples: None blackListFileName: None skipZeroOverZero: False bed_and_bin: False genomeChunkSize: None defaultFragmentLength: read length numberOfProcessors: 1 verbose: False region: None bedFile: None minMappingQuality: None ignoreDuplicates: False chrsToSkip: [] stepSize: 1 center_read: False samFlag_include: None samFlag_exclude: None minFragmentLength: 0 maxFragmentLength: 0 zerosToNans: False smoothLength: None save_data: False out_file_for_raw_data: None maxPairedFragmentLength: 1000 This is important for our future reference because users often want to know how the scaling (a kind of normalization) was done. The above notes indicate the scaling / normalization was done using a method called "CPM". CPM stands for "counts per million."
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Preparing to transfer data to data hosting file system:

            • In top-level directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252 made new subdirectory structure: for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252
            • Confirmed completion with:
            [aloraine@str-i2 coverage_graphs]$ pwd
            /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs
            [aloraine@str-i2 coverage_graphs]$ ls -lh *.tbi | wc -l
            12
            [aloraine@str-i2 coverage_graphs]$ ls -lh *.bedgraph.gz | cut -f5 -d ' ' | grep -c "M" 
            12
            

            The above code confirmed that the data files (suffix .bedgraph.gz) and index files (suffix .tbi) exist and have non-zero size.

            • Moved coverage graphs (since those are done) to SRP484252 staging location with:
            [aloraine@str-i2 coverage_graphs]$ pwd
            /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs
            [aloraine@str-i2 coverage_graphs]$ ls *bedgraph*
            SRR27595099.scaled.bedgraph.gz      SRR27595102.scaled.bedgraph.gz      SRR27595105.scaled.bedgraph.gz      SRR27595108.scaled.bedgraph.gz
            SRR27595099.scaled.bedgraph.gz.tbi  SRR27595102.scaled.bedgraph.gz.tbi  SRR27595105.scaled.bedgraph.gz.tbi  SRR27595108.scaled.bedgraph.gz.tbi
            SRR27595100.scaled.bedgraph.gz      SRR27595103.scaled.bedgraph.gz      SRR27595106.scaled.bedgraph.gz      SRR27595109.scaled.bedgraph.gz
            SRR27595100.scaled.bedgraph.gz.tbi  SRR27595103.scaled.bedgraph.gz.tbi  SRR27595106.scaled.bedgraph.gz.tbi  SRR27595109.scaled.bedgraph.gz.tbi
            SRR27595101.scaled.bedgraph.gz      SRR27595104.scaled.bedgraph.gz      SRR27595107.scaled.bedgraph.gz      SRR27595110.scaled.bedgraph.gz
            SRR27595101.scaled.bedgraph.gz.tbi  SRR27595104.scaled.bedgraph.gz.tbi  SRR27595107.scaled.bedgraph.gz.tbi  SRR27595110.scaled.bedgraph.gz.tbi
            [aloraine@str-i2 coverage_graphs]$ mv *bedgraph* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. 
            [aloraine@str-i2 coverage_graphs]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.
            

            Note that the final command ensures that all files in the staging location are world-readable. I did that because when I run the transfer command, I'll use an option that preserves file permissions.

            • Deployed RNA-Seq alignment files (bam and bam.bai) to staging location with:
            [aloraine@str-i2 star_salmon]$ pwd
            /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
            [aloraine@str-i2 star_salmon]$ ls *bam*
            SRR27595099.bam      SRR27595101.bam      SRR27595103.bam      SRR27595105.bam      SRR27595107.bam      SRR27595109.bam
            SRR27595099.bam.bai  SRR27595101.bam.bai  SRR27595103.bam.bai  SRR27595105.bam.bai  SRR27595107.bam.bai  SRR27595109.bam.bai
            SRR27595100.bam      SRR27595102.bam      SRR27595104.bam      SRR27595106.bam      SRR27595108.bam      SRR27595110.bam
            SRR27595100.bam.bai  SRR27595102.bam.bai  SRR27595104.bam.bai  SRR27595106.bam.bai  SRR27595108.bam.bai  SRR27595110.bam.bai
            [aloraine@str-i2 star_salmon]$ cp *bam* ../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.
            

            The find junctions jobs are still running, but I can get started copying the completed files that are ready now to the hosting site.

            Show
            ann.loraine Ann Loraine added a comment - - edited Preparing to transfer data to data hosting file system: In top-level directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252 made new subdirectory structure: for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252 Confirmed completion with: [aloraine@str-i2 coverage_graphs]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs [aloraine@str-i2 coverage_graphs]$ ls -lh *.tbi | wc -l 12 [aloraine@str-i2 coverage_graphs]$ ls -lh *.bedgraph.gz | cut -f5 -d ' ' | grep -c "M" 12 The above code confirmed that the data files (suffix .bedgraph.gz) and index files (suffix .tbi) exist and have non-zero size. Moved coverage graphs (since those are done) to SRP484252 staging location with: [aloraine@str-i2 coverage_graphs]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs [aloraine@str-i2 coverage_graphs]$ ls *bedgraph* SRR27595099.scaled.bedgraph.gz SRR27595102.scaled.bedgraph.gz SRR27595105.scaled.bedgraph.gz SRR27595108.scaled.bedgraph.gz SRR27595099.scaled.bedgraph.gz.tbi SRR27595102.scaled.bedgraph.gz.tbi SRR27595105.scaled.bedgraph.gz.tbi SRR27595108.scaled.bedgraph.gz.tbi SRR27595100.scaled.bedgraph.gz SRR27595103.scaled.bedgraph.gz SRR27595106.scaled.bedgraph.gz SRR27595109.scaled.bedgraph.gz SRR27595100.scaled.bedgraph.gz.tbi SRR27595103.scaled.bedgraph.gz.tbi SRR27595106.scaled.bedgraph.gz.tbi SRR27595109.scaled.bedgraph.gz.tbi SRR27595101.scaled.bedgraph.gz SRR27595104.scaled.bedgraph.gz SRR27595107.scaled.bedgraph.gz SRR27595110.scaled.bedgraph.gz SRR27595101.scaled.bedgraph.gz.tbi SRR27595104.scaled.bedgraph.gz.tbi SRR27595107.scaled.bedgraph.gz.tbi SRR27595110.scaled.bedgraph.gz.tbi [aloraine@str-i2 coverage_graphs]$ mv *bedgraph* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. [aloraine@str-i2 coverage_graphs]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. Note that the final command ensures that all files in the staging location are world-readable. I did that because when I run the transfer command, I'll use an option that preserves file permissions. Deployed RNA-Seq alignment files (bam and bam.bai) to staging location with: [aloraine@str-i2 star_salmon]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ ls *bam* SRR27595099.bam SRR27595101.bam SRR27595103.bam SRR27595105.bam SRR27595107.bam SRR27595109.bam SRR27595099.bam.bai SRR27595101.bam.bai SRR27595103.bam.bai SRR27595105.bam.bai SRR27595107.bam.bai SRR27595109.bam.bai SRR27595100.bam SRR27595102.bam SRR27595104.bam SRR27595106.bam SRR27595108.bam SRR27595110.bam SRR27595100.bam.bai SRR27595102.bam.bai SRR27595104.bam.bai SRR27595106.bam.bai SRR27595108.bam.bai SRR27595110.bam.bai [aloraine@str-i2 star_salmon]$ cp *bam* ../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. The find junctions jobs are still running, but I can get started copying the completed files that are ready now to the hosting site.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Setting up and deploying data to hosting site:

            • Logged into the hosting site from my local computer, while using the UNC Charlotte VPN, with:
            local aloraine$ ssh aloraine@data.bioviz.org
            
            #############################################################################
            
            Use of the University's computing and electronic communication resources is
            conditioned on compliance with the University's Information Technology (IT)
            policies (Policy Statements 311, 304, 303, 601.14, 307 and 302.) Pursuant to
            those policies, the University will take any steps necessary to safeguard the
            integrity of the University's computing and electronic communication resources
            and to minimize the risk to both those resources and the end users of those
            resources. Such safeguarding includes monitoring data traffic to detect
            anomalous network activity, as well as accessing, retrieving, reading and/or
            disclosing data communications when there is reasonable cause to suspect a
            violation of applicable University policy or criminal law, or when monitoring
            is otherwise required or permitted by law. 
            
            #############################################################################
            
            Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-88-generic x86_64)
            
             System information as of Wed Jul  3 10:44:00 AM EDT 2024
            
              System load:    0.22             Processes:               261
              Usage of /home: 0.0% of 5.82GB   Users logged in:         2
              Memory usage:   9%               IPv4 address for ens160: 10.16.57.232
              Swap usage:     0%
            
            Expanded Security Maintenance for Applications is not enabled.
            
            1 update can be applied immediately.
            To see these additional updates run: apt list --upgradable
            
            Enable ESM Apps to receive additional future security updates.
            See https://ubuntu.com/esm or run: sudo pro status
            
            
            3 updates could not be installed automatically. For more details,
            see /var/log/unattended-upgrades/unattended-upgrades.log
            
            *** System restart required ***
            

            Note: I do not need to enter a password when I logged in because I previously added my local computers public key to the "authorized_hosts" file in my home directory's .ssh folder on the remote host.

            • Create deployment directory for this new SRP484252 dataset:
            aloraine@cci-vm12:~$ cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017/
            aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ 
            aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP484252
            
            • Copy data from the cluster to this new location, using rsync, after starting a tmux session:
            tmux new -s base
            

            Start the transfer with:

            aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ rsync -rtpvz aloraine@10.16.115.245:/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* SRP484252/.
            Connected.
            
            (aloraine@10.16.115.245) Password: 
            (aloraine@10.16.115.245) Duo two-factor login for aloraine
            
            Enter a passcode or select one of the following options:
            
             1. Phone call to XXX-XXX-XXXX
             2. Phone call to XXX-XXX-XXXX
             3. Phone call to XXX-XXX-XXXX
             4. Phone call to XXX-XXX-XXXX
             5. SMS passcodes to XXX-XXX-XXXX (next code starts with: 2)
             6. SMS passcodes to XXX-XXX-XXXX
             7. SMS passcodes to XXX-XXX-3048 (next code starts with: 2)
            
            Passcode or option (1-7): 3
            receiving incremental file list
            SRR27595099.bam
            ...
            

            The above code launches a tmux session, ensuring that if I get disconnected, the transfer won't halt. The next code uses rsync to make copies from the source file system onto this recipient file system. Later, when the find junctions output files are ready, I'll repeat this same command, and the new files will get copied. Files that are already copied won't get re-copied, however.

            Note how rsync triggers a request for me to authenticate my user account. Because the UNC Charlotte cluster system is a desirable target for pests to bitcoin mine and do other useless awful crap, we have heavy security. No-one can access that system without providing a password and a second form of authentication. Putting your public key into your cluster account does nothing.

            Show
            ann.loraine Ann Loraine added a comment - - edited Setting up and deploying data to hosting site: Logged into the hosting site from my local computer, while using the UNC Charlotte VPN, with: local aloraine$ ssh aloraine@data.bioviz.org ############################################################################# Use of the University's computing and electronic communication resources is conditioned on compliance with the University's Information Technology (IT) policies (Policy Statements 311, 304, 303, 601.14, 307 and 302.) Pursuant to those policies, the University will take any steps necessary to safeguard the integrity of the University's computing and electronic communication resources and to minimize the risk to both those resources and the end users of those resources. Such safeguarding includes monitoring data traffic to detect anomalous network activity, as well as accessing, retrieving, reading and/or disclosing data communications when there is reasonable cause to suspect a violation of applicable University policy or criminal law, or when monitoring is otherwise required or permitted by law. ############################################################################# Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-88- generic x86_64) System information as of Wed Jul 3 10:44:00 AM EDT 2024 System load: 0.22 Processes: 261 Usage of /home: 0.0% of 5.82GB Users logged in: 2 Memory usage: 9% IPv4 address for ens160: 10.16.57.232 Swap usage: 0% Expanded Security Maintenance for Applications is not enabled. 1 update can be applied immediately. To see these additional updates run: apt list --upgradable Enable ESM Apps to receive additional future security updates. See https: //ubuntu.com/esm or run: sudo pro status 3 updates could not be installed automatically. For more details, see / var /log/unattended-upgrades/unattended-upgrades.log *** System restart required *** Note: I do not need to enter a password when I logged in because I previously added my local computers public key to the "authorized_hosts" file in my home directory's .ssh folder on the remote host. Create deployment directory for this new SRP484252 dataset: aloraine@cci-vm12:~$ cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017/ aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP484252 Copy data from the cluster to this new location, using rsync, after starting a tmux session: tmux new -s base Start the transfer with: aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ rsync -rtpvz aloraine@10.16.115.245:/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* SRP484252/. Connected. (aloraine@10.16.115.245) Password: (aloraine@10.16.115.245) Duo two-factor login for aloraine Enter a passcode or select one of the following options: 1. Phone call to XXX-XXX-XXXX 2. Phone call to XXX-XXX-XXXX 3. Phone call to XXX-XXX-XXXX 4. Phone call to XXX-XXX-XXXX 5. SMS passcodes to XXX-XXX-XXXX (next code starts with: 2) 6. SMS passcodes to XXX-XXX-XXXX 7. SMS passcodes to XXX-XXX-3048 (next code starts with: 2) Passcode or option (1-7): 3 receiving incremental file list SRR27595099.bam ... The above code launches a tmux session, ensuring that if I get disconnected, the transfer won't halt. The next code uses rsync to make copies from the source file system onto this recipient file system. Later, when the find junctions output files are ready, I'll repeat this same command, and the new files will get copied. Files that are already copied won't get re-copied, however. Note how rsync triggers a request for me to authenticate my user account. Because the UNC Charlotte cluster system is a desirable target for pests to bitcoin mine and do other useless awful crap, we have heavy security. No-one can access that system without providing a password and a second form of authentication. Putting your public key into your cluster account does nothing.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Oops! I forgot to change the file permissions for the sequence alignment (bam and bai) files. That's OK. I'm curious to see if rsync is smart enough to pick up changes in file permissions on the source. I'll use this mistake as an opportunity to see how rsync behaves. I'll make the source bam and bai files world-readable and then repeat the command. Ideally, the recipient file permissions will change without the system re-copying the entire (very large) files onto the host.

            Result:

            I re-ran the rsync command after changing file permissions. Nothing got transferred. The file permissions did not update.

            Show
            ann.loraine Ann Loraine added a comment - - edited Oops! I forgot to change the file permissions for the sequence alignment (bam and bai) files. That's OK. I'm curious to see if rsync is smart enough to pick up changes in file permissions on the source. I'll use this mistake as an opportunity to see how rsync behaves. I'll make the source bam and bai files world-readable and then repeat the command. Ideally, the recipient file permissions will change without the system re-copying the entire (very large) files onto the host. Result: I re-ran the rsync command after changing file permissions. Nothing got transferred. The file permissions did not update.
            Hide
            ann.loraine Ann Loraine added a comment -

            Transferring find junctions outputs.

            Checked that the files were made, as expected, with:

            [aloraine@str-i2 find_junctions]$ pwd
            /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/find_junctions
            [aloraine@str-i2 find_junctions]$ ls *FJ* | wc -l
            24
            

            Copying files to deployment directory (rsync source) and specifying desired permissions with:

            [aloraine@str-i2 find_junctions]$ cp *FJ* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.
            [aloraine@str-i2 find_junctions]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/*
            [aloraine@str-i2 find_junctions]$ chmod g+w ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/*
            
            Show
            ann.loraine Ann Loraine added a comment - Transferring find junctions outputs. Checked that the files were made, as expected, with: [aloraine@str-i2 find_junctions]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/find_junctions [aloraine@str-i2 find_junctions]$ ls *FJ* | wc -l 24 Copying files to deployment directory (rsync source) and specifying desired permissions with: [aloraine@str-i2 find_junctions]$ cp *FJ* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. [aloraine@str-i2 find_junctions]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* [aloraine@str-i2 find_junctions]$ chmod g+w ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/*
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Repeated the rsync command and noticed that after that, all the file permissions in the target directory match the source directory. Maybe file permissions only get updated properly when some data get actually transferred?

            Show
            ann.loraine Ann Loraine added a comment - - edited Repeated the rsync command and noticed that after that, all the file permissions in the target directory match the source directory. Maybe file permissions only get updated properly when some data get actually transferred?
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Proceeding with adding the new data to annots.xml, as described in preceding comment.

            • Launched Excel and opened SRA/SRP484252_SraRunTable.txt (selected comma-delimiter option)
            • Saved first draft as to tardigrade/Documentation/inputForMakeAnnotsXml as SRP484252_for_AnnotsXml.xlsx
            • Also opened SRP450893_for_AnnotsXml.xlsx for reference
            • Edited SRP484252_for_AnnotsXml.xlsx by adding new columns to the front the spreadsheet, and a new column "concentration" for sorting; I wanted the samples to be listed in ascending order of Bleomycin concentration. Used formulas in most of the new columns so that SRR codes and the like would get included automatically, without my having to type a bunch of stuff.
            • Picked colors and saved the hex codes and the actual color as a fill color in the spreadsheet.
            • Edited src/makeAnnotsXml.py by adding the spreadsheet to a function.
            • Ran makeAnnotsXml.py. New annots.xml file got written out.
            • Committed and pushed all the changes.
            • To deploy the new changes to the rnaseq data source host, logged onto the RENCI VM, changed to the directory for Hypsibus, and ran a script there called "update.sh", which copies the latest annots.xml from the tardigrade repository hosted on bitbucket, with:
            cd /projects/igbquickload/lorainelab/www/main/htdocs/rnaseq/H_exemplaris_Z151_Apr_2017
            ./update.sh
            

            Here is update.sh:

            #! /bin/bash
            wget --backups=3 https://bitbucket.org/lorainelab/tardigrade/raw/main/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017/annots.xml
            

            Confirmed the colors and the coverage graph files were accessible.

            Moving to "needs testing"

            Show
            ann.loraine Ann Loraine added a comment - - edited Proceeding with adding the new data to annots.xml, as described in preceding comment. Launched Excel and opened SRA/SRP484252_SraRunTable.txt (selected comma-delimiter option) Saved first draft as to tardigrade/Documentation/inputForMakeAnnotsXml as SRP484252_for_AnnotsXml.xlsx Also opened SRP450893_for_AnnotsXml.xlsx for reference Edited SRP484252_for_AnnotsXml.xlsx by adding new columns to the front the spreadsheet, and a new column "concentration" for sorting; I wanted the samples to be listed in ascending order of Bleomycin concentration. Used formulas in most of the new columns so that SRR codes and the like would get included automatically, without my having to type a bunch of stuff. Picked colors and saved the hex codes and the actual color as a fill color in the spreadsheet. Edited src/makeAnnotsXml.py by adding the spreadsheet to a function. Ran makeAnnotsXml.py. New annots.xml file got written out. Committed and pushed all the changes. To deploy the new changes to the rnaseq data source host, logged onto the RENCI VM, changed to the directory for Hypsibus, and ran a script there called "update.sh", which copies the latest annots.xml from the tardigrade repository hosted on bitbucket, with: cd /projects/igbquickload/lorainelab/www/main/htdocs/rnaseq/H_exemplaris_Z151_Apr_2017 ./update.sh Here is update.sh: #! /bin/bash wget --backups=3 https: //bitbucket.org/lorainelab/tardigrade/raw/main/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017/annots.xml Confirmed the colors and the coverage graph files were accessible. Moving to "needs testing"
            ann.loraine Ann Loraine made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            ann.loraine Ann Loraine made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            To test:

            • open genome assembly version H_exemplaris_Z151_Apr_2017
            • open folder containing SRP484252 in the name
            • open read alignments, coverage graphs, and junction files folders
            • check that each file can load by selecting each checkbox, zooming in, and clicking Load Data

            See example image:

            https://bitbucket.org/lorainelab/tardigrade/src/main/GenomeBrowserImages/SRP484252-CoverageGraphs-1.png

            Show
            ann.loraine Ann Loraine added a comment - - edited To test: open genome assembly version H_exemplaris_Z151_Apr_2017 open folder containing SRP484252 in the name open read alignments, coverage graphs, and junction files folders check that each file can load by selecting each checkbox, zooming in, and clicking Load Data See example image: https://bitbucket.org/lorainelab/tardigrade/src/main/GenomeBrowserImages/SRP484252-CoverageGraphs-1.png
            ann.loraine Ann Loraine made changes -
            Description For this task, create RNA-Seq alignments files (BAM), junction files (bed.gz), and scaled coverage graphs (bedgraph.gz) for data set SRP484252, submitted to Sequence Read Archive by Goldstein Lab at UNC Chapel Hill in 2024. For this task, create RNA-Seq alignments files (BAM), junction files (bed.gz), and scaled coverage graphs (bedgraph.gz) for data set SRP484252, submitted to Sequence Read Archive by Goldstein Lab at UNC Chapel Hill in 2024.

            All code will be saved in this repository: https://bitbucket.org/lorainelab/tardigrade/src/main/

            ann.loraine Ann Loraine made changes -
            Description For this task, create RNA-Seq alignments files (BAM), junction files (bed.gz), and scaled coverage graphs (bedgraph.gz) for data set SRP484252, submitted to Sequence Read Archive by Goldstein Lab at UNC Chapel Hill in 2024.

            All code will be saved in this repository: https://bitbucket.org/lorainelab/tardigrade/src/main/

            For this task, create RNA-Seq alignments files (BAM), junction files (bed.gz), and scaled coverage graphs (bedgraph.gz) for data set SRP484252, submitted to Sequence Read Archive by Goldstein Lab at UNC Chapel Hill in 2024.

            All code will be saved to main branch in this repository - see: https://bitbucket.org/lorainelab/tardigrade/src/main/

            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]
            dmarrott Dylan Marrotte (Inactive) made changes -
            Assignee Dylan Marrotte [ dmarrott ]
            Hide
            dmarrott Dylan Marrotte (Inactive) added a comment -

            Testing:

            Opened genome assembly version H_exemplaris_Z151_Apr_2017, opened folder containing SRP484252, opened each read alignment file, coverage graphs, and junction files folder by selecting each checkbox, zooming in, and clicking Load Data.

            All files properly loaded!

            Show
            dmarrott Dylan Marrotte (Inactive) added a comment - Testing: Opened genome assembly version H_exemplaris_Z151_Apr_2017, opened folder containing SRP484252, opened each read alignment file, coverage graphs, and junction files folder by selecting each checkbox, zooming in, and clicking Load Data. All files properly loaded!
            dmarrott Dylan Marrotte (Inactive) made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            dmarrott Dylan Marrotte (Inactive) made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
            dmarrott Dylan Marrotte (Inactive) made changes -
            Assignee Dylan Marrotte [ dmarrott ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBF-3721 [ IGBF-3721 ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBF-3735 [ IGBF-3735 ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBF-3849 [ IGBF-3849 ]

              People

              • Assignee:
                Unassigned
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: