Details
-
Type: Task
-
Status: Closed (View Workflow)
-
Priority: Major
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:2
-
Sprint:Summer 3, Summer 4
Description
For this task, create RNA-Seq alignments files (BAM), junction files (bed.gz), and scaled coverage graphs (bedgraph.gz) for data set SRP484252, submitted to Sequence Read Archive by Goldstein Lab at UNC Chapel Hill in 2024.
All code will be saved to main branch in this repository - see: https://bitbucket.org/lorainelab/tardigrade/src/main/
Attachments
Issue Links
Activity
An issue:
It looks like the .sra files got put into subdirectories for some reason:
[aloraine@str-i2 fastq]$ find . | grep .sra ./SRR27595102/SRR27595102/SRR27595102.sra ./SRR27595103/SRR27595103/SRR27595103.sra ./SRR27595099/SRR27595099/SRR27595099.sra ./SRR27595110/SRR27595110/SRR27595110.sra ./SRR27595101/SRR27595101/SRR27595101.sra ./SRR27595100/SRR27595100/SRR27595100.sra ./SRR27595108/SRR27595108/SRR27595108.sra ./SRR27595104/SRR27595104/SRR27595104.sra ./SRR27595105/SRR27595105/SRR27595105.sra ./SRR27595107/SRR27595107/SRR27595107.sra ./SRR27595109/SRR27595109/SRR27595109.sra
Maybe I did not actually need to specify the subdirectories for the accessions to be saved?
Before I proceed to the next steps, I am going to re-do this using a change in the script code. It will take more time, but I don't want to leave this issue unresolved.
The problem: My first version of slurm script prefetch.sh specified the output directory using:
prefetch $S -O $SLURM_SUBMIT_DIR/$S
This was wrong. There was no need to specify the name of the output directory this way. The prefetch program "knows" to create a directory named for the run id. Correct invocation is:
prefetch $S -O $SLURM_SUBMIT_DIR
New run completed without blocking errors. However, there was a warning. Not sure what it means.
Example:
Loading sra-tools/2.11.0
Loading requirement: hdf5/1.10.7
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' metadata: md5 ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Table 'SEQUENCE' metadata: md5 ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ALTREAD': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'ORIGINAL_QUALITY': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'READ': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'SPOT_GROUP': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'X': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Column 'Y': checksums ok
2024-07-02T19:24:46 vdb-validate.2.11.0 warn: type unrecognized while validating database - Database '/projects/tomato_genome/fnb/dataprocessing/tardigrade/S
RP484252/fastq/SRR27595104/SRR27595104.sra' has unrecognized type 'NCBI:SRA:Illumina:db'
2024-07-02T19:24:46 vdb-validate.2.11.0 info: Database 'SRR27595104.sra' is consistent
I don't know what this means. Moving ahead to the next step anyway.
Wrote a new script "fasterq-dump.sh" to convert the downloaded .sra files to fastq files.
Ran the script using command that pipes the run table file to xargs, sets variable A to the value of the input passed to xargs using the -I xargs option, and then used squeue to run fasterq-dump.sh, while exporting the variable A as variable S, as with the preceding commands. See comment in the script fasterq-dump.sh for example invocation.
Result: The .sra files were converted to fastq files, and each .sra file produced _1 and _2 (read 1 and read2) files, as expected since each of the .sra files was from a paired-end run of an Illumina sequencer.
Next, ran qzip.sh to compress (gzip) each fastq file, using xargs to loop over each fastq file name, like this:
ls *.fastq | xargs -I A sbatch --export=F=A --job-name=A --output=A.out --error=A.err gzip.sh
After compressing the fastq files, deleted the downloaded .sra files. The downloaded fastq files are in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq.
Setting up everything needed to launch nf-core/rnaseq pipeline:
(1) Make input sample names file for SRP484252. It should look like this:
header: sample,fastq_1,fastq_2,strandedness
[sample name],[read 1 fastq file name],[read2 fastq file name], auto
... one line per SRR run id
Made the above required file with:
echo sample,fastq_1,fastq_2,strandedness > samples.csv cut -f 1 -d , SRP484252_SraRunTable.txt | grep -v Run | xargs -I A echo A,A_1.fastq.gz,A_2.fastq.gz,auto >> samples.csv
Confirmed contents of samples.csv:
[aloraine@str-i1 SRP484252]$ cat samples.csv sample,fastq_1,fastq_2,strandedness SRR27595099,SRR27595099_1.fastq.gz,SRR27595099_2.fastq.gz,auto SRR27595100,SRR27595100_1.fastq.gz,SRR27595100_2.fastq.gz,auto SRR27595101,SRR27595101_1.fastq.gz,SRR27595101_2.fastq.gz,auto SRR27595102,SRR27595102_1.fastq.gz,SRR27595102_2.fastq.gz,auto SRR27595103,SRR27595103_1.fastq.gz,SRR27595103_2.fastq.gz,auto SRR27595104,SRR27595104_1.fastq.gz,SRR27595104_2.fastq.gz,auto SRR27595105,SRR27595105_1.fastq.gz,SRR27595105_2.fastq.gz,auto SRR27595106,SRR27595106_1.fastq.gz,SRR27595106_2.fastq.gz,auto SRR27595107,SRR27595107_1.fastq.gz,SRR27595107_2.fastq.gz,auto SRR27595108,SRR27595108_1.fastq.gz,SRR27595108_2.fastq.gz,auto SRR27595109,SRR27595109_1.fastq.gz,SRR27595109_2.fastq.gz,auto SRR27595110,SRR27595110_1.fastq.gz,SRR27595110_2.fastq.gz,auto
(2) Set the following environment variables in my account by adding these lines to my .bash_profile file:
export NXF_OFFLINE=FALSE export NXF_SINGULARITY_CACHEDIR=/projects/tomato_genome/scripts/nxf_singularity_cachedir2 export NXF_OPTS=-Xms1g -Xmx4g export NXF_EXECUTOR=slurm
NXF_SINGULARITY_CACHEDIR is a location my account has write permission. A location in my home directory would be fine, probably.
(3) Downloaded genome assembly-specific files required for the pipeline to run:
wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.fa http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.bed.gz http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.gff
Uncompressed bed.gz and removed the final two columns:
gunzip -c H_exemplaris_Z151_Apr_2017.bed.gz | cut f1-12 > H_exemplaris_Z151_Apr_2017.bed
(4) Started a tmux session on the head node by entering:
tmux new -s base
This ensures that I lose my connection, the session will continue to run.
If I get disconnected, I can log back into the same login (head) node, and enter:
tmux attach-session -t base
(5) Inside the tmux session, launched an interactive "job" on the cluster with:
[aloraine@str-i2 SRP484252]$ srun --partition Orion --cpus-per-task 5 --mem-per-cpu 12000 --time 60:00:00 --pty bash [aloraine@str-c141 SRP484252]$
(6) Loaded nextflow module with:
[aloraine@str-c141 SRP484252]$ module load nf-core Loading nf-core/2.12.1 Loading requirement: anaconda3/2020.11 (nf-core-2.12.1) [aloraine@str-c141 SRP484252]$
(7) Ran nextflow with:
(nf-core-2.12.1) [aloraine@str-c141 SRP484252]$ nextflow run nf-core/rnaseq -resume -profile singularity -r 3.14.0 -params-file H_exemplaris_Z151_Apr_2017-params.yaml 1>out.1 2>err.1
This command runs nf-core/rna-seq pipeline in the background, saving file streams "standard out" to file out.1 and "standard error" to err.1. If there are errors, I will see them written to these files. Also, nextflow creates logging files with file name prefix ".nextflow.log". If something goes wrong, I can look at those files for help.
Update:
- Pipeline nf-core/rnaeq revision 3.14 has finished.
- Added mutltiqc report to repository "tardigrade" in Documentation/multiqcReports as file name SRP484252-multiqc_report.html.
- Reviewed SRP484252-multiqc_report.html and found no problems.
Proceeding to post nf-core/rnaseq data processing steps:
- Changed to results directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
- Made symbolic link from my home folder "src" directory file renameBams.sh with:
[aloraine@str-i2 star_salmon]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ ln -s ~/src/tardigrade/src/renameBams.sh .
Ran with:
[aloraine@str-i2 star_salmon]$ renameBams.sh [aloraine@str-i2 star_salmon]$
This is not a slurm script. All it does is change file names.
For example, here are the new BAM file names:
[aloraine@str-i2 star_salmon]$ ls -lh *.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595099.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595100.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595101.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595102.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:43 SRR27595103.bam -rw-r----- 1 aloraine tomato_genome 1.7G Jul 3 00:43 SRR27595104.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:42 SRR27595105.bam -rw-r----- 1 aloraine tomato_genome 1.5G Jul 3 00:41 SRR27595106.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595107.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595108.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595109.bam -rw-r----- 1 aloraine tomato_genome 1.6G Jul 3 00:42 SRR27595110.bam
- Made scaled coverage graphs in a subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon:
[aloraine@str-i2 star_salmon]$ mkdir coverage_graphs [aloraine@str-i2 star_salmon]$ cd coverage_graphs/ [aloraine@str-i2 coverage_graphs]$ ln -s ../*bam* . [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/bamCoverage.sh . [aloraine@str-i2 coverage_graphs]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh . [aloraine@str-i2 coverage_graphs]$ sbatch-doIt.sh .bam bamCoverage.sh >jobs.out >jobs.err
- Made junction files in another subfolder of projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon:
[aloraine@str-i2 star_salmon]$ cd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon
[aloraine@str-i2 star_salmon]$ mkdir find_junctions
[aloraine@str-i2 star_salmon]$ cd find_junctions/
[aloraine@str-i2 find_junctions]$ ln -s ../*bam* .
[aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
[aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find_junctions.sh .
[aloraine@str-i2 find_junctions]$ ln -s ~/src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar .
[aloraine@str-i2 find_junctions]$ wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit
[aloraine@str-i2 find_junctions]$ sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err
After the data processing completes, I will copy the files to data hosting location. The location must be visible to the public internet - it's basically just a file hosting service that supports HTTP access from applications like IGB and, of course, Web browsers.
Once that is done, I will need to make it possible for users to select these new data files from within the IGB interface, in the "Available Data" section of the "Data Access" tab.
Currently, we are making tardigrade RNA-Seq data available as part of IGB's "RNA-Seq" Quickload data source.
For that, I just need to add the new files to the "annots.xml" file deployed in IGB's Quickload data source named "RNA-Seq." This is a "default" data source, meaning that when I download and install IGB, it is present in the list of Data Sources I can work with and access. Whenever I need to know more about a given Data Source, I can get more information about it by opening the Data Sources tab of the Preferences window. As of now, the "RNA-Seq" Quickload Data Source occupies the top of the list.
To add this new data set to the annots.xml, I need to:
- Using the Run Table to start, create a new Excel spreadsheet file: tardigrade / Documentation / inputForMakeAnnotsXml / SRP484252_for_AnnotsXml.xlsx.
This file should have new columns specifying visual styles, such as foreground colors, for each sample (SRR run identifier). All the data files for a given SRR run identifier have the SRR run identifier as the first part of the file name.
- I then add new code to tardigrade / src / makeAnnotsXml.py function "getSampleSheets" to import the new Excel spreadsheet SRP484252_for_AnnotsXml.xlsx
- Within the tardigrade "src" directory, I will run makeAnnotsXml.py.
This code will read the spreadsheets and output a new "annots.xml", saving it to a local directory within the tardigrade clone: tardigrade/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017.
The directory "quickload" is itself a valid quickload data source. For testing, I add it as a local Quickload data source to IGB. All the files should be accessible now. I can open them and look around, and if I don't like the colors, I can change them by editing the spreadsheet "SRP484252_for_AnnotsXml.xlsx" and re-running makeAnnotsXml.py. When I do that, however, I will need to click the "refresh" button in the first column of the Data Sources table in the Data Sources tab of the Preferences window in IGB. I've noticed that sometimes this refresh doesn't work. I don't know why! If I observe weird behavior, I usually just remove the data source I'm testing and add it back again. Or restart IGB.
Note that makeAnnotsXml.py has dependencies on another repository called "igbquickload," which means I need to make sure that other code is in my "PYTHONPATH," an environment variable specifying where the python program can find dependencies imported in makeAnnotsXml.py code.
To make this work, I have added this line to my .bash_profile in my personal computer:
export SRC=$HOME/src export PYTHONPATH=.:$SRC/igbquickload
And then I clone the repository in a subdirectory "src" I created in my home directory. (That's where I keep all my cloned repositories.)
The repository with dependencies is here: https://bitbucket.org/lorainelab/igbquickload/
Coverage graphs completed. Each "job" wrote information about its parameter settings to stderr. Here is an example:
SRR27595110.err :::::::::::::: normalization: CPM bamFilesList: ['SRR27595110.bam'] binLength: 1 numberOfSamples: None blackListFileName: None skipZeroOverZero: False bed_and_bin: False genomeChunkSize: None defaultFragmentLength: read length numberOfProcessors: 1 verbose: False region: None bedFile: None minMappingQuality: None ignoreDuplicates: False chrsToSkip: [] stepSize: 1 center_read: False samFlag_include: None samFlag_exclude: None minFragmentLength: 0 maxFragmentLength: 0 zerosToNans: False smoothLength: None save_data: False out_file_for_raw_data: None maxPairedFragmentLength: 1000
This is important for our future reference because users often want to know how the scaling (a kind of normalization) was done. The above notes indicate the scaling / normalization was done using a method called "CPM". CPM stands for "counts per million."
Preparing to transfer data to data hosting file system:
- In top-level directory /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252 made new subdirectory structure: for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252
- Confirmed completion with:
[aloraine@str-i2 coverage_graphs]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs
[aloraine@str-i2 coverage_graphs]$ ls -lh *.tbi | wc -l
12
[aloraine@str-i2 coverage_graphs]$ ls -lh *.bedgraph.gz | cut -f5 -d ' ' | grep -c "M"
12
The above code confirmed that the data files (suffix .bedgraph.gz) and index files (suffix .tbi) exist and have non-zero size.
- Moved coverage graphs (since those are done) to SRP484252 staging location with:
[aloraine@str-i2 coverage_graphs]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/coverage_graphs [aloraine@str-i2 coverage_graphs]$ ls *bedgraph* SRR27595099.scaled.bedgraph.gz SRR27595102.scaled.bedgraph.gz SRR27595105.scaled.bedgraph.gz SRR27595108.scaled.bedgraph.gz SRR27595099.scaled.bedgraph.gz.tbi SRR27595102.scaled.bedgraph.gz.tbi SRR27595105.scaled.bedgraph.gz.tbi SRR27595108.scaled.bedgraph.gz.tbi SRR27595100.scaled.bedgraph.gz SRR27595103.scaled.bedgraph.gz SRR27595106.scaled.bedgraph.gz SRR27595109.scaled.bedgraph.gz SRR27595100.scaled.bedgraph.gz.tbi SRR27595103.scaled.bedgraph.gz.tbi SRR27595106.scaled.bedgraph.gz.tbi SRR27595109.scaled.bedgraph.gz.tbi SRR27595101.scaled.bedgraph.gz SRR27595104.scaled.bedgraph.gz SRR27595107.scaled.bedgraph.gz SRR27595110.scaled.bedgraph.gz SRR27595101.scaled.bedgraph.gz.tbi SRR27595104.scaled.bedgraph.gz.tbi SRR27595107.scaled.bedgraph.gz.tbi SRR27595110.scaled.bedgraph.gz.tbi [aloraine@str-i2 coverage_graphs]$ mv *bedgraph* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. [aloraine@str-i2 coverage_graphs]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.
Note that the final command ensures that all files in the staging location are world-readable. I did that because when I run the transfer command, I'll use an option that preserves file permissions.
- Deployed RNA-Seq alignment files (bam and bam.bai) to staging location with:
[aloraine@str-i2 star_salmon]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon [aloraine@str-i2 star_salmon]$ ls *bam* SRR27595099.bam SRR27595101.bam SRR27595103.bam SRR27595105.bam SRR27595107.bam SRR27595109.bam SRR27595099.bam.bai SRR27595101.bam.bai SRR27595103.bam.bai SRR27595105.bam.bai SRR27595107.bam.bai SRR27595109.bam.bai SRR27595100.bam SRR27595102.bam SRR27595104.bam SRR27595106.bam SRR27595108.bam SRR27595110.bam SRR27595100.bam.bai SRR27595102.bam.bai SRR27595104.bam.bai SRR27595106.bam.bai SRR27595108.bam.bai SRR27595110.bam.bai [aloraine@str-i2 star_salmon]$ cp *bam* ../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/.
The find junctions jobs are still running, but I can get started copying the completed files that are ready now to the hosting site.
Setting up and deploying data to hosting site:
- Logged into the hosting site from my local computer, while using the UNC Charlotte VPN, with:
local aloraine$ ssh aloraine@data.bioviz.org ############################################################################# Use of the University's computing and electronic communication resources is conditioned on compliance with the University's Information Technology (IT) policies (Policy Statements 311, 304, 303, 601.14, 307 and 302.) Pursuant to those policies, the University will take any steps necessary to safeguard the integrity of the University's computing and electronic communication resources and to minimize the risk to both those resources and the end users of those resources. Such safeguarding includes monitoring data traffic to detect anomalous network activity, as well as accessing, retrieving, reading and/or disclosing data communications when there is reasonable cause to suspect a violation of applicable University policy or criminal law, or when monitoring is otherwise required or permitted by law. ############################################################################# Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-88-generic x86_64) System information as of Wed Jul 3 10:44:00 AM EDT 2024 System load: 0.22 Processes: 261 Usage of /home: 0.0% of 5.82GB Users logged in: 2 Memory usage: 9% IPv4 address for ens160: 10.16.57.232 Swap usage: 0% Expanded Security Maintenance for Applications is not enabled. 1 update can be applied immediately. To see these additional updates run: apt list --upgradable Enable ESM Apps to receive additional future security updates. See https://ubuntu.com/esm or run: sudo pro status 3 updates could not be installed automatically. For more details, see /var/log/unattended-upgrades/unattended-upgrades.log *** System restart required ***
Note: I do not need to enter a password when I logged in because I previously added my local computers public key to the "authorized_hosts" file in my home directory's .ssh folder on the remote host.
- Create deployment directory for this new SRP484252 dataset:
aloraine@cci-vm12:~$ cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017/ aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP484252
- Copy data from the cluster to this new location, using rsync, after starting a tmux session:
tmux new -s base
Start the transfer with:
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ rsync -rtpvz aloraine@10.16.115.245:/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* SRP484252/.
Connected.
(aloraine@10.16.115.245) Password:
(aloraine@10.16.115.245) Duo two-factor login for aloraine
Enter a passcode or select one of the following options:
1. Phone call to XXX-XXX-XXXX
2. Phone call to XXX-XXX-XXXX
3. Phone call to XXX-XXX-XXXX
4. Phone call to XXX-XXX-XXXX
5. SMS passcodes to XXX-XXX-XXXX (next code starts with: 2)
6. SMS passcodes to XXX-XXX-XXXX
7. SMS passcodes to XXX-XXX-3048 (next code starts with: 2)
Passcode or option (1-7): 3
receiving incremental file list
SRR27595099.bam
...
The above code launches a tmux session, ensuring that if I get disconnected, the transfer won't halt. The next code uses rsync to make copies from the source file system onto this recipient file system. Later, when the find junctions output files are ready, I'll repeat this same command, and the new files will get copied. Files that are already copied won't get re-copied, however.
Note how rsync triggers a request for me to authenticate my user account. Because the UNC Charlotte cluster system is a desirable target for pests to bitcoin mine and do other useless awful crap, we have heavy security. No-one can access that system without providing a password and a second form of authentication. Putting your public key into your cluster account does nothing.
Oops! I forgot to change the file permissions for the sequence alignment (bam and bai) files. That's OK. I'm curious to see if rsync is smart enough to pick up changes in file permissions on the source. I'll use this mistake as an opportunity to see how rsync behaves. I'll make the source bam and bai files world-readable and then repeat the command. Ideally, the recipient file permissions will change without the system re-copying the entire (very large) files onto the host.
Result:
I re-ran the rsync command after changing file permissions. Nothing got transferred. The file permissions did not update.
Transferring find junctions outputs.
Checked that the files were made, as expected, with:
[aloraine@str-i2 find_junctions]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/results/star_salmon/find_junctions [aloraine@str-i2 find_junctions]$ ls *FJ* | wc -l 24
Copying files to deployment directory (rsync source) and specifying desired permissions with:
[aloraine@str-i2 find_junctions]$ cp *FJ* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/. [aloraine@str-i2 find_junctions]$ chmod a+r ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/* [aloraine@str-i2 find_junctions]$ chmod g+w ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP484252/*
Repeated the rsync command and noticed that after that, all the file permissions in the target directory match the source directory. Maybe file permissions only get updated properly when some data get actually transferred?
Proceeding with adding the new data to annots.xml, as described in preceding comment.
- Launched Excel and opened SRA/SRP484252_SraRunTable.txt (selected comma-delimiter option)
- Saved first draft as to tardigrade/Documentation/inputForMakeAnnotsXml as SRP484252_for_AnnotsXml.xlsx
- Also opened SRP450893_for_AnnotsXml.xlsx for reference
- Edited SRP484252_for_AnnotsXml.xlsx by adding new columns to the front the spreadsheet, and a new column "concentration" for sorting; I wanted the samples to be listed in ascending order of Bleomycin concentration. Used formulas in most of the new columns so that SRR codes and the like would get included automatically, without my having to type a bunch of stuff.
- Picked colors and saved the hex codes and the actual color as a fill color in the spreadsheet.
- Edited src/makeAnnotsXml.py by adding the spreadsheet to a function.
- Ran makeAnnotsXml.py. New annots.xml file got written out.
- Committed and pushed all the changes.
- To deploy the new changes to the rnaseq data source host, logged onto the RENCI VM, changed to the directory for Hypsibus, and ran a script there called "update.sh", which copies the latest annots.xml from the tardigrade repository hosted on bitbucket, with:
cd /projects/igbquickload/lorainelab/www/main/htdocs/rnaseq/H_exemplaris_Z151_Apr_2017 ./update.sh
Here is update.sh:
#! /bin/bash
wget --backups=3 https://bitbucket.org/lorainelab/tardigrade/raw/main/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017/annots.xml
Confirmed the colors and the coverage graph files were accessible.
Moving to "needs testing"
To test:
- open genome assembly version H_exemplaris_Z151_Apr_2017
- open folder containing SRP484252 in the name
- open read alignments, coverage graphs, and junction files folders
- check that each file can load by selecting each checkbox, zooming in, and clicking Load Data
See example image:
Testing:
Opened genome assembly version H_exemplaris_Z151_Apr_2017, opened folder containing SRP484252, opened each read alignment file, coverage graphs, and junction files folder by selecting each checkbox, zooming in, and clicking Load Data.
All files properly loaded!
Running prefetch jobs with:
in:
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq
Confirmed it worked with:
[aloraine@str-i2 fastq]$ cat *.out | grep -c "was downloaded successfully" 12 [aloraine@str-i2 fastq]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l 12
All 12 runs were prefetched correctly.