[IGBF-3849] Process SRP454305 Goldstein Irradiation 2024 data set - JIRA UNCC

Hide

Permalink

Ann Loraine added a comment - 06/Aug/24 1:45 PM - edited

PREFETCH step

Pre-fetching SRA files with:

cut -d , -f 1 SRP454305_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh

in:

/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305

Confirmed it worked with:

[aloraine@str-i1 SRP454305]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l
[aloraine@str-i1 SRP454305]$ cut -d , -f 1 *_SraRunTable.txt | grep -v Run | wc -l
12

using prefetch.sh:

[aloraine@str-i1 SRP454305]$ ls -l prefetch.sh 
lrwxrwxrwx 1 aloraine tomato_genome 46 Aug  6 13:42 prefetch.sh -> /users/aloraine/src/tardigrade/src/prefetch.sh

using SRP454305_SraRunTable.txt:

[aloraine@str-i1 SRP454305]$ ls -lh *.txt
lrwxrwxrwx 1 aloraine tomato_genome 88 Aug  6 13:43 SRP454305_SraRunTable.txt -> /users/aloraine/src/tardigrade/Documentation/RunSelectorOutput/SRP454305_SraRunTable.txt

Show

Ann Loraine added a comment - 06/Aug/24 1:45 PM - edited PREFETCH step Pre-fetching SRA files with: cut -d , -f 1 SRP454305_SraRunTable.txt | grep -v Run | xargs -I A sbatch --export=S=A --job-name=A --output=A.out --error=A.err prefetch.sh in: /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305 Confirmed it worked with: [aloraine@str-i1 SRP454305]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l [aloraine@str-i1 SRP454305]$ cut -d , -f 1 *_SraRunTable.txt | grep -v Run | wc -l 12 using prefetch.sh: [aloraine@str-i1 SRP454305]$ ls -l prefetch.sh lrwxrwxrwx 1 aloraine tomato_genome 46 Aug 6 13:42 prefetch.sh -> /users/aloraine/src/tardigrade/src/prefetch.sh using SRP454305_SraRunTable.txt: [aloraine@str-i1 SRP454305]$ ls -lh *.txt lrwxrwxrwx 1 aloraine tomato_genome 88 Aug 6 13:43 SRP454305_SraRunTable.txt -> /users/aloraine/src/tardigrade/Documentation/RunSelectorOutput/SRP454305_SraRunTable.txt

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 11:12 AM - edited

FINDJUNCTIONS step

Similar to the coverage graphs step, made a new subdirectory in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon.

1) Made the find junctions "working" directory and added symbolic links to BAM and BAM index files in the parent directory with:

[aloraine@str-i1 star_salmon]$ mkdir find_junctions
[aloraine@str-i1 star_salmon]$ cd find_junctions/
[aloraine@str-i1 find_junctions]$ ln -s ../*bam* .

2) Download required required input 2bit file into the directory with:

wget http://lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit

3) Make symbolic links to scripts and jar file with code:

ln -s ~/src/tardigrade/src/sbatch-doIt.sh .
ln -s ~/src/tardigrade/src/find_junctions.sh
ln -s src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar

4) Launch jobs with:

sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Show

Ann Loraine added a comment - 07/Aug/24 11:12 AM - edited FINDJUNCTIONS step Similar to the coverage graphs step, made a new subdirectory in /projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP454305/results/star_salmon. 1) Made the find junctions "working" directory and added symbolic links to BAM and BAM index files in the parent directory with: [aloraine@str-i1 star_salmon]$ mkdir find_junctions [aloraine@str-i1 star_salmon]$ cd find_junctions/ [aloraine@str-i1 find_junctions]$ ln -s ../*bam* . 2) Download required required input 2bit file into the directory with: wget http: //lorainelab-quickload.scidas.org/quickload/H_exemplaris_Z151_Apr_2017/H_exemplaris_Z151_Apr_2017.2bit 3) Make symbolic links to scripts and jar file with code: ln -s ~/src/tardigrade/src/sbatch-doIt.sh . ln -s ~/src/tardigrade/src/find_junctions.sh ln -s src/tardigrade/src/find-junctions-1.0.0-jar-with-dependencies.jar 4) Launch jobs with: sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 7:51 PM - edited

DATA TRANSFER SETUP step

1) Create directory for transfer in /projects/tomato_genome/fnb/dataprocessing/tardigrade

[aloraine@str-i1 tardigrade]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade
mkdir for_quickload

We will use this to store everything we will trasfer to Quickload for this "tardigrade" project

2) Make directory for tardigrade genome assembly

[aloraine@str-i1 for_quickload]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload
mkdir H_exemplaris_Z151_Apr_2017

Note: the above two steps only need to be done once!

3) Make subdirectory for this data set, in the genome assembly directory used for alignments:

[aloraine@str-i1 for_quickload]$ cd H_exemplaris_Z151_Apr_2017/
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ mkdir SRP454305
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ cd SRP454305/
[aloraine@str-i1 SRP454305]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305

4) Move bam, scaled coverage graph, junction files into this location:

coverage graphs, from inside the directory containing them:

[aloraine@str-i1 SRP454305]$ mv ../../../SRP454305/results/star_salmon/coverage_graphs/*bedgraph* .
[aloraine@str-i1 SRP454305]$ ls
SRR25590736.scaled.bedgraph.gz      SRR25590739.scaled.bedgraph.gz      SRR25590742.scaled.bedgraph.gz      SRR25590745.scaled.bedgraph.gz
SRR25590736.scaled.bedgraph.gz.tbi  SRR25590739.scaled.bedgraph.gz.tbi  SRR25590742.scaled.bedgraph.gz.tbi  SRR25590745.scaled.bedgraph.gz.tbi
SRR25590737.scaled.bedgraph.gz      SRR25590740.scaled.bedgraph.gz      SRR25590743.scaled.bedgraph.gz      SRR25590746.scaled.bedgraph.gz
SRR25590737.scaled.bedgraph.gz.tbi  SRR25590740.scaled.bedgraph.gz.tbi  SRR25590743.scaled.bedgraph.gz.tbi  SRR25590746.scaled.bedgraph.gz.tbi
SRR25590738.scaled.bedgraph.gz      SRR25590741.scaled.bedgraph.gz      SRR25590744.scaled.bedgraph.gz      SRR25590747.scaled.bedgraph.gz
SRR25590738.scaled.bedgraph.gz.tbi  SRR25590741.scaled.bedgraph.gz.tbi  SRR25590744.scaled.bedgraph.gz.tbi  SRR25590747.scaled.bedgraph.gz.tbi

Bam files, from inside the directory containing them:

mv *.bam*  ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/.

Junction files, from inside the directory containing them:

mv *.FJ.* ../../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/.

5) Make all files world-readable and make all directories world-readable and world-executable:

files:

[aloraine@str-i1 SRP454305]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305
[aloraine@str-i1 SRP454305]$ chmod a+r *

directory:

[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd
/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017
[aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ chmod a+rx SRP454305

Show

Ann Loraine added a comment - 07/Aug/24 7:51 PM - edited DATA TRANSFER SETUP step 1) Create directory for transfer in /projects/tomato_genome/fnb/dataprocessing/tardigrade [aloraine@str-i1 tardigrade]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade mkdir for_quickload We will use this to store everything we will trasfer to Quickload for this "tardigrade" project 2) Make directory for tardigrade genome assembly [aloraine@str-i1 for_quickload]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload mkdir H_exemplaris_Z151_Apr_2017 Note: the above two steps only need to be done once! 3) Make subdirectory for this data set, in the genome assembly directory used for alignments: [aloraine@str-i1 for_quickload]$ cd H_exemplaris_Z151_Apr_2017/ [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ mkdir SRP454305 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ cd SRP454305/ [aloraine@str-i1 SRP454305]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305 4) Move bam, scaled coverage graph, junction files into this location: coverage graphs, from inside the directory containing them: [aloraine@str-i1 SRP454305]$ mv ../../../SRP454305/results/star_salmon/coverage_graphs/*bedgraph* . [aloraine@str-i1 SRP454305]$ ls SRR25590736.scaled.bedgraph.gz SRR25590739.scaled.bedgraph.gz SRR25590742.scaled.bedgraph.gz SRR25590745.scaled.bedgraph.gz SRR25590736.scaled.bedgraph.gz.tbi SRR25590739.scaled.bedgraph.gz.tbi SRR25590742.scaled.bedgraph.gz.tbi SRR25590745.scaled.bedgraph.gz.tbi SRR25590737.scaled.bedgraph.gz SRR25590740.scaled.bedgraph.gz SRR25590743.scaled.bedgraph.gz SRR25590746.scaled.bedgraph.gz SRR25590737.scaled.bedgraph.gz.tbi SRR25590740.scaled.bedgraph.gz.tbi SRR25590743.scaled.bedgraph.gz.tbi SRR25590746.scaled.bedgraph.gz.tbi SRR25590738.scaled.bedgraph.gz SRR25590741.scaled.bedgraph.gz SRR25590744.scaled.bedgraph.gz SRR25590747.scaled.bedgraph.gz SRR25590738.scaled.bedgraph.gz.tbi SRR25590741.scaled.bedgraph.gz.tbi SRR25590744.scaled.bedgraph.gz.tbi SRR25590747.scaled.bedgraph.gz.tbi Bam files, from inside the directory containing them: mv *.bam* ../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/. Junction files, from inside the directory containing them: mv *.FJ.* ../../../../for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/. 5) Make all files world-readable and make all directories world-readable and world-executable: files: [aloraine@str-i1 SRP454305]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305 [aloraine@str-i1 SRP454305]$ chmod a+r * directory: [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ pwd /projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017 [aloraine@str-i1 H_exemplaris_Z151_Apr_2017]$ chmod a+rx SRP454305

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 8:15 PM - edited

RSYNC step

1) Logged into data.bioviz.org (a virtual machine hosted on UNC Charlotte infrastructure) and moved to data deployment location in the file system there:

local aloraine$ ssh aloraine@data.bioviz.org
cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017

Some things to note:

I have deployed my public key into the authorized_hosts file in my "aloraine" account in data.bioviz.org. This way, I don't have to enter my password.
If I did need to enter my password, I would enter my Charlotte.edu password.
Anyone else wanting to do this will need to get an account on the data.bioviz.org
Note that we are inside a directory named for the reference genome assembly we used.

2) Make a new directory for this new data set to be deployed:

aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ pwd
/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls
SRP450893  SRP484252
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP454305

3) Make sure it is group write-able and that its permissions match the other directories in the same location:

aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh
total 12K
drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul  2 09:52 SRP450893
drwxr-xr-x 2 aloraine domain users           4.0K Aug  7 20:08 SRP454305
drwxrwxr-x 2 aloraine domain users           4.0K Jul  3 13:42 SRP484252
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ chmod g+w SRP454305
aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh
total 12K
drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul  2 09:52 SRP450893
drwxrwxr-x 2 aloraine domain users           4.0K Aug  7 20:08 SRP454305
drwxrwxr-x 2 aloraine domain users           4.0K Jul  3 13:42 SRP484252

4) Start the data transfer using tmux and then rsync:

tmux:

tmux new -s transfer

rsync:

rsync -rtpvz aloraine@hpc.charlotte.edu:/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/* SRP454305/.

Note: You can repeat the above rsync command any time you add new content to the source directory on hpc.charlotte.edu. Only the new files will get copied.

Note: I could probably just "rsync" the entire genome directory. I think that this would automatically copy any new "SRP" directories and their contents over to data.bioviz.org.

Show

Ann Loraine added a comment - 07/Aug/24 8:15 PM - edited RSYNC step 1) Logged into data.bioviz.org (a virtual machine hosted on UNC Charlotte infrastructure) and moved to data deployment location in the file system there: local aloraine$ ssh aloraine@data.bioviz.org cd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017 Some things to note: I have deployed my public key into the authorized_hosts file in my "aloraine" account in data.bioviz.org. This way, I don't have to enter my password. If I did need to enter my password, I would enter my Charlotte.edu password. Anyone else wanting to do this will need to get an account on the data.bioviz.org Note that we are inside a directory named for the reference genome assembly we used. 2) Make a new directory for this new data set to be deployed: aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ pwd /mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls SRP450893 SRP484252 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ mkdir SRP454305 3) Make sure it is group write-able and that its permissions match the other directories in the same location: aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh total 12K drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul 2 09:52 SRP450893 drwxr-xr-x 2 aloraine domain users 4.0K Aug 7 20:08 SRP454305 drwxrwxr-x 2 aloraine domain users 4.0K Jul 3 13:42 SRP484252 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ chmod g+w SRP454305 aloraine@cci-vm12:/mnt/igbdata/tardigrade/H_exemplaris_Z151_Apr_2017$ ls -lh total 12K drwxrwsr-x 3 aloraine cci-igbquickload_users 4.0K Jul 2 09:52 SRP450893 drwxrwxr-x 2 aloraine domain users 4.0K Aug 7 20:08 SRP454305 drwxrwxr-x 2 aloraine domain users 4.0K Jul 3 13:42 SRP484252 4) Start the data transfer using tmux and then rsync: tmux: tmux new -s transfer rsync: rsync -rtpvz aloraine@hpc.charlotte.edu:/projects/tomato_genome/fnb/dataprocessing/tardigrade/for_quickload/H_exemplaris_Z151_Apr_2017/SRP454305/* SRP454305/. Note : You can repeat the above rsync command any time you add new content to the source directory on hpc.charlotte.edu. Only the new files will get copied. Note : I could probably just "rsync" the entire genome directory. I think that this would automatically copy any new "SRP" directories and their contents over to data.bioviz.org.

Hide

Permalink

Ann Loraine added a comment - 07/Aug/24 8:26 PM - edited

ANNOTS.XML step

1) Opened the run file for this data set in Excel and save it, in Excel format, to tardigrade/Documentation/inputForMakeAnnotsXml (the tardigrade repository)

Note: Open SRP48452_for_AnnotsXml as a reference and guide!

2) Added five new columns to the front of the file, in from of "Run:

file name prefix
color
physical folder
study name
display name
url

3) Used Excel referencing to insert all the values in "Run" in "file name prefix"

4) Inserted hexadecimal colors codes for each sample. Made those cells have the same fill color as the colors I chose to help me assess their potential appearance and contrast in IGB.

5) Inserted the study code (e.g., SRP454305) in "physical folder" column

6) Used Excel reference to insert a human-friendly "study name" - this becomes the name of the folder where the data files will be listed in IGB.

7) Used Excel references to insert human-friendly "display name" values - these become the checkbox labels in IGB.

8) Used Excel references to make URLs for each file / data set. Used the "SRX" values in the existing "Experiment" column to construct the URL.

9) Added new columns as needed after the first five to use for sorting. For example, I added "Concentration" and then sorted the spreadsheet by concentration and then by run so that the lower concentration, control samples would appear first in the IGB data display list.

10) Edited the script makeAnnots.py to include the new spreadsheet in function getSampleSheets. Ran the script, which will add the new data files to annots.xml in tardigrade/ForGenomeBrowsers/quickload.

11) Checked how it looks by adding the above directory to IGB as a new quickload data source.

Show

Ann Loraine added a comment - 07/Aug/24 8:26 PM - edited ANNOTS.XML step 1) Opened the run file for this data set in Excel and save it, in Excel format, to tardigrade/Documentation/inputForMakeAnnotsXml (the tardigrade repository) Note : Open SRP48452_for_AnnotsXml as a reference and guide! 2) Added five new columns to the front of the file, in from of "Run: file name prefix color physical folder study name display name url 3) Used Excel referencing to insert all the values in "Run" in "file name prefix" 4) Inserted hexadecimal colors codes for each sample. Made those cells have the same fill color as the colors I chose to help me assess their potential appearance and contrast in IGB. 5) Inserted the study code (e.g., SRP454305) in "physical folder" column 6) Used Excel reference to insert a human-friendly "study name" - this becomes the name of the folder where the data files will be listed in IGB. 7) Used Excel references to insert human-friendly "display name" values - these become the checkbox labels in IGB. 8) Used Excel references to make URLs for each file / data set. Used the "SRX" values in the existing "Experiment" column to construct the URL. 9) Added new columns as needed after the first five to use for sorting. For example, I added "Concentration" and then sorted the spreadsheet by concentration and then by run so that the lower concentration, control samples would appear first in the IGB data display list. 10) Edited the script makeAnnots.py to include the new spreadsheet in function getSampleSheets. Ran the script, which will add the new data files to annots.xml in tardigrade/ForGenomeBrowsers/quickload. 11) Checked how it looks by adding the above directory to IGB as a new quickload data source.

Hide

Permalink

Ann Loraine added a comment - 12/Aug/24 5:44 AM - edited

CLEANUP step:

Removed the "work" directory within SRP454305 because it is ENORMOUS and we no longer need it.
Moved the entire SRP454305 directory into tardigrade/DONE

Show

Ann Loraine added a comment - 12/Aug/24 5:44 AM - edited CLEANUP step: Removed the "work" directory within SRP454305 because it is ENORMOUS and we no longer need it. Moved the entire SRP454305 directory into tardigrade/DONE

Process SRP454305 Goldstein Irradiation 2024 data set

Details

Description

Attachments

Issue Links

Activity

People

Dates