After the data processing completes, I will copy the files to data hosting location. The location must be visible to the public internet - it's basically just a file hosting service that supports HTTP access from applications like IGB and, of course, Web browsers.
Once that is done, I will need to make it possible for users to select these new data files from within the IGB interface, in the "Available Data" section of the "Data Access" tab.
Currently, we are making tardigrade RNA-Seq data available as part of IGB's "RNA-Seq" Quickload data source.
For that, I just need to add the new files to the "annots.xml" file deployed in IGB's Quickload data source named "RNA-Seq." This is a "default" data source, meaning that when I download and install IGB, it is present in the list of Data Sources I can work with and access. Whenever I need to know more about a given Data Source, I can get more information about it by opening the Data Sources tab of the Preferences window. As of now, the "RNA-Seq" Quickload Data Source occupies the top of the list.
To add this new data set to the annots.xml, I need to:
- Using the Run Table to start, create a new Excel spreadsheet file: tardigrade / Documentation / inputForMakeAnnotsXml / SRP484252_for_AnnotsXml.xlsx.
This file should have new columns specifying visual styles, such as foreground colors, for each sample (SRR run identifier). All the data files for a given SRR run identifier have the SRR run identifier as the first part of the file name.
- I then add new code to tardigrade / src / makeAnnotsXml.py function "getSampleSheets" to import the new Excel spreadsheet SRP484252_for_AnnotsXml.xlsx
- Within the tardigrade "src" directory, I will run makeAnnotsXml.py.
This code will read the spreadsheets and output a new "annots.xml", saving it to a local directory within the tardigrade clone: tardigrade/ForGenomeBrowsers/quickload/H_exemplaris_Z151_Apr_2017.
The directory "quickload" is itself a valid quickload data source. For testing, I add it as a local Quickload data source to IGB. All the files should be accessible now. I can open them and look around, and if I don't like the colors, I can change them by editing the spreadsheet "SRP484252_for_AnnotsXml.xlsx" and re-running makeAnnotsXml.py. When I do that, however, I will need to click the "refresh" button in the first column of the Data Sources table in the Data Sources tab of the Preferences window in IGB. I've noticed that sometimes this refresh doesn't work. I don't know why! If I observe weird behavior, I usually just remove the data source I'm testing and add it back again. Or restart IGB.
Note that makeAnnotsXml.py has dependencies on another repository called "igbquickload," which means I need to make sure that other code is in my "PYTHONPATH," an environment variable specifying where the python program can find dependencies imported in makeAnnotsXml.py code.
To make this work, I have added this line to my .bash_profile in my personal computer:
export SRC=$HOME/src
export PYTHONPATH=.:$SRC/igbquickload
And then I clone the repository in a subdirectory "src" I created in my home directory. (That's where I keep all my cloned repositories.)
The repository with dependencies is here: https://bitbucket.org/lorainelab/igbquickload/
Running prefetch jobs with:
in:
/projects/tomato_genome/fnb/dataprocessing/tardigrade/SRP484252/fastq
Confirmed it worked with:
[aloraine@str-i2 fastq]$ cat *.out | grep -c "was downloaded successfully" 12 [aloraine@str-i2 fastq]$ cut -d , -f 1 SRP484252_SraRunTable.txt | grep -v Run | wc -l 12All 12 runs were prefetched correctly.