Notes on updating the human genome:
Most of the annotation files come from the UCSC Table Browser: https://genome-euro.ucsc.edu/cgi-bin/hgTables
group > All Tracks; track: NCBI RefSeq
Output format for bed files should be set to "BED - browser extensible data".
Output format for psl files should be set to "all fields from selected table".
Many of the bed files need to be converted to bed detail and tabix indexed following instructions in the Google Drive under the name: "How we add new genomes to IGB Quickload using UCSC Genome Informatics as a data source".
*Updated gene2accession and gene_info files can be found here: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
*genomeSource repo: https://bitbucket.org/lorainelab/genomesource/src/master/
The psl files need to be tabix indexed using the following instructions: https://wiki.bioviz.org/confluence/display/igbdevelopers/Adding+a+new+genome+-+X.+tropicalis+Nov+2009
psl example:
To process the EST data set, I used the following command to strip off the first column
$ gunzip -c X_tropicalis_Nov_2009_all_est.gz | grep -v bin | cut -f2- > X_tropicalis_Nov_2009_all_est.psl
Next, I sorted and created an index using bgzip and tabix:
$ sort -k14,14 -k16,16n X_tropicalis_Nov_2009_all_est.psl > sorted.psl
$ mv sorted.psl X_tropicalis_Nov_2009_all_est.psl
$ bgzip X_tropicalis_Nov_2009_all_est.psl
$ tabix -s 14 -b 16 -0 X_tropicalis_Nov_2009_all_est.psl.gz
Once files are ready to be submitted, need access to subversion repository. Instructions for submitting to the subversion repository can be found in Google Drive under name "Setting up the Quickload subversion repository on EC2 with EBS".
Subversion repository can be found here: https://svn.bioviz.org/viewvc/genomes/quickload/H_sapiens_Dec_2013/
We updated dbSNP and genome annotations 15 months ago - see https://svn.bioviz.org/viewvc/genomes/quickload/H_sapiens_Dec_2013/ for history.
We should re-update these now. Moving to next sprint.