Downloaded the Jun. 2020 GRCm39/mm39 mm39.2bit file from https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/
Had to use ftp to get the gene2accession.gz file, there is some issue with trying to gunzip -c an ftp file, so I just pulled the whole file. I was able to download the Mus_musculus.gene_info file with no issues, but it is small.
ftp ftp:get gene2accession.gz
NCBI:txid10090 for mouse:
gunzip -c gene2accession.gz | grep '^10090\t' > ~/Desktop/jiraIssues/3330/10090.gene2accession.txt
I pulled the refGene and ncbiRefSeq files as bed files. UCSC table browser.
I then created bed14 files for the refGene and ncbiRefSeq files.
ucscToBedDetail.py -a 10090.gene2accession.txt -g Mus_musculus.gene_info mm39_refGene.bed.gz M_musculus_Jun_2020_refGene.bed
ucscToBedDetail.py -a 10090.gene2accession.txt -g Mus_musculus.gene_info mm39_ncbiRefSeq.bed.gz M_musculus_Jun_2020_ncbiRefSeq.bed
I then sorted the two bed files and tabix indexed them. I was getting a lot of errors with tabix (version 1.17) where it was complaining about coordinates being zero and did I forget to add the -0 flag to indicate zero based coordinates. I don't remember this being an issue before, but since bed files are 0-based I added the flag.
sort -k1,1 -k2,2n M_musculus_Jun_2020_refGene.bed | bgzip > M_musculus_Jun_2020_refGene.bed.gz
tabix -0 -s 1 -b 2 -e 3 M_musculus_Jun_2020_refGene.bed.gz
sort -k1,1 -k2,2n M_musculus_Jun_2020_ncbiRefSeq.bed | bgzip > M_musculus_Jun_2020_ncbiRefSeq.bed.gz
tabix -0 -s 1 -b 2 -e 3 M_musculus_Jun_2020_ncbiRefSeq.bed.gz
The all_mRNA, and all_est files I pulled as psl files by selecting the output format as all fields from selected table. Once downloaded, I used the following code:
gunzip \-c M_musculus_Jun_2020_all_est.psl.gz | grep -v bin | cut -f2- > M_musculus_Jun_2020_all_est.psl
sort -k14,14 -k16,16n M_musculus_Jun_2020_all_est.psl > sorted.psl
mv sorted.psl M_musculus_Jun_2020_all_est.psl
bgzip M_musculus_Jun_2020_all_est.psl
tabix -s 14 -b 16 -0 M_musculus_Jun_2020_all_est.psl.gz
gunzip \-c M_musculus_Jun_2020_all_mrna.psl.gz | grep -v bin | cut -f2- > M_musculus_Jun_2020_all_mrna.psl
sort -k14,14 -k16,16n M_musculus_Jun_2020_all_mrna.psl > sorted.psl
mv sorted.psl M_musculus_Jun_2020_all_mrna.psl
bgzip M_musculus_Jun_2020_all_mrna.psl
tabix -s 14 -b 16 -0 M_musculus_Jun_2020_all_mrna.psl.gz
For this, add a new genome "setup" to our subversion (like git) data repository.