Some notes on creating the D_melanogaster_Aug_2014_refGene.bed.gz file.
The bed12 file is from UCSC Table Browser
assembly: Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6)
track: NCBI RefSeq
table: UCSC RefSeq (refGene)
I downloaded the updated gene2accession and gene_info files can be found here: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ and selected the taxid 7227.
I then ran ucscToBedDetail.py to create the bed14 file.
To include the FlyBase transcript and gene IDs, I downloaded the NCBI annotation for dm6 (GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz). This file contained the NM/NR NCBI RefSeq gene ids and their corresponding FlyBase gene and transcript IDs.
I then used the file to grep lines where the feature was: mRNA, lnc_RNA, snRNA, miRNA, antisense_RNA, ncRNA, rRNA, snoRNA, RNase_MRP_RNA, RNase_P_RNA, SRP_RNA, primary_transcript - as these features contained the NR/NM RefSeq identifier and FlyBase gene and transcript IDs.
For example:
cat GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff | grep '.*\tSRP_RNA\t.*' >> output.txt
I then removed any lines where there was no "Name=" field. This resulted in 34,450 lines, which is the same number of unique NM/NR in the UCSC RefSeq (refgene) file.
Next, I extracted Just the NR/NM values and their corresponding FlyBase gene and transcript identifiers.
The resulting file looks like this:
NM_001103384 FBgn0025837 FBtr0112921
The NM/NR code was then used as a key to sort the data and add the FlyBase gene and transcript identifiers to the 14th column of the D_melanogaster_Aug_2014_refGene.bed.gz file.
Once this issue is complete, can push the dm6 blacklist to the SVN repo.