Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4130

Remove older 2bit files and point at UCSC hosted 2bit files in annots.xml

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Situation: We are hosting our subversion repository on our own EC2. The larger size of the repository is expensive.

      Task: Remove the 2bit file from the svn repository for older "legacy" genome versions, such as older versions of the human or mouse genomes. Then use the annots.xml "name" attribute to point to that same file hosted on the UCSC Genome Web (link: https://hgdownload.soe.ucsc.edu/gbdb/) and set "reference" to true (see https://wiki.bioviz.org/confluence/display/igbman/About+annots.xml for more details).

        Attachments

          Issue Links

            Activity

            Hide
            pkulzer Paige Kulzer (Inactive) added a comment - - edited

            See attached for a document (2bit_sizes.txt) created by Dr. Loraine which provides some more information about this task. It also contains a running list of genomes which have already been worked at the time of this comment.

            My task will be to modify the annots.xml file and remove the 2bit file for each genome I've added from NCBI whose 2bit file is 100 MB or larger.

            Show
            pkulzer Paige Kulzer (Inactive) added a comment - - edited See attached for a document (2bit_sizes.txt) created by Dr. Loraine which provides some more information about this task. It also contains a running list of genomes which have already been worked at the time of this comment. My task will be to modify the annots.xml file and remove the 2bit file for each genome I've added from NCBI whose 2bit file is 100 MB or larger.
            Hide
            pkulzer Paige Kulzer (Inactive) added a comment -

            To test that I'm modifying annots.xml files correctly, I modified annots.xml for just one genome (Aedes albopictus), deleted the corresponding 2bit file, then pushed those changes to the svn repo (revision 224). I then disabled IGB Quickload as a data provider in IGB and added my local copy of the svn repo as a data provider, opened the A. albopictus genome, zoomed in, and clicked Load Sequence. The reference sequence loaded successfully which means that I've correctly modified annots.xml!

            I'll now modify the remaining annots.xml files on my to-do list, remove those corresponding 2bit files from the repo, and commit those revisions all at once.

            Show
            pkulzer Paige Kulzer (Inactive) added a comment - To test that I'm modifying annots.xml files correctly, I modified annots.xml for just one genome ( Aedes albopictus ), deleted the corresponding 2bit file, then pushed those changes to the svn repo (revision 224). I then disabled IGB Quickload as a data provider in IGB and added my local copy of the svn repo as a data provider, opened the A. albopictus genome, zoomed in, and clicked Load Sequence. The reference sequence loaded successfully which means that I've correctly modified annots.xml! I'll now modify the remaining annots.xml files on my to-do list, remove those corresponding 2bit files from the repo, and commit those revisions all at once.
            Hide
            pkulzer Paige Kulzer (Inactive) added a comment -

            I've finished removing the 2bit files I added from NCBI. Our working list has been updated (see attached; 2bit_sizes_3_7_2025.txt).

            FYI, it looks like the remaining genomes with 2bit files not hosted in bigzips are hosted in GenArk (https://hgdownload.soe.ucsc.edu/hubs/).

            Show
            pkulzer Paige Kulzer (Inactive) added a comment - I've finished removing the 2bit files I added from NCBI. Our working list has been updated (see attached; 2bit_sizes_3_7_2025.txt). FYI, it looks like the remaining genomes with 2bit files not hosted in bigzips are hosted in GenArk ( https://hgdownload.soe.ucsc.edu/hubs/ ).
            Hide
            ann.loraine Ann Loraine added a comment -

            Status of the svn repository:

            • As of Paige Kulzer's changes, the repository version is 225
            • As of today, a "fresh" checkout of the repository version 225 creates a local directory of size 20 Gb

            Checked out to my local using:

            svn co -r 225 https://svn.bioviz.org/repos/genomes/quickload quickload.r225
            
            • An svn "dump" file of r225 is 10 Gb

            made on the svn host svn.bioviz.org using:

            svnadmin dump -r 225:HEAD /svn/genomes > genomes.r225.dump
            
            • An svn "dump" file of the entire repository is 32 Gb

            made on the svn host svn.bioviz.org using:

            svnadmin dump /svn/genomes > genomes.all.dump
            
            Show
            ann.loraine Ann Loraine added a comment - Status of the svn repository: As of Paige Kulzer 's changes, the repository version is 225 As of today, a "fresh" checkout of the repository version 225 creates a local directory of size 20 Gb Checked out to my local using: svn co -r 225 https: //svn.bioviz.org/repos/genomes/quickload quickload.r225 An svn "dump" file of r225 is 10 Gb made on the svn host svn.bioviz.org using: svnadmin dump -r 225:HEAD /svn/genomes > genomes.r225.dump An svn "dump" file of the entire repository is 32 Gb made on the svn host svn.bioviz.org using: svnadmin dump /svn/genomes > genomes.all.dump
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            To test:

            • De-activate all data providers except for Quickload "main"
            • Visit each of the genome versions listed in attached file with an "x" at the front of the line (sorry, there are a lot!)
            • Check choose a location in the genome and zoom in
            • Click "Load sequence" and check for errors in the log
            Show
            ann.loraine Ann Loraine added a comment - - edited To test: De-activate all data providers except for Quickload "main" Visit each of the genome versions listed in attached file with an "x" at the front of the line (sorry, there are a lot!) Check choose a location in the genome and zoom in Click "Load sequence" and check for errors in the log
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            I have a suggestion for the testing task!

            What do you think about automating this testing task?

            I can think of a few different ways it could be done, some more complicated than others

            One way kind of interesting and fun way might be to write a simple "IGB App" (we could call it "2bit Checker") that a tester could install into IGB. It would then try to get a listing of all genomes offered within IGB where we are using the "file" tag to point to an externally hosted 2bit file. (I'm not sure if the IGB API perfectly supports this, but that's OK. We could use XML parsing code within the App instead.) Then, once the App has the list of files to check, it would then try to retrieve a bit of sequence from each of the referenced files. It would then report the outcome of each attempt.

            One possible value of doing this would be that we would have yet another example of an IGB App and we would perhaps quickly find out if any of our 2bit files are corrupted or unavailable.

            We could create a simple IGB App called that would attempt to retrieve a bit of genome sequence for every genome version within a

            Show
            ann.loraine Ann Loraine added a comment - - edited I have a suggestion for the testing task! What do you think about automating this testing task? I can think of a few different ways it could be done, some more complicated than others One way kind of interesting and fun way might be to write a simple "IGB App" (we could call it "2bit Checker") that a tester could install into IGB. It would then try to get a listing of all genomes offered within IGB where we are using the "file" tag to point to an externally hosted 2bit file. (I'm not sure if the IGB API perfectly supports this, but that's OK. We could use XML parsing code within the App instead.) Then, once the App has the list of files to check, it would then try to retrieve a bit of sequence from each of the referenced files. It would then report the outcome of each attempt. One possible value of doing this would be that we would have yet another example of an IGB App and we would perhaps quickly find out if any of our 2bit files are corrupted or unavailable. We could create a simple IGB App called that would attempt to retrieve a bit of genome sequence for every genome version within a
            Hide
            nfreese Nowlan Freese added a comment - - edited

            Tested following instructions above on Mac on IGB 10.1.0 release.

            All genome versions marked with "x" were able to load the sequence except for the two below:

            • x 617M ./E_caballus_Sep_2007/E_caballus_Sep_2007.2bit
            • x 577M ./A_melanoleuca_Dec_2009/A_melanoleuca_Dec_2009.2bit

            My guess is that the URL to the 2bit file is incorrect for both of these genomes.

            Show
            nfreese Nowlan Freese added a comment - - edited Tested following instructions above on Mac on IGB 10.1.0 release. All genome versions marked with "x" were able to load the sequence except for the two below: x 617M ./E_caballus_Sep_2007/E_caballus_Sep_2007.2bit x 577M ./A_melanoleuca_Dec_2009/A_melanoleuca_Dec_2009.2bit My guess is that the URL to the 2bit file is incorrect for both of these genomes. E_caballus annots: http://igbquickload-main.bioviz.org/quickload/E_caballus_Sep_2007/annots.xml A_melanoleuca annots: http://igbquickload-main.bioviz.org/quickload/A_melanoleuca_Dec_2009/annots.xml
            Hide
            pkulzer Paige Kulzer (Inactive) added a comment -

            Attn: Ann Loraine, the SVN server is down again which is blocking work on this ticket.

            Show
            pkulzer Paige Kulzer (Inactive) added a comment - Attn: Ann Loraine , the SVN server is down again which is blocking work on this ticket.
            Hide
            ann.loraine Ann Loraine added a comment -

            Sorry for the block. The server should now be available.

            Attn: Paige Kulzer

            Show
            ann.loraine Ann Loraine added a comment - Sorry for the block. The server should now be available. Attn: Paige Kulzer
            Hide
            pkulzer Paige Kulzer (Inactive) added a comment -

            The URL's to the 2bit files were, in fact, incorrect for both of these genomes. I've fixed both URL's and tested these changes with IGB 10.1.0 and my local quickload repository. Both reference genomes/sequences load successfully, no errors in the log. Ready for testing!

            Show
            pkulzer Paige Kulzer (Inactive) added a comment - The URL's to the 2bit files were, in fact, incorrect for both of these genomes. I've fixed both URL's and tested these changes with IGB 10.1.0 and my local quickload repository. Both reference genomes/sequences load successfully, no errors in the log. Ready for testing!
            Hide
            nfreese Nowlan Freese added a comment - - edited

            SVN is down so I just tested the new URLs as custom genomes.

            • E_caballus_Sep_2007 - working correctly
            • A_melanoleuca_Dec_2009 - kind of working, Paige Kulzer - the url looks good, but when I try to view the genome in IGB it pauses for a while then IGB starts to load the sequence for every single chromosome, which it shouldn't do. Not sure why this is happening, but it may not be an issue if the genome is accessed through a Quickload. Could you try it on your machine and if it looks good then lets deploy the changes.
            Show
            nfreese Nowlan Freese added a comment - - edited SVN is down so I just tested the new URLs as custom genomes. E_caballus_Sep_2007 - working correctly A_melanoleuca_Dec_2009 - kind of working, Paige Kulzer - the url looks good, but when I try to view the genome in IGB it pauses for a while then IGB starts to load the sequence for every single chromosome, which it shouldn't do. Not sure why this is happening, but it may not be an issue if the genome is accessed through a Quickload. Could you try it on your machine and if it looks good then lets deploy the changes.
            Hide
            pkulzer Paige Kulzer (Inactive) added a comment -

            It took much longer to load the A_melanoleuca genome than it did the E_caballus genome (1.600 min vs 151.2 ms), but I'm not seeing IGB attempt to load sequences. I don't see anything out of the ordinary with the annots.xml file that might be causing this.

            Show
            pkulzer Paige Kulzer (Inactive) added a comment - It took much longer to load the A_melanoleuca genome than it did the E_caballus genome (1.600 min vs 151.2 ms), but I'm not seeing IGB attempt to load sequences. I don't see anything out of the ordinary with the annots.xml file that might be causing this.
            Hide
            pkulzer Paige Kulzer (Inactive) added a comment -

            Reviewed with Dr. Freese - ready for deployment!

            Show
            pkulzer Paige Kulzer (Inactive) added a comment - Reviewed with Dr. Freese - ready for deployment!
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Thanks Paige Kulzer and Nowlan Freese!

            I have deployed the new version of the repository and also made full and "thin" backups to this latest version, which is revision 229.

            Backups are available on-line here:

            https://data.bioviz.org/svnbackup/

            Show
            ann.loraine Ann Loraine added a comment - - edited Thanks Paige Kulzer and Nowlan Freese ! I have deployed the new version of the repository and also made full and "thin" backups to this latest version, which is revision 229. Backups are available on-line here: https://data.bioviz.org/svnbackup/

              People

              • Assignee:
                ann.loraine Ann Loraine
                Reporter:
                nfreese Nowlan Freese
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: