Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-4038

Improve documentation description quickload files for painted lady genome assembly

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      We are offering a genome assembly for painted lady as part of our Quickload site - see: http://igbquickload-main.bioviz.org/quickload/V_cardui_Feb_2021/

      However, the in the HEADER file (see above link) does not have enough information for a user to locate the original files we used to create our IGB-friendly versions of the files. This can become a problem for users who are working with bioinformatics pipelines that use assembly or annotation files. They will need to know if the files we are showing in IGB are the same as the files they are working with, or not.

      Also, it would be good to link to a research article describing this assembly, to help users learn more about this assembly.

      For this task, improve the documentation so that a user who clicks the "get info" link within IGB will be able to find information needed to locate the original data files we used to create our IGB-friendly versions for the Quickload repository.

      Notes:

        Attachments

          Issue Links

            Activity

            Hide
            pkulzer Paige Kulzer added a comment -

            I've added a link to the research article describing this assembly to HEADER.md. There, direct links to the sequence files (chromosome-specific FASTAs) on ENA are provided in the Data Availability section. The NCBI Genomes resource is also linked in HEADER.md and the specific genome assembly (ilVanCard2.1) has been specified in the first line so that a user can determine which of the reference genomes we used.

            Ready for review, as well as any more suggestions for improving quickload descriptions!

            Show
            pkulzer Paige Kulzer added a comment - I've added a link to the research article describing this assembly to HEADER.md . There, direct links to the sequence files (chromosome-specific FASTAs) on ENA are provided in the Data Availability section. The NCBI Genomes resource is also linked in HEADER.md and the specific genome assembly (ilVanCard2.1) has been specified in the first line so that a user can determine which of the reference genomes we used. Ready for review, as well as any more suggestions for improving quickload descriptions!
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Thank you for making the suggested changes!

            I have a request which I hope will not be too difficult to implement:

            • The link to the article is good, but one thing that can easily happen over time is that the provided link could become obsolete when or if the organization that hosts the paper changes the link pattern. So, instead of simply providing just the link, I recommend using the paper title as the link label, and using the PubMed "perma-URL" as the link. All articles that are published in the scientific literature should have a record in the PubMed database. The pattern of links to those articles looks like: https://pubmed.ncbi.nlm.nih.gov/[ pubmed numeric identifier ].

            I tried to test whether the provided information is sufficient to located the original data files used.

            Here's what I did:

            The two files I think might be the original data files Paige Kulzer used are:

            and

            The above GFF file is probably not the right file because it's last update was in 2022, not 2021, as mentioned in the annots.xml. I tried to find something that included the word "refgene", but I saw nothing with that term in the file name.

            Question for Paige Kulzer: Is there some way that you can make it clearer to users where the gene model file comes from? The ftp directory seems like it is showing a more recent data file than what IGB shows, because the release date for what NCBI shows is 2022, not 2021.

            Show
            ann.loraine Ann Loraine added a comment - - edited Thank you for making the suggested changes! I have a request which I hope will not be too difficult to implement: The link to the article is good, but one thing that can easily happen over time is that the provided link could become obsolete when or if the organization that hosts the paper changes the link pattern. So, instead of simply providing just the link, I recommend using the paper title as the link label, and using the PubMed "perma-URL" as the link. All articles that are published in the scientific literature should have a record in the PubMed database. The pattern of links to those articles looks like: https://pubmed.ncbi.nlm.nih.gov/[ pubmed numeric identifier ]. I tried to test whether the provided information is sufficient to located the original data files used. Here's what I did: The HEADER.md file mentions NCBI Genome, so I visited https://www.ncbi.nlm.nih.gov/home/genomes/ I entered "ilVanCard2.1" in the above page's query form. (ilVanCard2.1 is listed in the contents.txt file's second column, which shows up in IGB's title page when displaying this assembly.) The above search showed me a page listing the assembly, with a hyperlink to https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_905220355.1/ From there, I tried to locate (a) the original fasta file and (b) the original gene models file that Paige Kulzer used as inputs to create the 2bit and bed.gz files. The above link ( https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_905220355.1/ ) had another link labeled "ftp" that opened a web directory with some files. The URL of that Web directory is: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/905/220/365/GCF_905220365.1_ilVanCard2.1/README_Vanessa_cardui_annotation_release_100 The two files I think might be the original data files Paige Kulzer used are: genome file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/905/220/365/GCF_905220365.1_ilVanCard2.1/GCF_905220365.1_ilVanCard2.1_genomic.fna.gz and gene models file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/905/220/365/GCF_905220365.1_ilVanCard2.1/GCF_905220365.1_ilVanCard2.1_genomic.gff.gz The above GFF file is probably not the right file because it's last update was in 2022, not 2021, as mentioned in the annots.xml. I tried to find something that included the word "refgene", but I saw nothing with that term in the file name. Question for Paige Kulzer : Is there some way that you can make it clearer to users where the gene model file comes from? The ftp directory seems like it is showing a more recent data file than what IGB shows, because the release date for what NCBI shows is 2022, not 2021.
            Hide
            pkulzer Paige Kulzer added a comment -

            I've changed the publication link label to the name of the paper ("The genome sequence of the painted lady, Vanessa cardui Linnaeus 1758") and I've changed the publication link to the PubMed URL (https://pubmed.ncbi.nlm.nih.gov/37008186/).

            Note: The DOI link that I used originally (https://doi.org/10.12688/wellcomeopenres.17358.1) should also serve as a "perma-URL" because DOIs never change. See this FAQ from the UIC Library for more info: https://ask.library.uic.edu/faq/345899#:~:text=A%20DOI%20is%20like%20a,DOIs%20will%20stay%20the%20same..

            The HEADER.md file mentions NCBI Genome, but the URL that Ann Loraine used above (https://www.ncbi.nlm.nih.gov/home/genomes/) is different from what is listed in the file (ncbi.nlm.nih.gov/datasets/genome/). This is what resulted in her being unable to find the correct files in her comment above.

            I'm not sure how useful it is to reference annots.xml for more details like we're currently recommending in HEADER.md. Instead, it might be clearer to users where the gene model file comes from if we include the NCBI RefSeq assembly identifier (GCF_905220365.1). Once on that correct page, the Download button is fairly obvious, and the files to download are specified in HEADER.md (GFF, fasta).

            Please review HEADER_V2.md which contains all of these changes and let me know what you think!

            Show
            pkulzer Paige Kulzer added a comment - I've changed the publication link label to the name of the paper ("The genome sequence of the painted lady, Vanessa cardui Linnaeus 1758") and I've changed the publication link to the PubMed URL ( https://pubmed.ncbi.nlm.nih.gov/37008186/ ). Note: The DOI link that I used originally ( https://doi.org/10.12688/wellcomeopenres.17358.1 ) should also serve as a "perma-URL" because DOIs never change. See this FAQ from the UIC Library for more info: https://ask.library.uic.edu/faq/345899#:~:text=A%20DOI%20is%20like%20a,DOIs%20will%20stay%20the%20same .. The HEADER.md file mentions NCBI Genome, but the URL that Ann Loraine used above ( https://www.ncbi.nlm.nih.gov/home/genomes/ ) is different from what is listed in the file (ncbi.nlm.nih.gov/datasets/genome/). This is what resulted in her being unable to find the correct files in her comment above. I'm not sure how useful it is to reference annots.xml for more details like we're currently recommending in HEADER.md. Instead, it might be clearer to users where the gene model file comes from if we include the NCBI RefSeq assembly identifier (GCF_905220365.1). Once on that correct page, the Download button is fairly obvious, and the files to download are specified in HEADER.md (GFF, fasta). Please review HEADER_V2.md which contains all of these changes and let me know what you think!
            Hide
            ann.loraine Ann Loraine added a comment -

            Let's do an on-line session so that I can show you the problems I am having understanding where the data file came from.

            Show
            ann.loraine Ann Loraine added a comment - Let's do an on-line session so that I can show you the problems I am having understanding where the data file came from.
            Hide
            ann.loraine Ann Loraine added a comment -

            Adding the new file to the ticket by pasting in the code here:

            <html>
            <body>
            <h1>Vanessa cardui (Feb 2021) painted lady (ilVanCard2.1) genome assembly</h1>
            <p>
            The files listed below are formatted for visualization in the Integrated Genome
            Browser, available from <a href="https://bioviz.org">BioViz.org</a>.
            </p>
            <p>
            Annotation features (gff) files were downloaded from NCBI Genome (<a href="ncbi.nlm.nih.gov/datasets/genome/">ncbi.nlm.nih.gov/datasets/genome/</a>; NCBI RefSeq assembly GCF_905220365.1). See the <a href="annots.xml">annots.xml</a> meta-data file in this directory for details.
            </p>
            <p>
            Files with extension .gz were compressed and indexed using bgzip and
            tabix tools from <a href="https://www.htslib.org">htslib.org</a>.
            </p>
            <p>
            The file named V_cardui_Feb_2021.2bit contains sequence data. It was originally downloaded in fasta format from NCBI Genome and then converted to 2bit by running the faToTwoBit program provided by UCSC. The file <a href="genome.txt">genome.txt</a> lists sequences and their sizes. It was made from the V_cardui_Feb_2021.2bit sequence file using the twoBitInfo program.
            </p>
            <p>
            Both twoBitInfo and faToTwoBit are available from <a href="http://hgdownload.cse.ucsc.edu/admin/exe/">http://hgdownload.cse.ucsc.edu/admin/exe/</a>.
            </p>
            <p>
            More information about this genome assembly and its gene models can be found in the following publication: <a href="https://doi.org/10.12688/wellcomeopenres.17358.1">The genome sequence of the painted lady, Vanessa cardui Linnaeus 1758</a>.
            </p>
            </body>
            </html>
            
            Show
            ann.loraine Ann Loraine added a comment - Adding the new file to the ticket by pasting in the code here: <html> <body> <h1>Vanessa cardui (Feb 2021) painted lady (ilVanCard2.1) genome assembly</h1> <p> The files listed below are formatted for visualization in the Integrated Genome Browser, available from <a href= "https: //bioviz.org" >BioViz.org</a>. </p> <p> Annotation features (gff) files were downloaded from NCBI Genome (<a href= "ncbi.nlm.nih.gov/datasets/genome/" >ncbi.nlm.nih.gov/datasets/genome/</a>; NCBI RefSeq assembly GCF_905220365.1). See the <a href= "annots.xml" >annots.xml</a> meta-data file in this directory for details. </p> <p> Files with extension .gz were compressed and indexed using bgzip and tabix tools from <a href= "https: //www.htslib.org" >htslib.org</a>. </p> <p> The file named V_cardui_Feb_2021.2bit contains sequence data. It was originally downloaded in fasta format from NCBI Genome and then converted to 2bit by running the faToTwoBit program provided by UCSC. The file <a href= "genome.txt" >genome.txt</a> lists sequences and their sizes. It was made from the V_cardui_Feb_2021.2bit sequence file using the twoBitInfo program. </p> <p> Both twoBitInfo and faToTwoBit are available from <a href= "http: //hgdownload.cse.ucsc.edu/admin/exe/" >http://hgdownload.cse.ucsc.edu/admin/exe/</a>. </p> <p> More information about this genome assembly and its gene models can be found in the following publication: <a href= "https: //doi.org/10.12688/wellcomeopenres.17358.1" >The genome sequence of the painted lady, Vanessa cardui Linnaeus 1758</a>. </p> </body> </html>
            Hide
            pkulzer Paige Kulzer added a comment -

            After our discussion this morning, my task is to rewrite this Quickload description without using past HEADER.md files from UCSC Quickloads as templates because they don't adequately document the process of retrieving files from NCBI. This rewrite should include a description of annots.xml rather than a simple reference to it, an updated link to the NCBI database, instructions for downloading files from NCBI, information about the date these NCBI files were accessed to create this Quickload, and documentation regarding how we converted GFF to BED.

            Show
            pkulzer Paige Kulzer added a comment - After our discussion this morning, my task is to rewrite this Quickload description without using past HEADER.md files from UCSC Quickloads as templates because they don't adequately document the process of retrieving files from NCBI. This rewrite should include a description of annots.xml rather than a simple reference to it, an updated link to the NCBI database, instructions for downloading files from NCBI, information about the date these NCBI files were accessed to create this Quickload, and documentation regarding how we converted GFF to BED.
            Hide
            ann.loraine Ann Loraine added a comment -

            Thank you for your patience with my many questions.
            How about if I make a stab at updating the documentation?
            Then you can see if I got it right or not, and correct as needed?

            Show
            ann.loraine Ann Loraine added a comment - Thank you for your patience with my many questions. How about if I make a stab at updating the documentation? Then you can see if I got it right or not, and correct as needed?
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            New documentation page is deployed to IGB quickload sites and ready for testing.
            See: https://data.bioviz.org/quickload/D_dama_Nov_2023/

            Show
            ann.loraine Ann Loraine added a comment - - edited New documentation page is deployed to IGB quickload sites and ready for testing. See: https://data.bioviz.org/quickload/D_dama_Nov_2023/
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            Question for Ann Loraine - Did you forget to link to the research article describing this Dama dama assembly (see https://www.sciencedirect.com/science/article/pii/S2666937423000124#da0005)?

            Show
            pkulzer Paige Kulzer added a comment - - edited Question for Ann Loraine - Did you forget to link to the research article describing this Dama dama assembly (see https://www.sciencedirect.com/science/article/pii/S2666937423000124#da0005)?
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            I've added mention of the research article I mentioned in my previous comment. This update has been pushed to the SVN repository which is now at revision 228. This new documentation format is clear, concise, and easy to follow. Ready for a final review by Ann Loraine!

            Show
            pkulzer Paige Kulzer added a comment - - edited I've added mention of the research article I mentioned in my previous comment. This update has been pushed to the SVN repository which is now at revision 228. This new documentation format is clear, concise, and easy to follow. Ready for a final review by Ann Loraine !
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Our quickload sites are now updated to revision 228. New documentation is deployed. I checked all the links and they all go to the expected location.

            Moving to Done.

            Show
            ann.loraine Ann Loraine added a comment - - edited Our quickload sites are now updated to revision 228. New documentation is deployed. I checked all the links and they all go to the expected location. Moving to Done.

              People

              • Assignee:
                pkulzer Paige Kulzer
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: