Details

    • Type: Documentation
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Situation: As part of our efforts to add additional data sources to IGB (UCSC - IGBF-3129, Ensembl - IGBF-3555), we would like to add NCBI data if possible.

      Task: Identify and investigate the NCBI APIs to understand the various endpoints and if we could add/use it in IGB.

      API v2: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#

      Note: If we decide that we can use the NCBI APIs in IGB, change this ticket to an epic.

        Attachments

          Activity

          nfreese Nowlan Freese created issue -
          nfreese Nowlan Freese made changes -
          Field Original Value New Value
          Epic Link IGBF-1765 [ 17855 ]
          nfreese Nowlan Freese made changes -
          Sprint Fall 6 [ 207 ] Fall 4 [ 205 ]
          Assignee Jaya Sravani Sirigineedi [ jsirigin ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status To-Do [ 10305 ] In Progress [ 3 ]
          Hide
          jsirigin Jaya Sravani Sirigineedi (Inactive) added a comment -

          Some of the services of NCBI are down and the link provided in the ticket is also not available currently, will start working on this once it is back.

          Show
          jsirigin Jaya Sravani Sirigineedi (Inactive) added a comment - Some of the services of NCBI are down and the link provided in the ticket is also not available currently, will start working on this once it is back.
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status In Progress [ 3 ] To-Do [ 10305 ]
          ann.loraine Ann Loraine made changes -
          Sprint Fall 4 [ 205 ] Fall 4, Fall 5 [ 205, 206 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status To-Do [ 10305 ] In Progress [ 3 ]
          Hide
          jsirigin Jaya Sravani Sirigineedi (Inactive) added a comment - - edited

          Completed the initial investigation of the NCBI API and it can be used to integrate into IGB. Below are the useful APIs:

          A few more things to consider before we start working on this:

          • The total no.of genomes currently present in NCBI is 2.62M (https://www.ncbi.nlm.nih.gov/datasets/genome/), there are a few categories available on their websites we can use them as filters like, only referenced genomes or annotated genomes, not sure what they mean at this point but we can use those filters to reduce the number.
          • Also, from the APIs we won't get all data at once, there is a page_size attribute that we can use to set the limit for data that has to be sent to us and the max is only 1000 so to get the next data we have to use the next_page_token attribute in the request again and get it.
          • There is an API key that we can use in the request (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/api-keys/), I haven't gone through much about this and I was able to send requests without using it too, so we have investigate how useful is this or should we have to use this.
          • Lastly, there's an option to create a client library from the YAML specifications using the Open API Generator (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/languages/). We can use this or use the REST API as well to communicate. We have to decide which approach would be beneficial to us.
          Show
          jsirigin Jaya Sravani Sirigineedi (Inactive) added a comment - - edited Completed the initial investigation of the NCBI API and it can be used to integrate into IGB. Below are the useful APIs: To get all the genomes and their taxon ids (which will be used to get the datasets), we can use the below API, this is the same API we use to get the datasets for a genome using its taxon id, if we use 1 as the taxon id then we get all the datasets as it is the root taxon) documentation: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/taxon/-taxons-/dataset_report https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/1/dataset_report To get the datasets of a particular genome with its taxon id, use the above API but change the taxon id with the required one. Note: As the above URL itself gives all the datasets, we might not have to use this. It depends on how the implementation goes. To get the chromosome information of the genome, we can use the below APIs: chromosomes with length info: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/sequence_reports (eg: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_000001405.40/sequence_reports?role_filters=assembled-molecule ) Note: There are role filters which can be used to remove the scaffolds. only chromosomes: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#post-/genome/annotation_summary (eg: curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/annotation_summary" \ -H 'accept: application/json'\ -H 'content-type: application/json' \ -d ' {"accession":"GCF_000001405.40"} ') Finally, to get the annotations, we can use the below API, it also has a location query which we can use to send the start and end that we require: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/annotation_report (eg: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_000001405.40/annotation_report?locations=2%3A1000-2000 ) A few more things to consider before we start working on this: The total no.of genomes currently present in NCBI is 2.62M ( https://www.ncbi.nlm.nih.gov/datasets/genome/ ), there are a few categories available on their websites we can use them as filters like, only referenced genomes or annotated genomes, not sure what they mean at this point but we can use those filters to reduce the number. Also, from the APIs we won't get all data at once, there is a page_size attribute that we can use to set the limit for data that has to be sent to us and the max is only 1000 so to get the next data we have to use the next_page_token attribute in the request again and get it. There is an API key that we can use in the request ( https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/api-keys/ ), I haven't gone through much about this and I was able to send requests without using it too, so we have investigate how useful is this or should we have to use this. Lastly, there's an option to create a client library from the YAML specifications using the Open API Generator ( https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/languages/ ). We can use this or use the REST API as well to communicate. We have to decide which approach would be beneficial to us.
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Assignee Jaya Sravani Sirigineedi [ jsirigin ]
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          Notes from a discussion with Nowlan Freese on Wednesday:

          Given the large number of available genome assemblies, we should not read them all into memory at once or add them all to user interface component for the user to choose from.

          Instead, we need some kind of query interface that would allow a user to search for their genome assembly (or assemblies) of interest. This could be implemented as part of IGB itself, or, we could simply ask the user to enter some specific, named identifier for their assembly of interest into IGB, and then, IGB would use that identifier to retrieve and display the requested genome assembly. For example, we could ask the user to enter a "Nucleotide accession" which should uniquely identify a single assembly and assembly version. Indeed, the word "accession" is even included in the REST URL path described above!

          To start, we can implement a very simple interface where the user will enter the accession for their genome of interest, e.g., GCF_000001405.40.

          Then, IGB would reach out to the APIs described above to retrieve the number of chromosomes (also called "contigs") and their sizes. Or, if the accession entered is not found, the interface would report that the accession has not been found, and invite the user to try again. The interface would also provide a link to the "genomes" page at NCBI and so that the user could use the query interface there to look up the accession for the genome assembly of interest.

          One easy way to do this would be to ask the user to provide a unique identifier for the genome assembly of their interest. Then, IGB could send a request to one of the NCBI APIs to retrieve the data and load the requested assembly, or reply to the user that the requested assembly could not be found.

          In this way, we could leverage the query interfaces hosted at the NCBI Web site, which will always be well-maintained and will improve over time. Also, I believe most users will be learning and becoming expert with the identifier systems used at NCBI. Pretty much everybody in biology understands their are these things called "accessions" that identify sequences, so I think we are in good shape with this approach.

          Here is the Web page at NCBI with information about the genome assemblies available at the site - we would include this link in our interface to help the user find accessions and also learn about the vast amount of information available at NCBI:

          Show
          ann.loraine Ann Loraine added a comment - - edited Notes from a discussion with Nowlan Freese on Wednesday: Given the large number of available genome assemblies, we should not read them all into memory at once or add them all to user interface component for the user to choose from. Instead, we need some kind of query interface that would allow a user to search for their genome assembly (or assemblies) of interest. This could be implemented as part of IGB itself, or, we could simply ask the user to enter some specific, named identifier for their assembly of interest into IGB, and then, IGB would use that identifier to retrieve and display the requested genome assembly. For example, we could ask the user to enter a "Nucleotide accession" which should uniquely identify a single assembly and assembly version. Indeed, the word "accession" is even included in the REST URL path described above! To start, we can implement a very simple interface where the user will enter the accession for their genome of interest, e.g., GCF_000001405.40. Then, IGB would reach out to the APIs described above to retrieve the number of chromosomes (also called "contigs") and their sizes. Or, if the accession entered is not found, the interface would report that the accession has not been found, and invite the user to try again. The interface would also provide a link to the "genomes" page at NCBI and so that the user could use the query interface there to look up the accession for the genome assembly of interest. One easy way to do this would be to ask the user to provide a unique identifier for the genome assembly of their interest. Then, IGB could send a request to one of the NCBI APIs to retrieve the data and load the requested assembly, or reply to the user that the requested assembly could not be found. In this way, we could leverage the query interfaces hosted at the NCBI Web site, which will always be well-maintained and will improve over time. Also, I believe most users will be learning and becoming expert with the identifier systems used at NCBI. Pretty much everybody in biology understands their are these things called "accessions" that identify sequences, so I think we are in good shape with this approach. Here is the Web page at NCBI with information about the genome assemblies available at the site - we would include this link in our interface to help the user find accessions and also learn about the vast amount of information available at NCBI: https://www.ncbi.nlm.nih.gov/datasets/genome/
          Hide
          ann.loraine Ann Loraine added a comment -

          For the next steps, Jaya Sravani Sirigineedi would you please read the comment above and let us know your thoughts?

          Show
          ann.loraine Ann Loraine added a comment - For the next steps, Jaya Sravani Sirigineedi would you please read the comment above and let us know your thoughts?
          ann.loraine Ann Loraine made changes -
          Assignee Jaya Sravani Sirigineedi [ jsirigin ]
          ann.loraine Ann Loraine made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          ann.loraine Ann Loraine made changes -
          Status First Level Review in Progress [ 10301 ] Needs 1st Level Review [ 10005 ]
          ann.loraine Ann Loraine made changes -
          Sprint Fall 4, Fall 5 [ 205, 206 ] Fall 4, Fall 5, Fall 6 [ 205, 206, 207 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status First Level Review in Progress [ 10301 ] To-Do [ 10305 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status To-Do [ 10305 ] In Progress [ 3 ]
          Hide
          jsirigin Jaya Sravani Sirigineedi (Inactive) added a comment -

          The sequence of a genome accession can be retrieved by using this API: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/annotation_report/download, but this downloads the fasta sequence and a detailed annotation report of that genomic accession in a zipped folder. As this ticket is only to investigate the APIs that are required for us from NCBI and we found the required info, closing this ticket. Will investigate whether to implement and integrate this in IGB and how to approach the integration in this ticket: https://jira.bioviz.org/browse/IGBF-3975.

          Show
          jsirigin Jaya Sravani Sirigineedi (Inactive) added a comment - The sequence of a genome accession can be retrieved by using this API: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/annotation_report/download , but this downloads the fasta sequence and a detailed annotation report of that genomic accession in a zipped folder. As this ticket is only to investigate the APIs that are required for us from NCBI and we found the required info, closing this ticket. Will investigate whether to implement and integrate this in IGB and how to approach the integration in this ticket: https://jira.bioviz.org/browse/IGBF-3975 .
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
          jsirigin Jaya Sravani Sirigineedi (Inactive) made changes -
          Resolution Done [ 10000 ]
          Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]

            People

            • Assignee:
              jsirigin Jaya Sravani Sirigineedi (Inactive)
              Reporter:
              nfreese Nowlan Freese
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: