[IGBF-3963] Investigate NCBI API - JIRA UNCC

Details

Type: Documentation
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Improve IGB for users
Sprint:
Fall 4, Fall 5, Fall 6

Description

Situation: As part of our efforts to add additional data sources to IGB (UCSC - ~~IGBF-3129~~, Ensembl - IGBF-3555), we would like to add NCBI data if possible.

Task: Identify and investigate the NCBI APIs to understand the various endpoints and if we could add/use it in IGB.

API v2: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#

Note: If we decide that we can use the NCBI APIs in IGB, change this ticket to an epic.

Attachments

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Jaya Sravani Sirigineedi (Inactive) added a comment - 04/Nov/24 9:03 AM

Some of the services of NCBI are down and the link provided in the ticket is also not available currently, will start working on this once it is back.

Show

Jaya Sravani Sirigineedi (Inactive) added a comment - 04/Nov/24 9:03 AM Some of the services of NCBI are down and the link provided in the ticket is also not available currently, will start working on this once it is back.

Hide

Permalink

Jaya Sravani Sirigineedi (Inactive) added a comment - 08/Nov/24 6:23 PM - edited

Completed the initial investigation of the NCBI API and it can be used to integrate into IGB. Below are the useful APIs:

To get all the genomes and their taxon ids (which will be used to get the datasets), we can use the below API, this is the same API we use to get the datasets for a genome using its taxon id, if we use 1 as the taxon id then we get all the datasets as it is the root taxon)
documentation: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/taxon/-taxons-/dataset_report
https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/1/dataset_report
To get the datasets of a particular genome with its taxon id, use the above API but change the taxon id with the required one. Note: As the above URL itself gives all the datasets, we might not have to use this. It depends on how the implementation goes.
To get the chromosome information of the genome, we can use the below APIs:
chromosomes with length info: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/sequence_reports (eg: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_000001405.40/sequence_reports?role_filters=assembled-molecule) Note: There are role filters which can be used to remove the scaffolds.
only chromosomes: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#post-/genome/annotation_summary
(eg: curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/annotation_summary" \
-H 'accept: application/json'\
-H 'content-type: application/json' \
-d ' {"accession":"GCF_000001405.40"}
')
Finally, to get the annotations, we can use the below API, it also has a location query which we can use to send the start and end that we require:
https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/annotation_report (eg: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_000001405.40/annotation_report?locations=2%3A1000-2000)

A few more things to consider before we start working on this:

The total no.of genomes currently present in NCBI is 2.62M (https://www.ncbi.nlm.nih.gov/datasets/genome/), there are a few categories available on their websites we can use them as filters like, only referenced genomes or annotated genomes, not sure what they mean at this point but we can use those filters to reduce the number.
Also, from the APIs we won't get all data at once, there is a page_size attribute that we can use to set the limit for data that has to be sent to us and the max is only 1000 so to get the next data we have to use the next_page_token attribute in the request again and get it.
There is an API key that we can use in the request (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/api-keys/), I haven't gone through much about this and I was able to send requests without using it too, so we have investigate how useful is this or should we have to use this.
Lastly, there's an option to create a client library from the YAML specifications using the Open API Generator (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/languages/). We can use this or use the REST API as well to communicate. We have to decide which approach would be beneficial to us.

Show

Jaya Sravani Sirigineedi (Inactive) added a comment - 08/Nov/24 6:23 PM - edited Completed the initial investigation of the NCBI API and it can be used to integrate into IGB. Below are the useful APIs: To get all the genomes and their taxon ids (which will be used to get the datasets), we can use the below API, this is the same API we use to get the datasets for a genome using its taxon id, if we use 1 as the taxon id then we get all the datasets as it is the root taxon) documentation: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/taxon/-taxons-/dataset_report https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/1/dataset_report To get the datasets of a particular genome with its taxon id, use the above API but change the taxon id with the required one. Note: As the above URL itself gives all the datasets, we might not have to use this. It depends on how the implementation goes. To get the chromosome information of the genome, we can use the below APIs: chromosomes with length info: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/sequence_reports (eg: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_000001405.40/sequence_reports?role_filters=assembled-molecule ) Note: There are role filters which can be used to remove the scaffolds. only chromosomes: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#post-/genome/annotation_summary (eg: curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/annotation_summary" \ -H 'accept: application/json'\ -H 'content-type: application/json' \ -d ' {"accession":"GCF_000001405.40"} ') Finally, to get the annotations, we can use the below API, it also has a location query which we can use to send the start and end that we require: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/annotation_report (eg: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_000001405.40/annotation_report?locations=2%3A1000-2000 ) A few more things to consider before we start working on this: The total no.of genomes currently present in NCBI is 2.62M ( https://www.ncbi.nlm.nih.gov/datasets/genome/ ), there are a few categories available on their websites we can use them as filters like, only referenced genomes or annotated genomes, not sure what they mean at this point but we can use those filters to reduce the number. Also, from the APIs we won't get all data at once, there is a page_size attribute that we can use to set the limit for data that has to be sent to us and the max is only 1000 so to get the next data we have to use the next_page_token attribute in the request again and get it. There is an API key that we can use in the request ( https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/api-keys/ ), I haven't gone through much about this and I was able to send requests without using it too, so we have investigate how useful is this or should we have to use this. Lastly, there's an option to create a client library from the YAML specifications using the Open API Generator ( https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/languages/ ). We can use this or use the REST API as well to communicate. We have to decide which approach would be beneficial to us.

Hide

Permalink

Ann Loraine added a comment - 17/Nov/24 10:30 AM - edited

Notes from a discussion with Nowlan Freese on Wednesday:

Given the large number of available genome assemblies, we should not read them all into memory at once or add them all to user interface component for the user to choose from.

Instead, we need some kind of query interface that would allow a user to search for their genome assembly (or assemblies) of interest. This could be implemented as part of IGB itself, or, we could simply ask the user to enter some specific, named identifier for their assembly of interest into IGB, and then, IGB would use that identifier to retrieve and display the requested genome assembly. For example, we could ask the user to enter a "Nucleotide accession" which should uniquely identify a single assembly and assembly version. Indeed, the word "accession" is even included in the REST URL path described above!

To start, we can implement a very simple interface where the user will enter the accession for their genome of interest, e.g., GCF_000001405.40.

Then, IGB would reach out to the APIs described above to retrieve the number of chromosomes (also called "contigs") and their sizes. Or, if the accession entered is not found, the interface would report that the accession has not been found, and invite the user to try again. The interface would also provide a link to the "genomes" page at NCBI and so that the user could use the query interface there to look up the accession for the genome assembly of interest.

One easy way to do this would be to ask the user to provide a unique identifier for the genome assembly of their interest. Then, IGB could send a request to one of the NCBI APIs to retrieve the data and load the requested assembly, or reply to the user that the requested assembly could not be found.

In this way, we could leverage the query interfaces hosted at the NCBI Web site, which will always be well-maintained and will improve over time. Also, I believe most users will be learning and becoming expert with the identifier systems used at NCBI. Pretty much everybody in biology understands their are these things called "accessions" that identify sequences, so I think we are in good shape with this approach.

Here is the Web page at NCBI with information about the genome assemblies available at the site - we would include this link in our interface to help the user find accessions and also learn about the vast amount of information available at NCBI:

https://www.ncbi.nlm.nih.gov/datasets/genome/

Show

Ann Loraine added a comment - 17/Nov/24 10:30 AM - edited Notes from a discussion with Nowlan Freese on Wednesday: Given the large number of available genome assemblies, we should not read them all into memory at once or add them all to user interface component for the user to choose from. Instead, we need some kind of query interface that would allow a user to search for their genome assembly (or assemblies) of interest. This could be implemented as part of IGB itself, or, we could simply ask the user to enter some specific, named identifier for their assembly of interest into IGB, and then, IGB would use that identifier to retrieve and display the requested genome assembly. For example, we could ask the user to enter a "Nucleotide accession" which should uniquely identify a single assembly and assembly version. Indeed, the word "accession" is even included in the REST URL path described above! To start, we can implement a very simple interface where the user will enter the accession for their genome of interest, e.g., GCF_000001405.40. Then, IGB would reach out to the APIs described above to retrieve the number of chromosomes (also called "contigs") and their sizes. Or, if the accession entered is not found, the interface would report that the accession has not been found, and invite the user to try again. The interface would also provide a link to the "genomes" page at NCBI and so that the user could use the query interface there to look up the accession for the genome assembly of interest. One easy way to do this would be to ask the user to provide a unique identifier for the genome assembly of their interest. Then, IGB could send a request to one of the NCBI APIs to retrieve the data and load the requested assembly, or reply to the user that the requested assembly could not be found. In this way, we could leverage the query interfaces hosted at the NCBI Web site, which will always be well-maintained and will improve over time. Also, I believe most users will be learning and becoming expert with the identifier systems used at NCBI. Pretty much everybody in biology understands their are these things called "accessions" that identify sequences, so I think we are in good shape with this approach. Here is the Web page at NCBI with information about the genome assemblies available at the site - we would include this link in our interface to help the user find accessions and also learn about the vast amount of information available at NCBI: https://www.ncbi.nlm.nih.gov/datasets/genome/

Hide

Permalink

Ann Loraine added a comment - 17/Nov/24 10:46 AM

For the next steps, Jaya Sravani Sirigineedi would you please read the comment above and let us know your thoughts?

Show

Ann Loraine added a comment - 17/Nov/24 10:46 AM For the next steps, Jaya Sravani Sirigineedi would you please read the comment above and let us know your thoughts?

Hide

Permalink

Jaya Sravani Sirigineedi (Inactive) added a comment - 27/Nov/24 3:29 PM

The sequence of a genome accession can be retrieved by using this API: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/annotation_report/download, but this downloads the fasta sequence and a detailed annotation report of that genomic accession in a zipped folder. As this ticket is only to investigate the APIs that are required for us from NCBI and we found the required info, closing this ticket. Will investigate whether to implement and integrate this in IGB and how to approach the integration in this ticket: https://jira.bioviz.org/browse/IGBF-3975.

Show

Jaya Sravani Sirigineedi (Inactive) added a comment - 27/Nov/24 3:29 PM The sequence of a genome accession can be retrieved by using this API: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/rest-api/#get-/genome/accession/-accession-/annotation_report/download , but this downloads the fasta sequence and a detailed annotation report of that genomic accession in a zipped folder. As this ticket is only to investigate the APIs that are required for us from NCBI and we found the required info, closing this ticket. Will investigate whether to implement and integrate this in IGB and how to approach the integration in this ticket: https://jira.bioviz.org/browse/IGBF-3975 .

People

Assignee:

Jaya Sravani Sirigineedi (Inactive)

Reporter:

Nowlan Freese

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

31/Oct/24 11:47 AM

Updated:

27/Nov/24 3:29 PM

Resolved:

27/Nov/24 3:29 PM