Details
-
Type: Task
-
Status: Closed (View Workflow)
-
Priority: Major
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:2
-
Epic Link:
-
Sprint:Summer 1, Summer 2
Description
Task: Investigate the Ensembl API at https://rest.ensembl.org/ to determine if the API will return data that can be used by IGB, such as genomic sequence and gene annotations.
Attachments
Activity
Lists all available species, their aliases, available adaptor groups and data release: https://rest.ensembl.org/documentation/info/species
Example: https://rest.ensembl.org/info/species
Read the documentation, went through a few APIs, and found the below useful information
GET info/divisions
Get the list of all Ensembl divisions for which information is available: https://rest.ensembl.org/documentation/info/info_divisions
Example: https://rest.ensembl.org/info/divisions?content-type=application/json
GET info/species
Lists all available species, their aliases, available adaptor groups, and data release.
Has to specify a division optional parameter to get the species for that ensembl division.
API to get the list of all species: https://rest.ensembl.org/documentation/info/species
Example: https://rest.ensembl.org/info/species?content-type=application/json
https://rest.ensembl.org/info/species?content-type=application/json;division=EnsemblPlants (default division is EnsemblVertebrates)
Note: No version info present in the response
GET info/assembly/:species
API to get the list of chromosomes for a selected species: https://rest.ensembl.org/documentation/info/assembly_info
Example: https://rest.ensembl.org/info/assembly/homo_sapiens?content-type=application/json
Note: response has three array variables - one is top_level_region which has an array of objects, each object has a name, length, and coord_system, this looks similar to the list of chromosomes with the length and name but the coord_system has different values, two of them are scaffold and chromosome which brings the doubt of whether we consider all of them as chromosomes are not? Also, didn't quite understand what the other two arrays intend to describe.
Additional notes: Karyotype is defined as an individual’s complete set of chromosomes, so can this be considered as the list of chromosomes for that species?
GET info/genomes/:genome_name
Find information about a given genome
API to get more information about the genome: https://rest.ensembl.org/documentation/info/info_genome
Example: https://rest.ensembl.org/info/genomes/homo_sapiens?content-type=application/json
Note: genebuild in the response looks similar to genomeVersionName
GET info/assembly/:species/:region_name
Returns information about the specified top-level sequence region for the given species: https://rest.ensembl.org/documentation/info/assembly_stats
Example: https://rest.ensembl.org/info/assembly/homo_sapiens/X?content-type=application/json
https://rest.ensembl.org/info/assembly/homo_sapiens/KI270757.1?content-type=application/json
Note: Both the assemblies mentioned in the example are taken from the get assembly info API but in the first one's response the is_chromosome variable is 1 (means true) where as for the second one it is 0 (means false), does this imply that the coord_system variable in the assembly API response says whether that object should be considered a chromosome or not?
Information about other APIs:
GET sequence/id/:id
Request multiple types of sequence by stable identifier. Supports feature masking and expand options.
API to get the sequence: https://rest.ensembl.org/documentation/info/sequence_id
Example: https://rest.ensembl.org/sequence/id/GENSCAN00000000001?object_type=predictiontranscript;db_type=core;content-type=application/json;type=protein;species=homo_sapiens
Note: The id present in the path argument is an Ensembl stable ID Not sure from where we can get this
GET sequence/region/:species/:region
Returns the genomic sequence of the specified region of the given species. Supports feature masking and expand options.
API to get the sequence for a specifies region: https://rest.ensembl.org/documentation/info/sequence_region
Example: https://rest.ensembl.org/sequence/region/human/X:1000..26000:-1?content-type=application/json
Note: the maximum allowed length to request the sequence for is 10000000
GET alignment/region/:species/:region
Retrieves genomic alignments as separate blocks based on a region and species
API to get the alignment for the specified region: https://rest.ensembl.org/documentation/info/genomic_alignment_region
Example: https://rest.ensembl.org/alignment/region/homo_sapiens/1:0-100000?content-type=application/json;species_set_group=primates;display_species_set=homo_sapiens
Note: This looks like the region API which we use to get the information about the glyphs (or Syms) but the response isn’t similar to any known file types and has the sequence in it with dashes in between, also the parameters we give to the api are kind of complicated, not sure how we decide which species_set_group does the species belong to, we do have other API from where we can get the species_set_groups and the species that belong to that set group:
GET info/compara/species_sets/:method
List all collections of species analysed with the specified compara method: https://rest.ensembl.org/documentation/info/compara_species_sets
Example: https://rest.ensembl.org/info/compara/species_sets/EPO?content-type=application/json
GET info/compara/methods
List all compara analyses available (an analysis defines the type of comparative data): https://rest.ensembl.org/documentation/info/compara_methods
Example: https://rest.ensembl.org/info/compara/methods/?content-type=application/json
but even this is a bit complicated.
GET info/variation/:species
List the variation sources used in Ensembl for a species: https://rest.ensembl.org/documentation/info/variation
Example: https://rest.ensembl.org/info/variation/homo_sapiens?content-type=application/json
Note: Have to check whether this information is useful or not.
GET lookup/symbol/:species/:symbol
Find the species and database for a symbol in a linked external database: https://rest.ensembl.org/documentation/info/symbol_lookup
Example: https://rest.ensembl.org/lookup/symbol/homo_sapiens/BRCA2?content-type=application/json;expand=1
Note: this api response has exons and other stuff and it looks more like the get region API than the before mentioned API
After discussing with Dr.Freese, we finalised the below API calls:
To get the list of species (genomes) available in Ensembl:
GET info/divisions
Get the list of all Ensembl divisions for which information is available: https://rest.ensembl.org/documentation/info/info_divisions
Example: https://rest.ensembl.org/info/divisions?content-type=application/json
GET info/species
Lists all available species, their aliases, available adaptor groups, and data release.
To get the species for that ensembl division, have to specify the division optional parameter.
API to get the list of all species: https://rest.ensembl.org/documentation/info/species
Example: https://rest.ensembl.org/info/species?content-type=application/json
Use name, display_name, and accession from the API response as the species name, tooltip description, and version respectively.
To get the chromosome list:
GET info/assembly/:species
API to get the list of chromosomes for a selected species: https://rest.ensembl.org/documentation/info/assembly_info
Example: https://rest.ensembl.org/info/assembly/homo_sapiens?content-type=application/json
the top_level_region has all the chromosomes, name and length would be enough to get the assembly info.
To get the sequence for a particular region of a chromosome:
GET sequence/region/:species/:region
Returns the genomic sequence of the specified region of the given species. Supports feature masking and expand options.
API to get the sequence for a specifies region: https://rest.ensembl.org/documentation/info/sequence_region
Example: https://rest.ensembl.org/sequence/region/human/X:1000..26000:-1?content-type=application/json
The maximum allowed length to request the sequence is 10000000, so we have to do multiple API calls by dividing the large length into smaller lengths (use multithreading to optimize it)
To get the annotation for a particular region of a chromosome, we think we have to use the below API but still have to investigate on this. Nowlan Freese will be looking into this.
GET lookup/symbol/:species/:symbol
Find the species and database for a symbol in a linked external database: https://rest.ensembl.org/documentation/info/symbol_lookup
Example: https://rest.ensembl.org/lookup/symbol/homo_sapiens/BRCA2?content-type=application/json;expand=1
Jaya Sravani Sirigineedi - Need to look into the existing logic of showing the species list and understand it.
Jaya Sravani Sirigineedi - this endpoint looks like it should be able to give us the annotation (gene) info: https://rest.ensembl.org/documentation/info/overlap_region
I'm not sure which features we will need. We may need several features to get all of the info we need.
Example above can be viewed in IGB for the hg38 (H_sapiens_Dec_2013) genome at chr1:26,169,859-26,170,921
Looked into that API, as discussed with Nowlan Freese this API is the only one that matches our requirements and gives the expected response. The response is a bit different when compared to UCSC, so it should be parsed differently need a new parser and Symmetry class for this. Closing this ticket as we got all the APIs that we need for the development.
Sequence: https://rest.ensembl.org/documentation/info/sequence_id