Details
-
Type: New Feature
-
Status: To-Do (View Workflow)
-
Priority: Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:
-
Story Points:1
-
Epic Link:
Description
When users open the metadata panel for a file, the species and genome version is only shown if the user has previously set these values.
This is fine, but it means that a user must manually annotate each file with this information. Also, sometimes it is not possible to do so because the metadata can only be edited by the file's owner.
IGB Quickload sites already have file-based databases that map chromosome names and sizes onto genome versions. Some file types in bioinformatics (e.g., BAM) also contain this information. All genome data files contain location coordinates. Sometimes the structure of the data itself, esp. transcriptome data, can suggest what sort of creature or plant species provided the original biological material that was used to generate the data. In light of all these facts, it may be possible to build a simple computer system that will make a guess about which genome assembly to display the data on top of. I think we could implement a "species and genome version" guesser that fill in these metadata values if the user has not yet provided them.
I think it would be worthwhile to examine the download endpoint in a bit more detail.
Is it possible for a host to initiate a download onto itself but instead of downloading the entire file, it would halt the download after some small number of bytes.
We could train an AI to recognize the species and genome version for a data file purely on the basis of the bytes. That way, we don't have to come up with all the tortuous business logic needed to figure it out.
I'm pretty sure that AI methods are getting to the point where you can deploy them into production systems.
I'm not sure if this counts as "innovative" though. Has anyone else done anything like this?
Philip Badzuh or Karthik Raveendran - do you know?