[IGBF-2790] Investigate: business logic that attempts to guess species and genome version from the file - JIRA UNCC

Details

Type: New Feature
Status: To-Do (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Labels:
- Advanced

Story Points:
1
Epic Link:
Maintain BioViz Connect

Description

When users open the metadata panel for a file, the species and genome version is only shown if the user has previously set these values.

This is fine, but it means that a user must manually annotate each file with this information. Also, sometimes it is not possible to do so because the metadata can only be edited by the file's owner.

IGB Quickload sites already have file-based databases that map chromosome names and sizes onto genome versions. Some file types in bioinformatics (e.g., BAM) also contain this information. All genome data files contain location coordinates. Sometimes the structure of the data itself, esp. transcriptome data, can suggest what sort of creature or plant species provided the original biological material that was used to generate the data. In light of all these facts, it may be possible to build a simple computer system that will make a guess about which genome assembly to display the data on top of. I think we could implement a "species and genome version" guesser that fill in these metadata values if the user has not yet provided them.

Attachments

Activity

Descending order - Click to sort in ascending order

Hide

Permalink

Ann Loraine added a comment - 18/Feb/21 4:18 AM

I think it would be worthwhile to examine the download endpoint in a bit more detail.

Is it possible for a host to initiate a download onto itself but instead of downloading the entire file, it would halt the download after some small number of bytes.

We could train an AI to recognize the species and genome version for a data file purely on the basis of the bytes. That way, we don't have to come up with all the tortuous business logic needed to figure it out.

I'm pretty sure that AI methods are getting to the point where you can deploy them into production systems.

I'm not sure if this counts as "innovative" though. Has anyone else done anything like this?

Philip Badzuh or Karthik Raveendran - do you know?

Show

Ann Loraine added a comment - 18/Feb/21 4:18 AM I think it would be worthwhile to examine the download endpoint in a bit more detail. Is it possible for a host to initiate a download onto itself but instead of downloading the entire file, it would halt the download after some small number of bytes. We could train an AI to recognize the species and genome version for a data file purely on the basis of the bytes. That way, we don't have to come up with all the tortuous business logic needed to figure it out. I'm pretty sure that AI methods are getting to the point where you can deploy them into production systems. I'm not sure if this counts as "innovative" though. Has anyone else done anything like this? Philip Badzuh or Karthik Raveendran - do you know?

Hide

Permalink

Nowlan Freese added a comment - 17/Feb/21 9:45 AM

For upload/download we avoid streaming any part of the file through our servers as it would put a significant strain on our resources. So we pass the access token to the user and their client interacts directly with CyVerse. In addition, if the file is not public, there is currently no way to do byte-range requests to retrieve part of the file.

A potential idea we had previously was to be able to copy/paste metadata. This is a similar idea to what is in other software such as photoshop where you can copy/paste style metadata on to various layers. I would imagine a user could have many files in a single folder all belonging to the same genome split between controls/experiments. The user could set the metadata for a single control and single experiment file, then copy/paste the metadata (genome, track color, track label) onto the appropriate files, potentially all of a group by shift-clicking all of the files. On the back end we would retrieve the metadata for the selected file and copy it, then create requests for each file's metadata when pasted.

Show

Nowlan Freese added a comment - 17/Feb/21 9:45 AM For upload/download we avoid streaming any part of the file through our servers as it would put a significant strain on our resources. So we pass the access token to the user and their client interacts directly with CyVerse. In addition, if the file is not public, there is currently no way to do byte-range requests to retrieve part of the file. A potential idea we had previously was to be able to copy/paste metadata. This is a similar idea to what is in other software such as photoshop where you can copy/paste style metadata on to various layers. I would imagine a user could have many files in a single folder all belonging to the same genome split between controls/experiments. The user could set the metadata for a single control and single experiment file, then copy/paste the metadata (genome, track color, track label) onto the appropriate files, potentially all of a group by shift-clicking all of the files. On the back end we would retrieve the metadata for the selected file and copy it, then create requests for each file's metadata when pasted.

Hide

Permalink

Ann Loraine added a comment - 16/Feb/21 8:31 PM

Request for Nowlan Freese:

Please describe the download endpoint here in the comments. Asking because we could grab the first 1000 or so bytes of a file and maybe do something with that.

Another option would be simpler. We could recommend users save files in folders named for the genome version they are from. We could then attempt to guess the genome version from the file path.

Show

Ann Loraine added a comment - 16/Feb/21 8:31 PM Request for Nowlan Freese : Please describe the download endpoint here in the comments. Asking because we could grab the first 1000 or so bytes of a file and maybe do something with that. Another option would be simpler. We could recommend users save files in folders named for the genome version they are from. We could then attempt to guess the genome version from the file path.

Hide

Permalink

Nowlan Freese added a comment - 16/Feb/21 10:27 AM

This may be very difficult to implement. To my knowledge we cannot directly read a files contents using the Terrain API. I think this would be especially true for bam files as they are binary and would need samtools view to be run to view the header info. A potential option would be to create an app that would take a bam file as input and output the header info as a file. We would then have to read this file (somehow?) and parse out the species/genome. However, this would require that we run many jobs for the user (equal to number of files they have?). And we would also need to keep track of which files have had the metadata set and which have not or have been altered by the user.

Show

Nowlan Freese added a comment - 16/Feb/21 10:27 AM This may be very difficult to implement. To my knowledge we cannot directly read a files contents using the Terrain API. I think this would be especially true for bam files as they are binary and would need samtools view to be run to view the header info. A potential option would be to create an app that would take a bam file as input and output the header info as a file. We would then have to read this file (somehow?) and parse out the species/genome. However, this would require that we run many jobs for the user (equal to number of files they have?). And we would also need to keep track of which files have had the metadata set and which have not or have been altered by the user.

Investigate: business logic that attempts to guess species and genome version from the file

Details

Description

Attachments

Activity

People

Dates