[IGBF-3619] Kelsey's quest for unbiased gene distribution - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Minor
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Epic Link:
Support NSF pollen grant
Sprint:
Spring 4, Spring 5

Description

This task is more of a brainstorm based on the details that came out of Kelsey's committee meeting.
In her meeting, an advisor inquired about how Heinz is the reference genome and a sample in the experiment. It would make sense that Heinz will more easily align to reference genome compared to the other 3 TMH varieties.

The PCA (attached) reflects this.

The goal would be to come up with an unbiased way to check MDS / PCA / PCoA where everything is not aligned to Heinz.

1 idea.

We de novo assembly via trinity.
Get a set of genes for each variety.
Assign all of these genes to a SolyID via blast. Along with a gene expression count.
Run these as a MDS / PCA and see the distribution.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

Kelsey Pryze.pdf
18 kB
23/Feb/24 11:35 AM
pca_8hr_total.png
160 kB
23/Feb/24 11:27 AM

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Robert Reid added a comment - 27/Feb/24 9:05 AM

A more detailed plan laid out in this email:

I think I can churn something out in time. Let's lay out a plan first. I can start the pipeline and I have a student rotating with me who can help as well. I think we can have it done in time.

But if not, you can show the pipeline, lay out the justification and describe the expected end result.

It will also be a good idea to run the pipeline in Arizona afterwards. I share with you the code, you run it, we make it all publicly available for publication! You running it = repeatable.

Let me know if the following is a good strategy:

GOAL: To create gene counts for each plant variety using the same plant sequences and a de novo assembly approach to avoid HEINZ reference bias.

Summarize the sequences available in each variety (TMNH).
Pool all sequences in a variety to create a "mega pool". X4 (repeated for each variety)
Run Trinity on this mega pool to create a set of de novo created contigs. X4
Blast each of these contigs back to SL5 genome to get an annotation and have a way to compare varieties.
Align each set of reads to the appropriate set of de novo contigs to get gene counts.
Make a big table of gene counts from step #5 and tie them together via the annotation step #4.
Off to Deseq2 / EdgeR / EBSeq-HMM, etc !
Also, I have Molly run and fine tune the pipeline, and she then sets up in BitBucket to make the code publicly available for others.

Show

Robert Reid added a comment - 27/Feb/24 9:05 AM A more detailed plan laid out in this email: I think I can churn something out in time. Let's lay out a plan first. I can start the pipeline and I have a student rotating with me who can help as well. I think we can have it done in time. But if not, you can show the pipeline, lay out the justification and describe the expected end result. It will also be a good idea to run the pipeline in Arizona afterwards. I share with you the code, you run it, we make it all publicly available for publication! You running it = repeatable. Let me know if the following is a good strategy: GOAL: To create gene counts for each plant variety using the same plant sequences and a de novo assembly approach to avoid HEINZ reference bias. Summarize the sequences available in each variety (TMNH). Pool all sequences in a variety to create a "mega pool". X4 (repeated for each variety) Run Trinity on this mega pool to create a set of de novo created contigs. X4 Blast each of these contigs back to SL5 genome to get an annotation and have a way to compare varieties. Align each set of reads to the appropriate set of de novo contigs to get gene counts. Make a big table of gene counts from step #5 and tie them together via the annotation step #4. Off to Deseq2 / EdgeR / EBSeq-HMM, etc ! Also, I have Molly run and fine tune the pipeline, and she then sets up in BitBucket to make the code publicly available for others.

Hide

Permalink

Robert Reid added a comment - 27/Feb/24 9:07 AM

Kelsey presents in Tomato Group meeting on FEB. 27.

We get more details then and can plan an attack.
Would like to move this into current sprint.

I can start tackling this week.

Show

Robert Reid added a comment - 27/Feb/24 9:07 AM Kelsey presents in Tomato Group meeting on FEB. 27. We get more details then and can plan an attack. Would like to move this into current sprint. I can start tackling this week.

Hide

Permalink

Ann Loraine added a comment - 27/Feb/24 11:08 AM - edited

More thoughts:

A PCA needs as input a matrix of numbers. In our case, the matrix rows are genes, matrix columns are samples, and values are expression measurements per gene, per sample.

We can make PCA plots showing all the data (all samples) currently because we are using a common genome assembly as the alignment target. We have expression measurements for every gene, in every sample.

The gene models are from the reference Heinz assembly, in this case.

If we make RNA-Seq assemblies - a collection of transcript models - how will we get the matrix to use as input to PCA?
We would need a way to make the input matrix that wouldn't exacerbate bias arising from Heinz alignments working best against Heinz sequence.

PCA plots give you a nice, visual overview of the distributions of variables in a data set. We interpret them by making note of 2-dimension relationships between points, where we use color and shape to indicate values of experimental variables - temperature, cultivar, age, etc. In our case, points cluster by cultivar. This means there are large differences in gene expression measurements associated with cultivar in the data set. These differences could arise from many causes, of which alignment quality *could* be one. Heinz sequences may align better to some genes than do the other varieties.

Is there a way that we can assess the degree to which alignment quality is affecting the results?

Has anyone looked at whether there are any signs in the data of such a bias? How do we do that?

Show

Ann Loraine added a comment - 27/Feb/24 11:08 AM - edited More thoughts: A PCA needs as input a matrix of numbers. In our case, the matrix rows are genes, matrix columns are samples, and values are expression measurements per gene, per sample. We can make PCA plots showing all the data (all samples) currently because we are using a common genome assembly as the alignment target. We have expression measurements for every gene, in every sample. The gene models are from the reference Heinz assembly, in this case. If we make RNA-Seq assemblies - a collection of transcript models - how will we get the matrix to use as input to PCA? We would need a way to make the input matrix that wouldn't exacerbate bias arising from Heinz alignments working best against Heinz sequence. PCA plots give you a nice, visual overview of the distributions of variables in a data set. We interpret them by making note of 2-dimension relationships between points, where we use color and shape to indicate values of experimental variables - temperature, cultivar, age, etc. In our case, points cluster by cultivar. This means there are large differences in gene expression measurements associated with cultivar in the data set. These differences could arise from many causes, of which alignment quality * could * be one. Heinz sequences may align better to some genes than do the other varieties. Is there a way that we can assess the degree to which alignment quality is affecting the results? Has anyone looked at whether there are any signs in the data of such a bias? How do we do that?

Hide

Permalink

Robert Reid added a comment - 29/Feb/24 10:39 AM

To dive further into what Kelsey and Ravi want, we will meet with Kelsey and Ravi next week!

Show

Robert Reid added a comment - 29/Feb/24 10:39 AM To dive further into what Kelsey and Ravi want, we will meet with Kelsey and Ravi next week!

Hide

Permalink

Robert Reid added a comment - 07/Mar/24 10:29 AM

Slides that are related to this topic.

https://docs.google.com/presentation/d/1DulwQn8qpXMv1E2nDB_BZEVQlZtJo9ZA0Ad_U_1QEWs/edit?usp=sharing

2 take potential tasks for us to try.

Run Trinity on Kelsey's data. We can then explore how well these de novo contigs align to SL4 or SL5. (SL4 because Kelsey plans to stick with that due to her prior analyses)
Align Kelsey's data to the OTHER available tomato genome.

The other genome:
https://solgenomics.net/organism/Solanum_lycopersicum_var._cerasiforme/genome

Show

Robert Reid added a comment - 07/Mar/24 10:29 AM Slides that are related to this topic. https://docs.google.com/presentation/d/1DulwQn8qpXMv1E2nDB_BZEVQlZtJo9ZA0Ad_U_1QEWs/edit?usp=sharing 2 take potential tasks for us to try. Run Trinity on Kelsey's data. We can then explore how well these de novo contigs align to SL4 or SL5. (SL4 because Kelsey plans to stick with that due to her prior analyses) Align Kelsey's data to the OTHER available tomato genome. The other genome: https://solgenomics.net/organism/Solanum_lycopersicum_var._cerasiforme/genome

Hide

Permalink

Robert Reid added a comment - 07/Mar/24 10:29 AM

Let's discuss and then make tickets and tasks!

Show

Robert Reid added a comment - 07/Mar/24 10:29 AM Let's discuss and then make tickets and tasks!

Hide

Permalink

Robert Reid added a comment - 13/Mar/24 11:52 AM

I have made a few new tickets related to this. They have not been assigned to any sprint yet.

We can add them whenever we do the next sprint planning session.

https://jira.bioviz.org/browse/IGBF-3647

https://jira.bioviz.org/browse/IGBF-3646

Show

Robert Reid added a comment - 13/Mar/24 11:52 AM I have made a few new tickets related to this. They have not been assigned to any sprint yet. We can add them whenever we do the next sprint planning session. https://jira.bioviz.org/browse/IGBF-3647 https://jira.bioviz.org/browse/IGBF-3646

People

Assignee:

Robert Reid

Reporter:

Robert Reid

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

23/Feb/24 11:36 AM

Updated:

13/Mar/24 11:55 AM

Resolved:

13/Mar/24 11:55 AM