Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3619

Kelsey's quest for unbiased gene distribution

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      This task is more of a brainstorm based on the details that came out of Kelsey's committee meeting.
      In her meeting, an advisor inquired about how Heinz is the reference genome and a sample in the experiment. It would make sense that Heinz will more easily align to reference genome compared to the other 3 TMH varieties.

      The PCA (attached) reflects this.

      The goal would be to come up with an unbiased way to check MDS / PCA / PCoA where everything is not aligned to Heinz.

      1 idea.

      1. We de novo assembly via trinity.
      2. Get a set of genes for each variety.
      3. Assign all of these genes to a SolyID via blast. Along with a gene expression count.
      4. Run these as a MDS / PCA and see the distribution.

        Attachments

          Activity

          Hide
          robofjoy Robert Reid added a comment -

          A more detailed plan laid out in this email:

          I think I can churn something out in time. Let's lay out a plan first. I can start the pipeline and I have a student rotating with me who can help as well. I think we can have it done in time.

          But if not, you can show the pipeline, lay out the justification and describe the expected end result.

          It will also be a good idea to run the pipeline in Arizona afterwards. I share with you the code, you run it, we make it all publicly available for publication! You running it = repeatable.

          Let me know if the following is a good strategy:

          GOAL: To create gene counts for each plant variety using the same plant sequences and a de novo assembly approach to avoid HEINZ reference bias.

          1. Summarize the sequences available in each variety (TMNH).
          2. Pool all sequences in a variety to create a "mega pool". X4 (repeated for each variety)
          3. Run Trinity on this mega pool to create a set of de novo created contigs. X4
          4. Blast each of these contigs back to SL5 genome to get an annotation and have a way to compare varieties.
          5. Align each set of reads to the appropriate set of de novo contigs to get gene counts.
          6. Make a big table of gene counts from step #5 and tie them together via the annotation step #4.
          7. Off to Deseq2 / EdgeR / EBSeq-HMM, etc !
            Also, I have Molly run and fine tune the pipeline, and she then sets up in BitBucket to make the code publicly available for others.
          Show
          robofjoy Robert Reid added a comment - A more detailed plan laid out in this email: I think I can churn something out in time. Let's lay out a plan first. I can start the pipeline and I have a student rotating with me who can help as well. I think we can have it done in time. But if not, you can show the pipeline, lay out the justification and describe the expected end result. It will also be a good idea to run the pipeline in Arizona afterwards. I share with you the code, you run it, we make it all publicly available for publication! You running it = repeatable. Let me know if the following is a good strategy: GOAL: To create gene counts for each plant variety using the same plant sequences and a de novo assembly approach to avoid HEINZ reference bias. Summarize the sequences available in each variety (TMNH). Pool all sequences in a variety to create a "mega pool". X4 (repeated for each variety) Run Trinity on this mega pool to create a set of de novo created contigs. X4 Blast each of these contigs back to SL5 genome to get an annotation and have a way to compare varieties. Align each set of reads to the appropriate set of de novo contigs to get gene counts. Make a big table of gene counts from step #5 and tie them together via the annotation step #4. Off to Deseq2 / EdgeR / EBSeq-HMM, etc ! Also, I have Molly run and fine tune the pipeline, and she then sets up in BitBucket to make the code publicly available for others.
          Hide
          robofjoy Robert Reid added a comment -

          Kelsey presents in Tomato Group meeting on FEB. 27.

          We get more details then and can plan an attack.
          Would like to move this into current sprint.

          I can start tackling this week.

          Show
          robofjoy Robert Reid added a comment - Kelsey presents in Tomato Group meeting on FEB. 27. We get more details then and can plan an attack. Would like to move this into current sprint. I can start tackling this week.
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          More thoughts:

          A PCA needs as input a matrix of numbers. In our case, the matrix rows are genes, matrix columns are samples, and values are expression measurements per gene, per sample.

          We can make PCA plots showing all the data (all samples) currently because we are using a common genome assembly as the alignment target. We have expression measurements for every gene, in every sample.

          The gene models are from the reference Heinz assembly, in this case.

          If we make RNA-Seq assemblies - a collection of transcript models - how will we get the matrix to use as input to PCA?
          We would need a way to make the input matrix that wouldn't exacerbate bias arising from Heinz alignments working best against Heinz sequence.

          PCA plots give you a nice, visual overview of the distributions of variables in a data set. We interpret them by making note of 2-dimension relationships between points, where we use color and shape to indicate values of experimental variables - temperature, cultivar, age, etc. In our case, points cluster by cultivar. This means there are large differences in gene expression measurements associated with cultivar in the data set. These differences could arise from many causes, of which alignment quality *could* be one. Heinz sequences may align better to some genes than do the other varieties.

          Is there a way that we can assess the degree to which alignment quality is affecting the results?

          Has anyone looked at whether there are any signs in the data of such a bias? How do we do that?

          Show
          ann.loraine Ann Loraine added a comment - - edited More thoughts: A PCA needs as input a matrix of numbers. In our case, the matrix rows are genes, matrix columns are samples, and values are expression measurements per gene, per sample. We can make PCA plots showing all the data (all samples) currently because we are using a common genome assembly as the alignment target. We have expression measurements for every gene, in every sample. The gene models are from the reference Heinz assembly, in this case. If we make RNA-Seq assemblies - a collection of transcript models - how will we get the matrix to use as input to PCA? We would need a way to make the input matrix that wouldn't exacerbate bias arising from Heinz alignments working best against Heinz sequence. PCA plots give you a nice, visual overview of the distributions of variables in a data set. We interpret them by making note of 2-dimension relationships between points, where we use color and shape to indicate values of experimental variables - temperature, cultivar, age, etc. In our case, points cluster by cultivar. This means there are large differences in gene expression measurements associated with cultivar in the data set. These differences could arise from many causes, of which alignment quality * could * be one. Heinz sequences may align better to some genes than do the other varieties. Is there a way that we can assess the degree to which alignment quality is affecting the results? Has anyone looked at whether there are any signs in the data of such a bias? How do we do that?
          Hide
          robofjoy Robert Reid added a comment -

          To dive further into what Kelsey and Ravi want, we will meet with Kelsey and Ravi next week!

          Show
          robofjoy Robert Reid added a comment - To dive further into what Kelsey and Ravi want, we will meet with Kelsey and Ravi next week!
          Hide
          robofjoy Robert Reid added a comment -

          Slides that are related to this topic.

          https://docs.google.com/presentation/d/1DulwQn8qpXMv1E2nDB_BZEVQlZtJo9ZA0Ad_U_1QEWs/edit?usp=sharing

          2 take potential tasks for us to try.

          • Run Trinity on Kelsey's data. We can then explore how well these de novo contigs align to SL4 or SL5. (SL4 because Kelsey plans to stick with that due to her prior analyses)
          • Align Kelsey's data to the OTHER available tomato genome.

          The other genome:
          https://solgenomics.net/organism/Solanum_lycopersicum_var._cerasiforme/genome

          Show
          robofjoy Robert Reid added a comment - Slides that are related to this topic. https://docs.google.com/presentation/d/1DulwQn8qpXMv1E2nDB_BZEVQlZtJo9ZA0Ad_U_1QEWs/edit?usp=sharing 2 take potential tasks for us to try. Run Trinity on Kelsey's data. We can then explore how well these de novo contigs align to SL4 or SL5. (SL4 because Kelsey plans to stick with that due to her prior analyses) Align Kelsey's data to the OTHER available tomato genome. The other genome: https://solgenomics.net/organism/Solanum_lycopersicum_var._cerasiforme/genome
          Hide
          robofjoy Robert Reid added a comment -

          Let's discuss and then make tickets and tasks!

          Show
          robofjoy Robert Reid added a comment - Let's discuss and then make tickets and tasks!
          Hide
          robofjoy Robert Reid added a comment -

          I have made a few new tickets related to this. They have not been assigned to any sprint yet.

          We can add them whenever we do the next sprint planning session.

          https://jira.bioviz.org/browse/IGBF-3647

          https://jira.bioviz.org/browse/IGBF-3646

          Show
          robofjoy Robert Reid added a comment - I have made a few new tickets related to this. They have not been assigned to any sprint yet. We can add them whenever we do the next sprint planning session. https://jira.bioviz.org/browse/IGBF-3647 https://jira.bioviz.org/browse/IGBF-3646

            People

            • Assignee:
              robofjoy Robert Reid
              Reporter:
              robofjoy Robert Reid
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: