Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3047

Investigate: New splice variant annotations for tomato

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      2
    • Sprint:
      Spring 9 2022 May 9, Summer 1 2022 May 23, Summer 2 2022 June 6, Summer 3 2022 June 21, Summer 4 2022 July 4

      Description

      The S. lycopersicon (cultivated tomato) gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

      There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or, is this something only our group might care about?

      As part of the pollen NSF project, we are trying to understand and discover how heat stress triggers changes in RNA synthesis in pollen, in pollen tubes, and in other sample types related to reproduction in plants, especially tomato?

      How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing".

      How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?

        Attachments

          Issue Links

            Activity

            ann.loraine Ann Loraine created issue -
            ann.loraine Ann Loraine made changes -
            Field Original Value New Value
            Epic Link IGBF-2993 [ 21429 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Description The S. lycopersicon (cultivated tomato) genome gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information.

            Another question we can and probably should address right away is: How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other sample
            The S. lycopersicon (cultivated tomato) genome gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about, since we are studying the effects of heat stress and heat stress, along with desiccation stress, triggers changes in alternative splicing?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot.
            ann.loraine Ann Loraine made changes -
            Description The S. lycopersicon (cultivated tomato) genome gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about, since we are studying the effects of heat stress and heat stress, along with desiccation stress, triggers changes in alternative splicing?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot.
            The S. lycopersicon (cultivated tomato) genome gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about, since we are studying the effects of heat stress and heat stress, along with desiccation stress, triggers changes in alternative splicing?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "[RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing|https://pubmed.ncbi.nlm.nih.gov/23590974/]".

            How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?
            ann.loraine Ann Loraine made changes -
            Description The S. lycopersicon (cultivated tomato) genome gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about, since we are studying the effects of heat stress and heat stress, along with desiccation stress, triggers changes in alternative splicing?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "[RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing|https://pubmed.ncbi.nlm.nih.gov/23590974/]".

            How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?
            The S. lycopersicon (cultivated tomato) gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about, since we are studying the effects of heat stress and heat stress, along with desiccation stress, triggers changes in alternative splicing?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "[RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing|https://pubmed.ncbi.nlm.nih.gov/23590974/]".

            How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?
            nfreese Nowlan Freese made changes -
            Rank Ranked lower
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9 2022 May 9 [ 144 ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9 2022 May 9 [ 144 ] Spring 9 2022 May 9, Summer 1 2022 May 23 [ 144, 147 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]
            ann.loraine Ann Loraine made changes -
            Status In Progress [ 3 ] To-Do [ 10305 ]
            ann.loraine Ann Loraine made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            More questions:

            What are the splice variants of tomato?
            What are the most frequently observed transcript models for each gene model?
            What conditions change the balance? Which coditions or treatments push the observed transcript ratios out of balance?
            What is "balance", and how would it manifest in the biological data we've collected?

            Possible strategies for answering the above questions:

            • Investigate nf-core/rnaseq output. Did it produce output relevant to alternative splicing research?

            Possible ways to do the above:

            • Text word search for terms like "alternative splicing" and "rna-binding protein", etc.
            • Ask an RNA biologist to interactively search the corpus of text documents for important words in their discipline

            Then, describe as plainly as possible what was found.

            Show
            ann.loraine Ann Loraine added a comment - - edited More questions: What are the splice variants of tomato? What are the most frequently observed transcript models for each gene model? What conditions change the balance? Which coditions or treatments push the observed transcript ratios out of balance? What is "balance", and how would it manifest in the biological data we've collected? Possible strategies for answering the above questions: Investigate nf-core/rnaseq output. Did it produce output relevant to alternative splicing research? Possible ways to do the above: Text word search for terms like "alternative splicing" and "rna-binding protein", etc. Ask an RNA biologist to interactively search the corpus of text documents for important words in their discipline Then, describe as plainly as possible what was found.
            ann.loraine Ann Loraine made changes -
            Description The S. lycopersicon (cultivated tomato) gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about, since we are studying the effects of heat stress and heat stress, along with desiccation stress, triggers changes in alternative splicing?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "[RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing|https://pubmed.ncbi.nlm.nih.gov/23590974/]".

            How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?
            The S. lycopersicon (cultivated tomato) gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about.

            As part of the pollen NSF project, we are trying to understand and discover how heat stress triggers changes in RNA synthesis in pollen, in pollen tubes, and in other sample types related to reproduction in plants, especially tomato?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "[RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing|https://pubmed.ncbi.nlm.nih.gov/23590974/]".

            How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?
            Hide
            ann.loraine Ann Loraine added a comment -

            Discussed project with Rob and Molly week of May 31. Moving to closed.

            Show
            ann.loraine Ann Loraine added a comment - Discussed project with Rob and Molly week of May 31. Moving to closed.
            ann.loraine Ann Loraine made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            ann.loraine Ann Loraine made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            ann.loraine Ann Loraine made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Met with Molly and Rob Wed June 8, 2022 for status report, as follows:

            • Discussed "big picture" - need for positive controls to ensure that our methodology for detecting splicing changes in response to a treatment is working as expected
            • Molly showed a search for a specific SRP identifier SRP252265 (as an example - this is Mark Johnson data)
            • Click "Send results for run selector" link shown at the top of the search results
            • Goal: find more data to use in IGB (align and distribute via IGB QuickLoad) with similar properties
            • Has identified a few other data sets but needs to make a spreadsheet with data (SRP accessions values, SRR accession values ?)
            • Found a video with instructions on how to identify similar datasets: https://youtu.be/Ww_OTe3M_94
            • We discussed using Bioconductor tools to automate extracting potentially useful datasets from the SRA
            • We found one that looked promising: SRAdb. We used Google Scholar to find citing articles for the 2013 paper that introduced SRAdb: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-19
            • We noted that one citing article from 2022 apparently used SRAdb Bioconductor package to develop a curated sequence database for a particular subarea of biology, suggesting that this package might be useful for us.
            • Strategy of using the SRA Web site interfaces to explore the SRA is a good one (will help us understand what's there and how its structure)

            To-Do items for us:

            • Send Andrea & James a request for RNA-Seq data set with a specific profile: paired-end data, 150 bases in length, non-strand-specific, VF36 and/or Heinz, Illumina HiSeq 4000
            • Rob: Let's ask during technical meeting today
            • Rob: Let's make a location in Google Drive for us to share information and documents
            • Molly: Continue her work with SRA Web interface and create table for us to look at - first draft only with a few examples of datasets we might be able to use
            Show
            ann.loraine Ann Loraine added a comment - - edited Met with Molly and Rob Wed June 8, 2022 for status report, as follows: Discussed "big picture" - need for positive controls to ensure that our methodology for detecting splicing changes in response to a treatment is working as expected Molly showed a search for a specific SRP identifier SRP252265 (as an example - this is Mark Johnson data) Click "Send results for run selector" link shown at the top of the search results Goal: find more data to use in IGB (align and distribute via IGB QuickLoad) with similar properties Has identified a few other data sets but needs to make a spreadsheet with data (SRP accessions values, SRR accession values ?) Found a video with instructions on how to identify similar datasets: https://youtu.be/Ww_OTe3M_94 We discussed using Bioconductor tools to automate extracting potentially useful datasets from the SRA We found one that looked promising: SRAdb. We used Google Scholar to find citing articles for the 2013 paper that introduced SRAdb: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-19 We noted that one citing article from 2022 apparently used SRAdb Bioconductor package to develop a curated sequence database for a particular subarea of biology, suggesting that this package might be useful for us. Strategy of using the SRA Web site interfaces to explore the SRA is a good one (will help us understand what's there and how its structure) To-Do items for us: Send Andrea & James a request for RNA-Seq data set with a specific profile: paired-end data, 150 bases in length, non-strand-specific, VF36 and/or Heinz, Illumina HiSeq 4000 Rob: Let's ask during technical meeting today Rob: Let's make a location in Google Drive for us to share information and documents Molly: Continue her work with SRA Web interface and create table for us to look at - first draft only with a few examples of datasets we might be able to use
            Hide
            robofjoy Robert Reid added a comment -

            Rob successfully navigates his way to the comments!

            Show
            robofjoy Robert Reid added a comment - Rob successfully navigates his way to the comments!
            ann.loraine Ann Loraine made changes -
            Resolution Done [ 10000 ]
            Status Closed [ 6 ] To-Do [ 10305 ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9 2022 May 9, Summer 1 2022 May 23 [ 144, 147 ] Spring 9 2022 May 9, Summer 1 2022 May 23, Summer 2 2022 June 6 [ 144, 147, 148 ]
            Hide
            ann.loraine Ann Loraine added a comment -

            Following the meeting, Molly (email: mdavi258@uncc.edu) created a first draft spreadsheet showing RNA-Seq datasets from Sequence Read Archive, assembled using web interfaces at the SRA Web site.

            Link to work-in-progress spreadsheet: https://docs.google.com/spreadsheets/d/1hg38PoQYAHUjx-H_40emgmpii3z2T1QNTcdSbX5LXBY/edit?usp=sharing

            Show
            ann.loraine Ann Loraine added a comment - Following the meeting, Molly (email: mdavi258@uncc.edu) created a first draft spreadsheet showing RNA-Seq datasets from Sequence Read Archive, assembled using web interfaces at the SRA Web site. Link to work-in-progress spreadsheet: https://docs.google.com/spreadsheets/d/1hg38PoQYAHUjx-H_40emgmpii3z2T1QNTcdSbX5LXBY/edit?usp=sharing
            Hide
            ann.loraine Ann Loraine added a comment -

            A paper was published in 2019 that discussed using RNA-Seq data harvested from the public domain (e.g., SRA) to form a new set of gene models for tomato, using the SL3.0 (feb 2017) assembly for generating alignments, which were used to form gene models. [~aloraine] set up those gene models for visualization in IGB as described in the README.md here: https://bitbucket.org/hotpollen/rna-seq/src/master/SplicingBackground/.

            Show
            ann.loraine Ann Loraine added a comment - A paper was published in 2019 that discussed using RNA-Seq data harvested from the public domain (e.g., SRA) to form a new set of gene models for tomato, using the SL3.0 (feb 2017) assembly for generating alignments, which were used to form gene models. [~aloraine] set up those gene models for visualization in IGB as described in the README.md here: https://bitbucket.org/hotpollen/rna-seq/src/master/SplicingBackground/ .
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9 2022 May 9, Summer 1 2022 May 23, Summer 2 2022 June 6 [ 144, 147, 148 ] Spring 9 2022 May 9, Summer 1 2022 May 23, Summer 2 2022 June 6, Summer 3 2022 June 20 [ 144, 147, 148, 149 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Update:

            • We looked at the spreadsheet during a meeting (11:00 am Tues 6/22).
            • Labeled (using color) the experimental (target) dataset (green); candidate positives controls (grayish-blue); picked first positive control dataset (SRP328042)
            • Decided to filter candidates on +1 or -1 the target datasets base length (filtered on paired-end already)
            • Made folders in Google Drive to track today's products
            • Downloaded automatically-generated file containing 18 "SRR" run accessions in SRP328042.
            • Decided next steps:

            Next steps:

            • Download fastq format files from sequence read archive, put onto cluster (using fasterqDump from SRA toolkit)
            • Retrieve data for experimental () and positive control (SRP328042)
            • Make sure that the Unix file permissions are set such that everybody in the group can edit, change names, delete the files
            • Please provide a simple script (nothing fancy) that will do the above

            Probably next steps:

            • Develop configuration files for running rna-seq data analysis pipeline from nextflow (nfcore / rnaseq)

            Molly: Video on working with GEO (not SRA) in R, bioconductor, tidyverse is here: https://youtu.be/dc77edcNp3M

            Show
            ann.loraine Ann Loraine added a comment - - edited Update: We looked at the spreadsheet during a meeting (11:00 am Tues 6/22). Labeled (using color) the experimental (target) dataset (green); candidate positives controls (grayish-blue); picked first positive control dataset (SRP328042) Decided to filter candidates on +1 or -1 the target datasets base length (filtered on paired-end already) Made folders in Google Drive to track today's products Downloaded automatically-generated file containing 18 "SRR" run accessions in SRP328042. Decided next steps: Next steps: Download fastq format files from sequence read archive, put onto cluster (using fasterqDump from SRA toolkit) Retrieve data for experimental () and positive control (SRP328042) Make sure that the Unix file permissions are set such that everybody in the group can edit, change names, delete the files Please provide a simple script (nothing fancy) that will do the above Probably next steps: Develop configuration files for running rna-seq data analysis pipeline from nextflow (nfcore / rnaseq) Molly: Video on working with GEO (not SRA) in R, bioconductor, tidyverse is here: https://youtu.be/dc77edcNp3M
            ann.loraine Ann Loraine made changes -
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Today we learned about a new tomato genome release with what appear to be splice variants.
            Loaded GFF file into IGB (w/o the sequence) and clicked on a few things.

            Questions about the gene models :

            • How well represented are genes expressed primarily or exclusively in germinating pollen, as compared to other, more exhaustively sequenced organ or tissue types?
            • How are transcripts assigned to the same parent identifier?

            To-Do for splicing investigation project, now that we have these splice variants (maybe?)

            Start testing splicing detection methodologies, starting with ArabiTag, as this is code Ann is very familiar with.

            To run ArabiTag:

            Methodology question:
            Can we merge overlapping R1 and R2 sequences into a single sequence and align the entire thing, rather than the two ends separately? Somehow, that seems better.

            Show
            ann.loraine Ann Loraine added a comment - - edited Today we learned about a new tomato genome release with what appear to be splice variants. Loaded GFF file into IGB (w/o the sequence) and clicked on a few things. Questions about the gene models : How well represented are genes expressed primarily or exclusively in germinating pollen, as compared to other, more exhaustively sequenced organ or tissue types? How are transcripts assigned to the same parent identifier? To-Do for splicing investigation project, now that we have these splice variants (maybe?) Start testing splicing detection methodologies, starting with ArabiTag, as this is code Ann is very familiar with. To run ArabiTag: Convert the gene models into bed12 format, use ArabiTag to detect and classify the AS event pairs (will probably use: https://bitbucket.org/lorainelab/genomesource/src/master/ ) Run ArabiTag on read1 and then read2 of experimental and positive control datasets to generate counts for S and L forms of each pair (arabitag: https://bitbucket.org/lorainelab/altspliceanalysis/src/master/ ) Analyze read1 and read2 results separately, not together, to avoid fragments double-sampling when R1 and R2 overlap Run ArabiTag R code (via Markdown) to calculate %spliced-in, t-statistic for comparing two groups, counts per comparison, false discovery rate (Q) per comparison, and more (for examples, see: https://bitbucket.org/lorainelab/hot-dry-arabidopsis/src/master/ and https://bitbucket.org/lorainelab/ricealtsplice/src/master/ ) Select genes with RNA processing or RNA-binding annotations Visually analyze the data by viewing RNA-Seq sequence alignments in an interactive genome browser Compare the amount of splicing difference in the positive control versus the experimental sample. Are they similar? Are they different? Methodology question: Can we merge overlapping R1 and R2 sequences into a single sequence and align the entire thing, rather than the two ends separately? Somehow, that seems better.
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBF-3135 [ IGBF-3135 ]
            Hide
            robofjoy Robert Reid added a comment -

            Assuming one has downloaded the SRA toolkit locally.
            SRA fasterqdump:

            ~/sw/sratoolkit.3.0.0-mac64/bin/prefetch SRR1572591
            fasterq-dump -S SRR1572591.sra

            (The -S makes the split into 2 fastq files)
            Now validate what you pulled:
            ~/sw/sratoolkit.3.0.0-mac64/bin/vdb-validate SRR5790104

            Show
            robofjoy Robert Reid added a comment - Assuming one has downloaded the SRA toolkit locally. SRA fasterqdump: ~/sw/sratoolkit.3.0.0-mac64/bin/prefetch SRR1572591 fasterq-dump -S SRR1572591.sra (The -S makes the split into 2 fastq files) Now validate what you pulled: ~/sw/sratoolkit.3.0.0-mac64/bin/vdb-validate SRR5790104
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9 2022 May 9, Summer 1 2022 May 23, Summer 2 2022 June 6, Summer 3 2022 June 21 [ 144, 147, 148, 149 ] Spring 9 2022 May 9, Summer 1 2022 May 23, Summer 2 2022 June 6, Summer 3 2022 June 21, Summer 4 2022 July 4 [ 144, 147, 148, 149, 150 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Description The S. lycopersicon (cultivated tomato) gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or is this something only our group might care about.

            As part of the pollen NSF project, we are trying to understand and discover how heat stress triggers changes in RNA synthesis in pollen, in pollen tubes, and in other sample types related to reproduction in plants, especially tomato?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "[RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing|https://pubmed.ncbi.nlm.nih.gov/23590974/]".

            How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?
            The S. lycopersicon (cultivated tomato) gene annotations include only one gene model per gene. However, visualizing RNA-Seq data in IGB shows that a large number of genes produce multiple splice forms. At least one other group has noticed this, as well. In their article "[Expanding Alternative Splicing Identification by Integrating Multiple Sources of Transcription Data in Tomato|https://www.frontiersin.org/articles/10.3389/fpls.2019.00689/full]", a group at Ohio State University led by Prof. Xiangjia (Jack) Min reported using transcriptome data, including ESTs and RNA-Seq data, to assemble new gene models. I downloaded these and deployed them to IGB Quickload; they are one of the available data sets for the next to last genome release.

            There may be other groups developing similar datasets for the most recent genome release for tomato. And in order to quantify splice variant expression using current methods, it would be extremely helpful to have an up-to-date, accurate-as-possible collection of gene models annotated with functional information. Who else is interested in this and would be interested in contributing? Or, is this something only our group might care about?

            As part of the pollen NSF project, we are trying to understand and discover how heat stress triggers changes in RNA synthesis in pollen, in pollen tubes, and in other sample types related to reproduction in plants, especially tomato?

            How homogenous are the RNA-Seq data sets coming from the pollen project? So far, all the data have been from a single cell type: germinating pollen tubes. I do not recall seeing much evidence for alternative splicing in these datasets, at least not as compared with other samples that included many cell types, e.g., root or shoot. Also, are there splice forms that exist mainly in pollen but not other tissue types? We found some examples of this in the Arabidospis pollen RNA-Seq data described in our paper "[RNA-seq of Arabidopsis pollen uncovers novel transcription and alternative splicing|https://pubmed.ncbi.nlm.nih.gov/23590974/]".

            How many tomato RNA-Seq data sets are there, and how good are they? For the purpose of producing new gene models, the best bulk RNA-Seq data would be paired end, very long read lengths, and strand-specific. Are such data available currently, or would we need to create new data to cover the entirety of transcription?
            Hide
            ann.loraine Ann Loraine added a comment - - edited
            Show
            ann.loraine Ann Loraine added a comment - - edited Saw talk at meeting which mentioned this new method for transcript assembly: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02700-3 Slides from meeting review: https://docs.google.com/presentation/d/1YmZWDT8-COLkEtj3ppA7HadD_Zyy6uD_I3V6K90jmjg/edit#slide=id.p
            ann.loraine Ann Loraine made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            ann.loraine Ann Loraine made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            ann.loraine Ann Loraine made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]

              People

              • Assignee:
                ann.loraine Ann Loraine
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: