Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3189

Annotate SL5 genes with RNA processing related functions

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Annotate the SL5 genes with RNA processing related functions.

      Create a table that reports functions for each of the "SL" identifiers (transcript or gene name is fine)
      Also, there can be a many-to-many relationship! For example, one gene or transcript (column 1) could have multiple functional assignments (column 2).

      Predicted:

      • SR protein annotations (most likely to be alternatively spliced in response to a treatment)
      • RNA-binding proteins
      • Protein components of the spliceosome
      • snRNP RNAs (if possible)

      References:

      • Implementing a Rational and Consistent Nomenclature for Serine/Arginine-Rich Protein Splicing Factors (SR Proteins) in Plants = https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2965536/
      • Gene Ontology classifications could be used, would need probably to transfer these over from previous annotation releases

        Attachments

          Issue Links

            Activity

            Hide
            robofjoy Robert Reid added a comment -

            SR Proteins
            From Figure 1 of: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2965536/

            Figure 1 shows the various protein / gene symbol names from arabidopsis for the SR family.

            To get a list of these genes, the image was copied into Google Images and translated into text.
            The text was then copied into a Google Sheet and edited.

            SR splicing factors Sheet:
            https://docs.google.com/spreadsheets/d/1ScfcUkmf74G6eOoGGDV-3ZpQ0iqbOiVqF78SyEiPSoM/edit?usp=sharing

            *A few of the entries were incorrect in the OCR step from Google lens / images and were manually corrected.

            Now we have a list of arabid gene id’s! (and rice too)

            Finding the relevant GO terms
            To find the relevant GO terms, each of the 17 arabid ID’s were entered into TAIR’s online tool for GO term retrieval (Via PANTHER’s DB)

            GO TOOL:
            https://www.arabidopsis.org/tools/go_term_enrichment.jsp

            Each ID found a suitable GO term. The GO term, and the link to the Panther result were recorded in the SR Splicing factors Google sheet.

            ( https://docs.google.com/spreadsheets/d/1ScfcUkmf74G6eOoGGDV-3ZpQ0iqbOiVqF78SyEiPSoM/edit?usp=sharing
            )

            Tomato genes that match…..
            For 8 of the 17 Arabid SR genes, there is an already identified Sol Lyco ID that has been made (NOT Genome version SL5). See column SolyID.

            These are included in the spread sheet. 9 do not have a match.

            Finding matches via Reciprocal Best Hit Blasts
            Goal: to see what high fidelity matches we can get by blasting the 17 aribid genes against the tomato SL5 protein sequences.
            Pull the arabid sequences from TAIR and save as fasta on the cluster
            Blast all soly protein sequences to these 17 seq. (blastx)
            Blast the 17 against the soly proteins (tblastn).
            Identify best matches somehow.
            Identify 1:1, 1tomany and manyto1.

            Step 1:
            The 17 arabid sequence IDs for SR:

            At1g09140
            At1g02840
            At3g49430
            A14902430
            At1g23860
            At4g31580
            At2g24590
            At5g64200
            At5g18810
            At3g55460
            At3g13570
            At1g55310
            At3g53500
            At2g37340
            At2g46610
            At3g61860
            At4g25500
            At5g52040

            These are fed into TAIR here:
            https://www.arabidopsis.org/tools/bulk/sequences/

            Cluster Location: /nobackup/tomato_genome/alt_splicing/SR-proteins

            MaKE BLAST DB ON THE cluster:
            module load blast

            makeblastdb -in SRgenes-arabid.fna -input_type fasta -dbtype nucl

            makeblastdb -in SRproteins.faa -input_type fasta -dbtype prot

            Show
            robofjoy Robert Reid added a comment - SR Proteins From Figure 1 of: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2965536/ Figure 1 shows the various protein / gene symbol names from arabidopsis for the SR family. To get a list of these genes, the image was copied into Google Images and translated into text. The text was then copied into a Google Sheet and edited. SR splicing factors Sheet: https://docs.google.com/spreadsheets/d/1ScfcUkmf74G6eOoGGDV-3ZpQ0iqbOiVqF78SyEiPSoM/edit?usp=sharing *A few of the entries were incorrect in the OCR step from Google lens / images and were manually corrected. Now we have a list of arabid gene id’s! (and rice too) Finding the relevant GO terms To find the relevant GO terms, each of the 17 arabid ID’s were entered into TAIR’s online tool for GO term retrieval (Via PANTHER’s DB) GO TOOL: https://www.arabidopsis.org/tools/go_term_enrichment.jsp Each ID found a suitable GO term. The GO term, and the link to the Panther result were recorded in the SR Splicing factors Google sheet. ( https://docs.google.com/spreadsheets/d/1ScfcUkmf74G6eOoGGDV-3ZpQ0iqbOiVqF78SyEiPSoM/edit?usp=sharing ) Tomato genes that match….. For 8 of the 17 Arabid SR genes, there is an already identified Sol Lyco ID that has been made (NOT Genome version SL5). See column SolyID. These are included in the spread sheet. 9 do not have a match. Finding matches via Reciprocal Best Hit Blasts Goal: to see what high fidelity matches we can get by blasting the 17 aribid genes against the tomato SL5 protein sequences. Pull the arabid sequences from TAIR and save as fasta on the cluster Blast all soly protein sequences to these 17 seq. (blastx) Blast the 17 against the soly proteins (tblastn). Identify best matches somehow. Identify 1:1, 1tomany and manyto1. Step 1: The 17 arabid sequence IDs for SR: At1g09140 At1g02840 At3g49430 A14902430 At1g23860 At4g31580 At2g24590 At5g64200 At5g18810 At3g55460 At3g13570 At1g55310 At3g53500 At2g37340 At2g46610 At3g61860 At4g25500 At5g52040 These are fed into TAIR here: https://www.arabidopsis.org/tools/bulk/sequences/ Cluster Location: /nobackup/tomato_genome/alt_splicing/SR-proteins MaKE BLAST DB ON THE cluster: module load blast makeblastdb -in SRgenes-arabid.fna -input_type fasta -dbtype nucl makeblastdb -in SRproteins.faa -input_type fasta -dbtype prot
            Hide
            robofjoy Robert Reid added a comment -

            Rerunning blast script after noticing that one of the OCR translated ID's turned a 4 into an "A". So one sequence was missing because of this.

            Pulled the sequence
            AT4G02430.4 | Symbols: SR34b, At-SR34b | Serine/Arginine-Rich Protein Splicing Factor 34b | chr4:1069186-1071366 FORWARD LENGTH=292

            Now there are 18 Arabid sequences to explore, instead of 17.

            Blastn produces only 4 hits total. Ignoring this for now as protein to protein blast searches were more abundant.

            Blast results are on HPC cluster at this location:
            /nobackup/tomato_genome/alt_splicing/SR-proteins

            Results are these files (tab delimtied):
            rw-rw--- 1 rreid2 tomato_genome 296 Oct 12 11:25 blastn-SR17cds-vs-SL5genes.txt
            rw-rw--- 1 rreid2 tomato_genome 296 Oct 12 11:25 blastn-SL5genes-vs-SR17cds.txt
            rw-rw--- 1 rreid2 tomato_genome 17K Oct 12 11:25 blastx-SR17cds-vs-SL5pep.txt
            rw-rw--- 1 rreid2 tomato_genome 29K Oct 12 11:26 tblastn-SL5pep-vs-SR17cds.txt
            rw-rw--- 1 rreid2 tomato_genome 17K Oct 12 11:26 blastp-SR17pep-vs-SL5pep.txt
            rw-rw--- 1 rreid2 tomato_genome 32K Oct 12 11:27 blastp-SL5pep-vs-SR17pep.txt
            rw-rw--- 1 rreid2 tomato_genome 32K Oct 12 11:32 blastx-SL5cds-vs-SR17pep.txt
            rw-rw--- 1 rreid2 tomato_genome 17K Oct 12 11:32 tblastn-SR17pep-vs-SL5cds.txt

            How many hits total via each blast?
            *for f in *.txt; do wc -l $f ; done *
            4 blastn-SL5genes-vs-SR17cds.txt
            4 blastn-SR17cds-vs-SL5genes.txt
            453 blastp-SL5pep-vs-SR17pep.txt
            235 blastp-SR17pep-vs-SL5pep.txt
            447 blastx-SL5cds-vs-SR17pep.txt
            227 blastx-SR17cds-vs-SL5pep.txt
            397 tblastn-SL5pep-vs-SR17cds.txt
            240 tblastn-SR17pep-vs-SL5cds.txt
            4 blastn-SL5genes-vs-SR17cds.txt
            4 blastn-SR17cds-vs-SL5genes.txt
            453 blastp-SL5pep-vs-SR17pep.txt
            235 blastp-SR17pep-vs-SL5pep.txt
            447 blastx-SL5cds-vs-SR17pep.txt
            227 blastx-SR17cds-vs-SL5pep.txt
            397 tblastn-SL5pep-vs-SR17cds.txt
            240 tblastn-SR17pep-vs-SL5cds.txt

            note:
            SR17 = the 18 arabidopsis SR CDS / AA sequences
            SL5 = the tomato genes from reference version SL5

            BlastP Protein to Protein blast produces 447 hits total regardless of which set of sequences is the query and which is the reference.

            Of these 447, Number of unique tomato genes soli IDs:
            awk '

            { print $1}

            ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l
            114

            Show
            robofjoy Robert Reid added a comment - Rerunning blast script after noticing that one of the OCR translated ID's turned a 4 into an "A". So one sequence was missing because of this. Pulled the sequence AT4G02430.4 | Symbols: SR34b, At-SR34b | Serine/Arginine-Rich Protein Splicing Factor 34b | chr4:1069186-1071366 FORWARD LENGTH=292 Now there are 18 Arabid sequences to explore, instead of 17. Blastn produces only 4 hits total. Ignoring this for now as protein to protein blast searches were more abundant. Blast results are on HPC cluster at this location: /nobackup/tomato_genome/alt_splicing/SR-proteins Results are these files (tab delimtied): rw-rw --- 1 rreid2 tomato_genome 296 Oct 12 11:25 blastn-SR17cds-vs-SL5genes.txt rw-rw --- 1 rreid2 tomato_genome 296 Oct 12 11:25 blastn-SL5genes-vs-SR17cds.txt rw-rw --- 1 rreid2 tomato_genome 17K Oct 12 11:25 blastx-SR17cds-vs-SL5pep.txt rw-rw --- 1 rreid2 tomato_genome 29K Oct 12 11:26 tblastn-SL5pep-vs-SR17cds.txt rw-rw --- 1 rreid2 tomato_genome 17K Oct 12 11:26 blastp-SR17pep-vs-SL5pep.txt rw-rw --- 1 rreid2 tomato_genome 32K Oct 12 11:27 blastp-SL5pep-vs-SR17pep.txt rw-rw --- 1 rreid2 tomato_genome 32K Oct 12 11:32 blastx-SL5cds-vs-SR17pep.txt rw-rw --- 1 rreid2 tomato_genome 17K Oct 12 11:32 tblastn-SR17pep-vs-SL5cds.txt How many hits total via each blast? *for f in *.txt; do wc -l $f ; done * 4 blastn-SL5genes-vs-SR17cds.txt 4 blastn-SR17cds-vs-SL5genes.txt 453 blastp-SL5pep-vs-SR17pep.txt 235 blastp-SR17pep-vs-SL5pep.txt 447 blastx-SL5cds-vs-SR17pep.txt 227 blastx-SR17cds-vs-SL5pep.txt 397 tblastn-SL5pep-vs-SR17cds.txt 240 tblastn-SR17pep-vs-SL5cds.txt 4 blastn-SL5genes-vs-SR17cds.txt 4 blastn-SR17cds-vs-SL5genes.txt 453 blastp-SL5pep-vs-SR17pep.txt 235 blastp-SR17pep-vs-SL5pep.txt 447 blastx-SL5cds-vs-SR17pep.txt 227 blastx-SR17cds-vs-SL5pep.txt 397 tblastn-SL5pep-vs-SR17cds.txt 240 tblastn-SR17pep-vs-SL5cds.txt note: SR17 = the 18 arabidopsis SR CDS / AA sequences SL5 = the tomato genes from reference version SL5 BlastP Protein to Protein blast produces 447 hits total regardless of which set of sequences is the query and which is the reference. Of these 447, Number of unique tomato genes soli IDs: awk ' { print $1} ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l 114
            Hide
            robofjoy Robert Reid added a comment -

            BLAST RESULTS
            Are located: /nobackup/tomato_genome/alt_splicing/SR-proteins

            6 blast files:
            for f in *.txt; do echo $f; done
            blastn-SL5genes-vs-SR17cds.txt
            blastn-SR17cds-vs-SL5genes.txt
            blastp-SL5pep-vs-SR17pep.txt
            blastp-SR17pep-vs-SL5pep.txt
            blastx-SL5cds-vs-SR17pep.txt
            blastx-SR17cds-vs-SL5pep.txt
            tblastn-SL5pep-vs-SR17cds.txt
            tblastn-SR17pep-vs-SL5cds.txt

            Number of hits per file:
            for f in *.txt; wc -l $f ; done

            for f in *.txt; do wc -l $f ; done
            4 blastn-SL5genes-vs-SR17cds.txt
            4 blastn-SR17cds-vs-SL5genes.txt
            453 blastp-SL5pep-vs-SR17pep.txt
            235 blastp-SR17pep-vs-SL5pep.txt
            447 blastx-SL5cds-vs-SR17pep.txt
            227 blastx-SR17cds-vs-SL5pep.txt
            397 tblastn-SL5pep-vs-SR17cds.txt
            240 tblastn-SR17pep-vs-SL5cds.txt

            Protein to Protein Blasts produce the most hits!

            Number of Soly ID hits in Protein vs. Protein = 453

            Could explore all 114 of these Soly IDs?

            Or could explore the top hits:

            Number of hits where Percent Identity is greater than 50%, 75% and 90%.

            awk '

            {if($3 > 50) print $1}

            ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l
            40
            awk '

            {if($3 > 75) print $1}

            ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l
            14
            awk '

            {if($3 > 90) print $1}

            ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l
            0 !!!!!

            Number of hits where Escore is less than 1e-10, etc….
            awk '

            {if($11 < 1e-10) print $1}

            ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l
            114
            awk '

            {if($11 < 1e-50) print $1}

            ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l
            36
            awk '

            {if($11 < 1e-6) print $1}

            ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l
            6

            Would this be a good criteria for starting exploring genes of interest?

            Show
            robofjoy Robert Reid added a comment - BLAST RESULTS Are located: /nobackup/tomato_genome/alt_splicing/SR-proteins 6 blast files: for f in *.txt; do echo $f; done blastn-SL5genes-vs-SR17cds.txt blastn-SR17cds-vs-SL5genes.txt blastp-SL5pep-vs-SR17pep.txt blastp-SR17pep-vs-SL5pep.txt blastx-SL5cds-vs-SR17pep.txt blastx-SR17cds-vs-SL5pep.txt tblastn-SL5pep-vs-SR17cds.txt tblastn-SR17pep-vs-SL5cds.txt Number of hits per file: for f in *.txt; wc -l $f ; done for f in *.txt; do wc -l $f ; done 4 blastn-SL5genes-vs-SR17cds.txt 4 blastn-SR17cds-vs-SL5genes.txt 453 blastp-SL5pep-vs-SR17pep.txt 235 blastp-SR17pep-vs-SL5pep.txt 447 blastx-SL5cds-vs-SR17pep.txt 227 blastx-SR17cds-vs-SL5pep.txt 397 tblastn-SL5pep-vs-SR17cds.txt 240 tblastn-SR17pep-vs-SL5cds.txt Protein to Protein Blasts produce the most hits! Number of Soly ID hits in Protein vs. Protein = 453 Could explore all 114 of these Soly IDs? Or could explore the top hits: Number of hits where Percent Identity is greater than 50%, 75% and 90%. awk ' {if($3 > 50) print $1} ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l 40 awk ' {if($3 > 75) print $1} ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l 14 awk ' {if($3 > 90) print $1} ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l 0 !!!!! Number of hits where Escore is less than 1e-10, etc…. awk ' {if($11 < 1e-10) print $1} ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l 114 awk ' {if($11 < 1e-50) print $1} ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l 36 awk ' {if($11 < 1e-6) print $1} ' blastp-SL5pep-vs-SR17pep.txt | sort | uniq | wc -l 6 Would this be a good criteria for starting exploring genes of interest?
            Hide
            ann.loraine Ann Loraine added a comment -

            For Ann-curated Arabidopsis SR protein gene names and symbols and descriptions, download:

            To extra the gene identifiers, you can do something like:

            grep "SR protein" TAIR10.bed | cut -f4 | cut -f1 -d . | sort | uniq | wc -l
            

            which confirms there are 18 SR protein genes in Arabidopsis TAIR10 annotations.

            Note:

            The igbquickload.org site is invisible within the UNC Charlotte network due to networking issues.

            Show
            ann.loraine Ann Loraine added a comment - For Ann-curated Arabidopsis SR protein gene names and symbols and descriptions, download: http://lorainelab-quickload.scidas.org/quickload/A_thaliana_Jun_2009/TAIR10.bed.gz To extra the gene identifiers, you can do something like: grep "SR protein" TAIR10.bed | cut -f4 | cut -f1 -d . | sort | uniq | wc -l which confirms there are 18 SR protein genes in Arabidopsis TAIR10 annotations. Note: The igbquickload.org site is invisible within the UNC Charlotte network due to networking issues.
            Hide
            ann.loraine Ann Loraine added a comment -

            The TAIR10.bed file contains 41 genes models from 18 genes encoding SR proteins, as identified in the "nomenclature" article mentioned above.

            In this file, the following fields are relevant:

            • field 4 - the gene model identier, e.g., AT1G02840.1
            • field 13 - the nomenclature-approved gene symbol, e.g., At-SR34
            • field 14 - Ann's easy-to-read, easy-to-search-for human-readable description of the gene, e.g, RNA-binding SR protein At-SR34 subfamily SR

            Note that if there are multiple gene models (splice variants) for a given gene, they each get their own line in the "bed" file. Also, they all should have the same data in their 13th and 14th fields. Also, they should have the same AGI "locus" name (e.g., AT1G02840) but different transcript names (e.g., AT1G02840.1)

            Show
            ann.loraine Ann Loraine added a comment - The TAIR10.bed file contains 41 genes models from 18 genes encoding SR proteins, as identified in the "nomenclature" article mentioned above. In this file, the following fields are relevant: field 4 - the gene model identier, e.g., AT1G02840.1 field 13 - the nomenclature-approved gene symbol, e.g., At-SR34 field 14 - Ann's easy-to-read, easy-to-search-for human-readable description of the gene, e.g, RNA-binding SR protein At-SR34 subfamily SR Note that if there are multiple gene models (splice variants) for a given gene, they each get their own line in the "bed" file. Also, they all should have the same data in their 13th and 14th fields. Also, they should have the same AGI "locus" name (e.g., AT1G02840) but different transcript names (e.g., AT1G02840.1)
            Hide
            ann.loraine Ann Loraine added a comment -

            Note that some of the peptides Rob got from the TAIR (arabidopsis.org) Web site might not be in the above BED file. If so, please notify [~aloraine]. If this turns out to be a problem, we can instead use the data from "Araport11" BED file which contains (probably) more gene models for individual SR protein genes.

            Please also see: https://bitbucket.org/lorainelab/arabidopsis-annotation-update/src/master/ - the git repository containing code Ann used to make BED files for IGB which included improved annotations for SR protein genes.

            Show
            ann.loraine Ann Loraine added a comment - Note that some of the peptides Rob got from the TAIR (arabidopsis.org) Web site might not be in the above BED file. If so, please notify [~aloraine] . If this turns out to be a problem, we can instead use the data from "Araport11" BED file which contains (probably) more gene models for individual SR protein genes. Please also see: https://bitbucket.org/lorainelab/arabidopsis-annotation-update/src/master/ - the git repository containing code Ann used to make BED files for IGB which included improved annotations for SR protein genes.
            Show
            Mdavis4290 Molly Davis added a comment - SR Splicing Factors file location: https://docs.google.com/spreadsheets/d/1ScfcUkmf74G6eOoGGDV-3ZpQ0iqbOiVqF78SyEiPSoM/edit?usp=sharing
            Hide
            ann.loraine Ann Loraine added a comment -

            Dowloaded spreadsheet as Excel file and added to external datasets folder. Committed to main branch.

            Show
            ann.loraine Ann Loraine added a comment - Dowloaded spreadsheet as Excel file and added to external datasets folder. Committed to main branch.

              People

              • Assignee:
                robofjoy Robert Reid
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: