Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3145

Transfer SL4.0 gene descriptions to SL5.0 annotations

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      0.25
    • Sprint:
      Summer 5 2022 July 18, Summer 6 2022 Aug 1, Fall 1 2022 Aug 15, Fall 2 2022 Sep 5, Fall 3 2022 Sep 26, Fall 4 2022 Oct 10, Fall 5 2022 Oct 24, Fall 6 2022 Nov 7

      Description

      Git repository for this sub-project: https://bitbucket.org/hotpollen/splicing-analysis/src/master/

      Use gene mapping table to add gene descriptions to new annotations bed-detail file for SL5.0 genome assembly release and annotations.

      Specifically:

      The 13th column of this file contains a gene identifier. The 4th column contains a transcript identifier. The final (14th) column contains "NA" for "Not Available." We would like it to instead contain a description of the gene. Since we don't have that, we would like to insert the description of gene counterparts present in the SL4.0 annotations.

      To map genes from SL4 onto SL5, you need a mapping file. The people who made SL5.0 gave us that mapping file back in July 2022:

      Dear Ann,
      Yes, we do have tables for the conversion between different versions. They are now avaiable on our website (http://solomics.agis.org.cn/tomato/ftp/ID_convert/).
      Thanks so much for the intergration! I would be happy to forward it to my colleagues when it is avaiable.
      Best regards,
      Yao

      Mapping files needed are also available in: /nobackup/tomato_genome/alt_splicing/mappingfiles

        Attachments

          Issue Links

            Activity

            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Potentially useful code is available here:

            Check out the code used to read the BED annotation files.

            Show
            ann.loraine Ann Loraine added a comment - - edited Potentially useful code is available here: https://bitbucket.org/hotpollen/flavonoid-rnaseq/src/main/ https://bitbucket.org/hotpollen/rna-seq/src/master/ Check out the code used to read the BED annotation files.
            Hide
            ann.loraine Ann Loraine added a comment -

            In the "bed-detail" file from IGB Quickload:

            • Each row represents a transcript (also called a "gene model")
            • Field 3 is a transcript identifier, one per row
            • Field 13 is the gene identifier.
            Show
            ann.loraine Ann Loraine added a comment - In the "bed-detail" file from IGB Quickload: Each row represents a transcript (also called a "gene model") Field 3 is a transcript identifier, one per row Field 13 is the gene identifier.
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Need Reviewing:

            I commit new code to bitbucket for file 'MappingNames.Rmd'.

            https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd

            [~aloraine]

            Currently working on code for checking merge.
            Still learning git commands!

            Show
            Mdavis4290 Molly Davis added a comment - - edited Need Reviewing: I commit new code to bitbucket for file 'MappingNames.Rmd'. https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd [~aloraine] Currently working on code for checking merge. Still learning git commands!
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            [~molly] - please submit PR from your fork's master branch to the team repository master branch. Use the Bitbucket interface; google "how to submit PR using bitbucket" for tips if required.

            Show
            ann.loraine Ann Loraine added a comment - - edited [~molly] - please submit PR from your fork's master branch to the team repository master branch. Use the Bitbucket interface; google "how to submit PR using bitbucket" for tips if required.
            Hide
            Mdavis4290 Molly Davis added a comment -

            Made a pull request for the master branch for file 'MappingNames.Rmd'.

            Notes include:

            • Added correct files with all of the data included.
            • Made write commands to save files to local machine.
            • Creates two files. The first one includes the exact columns Ann requested. The second one looks like the original bed file SL5 but now with the descriptions.

            [~aloraine]

            Show
            Mdavis4290 Molly Davis added a comment - Made a pull request for the master branch for file 'MappingNames.Rmd'. Notes include: Added correct files with all of the data included. Made write commands to save files to local machine. Creates two files. The first one includes the exact columns Ann requested. The second one looks like the original bed file SL5 but now with the descriptions. [~aloraine]
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Testing "MappingNames.Rmd" in https://bitbucket.org/hotpollen/splicing-analysis/src/master/DescriptionMapping/.

            An issue:

            • Markdown fails to compile (knit) due to files not being present on my system. Paths are hard-coded to a location not on my computer:

            SL5.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Jun_2022.bed"
            SL4.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Sep_2019.bed"
            ...
            mapping.fname="/Users/mollydavis333/Desktop/SL4-SL5_covert.tsv"

            This can be fixed by directly accessing the files using relative paths as these are version controlled in splicing_analysis/ExternalData.
            R can open "gzip'd" files. No need to uncompress them first. Google "read compressed data file into R" or something to see how this can be done.

            attn: [~molly]

            Show
            ann.loraine Ann Loraine added a comment - - edited Testing "MappingNames.Rmd" in https://bitbucket.org/hotpollen/splicing-analysis/src/master/DescriptionMapping/ . An issue: Markdown fails to compile (knit) due to files not being present on my system. Paths are hard-coded to a location not on my computer: SL5.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Jun_2022.bed" SL4.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Sep_2019.bed" ... mapping.fname="/Users/mollydavis333/Desktop/SL4-SL5_covert.tsv" This can be fixed by directly accessing the files using relative paths as these are version controlled in splicing_analysis/ExternalData. R can open "gzip'd" files. No need to uncompress them first. Google "read compressed data file into R" or something to see how this can be done. attn: [~molly]
            Hide
            Mdavis4290 Molly Davis added a comment -

            Link to updated mapping file on bitbucket under Molly fork:

            https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd

            Not sure why there is a conflict. I have committed changes and when I try to 'git merge remotes/origin/master' it says that it is already up to date.

            Will need help making pull request with conflict in the way.

            [~aloraine]

            Show
            Mdavis4290 Molly Davis added a comment - Link to updated mapping file on bitbucket under Molly fork: https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd Not sure why there is a conflict. I have committed changes and when I try to 'git merge remotes/origin/master' it says that it is already up to date. Will need help making pull request with conflict in the way. [~aloraine]
            Hide
            ann.loraine Ann Loraine added a comment -

            PR is merged and code is ready for review and testing.

            Show
            ann.loraine Ann Loraine added a comment - PR is merged and code is ready for review and testing.
            Hide
            ann.loraine Ann Loraine added a comment -

            The .Rmd file can't "knit".
            I think this is due to the final line of one of the code chunks in which a large data frame is being printed to the document.
            So I ran the file interactively, chunk by chunk.
            This resulted in creating two data files, one with identifiers and another that tries to be a BED file.
            However, when I opened the BED file in IGB, I got an error:

            09:03:26.741 ERROR c.a.igb.view.load.GeneralLoadUtils - Error in loadOnSequence
            java.lang.NumberFormatException: For input string: ""270"
            at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[na:1.8.0_332]
            at java.lang.Integer.parseInt(Integer.java:569) ~[na:1.8.0_332]
            at java.lang.Integer.parseInt(Integer.java:615) ~[na:1.8.0_332]
            at com.affymetrix.genometry.parsers.BedParser.parseIntArray(BedParser.java:414) ~[genometry-9.1.10.jar:na]
            at org.lorainelab.igb.bed.BedSymloader.parseLine(BedSymloader.java:310) ~[na:na]
            at org.lorainelab.igb.bed.BedSymloader.parseLines(BedSymloader.java:184) ~[na:na]
            at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:170) ~[na:na]
            at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:146) ~[na:na]
            at org.lorainelab.igb.bed.BedSymloader.getRegion(BedSymloader.java:139) ~[na:na]
            at com.affymetrix.genometry.quickload.QuickLoadSymLoader.getRegion(QuickLoadSymLoader.java:287) ~[genometry-9.1.10.jar:na]
            at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadAndAddSymmetries(QuickLoadSymLoader.java:164) ~[genometry-9.1.10.jar:na]
            at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadSymmetriesThread(QuickLoadSymLoader.java:139) ~[genometry-9.1.10.jar:na]
            at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadFeatures(QuickLoadSymLoader.java:119) ~[genometry-9.1.10.jar:na]
            at com.affymetrix.igb.view.load.GeneralLoadUtils.loadFeaturesForSym(GeneralLoadUtils.java:749) ~[igb-9.1.10.jar:na]
            at com.affymetrix.igb.view.load.GeneralLoadUtils$1.loadOnSequence(GeneralLoadUtils.java:664) [igb-9.1.10.jar:na]
            at com.affymetrix.igb.view.load.GeneralLoadUtils$1.lambda$multiThreadedLoad$178(GeneralLoadUtils.java:607) [igb-9.1.10.jar:na]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_332]
            at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_332]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_332]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_332]
            at java.lang.Thread.run(Thread.java:750) ~[na:1.8.0_332]

            I looked at the BED file in Terminal and observed that many of the fields have quotation marks around them.
            This is likely the cause of the error, as IGB reports failing due to a NumberFormatException when attempting to
            parse "270

            The table output command needs to indicate that no quotation marks will be used.
            Moving back to "To-Do"

            Next steps:

            • Modify the output file format to not wrap any of the fields with quotation marks.
            • Modify the code chunk to no longer print out an entire data frame to the final knitted document
            • Ensure that the file can "knit" without error
            Show
            ann.loraine Ann Loraine added a comment - The .Rmd file can't "knit". I think this is due to the final line of one of the code chunks in which a large data frame is being printed to the document. So I ran the file interactively, chunk by chunk. This resulted in creating two data files, one with identifiers and another that tries to be a BED file. However, when I opened the BED file in IGB, I got an error: 09:03:26.741 ERROR c.a.igb.view.load.GeneralLoadUtils - Error in loadOnSequence java.lang.NumberFormatException: For input string: ""270" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~ [na:1.8.0_332] at java.lang.Integer.parseInt(Integer.java:569) ~ [na:1.8.0_332] at java.lang.Integer.parseInt(Integer.java:615) ~ [na:1.8.0_332] at com.affymetrix.genometry.parsers.BedParser.parseIntArray(BedParser.java:414) ~ [genometry-9.1.10.jar:na] at org.lorainelab.igb.bed.BedSymloader.parseLine(BedSymloader.java:310) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.parseLines(BedSymloader.java:184) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:170) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:146) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.getRegion(BedSymloader.java:139) ~ [na:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.getRegion(QuickLoadSymLoader.java:287) ~ [genometry-9.1.10.jar:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadAndAddSymmetries(QuickLoadSymLoader.java:164) ~ [genometry-9.1.10.jar:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadSymmetriesThread(QuickLoadSymLoader.java:139) ~ [genometry-9.1.10.jar:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadFeatures(QuickLoadSymLoader.java:119) ~ [genometry-9.1.10.jar:na] at com.affymetrix.igb.view.load.GeneralLoadUtils.loadFeaturesForSym(GeneralLoadUtils.java:749) ~ [igb-9.1.10.jar:na] at com.affymetrix.igb.view.load.GeneralLoadUtils$1.loadOnSequence(GeneralLoadUtils.java:664) [igb-9.1.10.jar:na] at com.affymetrix.igb.view.load.GeneralLoadUtils$1.lambda$multiThreadedLoad$178(GeneralLoadUtils.java:607) [igb-9.1.10.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~ [na:1.8.0_332] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~ [na:1.8.0_332] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~ [na:1.8.0_332] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~ [na:1.8.0_332] at java.lang.Thread.run(Thread.java:750) ~ [na:1.8.0_332] I looked at the BED file in Terminal and observed that many of the fields have quotation marks around them. This is likely the cause of the error, as IGB reports failing due to a NumberFormatException when attempting to parse "270 The table output command needs to indicate that no quotation marks will be used. Moving back to "To-Do" Next steps: Modify the output file format to not wrap any of the fields with quotation marks. Modify the code chunk to no longer print out an entire data frame to the final knitted document Ensure that the file can "knit" without error
            Hide
            ann.loraine Ann Loraine added a comment -

            Moving the above request changes into a new ticket and closing this one.

            Show
            ann.loraine Ann Loraine added a comment - Moving the above request changes into a new ticket and closing this one.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Update:

            • Aligned SL4 gene models onto SL5 genome assembly using blat.sh:
            #!/bin/bash
            G=S_lycopersicum_Jun_2022
            D=$G.2bit
            Q=S_lycopersicum_Sep_2019_models_cDNA.fa
            PSL=SL42SL5.psl
            MI=15000
            blat -noTrimA -maxIntron=$MI -noHead -minIdentity=95 -dots=100 $D $Q $PSL
            
            • Sorted and tabix-indexed PSL output file with:
            sort -k14,14 -k16,16n SL42SL5.psl | bgzip -c > SL42SL5.psl.gz
            tabix -s 14 -b 16 -e 17 SL42SL5.psl.gz
            
            • Added output and code to repository - bitbucket.org/hotpollen/splicing-analysis.git
            Show
            ann.loraine Ann Loraine added a comment - - edited Update: Aligned SL4 gene models onto SL5 genome assembly using blat.sh: #!/bin/bash G=S_lycopersicum_Jun_2022 D=$G.2bit Q=S_lycopersicum_Sep_2019_models_cDNA.fa PSL=SL42SL5.psl MI=15000 blat -noTrimA -maxIntron=$MI -noHead -minIdentity=95 -dots=100 $D $Q $PSL Sorted and tabix-indexed PSL output file with: sort -k14,14 -k16,16n SL42SL5.psl | bgzip -c > SL42SL5.psl.gz tabix -s 14 -b 16 -e 17 SL42SL5.psl.gz Added output and code to repository - bitbucket.org/hotpollen/splicing-analysis.git
            Hide
            ann.loraine Ann Loraine added a comment -

            Reference: https://bitbucket.org/lorainelab/affyprobesetsforigb/src/master/ (documents how to make a tabix-indexed file from PSL blat output)

            Show
            ann.loraine Ann Loraine added a comment - Reference: https://bitbucket.org/lorainelab/affyprobesetsforigb/src/master/ (documents how to make a tabix-indexed file from PSL blat output)
            Hide
            ann.loraine Ann Loraine added a comment -

            Update:

            Modifying SL5 description field to include SL4 locus identifier, for example:

            hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1)

            becomes:

            hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1) ITAG4.0:Solyc07g052420.3

            Show
            ann.loraine Ann Loraine added a comment - Update: Modifying SL5 description field to include SL4 locus identifier, for example: hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1) becomes: hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1) ITAG4.0:Solyc07g052420.3
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Added new file to svn repository after creating tabix-indexed file with:

            sort -k1,1 -k2,2n ~/src/splicing-analysis2/DescriptionMapping/output/S_lycopersicum_Jun_2022.bed | bgzip > S_lycopersicum_Jun_2022.bed.gz 
            tabix -s 1 -b 2 -e 3 S_lycopersicum_Jun_2022.bed.gz
            

            svn repo info:

            Show
            ann.loraine Ann Loraine added a comment - - edited Added new file to svn repository after creating tabix-indexed file with: sort -k1,1 -k2,2n ~/src/splicing-analysis2/DescriptionMapping/output/S_lycopersicum_Jun_2022.bed | bgzip > S_lycopersicum_Jun_2022.bed.gz tabix -s 1 -b 2 -e 3 S_lycopersicum_Jun_2022.bed.gz svn repo info: Browse by visiting https://svn.bioviz.org/viewvc/ svn repo URLs: URL: https://svn.bioviz.org/repos/genomes/quickload/S_lycopersicum_Jun_2022 Repository Root: https://svn.bioviz.org/repos/genomes to check out the repo using read-only user, enter user name "guest" password "guest"
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Updated quickload sites on RENCI and UNCC hosting.

            At UNCC, logged in with:

            ssh -J aloraine@cci-jump.uncc.edu -p 1657 aloraine@igbquickload.org
            

            At RENCI, logged in with:

            ssh -J aloraine@hop.renci.org aloraine@lorainelab-quickload.scidas.org
            
            Show
            ann.loraine Ann Loraine added a comment - - edited Updated quickload sites on RENCI and UNCC hosting. At UNCC, logged in with: ssh -J aloraine@cci-jump.uncc.edu -p 1657 aloraine@igbquickload.org At RENCI, logged in with: ssh -J aloraine@hop.renci.org aloraine@lorainelab-quickload.scidas.org

              People

              • Assignee:
                Mdavis4290 Molly Davis
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: