[IGBF-3145] Transfer SL4.0 gene descriptions to SL5.0 annotations - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
0.25
Epic Link:
Support NSF pollen grant
Sprint:
Summer 5 2022 July 18, Summer 6 2022 Aug 1, Fall 1 2022 Aug 15, Fall 2 2022 Sep 5, Fall 3 2022 Sep 26, Fall 4 2022 Oct 10, Fall 5 2022 Oct 24, Fall 6 2022 Nov 7

Description

Git repository for this sub-project: https://bitbucket.org/hotpollen/splicing-analysis/src/master/

Use gene mapping table to add gene descriptions to new annotations bed-detail file for SL5.0 genome assembly release and annotations.

Specifically:

Get the BED-detail gene model file from IGBQuickload repository: http://lorainelab-quickload.scidas.org/quickload/S_lycopersicum_Jun_2022/S_lycopersicum_Jun_2022.bed.gz

The 13th column of this file contains a gene identifier. The 4th column contains a transcript identifier. The final (14th) column contains "NA" for "Not Available." We would like it to instead contain a description of the gene. Since we don't have that, we would like to insert the description of gene counterparts present in the SL4.0 annotations.

Note that the SL4 annotations can be found here: http://lorainelab-quickload.scidas.org/quickload/S_lycopersicum_Sep_2019/S_lycopersicum_Sep_2019.bed.gz

To map genes from SL4 onto SL5, you need a mapping file. The people who made SL5.0 gave us that mapping file back in July 2022:

Dear Ann,
Yes, we do have tables for the conversion between different versions. They are now avaiable on our website (http://solomics.agis.org.cn/tomato/ftp/ID_convert/).
Thanks so much for the intergration! I would be happy to forward it to my colleagues when it is avaiable.
Best regards,
Yao

Mapping files needed are also available in: /nobackup/tomato_genome/alt_splicing/mappingfiles

Attachments

Issue Links

is blocked by

IGBF-3135 Add new tomato genome and annotations to IGB Quickload repository

Closed

relates to

IGBF-3229 Complete gene description transfer

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Ann Loraine added a comment - 26/Aug/22 12:30 PM - edited

Potentially useful code is available here:

Check out the code used to read the BED annotation files.

Show

Ann Loraine added a comment - 26/Aug/22 12:30 PM - edited Potentially useful code is available here: https://bitbucket.org/hotpollen/flavonoid-rnaseq/src/main/ https://bitbucket.org/hotpollen/rna-seq/src/master/ Check out the code used to read the BED annotation files.

Hide

Permalink

Ann Loraine added a comment - 01/Sep/22 11:25 AM

In the "bed-detail" file from IGB Quickload:

Each row represents a transcript (also called a "gene model")
Field 3 is a transcript identifier, one per row
Field 13 is the gene identifier.

Show

Ann Loraine added a comment - 01/Sep/22 11:25 AM In the "bed-detail" file from IGB Quickload: Each row represents a transcript (also called a "gene model") Field 3 is a transcript identifier, one per row Field 13 is the gene identifier.

Hide

Permalink

Molly Davis added a comment - 20/Sep/22 11:50 AM - edited

Need Reviewing:

I commit new code to bitbucket for file 'MappingNames.Rmd'.

https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd

[~aloraine]

Currently working on code for checking merge.
Still learning git commands!

Show

Molly Davis added a comment - 20/Sep/22 11:50 AM - edited Need Reviewing: I commit new code to bitbucket for file 'MappingNames.Rmd'. https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd [~aloraine] Currently working on code for checking merge. Still learning git commands!

Hide

Permalink

Ann Loraine added a comment - 21/Sep/22 12:23 PM - edited

[~molly] - please submit PR from your fork's master branch to the team repository master branch. Use the Bitbucket interface; google "how to submit PR using bitbucket" for tips if required.

Show

Ann Loraine added a comment - 21/Sep/22 12:23 PM - edited [~molly] - please submit PR from your fork's master branch to the team repository master branch. Use the Bitbucket interface; google "how to submit PR using bitbucket" for tips if required.

Hide

Permalink

Molly Davis added a comment - 29/Sep/22 1:23 PM

Made a pull request for the master branch for file 'MappingNames.Rmd'.

Notes include:

Added correct files with all of the data included.
Made write commands to save files to local machine.
Creates two files. The first one includes the exact columns Ann requested. The second one looks like the original bed file SL5 but now with the descriptions.

[~aloraine]

Show

Molly Davis added a comment - 29/Sep/22 1:23 PM Made a pull request for the master branch for file 'MappingNames.Rmd'. Notes include: Added correct files with all of the data included. Made write commands to save files to local machine. Creates two files. The first one includes the exact columns Ann requested. The second one looks like the original bed file SL5 but now with the descriptions. [~aloraine]

Hide

Permalink

Ann Loraine added a comment - 20/Oct/22 3:37 PM - edited

Testing "MappingNames.Rmd" in https://bitbucket.org/hotpollen/splicing-analysis/src/master/DescriptionMapping/.

An issue:

Markdown fails to compile (knit) due to files not being present on my system. Paths are hard-coded to a location not on my computer:

SL5.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Jun_2022.bed"
SL4.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Sep_2019.bed"
...
mapping.fname="/Users/mollydavis333/Desktop/SL4-SL5_covert.tsv"

This can be fixed by directly accessing the files using relative paths as these are version controlled in splicing_analysis/ExternalData.
R can open "gzip'd" files. No need to uncompress them first. Google "read compressed data file into R" or something to see how this can be done.

attn: [~molly]

Show

Ann Loraine added a comment - 20/Oct/22 3:37 PM - edited Testing "MappingNames.Rmd" in https://bitbucket.org/hotpollen/splicing-analysis/src/master/DescriptionMapping/ . An issue: Markdown fails to compile (knit) due to files not being present on my system. Paths are hard-coded to a location not on my computer: SL5.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Jun_2022.bed" SL4.fname="/Users/mollydavis333/Desktop/S_lycopersicum_Sep_2019.bed" ... mapping.fname="/Users/mollydavis333/Desktop/SL4-SL5_covert.tsv" This can be fixed by directly accessing the files using relative paths as these are version controlled in splicing_analysis/ExternalData. R can open "gzip'd" files. No need to uncompress them first. Google "read compressed data file into R" or something to see how this can be done. attn: [~molly]

Hide

Permalink

Molly Davis added a comment - 03/Nov/22 12:22 PM

Link to updated mapping file on bitbucket under Molly fork:

https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd

Not sure why there is a conflict. I have committed changes and when I try to 'git merge remotes/origin/master' it says that it is already up to date.

Will need help making pull request with conflict in the way.

[~aloraine]

Show

Molly Davis added a comment - 03/Nov/22 12:22 PM Link to updated mapping file on bitbucket under Molly fork: https://bitbucket.org/mdavis4290/splicing-analysis/src/master/DescriptionMapping/MappingNames.Rmd Not sure why there is a conflict. I have committed changes and when I try to 'git merge remotes/origin/master' it says that it is already up to date. Will need help making pull request with conflict in the way. [~aloraine]

Hide

Permalink

Ann Loraine added a comment - 15/Nov/22 9:50 AM

PR is merged and code is ready for review and testing.

Show

Ann Loraine added a comment - 15/Nov/22 9:50 AM PR is merged and code is ready for review and testing.

Hide

Permalink

Ann Loraine added a comment - 17/Nov/22 10:15 AM

The .Rmd file can't "knit".
I think this is due to the final line of one of the code chunks in which a large data frame is being printed to the document.
So I ran the file interactively, chunk by chunk.
This resulted in creating two data files, one with identifiers and another that tries to be a BED file.
However, when I opened the BED file in IGB, I got an error:

09:03:26.741 ERROR c.a.igb.view.load.GeneralLoadUtils - Error in loadOnSequence
java.lang.NumberFormatException: For input string: ""270"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[na:1.8.0_332]
at java.lang.Integer.parseInt(Integer.java:569) ~[na:1.8.0_332]
at java.lang.Integer.parseInt(Integer.java:615) ~[na:1.8.0_332]
at com.affymetrix.genometry.parsers.BedParser.parseIntArray(BedParser.java:414) ~[genometry-9.1.10.jar:na]
at org.lorainelab.igb.bed.BedSymloader.parseLine(BedSymloader.java:310) ~[na:na]
at org.lorainelab.igb.bed.BedSymloader.parseLines(BedSymloader.java:184) ~[na:na]
at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:170) ~[na:na]
at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:146) ~[na:na]
at org.lorainelab.igb.bed.BedSymloader.getRegion(BedSymloader.java:139) ~[na:na]
at com.affymetrix.genometry.quickload.QuickLoadSymLoader.getRegion(QuickLoadSymLoader.java:287) ~[genometry-9.1.10.jar:na]
at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadAndAddSymmetries(QuickLoadSymLoader.java:164) ~[genometry-9.1.10.jar:na]
at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadSymmetriesThread(QuickLoadSymLoader.java:139) ~[genometry-9.1.10.jar:na]
at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadFeatures(QuickLoadSymLoader.java:119) ~[genometry-9.1.10.jar:na]
at com.affymetrix.igb.view.load.GeneralLoadUtils.loadFeaturesForSym(GeneralLoadUtils.java:749) ~[igb-9.1.10.jar:na]
at com.affymetrix.igb.view.load.GeneralLoadUtils$1.loadOnSequence(GeneralLoadUtils.java:664) [igb-9.1.10.jar:na]
at com.affymetrix.igb.view.load.GeneralLoadUtils$1.lambda$multiThreadedLoad$178(GeneralLoadUtils.java:607) [igb-9.1.10.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_332]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_332]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_332]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_332]
at java.lang.Thread.run(Thread.java:750) ~[na:1.8.0_332]

I looked at the BED file in Terminal and observed that many of the fields have quotation marks around them.
This is likely the cause of the error, as IGB reports failing due to a NumberFormatException when attempting to
parse "270

The table output command needs to indicate that no quotation marks will be used.
Moving back to "To-Do"

Next steps:

Modify the output file format to not wrap any of the fields with quotation marks.
Modify the code chunk to no longer print out an entire data frame to the final knitted document
Ensure that the file can "knit" without error

Show

Ann Loraine added a comment - 17/Nov/22 10:15 AM The .Rmd file can't "knit". I think this is due to the final line of one of the code chunks in which a large data frame is being printed to the document. So I ran the file interactively, chunk by chunk. This resulted in creating two data files, one with identifiers and another that tries to be a BED file. However, when I opened the BED file in IGB, I got an error: 09:03:26.741 ERROR c.a.igb.view.load.GeneralLoadUtils - Error in loadOnSequence java.lang.NumberFormatException: For input string: ""270" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~ [na:1.8.0_332] at java.lang.Integer.parseInt(Integer.java:569) ~ [na:1.8.0_332] at java.lang.Integer.parseInt(Integer.java:615) ~ [na:1.8.0_332] at com.affymetrix.genometry.parsers.BedParser.parseIntArray(BedParser.java:414) ~ [genometry-9.1.10.jar:na] at org.lorainelab.igb.bed.BedSymloader.parseLine(BedSymloader.java:310) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.parseLines(BedSymloader.java:184) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:170) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.parse(BedSymloader.java:146) ~ [na:na] at org.lorainelab.igb.bed.BedSymloader.getRegion(BedSymloader.java:139) ~ [na:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.getRegion(QuickLoadSymLoader.java:287) ~ [genometry-9.1.10.jar:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadAndAddSymmetries(QuickLoadSymLoader.java:164) ~ [genometry-9.1.10.jar:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadSymmetriesThread(QuickLoadSymLoader.java:139) ~ [genometry-9.1.10.jar:na] at com.affymetrix.genometry.quickload.QuickLoadSymLoader.loadFeatures(QuickLoadSymLoader.java:119) ~ [genometry-9.1.10.jar:na] at com.affymetrix.igb.view.load.GeneralLoadUtils.loadFeaturesForSym(GeneralLoadUtils.java:749) ~ [igb-9.1.10.jar:na] at com.affymetrix.igb.view.load.GeneralLoadUtils$1.loadOnSequence(GeneralLoadUtils.java:664) [igb-9.1.10.jar:na] at com.affymetrix.igb.view.load.GeneralLoadUtils$1.lambda$multiThreadedLoad$178(GeneralLoadUtils.java:607) [igb-9.1.10.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~ [na:1.8.0_332] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~ [na:1.8.0_332] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~ [na:1.8.0_332] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~ [na:1.8.0_332] at java.lang.Thread.run(Thread.java:750) ~ [na:1.8.0_332] I looked at the BED file in Terminal and observed that many of the fields have quotation marks around them. This is likely the cause of the error, as IGB reports failing due to a NumberFormatException when attempting to parse "270 The table output command needs to indicate that no quotation marks will be used. Moving back to "To-Do" Next steps: Modify the output file format to not wrap any of the fields with quotation marks. Modify the code chunk to no longer print out an entire data frame to the final knitted document Ensure that the file can "knit" without error

Hide

Permalink

Ann Loraine added a comment - 18/Nov/22 1:09 PM

Moving the above request changes into a new ticket and closing this one.

Show

Ann Loraine added a comment - 18/Nov/22 1:09 PM Moving the above request changes into a new ticket and closing this one.

Hide

Permalink

Ann Loraine added a comment - 07/Feb/23 2:06 PM - edited

Update:

Aligned SL4 gene models onto SL5 genome assembly using blat.sh:

#!/bin/bash
G=S_lycopersicum_Jun_2022
D=$G.2bit
Q=S_lycopersicum_Sep_2019_models_cDNA.fa
PSL=SL42SL5.psl
MI=15000
blat -noTrimA -maxIntron=$MI -noHead -minIdentity=95 -dots=100 $D $Q $PSL

Sorted and tabix-indexed PSL output file with:

sort -k14,14 -k16,16n SL42SL5.psl | bgzip -c > SL42SL5.psl.gz
tabix -s 14 -b 16 -e 17 SL42SL5.psl.gz

Added output and code to repository - bitbucket.org/hotpollen/splicing-analysis.git

Show

Ann Loraine added a comment - 07/Feb/23 2:06 PM - edited Update: Aligned SL4 gene models onto SL5 genome assembly using blat.sh: #!/bin/bash G=S_lycopersicum_Jun_2022 D=$G.2bit Q=S_lycopersicum_Sep_2019_models_cDNA.fa PSL=SL42SL5.psl MI=15000 blat -noTrimA -maxIntron=$MI -noHead -minIdentity=95 -dots=100 $D $Q $PSL Sorted and tabix-indexed PSL output file with: sort -k14,14 -k16,16n SL42SL5.psl | bgzip -c > SL42SL5.psl.gz tabix -s 14 -b 16 -e 17 SL42SL5.psl.gz Added output and code to repository - bitbucket.org/hotpollen/splicing-analysis.git

Hide

Permalink

Ann Loraine added a comment - 07/Feb/23 2:32 PM

Reference: https://bitbucket.org/lorainelab/affyprobesetsforigb/src/master/ (documents how to make a tabix-indexed file from PSL blat output)

Show

Ann Loraine added a comment - 07/Feb/23 2:32 PM Reference: https://bitbucket.org/lorainelab/affyprobesetsforigb/src/master/ (documents how to make a tabix-indexed file from PSL blat output)

Hide

Permalink

Ann Loraine added a comment - 10/Feb/23 12:55 PM

Update:

Modifying SL5 description field to include SL4 locus identifier, for example:

hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1)

becomes:

hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1) ITAG4.0:Solyc07g052420.3

Show

Ann Loraine added a comment - 10/Feb/23 12:55 PM Update: Modifying SL5 description field to include SL4 locus identifier, for example: hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1) becomes: hexokinase-1 protein (AHRD V3.3 *** AT1G05205.1) ITAG4.0:Solyc07g052420.3

Hide

Permalink

Ann Loraine added a comment - 10/Feb/23 1:05 PM - edited

Added new file to svn repository after creating tabix-indexed file with:

sort -k1,1 -k2,2n ~/src/splicing-analysis2/DescriptionMapping/output/S_lycopersicum_Jun_2022.bed | bgzip > S_lycopersicum_Jun_2022.bed.gz 
tabix -s 1 -b 2 -e 3 S_lycopersicum_Jun_2022.bed.gz

svn repo info:

Browse by visiting https://svn.bioviz.org/viewvc/
svn repo URLs:

URL: https://svn.bioviz.org/repos/genomes/quickload/S_lycopersicum_Jun_2022
Repository Root: https://svn.bioviz.org/repos/genomes
to check out the repo using read-only user, enter user name "guest" password "guest"

Show

Ann Loraine added a comment - 10/Feb/23 1:05 PM - edited Added new file to svn repository after creating tabix-indexed file with: sort -k1,1 -k2,2n ~/src/splicing-analysis2/DescriptionMapping/output/S_lycopersicum_Jun_2022.bed | bgzip > S_lycopersicum_Jun_2022.bed.gz tabix -s 1 -b 2 -e 3 S_lycopersicum_Jun_2022.bed.gz svn repo info: Browse by visiting https://svn.bioviz.org/viewvc/ svn repo URLs: URL: https://svn.bioviz.org/repos/genomes/quickload/S_lycopersicum_Jun_2022 Repository Root: https://svn.bioviz.org/repos/genomes to check out the repo using read-only user, enter user name "guest" password "guest"

Hide

Permalink

Ann Loraine added a comment - 10/Feb/23 1:52 PM - edited

Updated quickload sites on RENCI and UNCC hosting.

At UNCC, logged in with:

ssh -J aloraine@cci-jump.uncc.edu -p 1657 aloraine@igbquickload.org

At RENCI, logged in with:

ssh -J aloraine@hop.renci.org aloraine@lorainelab-quickload.scidas.org

Show

Ann Loraine added a comment - 10/Feb/23 1:52 PM - edited Updated quickload sites on RENCI and UNCC hosting. At UNCC, logged in with: ssh -J aloraine@cci-jump.uncc.edu -p 1657 aloraine@igbquickload.org At RENCI, logged in with: ssh -J aloraine@hop.renci.org aloraine@lorainelab-quickload.scidas.org

People

Assignee:

Molly Davis

Reporter:

Ann Loraine

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

08/Jul/22 8:51 AM

Updated:

10/Feb/23 1:54 PM

Resolved:

18/Nov/22 1:09 PM