As requested, here are the parameters and inputs you should use to align the contig sequences against the SL4 and SL5 genome sequences.
- You need "target" genome sequence files in ".2bit" format. Please use these files as the target sequences:
SL5 genome assembly, in blat-friendly ".2bit" format: http://lorainelab-quickload.scidas.org/quickload/S_lycopersicum_Jun_2022/S_lycopersicum_Jun_2022.2bit
SL4 genome assembly, in blat-friend ".2bit" format: http://lorainelab-quickload.scidas.org/quickload/S_lycopersicum_Sep_2019/S_lycopersicum_Sep_2019.2bit
- The output format you should use for blat is "pslx". This will ensure that the aligned sequence gets included in the output. If we do that, IGB will be able to display the sequence and we can more easily check that the results are good. To specify this output format, use this option: -out=pslx
- Please run blat using the option that does not include a header. That option is: -noHead
- Do not use the default maximum intron size. It is too big. Use 13,000 bases. That option is: -maxIntron=13000
For more information about running blat, see this link: https://genome.ucsc.edu/goldenPath/help/blatSpec.html
Last but not least, I recommend that you run blat in parallel. If you do that, make sure that you keep track of which sequence came from which collection of assembled contigs. Otherwise, it will be super hard to sanity-check the results. For example, what I would like to do is open up the blat alignments for a set of transcript contigs where I know exactly which RNA-Seq reads were used to make them. Then, I can easily use the Genome Browser to compare the genome alignments for the input RNA-Seq reads to the alignments of their assembled contigs. If we do this, we can easily get an idea for how the trinity assembly software performs when there are lots of reads or not so many reads in a given region. It will make more sense when I can show you what I mean in IGB
Please note I am moving this to the next sprint since it seems unlikely we will get around to doing further work on this during the current sprint, which ends tomorrow.