Uploaded image for project: 'User Support'
  1. User Support
  2. HELP-126

[Biostars] Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations

    Details

    • Type: Support
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Labels:
      None

      Description

      The Biostars post can be found here:

      https://www.biostars.org/p/147826/

      User kirannbishwa01 wrote Question: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:
      I very much like the IGB tools and its features. While I have been able to make a good use of it, I have been facing a problem and can't seem to find a solution how much I try. I am trying to view the aligned tophat output (mapped.bam and junction files from aligned RNAseq data on the reference A. lyrata genome. When I load the lyrata genome on the IGB browser I can see the genome coordinate and the TAIRmRNA database (the annotated .gff file). But, after I upload a mapped.bam and junction file I am not able to see the alignment (aligned reads) with the reference and the annotation.
      But, I figured that the mapped.bam and junction creates its own set of scaffold at the bottom of the default set of scaffold (one to one copy with default, but not sure why?). So, if I select a scaffold that the mapped.bam file has created I am able to see the mapped reads and the junctions but now cannot see the co-ordinate bases and the annotations. However, with A. thaliana genome there is no such problem with viewing the mapped output and junctions from RNAseq data along with genome coordinates and bases, TAIR10 mRNA database and several other databases from other labs.

      Also, I see that updated version of phytozome data is available (V10.2). Is the data for A. lyrata available on IGB browser (V7) the same as V10.2?
      Thanks,
      Bishwa K.

        Attachments

          Activity

          Hide
          mason Mason Meyer (Inactive) added a comment - - edited

          User Sam wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          If you perform the alignment yourself, it might be a good idea to actually load the fasta file together with the gtf file you used for alignment to try and visualize the mapping on the IGB. Most of the time, it might just be due to the naming problem. And just in case, you might want to read this if you want to know how to index your fasta file:

          http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference

          Show
          mason Mason Meyer (Inactive) added a comment - - edited User Sam wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: If you perform the alignment yourself, it might be a good idea to actually load the fasta file together with the gtf file you used for alignment to try and visualize the mapping on the IGB. Most of the time, it might just be due to the naming problem. And just in case, you might want to read this if you want to know how to index your fasta file: http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User alolex wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          IGB has released a new version, v 8.3.4, which does contain A.lyrata, but it is gene models from Phytosome v7; however, from the release notes on http://phytozome.jgi.doe.gov/pz/portal.html it seems A.lyrata was not changed and is still at v1.0. You need to make sure the aligned data you are loading in has the same names as the genome version--it looks like they are named scaffold_1, scaffold_2, etc. The sequence is only viewable when you are zoomed in a substantial amount, but if it doesn't show up when you are zoomed in try clicking the "Load Sequence" button. If the sequence still doesn't load then you probably have some mismatching names in the files you are using. You should be able to just drag and drop your bam and junctions files onto IGB, zoom in to the area you want to study and click on "Load Data" and "Load Sequence".

          Show
          mason Mason Meyer (Inactive) added a comment - User alolex wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: IGB has released a new version, v 8.3.4, which does contain A.lyrata, but it is gene models from Phytosome v7; however, from the release notes on http://phytozome.jgi.doe.gov/pz/portal.html it seems A.lyrata was not changed and is still at v1.0. You need to make sure the aligned data you are loading in has the same names as the genome version--it looks like they are named scaffold_1, scaffold_2, etc. The sequence is only viewable when you are zoomed in a substantial amount, but if it doesn't show up when you are zoomed in try clicking the "Load Sequence" button. If the sequence still doesn't load then you probably have some mismatching names in the files you are using. You should be able to just drag and drop your bam and junctions files onto IGB, zoom in to the area you want to study and click on "Load Data" and "Load Sequence".
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Hello,

          It sounds to me like the reference genome you used for the alignment step is using different names for scaffolds than the version of the sequence IGB is using.

          Some useful info:

          IGB is getting reference genome sequence and gene model annotations from a publicly accessible IGB QuickLoad site located at:
          http://www.igbquickload.org/quickload

          The various genomes we support are contained in folders for each genome, named for the species and the month and year of the genome assembly release.

          It looks like our latest A. lyrata genome is in here:

          http://www.igbquickload.org/quickload/A_lyrata_Apr_2011/

          IGB uses a file called "genome.txt" to populate the list of chromosome/scaffold sequences you see in the "Current Genome" table (right side tab):

          http://igbquickload.org/quickload/A_lyrata_Apr_2011/genome.txt

          If you download that file and open it in Excel or a text editor, you can see all the names of the chromosomes and their sizes.

          The sequence data, which IGB will load when you click the "Load Sequence" button, is in a "2bit" format file called A_lyrata_Apr_2011.2bit. We are using the "2bit" format because it's very compact and there are many utilities for working with it - mostly available from Jim Kent and the UCSC Genome Bioinformatics group. The 2bit format has some nice features that makes accessing sequence data fast and easy for IGB.

          If you load a BAM file into IGB and notice that all-new sequence names are getting added to the Current Sequence table *and* when you click the "Load Sequence" button, no sequence gets loaded, then that usually means: the genome.txt and 2bit files don't contain the sequences you used to run your alignment. This could happen if your genome version is different or if it's the same version but is just using different names.

          If using different names is the problem, then you can give IGB a list of synonyms that IGB can use to match names. So for example, if the reference genome sequence you used to do your alignments contains a sequence called "FooBar" which is the same sequence that IGB calls "foobar123", then you can tell IGB the two names mean the same thing by adding a personal synonyms "chromosomes.txt" file to IGB.

          For more info about that, see: https://wiki.transvar.org/display/igbman/Personal+Synonyms

          Let us know if you need any help with this.

          Also, if there is a more recent version of A. lyrata genome, we'd be happy to add it to the IGB QuickLoad system - including sequence and gene model annotations.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Hello, It sounds to me like the reference genome you used for the alignment step is using different names for scaffolds than the version of the sequence IGB is using. Some useful info: IGB is getting reference genome sequence and gene model annotations from a publicly accessible IGB QuickLoad site located at: http://www.igbquickload.org/quickload The various genomes we support are contained in folders for each genome, named for the species and the month and year of the genome assembly release. It looks like our latest A. lyrata genome is in here: http://www.igbquickload.org/quickload/A_lyrata_Apr_2011/ IGB uses a file called "genome.txt" to populate the list of chromosome/scaffold sequences you see in the "Current Genome" table (right side tab): http://igbquickload.org/quickload/A_lyrata_Apr_2011/genome.txt If you download that file and open it in Excel or a text editor, you can see all the names of the chromosomes and their sizes. The sequence data, which IGB will load when you click the "Load Sequence" button, is in a "2bit" format file called A_lyrata_Apr_2011.2bit. We are using the "2bit" format because it's very compact and there are many utilities for working with it - mostly available from Jim Kent and the UCSC Genome Bioinformatics group. The 2bit format has some nice features that makes accessing sequence data fast and easy for IGB. If you load a BAM file into IGB and notice that all-new sequence names are getting added to the Current Sequence table * and * when you click the "Load Sequence" button, no sequence gets loaded, then that usually means: the genome.txt and 2bit files don't contain the sequences you used to run your alignment. This could happen if your genome version is different or if it's the same version but is just using different names. If using different names is the problem, then you can give IGB a list of synonyms that IGB can use to match names. So for example, if the reference genome sequence you used to do your alignments contains a sequence called "FooBar" which is the same sequence that IGB calls "foobar123", then you can tell IGB the two names mean the same thing by adding a personal synonyms "chromosomes.txt" file to IGB. For more info about that, see: https://wiki.transvar.org/display/igbman/Personal+Synonyms Let us know if you need any help with this. Also, if there is a more recent version of A. lyrata genome, we'd be happy to add it to the IGB QuickLoad system - including sequence and gene model annotations.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          This also works. To open your fasta file in IGB and use it as the reference, select File > Open Genome from File. (Or click the blue and red DNA icon in the toolbar.)

          IGB will then display a window that let's you select a fasta or 2bit format file to use as the reference sequence. (Better to use 2bit - it's much faster to read

          You can also enter a genome version and species names. It's optional, but if you do that, then IGB will display the names you selected in the Species and Genome Version menus of the Current Genome tab. Otherwise IGB will assign a default name.

          Then, click OK.

          What happens at that point is that IGB will scan your reference sequence file, make a list of all the chromosomes and their sizes, and then list them in Sequence table in the Current Genome tab.

          At that point, you can open your files as you would normally, including your GTF file. IGB can read GTF files produced by cufflnks.

          It can also read some GFF3 files. However, GFF3 files are sometimes not read correctly because different groups interpret the GFF3 specification differently and it's hard to make sure that all GFF3 files will work with IGB. For this reason, we recommend using BED or BED-Detail to represent gene models in IGB.

          If you use BED-Detail, make a regular BED file. Then add a column 13 with whatever you want the gene title to be (e.g., TP53) and add a column 14 with whatever descriptive text you'd like to see in the Selection Info tab when you click on the gene. For column 4, insert the name of the gene model, e.g., AT1G07350.1 if it's Arabidopsis. For examples, see the "bed" files on the QuickLoad site - there are many examples from many different species. The text you insert into columns 4, 13, and 14 will be available for searching under the Advanced Search tab, so it's useful to add text you think will be helpful for search, like gene name and gene function.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: This also works. To open your fasta file in IGB and use it as the reference, select File > Open Genome from File. (Or click the blue and red DNA icon in the toolbar.) IGB will then display a window that let's you select a fasta or 2bit format file to use as the reference sequence. (Better to use 2bit - it's much faster to read You can also enter a genome version and species names. It's optional, but if you do that, then IGB will display the names you selected in the Species and Genome Version menus of the Current Genome tab. Otherwise IGB will assign a default name. Then, click OK. What happens at that point is that IGB will scan your reference sequence file, make a list of all the chromosomes and their sizes, and then list them in Sequence table in the Current Genome tab. At that point, you can open your files as you would normally, including your GTF file. IGB can read GTF files produced by cufflnks. It can also read some GFF3 files. However, GFF3 files are sometimes not read correctly because different groups interpret the GFF3 specification differently and it's hard to make sure that all GFF3 files will work with IGB. For this reason, we recommend using BED or BED-Detail to represent gene models in IGB. If you use BED-Detail, make a regular BED file. Then add a column 13 with whatever you want the gene title to be (e.g., TP53) and add a column 14 with whatever descriptive text you'd like to see in the Selection Info tab when you click on the gene. For column 4, insert the name of the gene model, e.g., AT1G07350.1 if it's Arabidopsis. For examples, see the "bed" files on the QuickLoad site - there are many examples from many different species. The text you insert into columns 4, 13, and 14 will be available for searching under the Advanced Search tab, so it's useful to add text you think will be helpful for search, like gene name and gene function.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User kirannbishwa01 wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Thanks you all for you help. It all make sense to me and now I think there is not just one problem. The first problem I found is that the size of the lyrata genome available via several sites is different. JGI (joint genome institue) and Phytozome have lyrata reference fasta file as 199 mb while the downloads from ensembl and iplant have the reference lyrata genome fasta file at the size of 200 mb. I am not sure but it seems like this slight variation in file size (and the data within it) could have caused the first source of variation in the index file they created. If I am wrong please correct me.

          IGB has lyrata reference sequence file in 2bit format, so I am not sure which release does it mainly matches too. But, the problem with the alignment mainly seems due to the naming (as I see there are equal number of scaffolds created while loading default IGB lyrata genome vs. while loading the reference fasta file available from iplant (community data folder, but the scaffolds are named differently). So, for now I will try to see if creating personal synonym file will help.

          Also, the size of .gtf file available via ensembl release (at 83 mb) is different from the one available from iplant community data folder (67.1 mb) which makes me think that ensemble now has more annotated genes than those available in iplant .gtf file.

          I have just begun analyzing my data and am new to bio-informatics stuff, so my certain assumptions could be wrong. But, I would be happy to hear any feedbacks. Also, it would be great if we could have an updated version of lyrata genome sequence and gene model annotation on IGB.
          I could share the several files that I have downloaded from several servers. Please let me know.

          Thank you so much all of you.

          • Bishwa K.
          Show
          mason Mason Meyer (Inactive) added a comment - User kirannbishwa01 wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Thanks you all for you help. It all make sense to me and now I think there is not just one problem. The first problem I found is that the size of the lyrata genome available via several sites is different. JGI (joint genome institue) and Phytozome have lyrata reference fasta file as 199 mb while the downloads from ensembl and iplant have the reference lyrata genome fasta file at the size of 200 mb. I am not sure but it seems like this slight variation in file size (and the data within it) could have caused the first source of variation in the index file they created. If I am wrong please correct me. IGB has lyrata reference sequence file in 2bit format, so I am not sure which release does it mainly matches too. But, the problem with the alignment mainly seems due to the naming (as I see there are equal number of scaffolds created while loading default IGB lyrata genome vs. while loading the reference fasta file available from iplant (community data folder, but the scaffolds are named differently). So, for now I will try to see if creating personal synonym file will help. Also, the size of .gtf file available via ensembl release (at 83 mb) is different from the one available from iplant community data folder (67.1 mb) which makes me think that ensemble now has more annotated genes than those available in iplant .gtf file. I have just begun analyzing my data and am new to bio-informatics stuff, so my certain assumptions could be wrong. But, I would be happy to hear any feedbacks. Also, it would be great if we could have an updated version of lyrata genome sequence and gene model annotation on IGB. I could share the several files that I have downloaded from several servers. Please let me know. Thank you so much all of you. Bishwa K.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User alolex wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          You should not conclude that 2 fasta files are different just based on a 1MB difference in file size. The content that matters could be exactly the same, but the larger file may just have more meta information, more details in the fasta headers, or have been generated by a different program etc that could result in a slightly different file size. For example, just adding 1 million spaces will increase the file size by about 1MB. Have you looked at the actual content of the fasta files? Do the headers look the same? The gtf file size does look significant though. Can you give us more details on the sequence of steps and the exact files you are trying to load into IGB?

          Show
          mason Mason Meyer (Inactive) added a comment - User alolex wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: You should not conclude that 2 fasta files are different just based on a 1MB difference in file size. The content that matters could be exactly the same, but the larger file may just have more meta information, more details in the fasta headers, or have been generated by a different program etc that could result in a slightly different file size. For example, just adding 1 million spaces will increase the file size by about 1MB. Have you looked at the actual content of the fasta files? Do the headers look the same? The gtf file size does look significant though. Can you give us more details on the sequence of steps and the exact files you are trying to load into IGB?
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Yes, do please put the files into a folder on Google Drive or into a Public Dropbox folder, along with a short document giving the URL or other information indicating where the file came from.

          I can then take a look at the files and use them to get a better idea of how to update the IGB QuickLoad site.
          Most useful would be:

          • fasta file from Ensembl/iPlant
          • fasta file from JGI/Phytozome
          • annotations file (gtf) from Ensembl
          • annotations file (gtf) from iPlant community folder

          Also helpful would be a few references to recent RNA-Seq (or other *Seq) papers featuring data from A. lyrata - anything where the authors would have run some type of alignment against a reference genome.

          Most genome projects do the same basic pipelines, but with variations. So usually I like to look at recent papers where researchers used the genome - from that, I can get a sense of what resources are available and how (or if) to re-format them for use with IGB.

          Also, when you get to this point, you might want to set up your own IGB QuickLoad site to share data with other people in your lab and/or collaborators in other labs. You can distribute the data on a Web site or use a "Public" Dropbox folder.

          The Dropbox folder approach is a bit of a hack in that I'm almost positive Dropbox has no idea scientists are using their service in this way, but so far it seems to work great. If your lab can afford $100/year for 1 TB of storage, then this could be a great option, as most RNA-Seq data sets (in my experience, at least are 30 Gb or smaller. And since 1 TB = 1,000 Gb, you could probably share a LOT of data sets this way.

          Details are here:

          https://wiki.transvar.org/display/igbman/Set+up+a+Quickload+in+Dropbox
          You can also do something similar with iPlant, but IGB connectivity to iPlant is still a bit experimental, and we're planning to work with the iPlant engineering team to make it easier. So hopefully that will also be an option in the near future.

          Let me know when the files are ready and I will take a look.

          Best,
          Ann

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Yes, do please put the files into a folder on Google Drive or into a Public Dropbox folder, along with a short document giving the URL or other information indicating where the file came from. I can then take a look at the files and use them to get a better idea of how to update the IGB QuickLoad site. Most useful would be: • fasta file from Ensembl/iPlant • fasta file from JGI/Phytozome • annotations file (gtf) from Ensembl • annotations file (gtf) from iPlant community folder Also helpful would be a few references to recent RNA-Seq (or other *Seq) papers featuring data from A. lyrata - anything where the authors would have run some type of alignment against a reference genome. Most genome projects do the same basic pipelines, but with variations. So usually I like to look at recent papers where researchers used the genome - from that, I can get a sense of what resources are available and how (or if) to re-format them for use with IGB. Also, when you get to this point, you might want to set up your own IGB QuickLoad site to share data with other people in your lab and/or collaborators in other labs. You can distribute the data on a Web site or use a "Public" Dropbox folder. The Dropbox folder approach is a bit of a hack in that I'm almost positive Dropbox has no idea scientists are using their service in this way, but so far it seems to work great. If your lab can afford $100/year for 1 TB of storage, then this could be a great option, as most RNA-Seq data sets (in my experience, at least are 30 Gb or smaller. And since 1 TB = 1,000 Gb, you could probably share a LOT of data sets this way. Details are here: https://wiki.transvar.org/display/igbman/Set+up+a+Quickload+in+Dropbox You can also do something similar with iPlant, but IGB connectivity to iPlant is still a bit experimental, and we're planning to work with the iPlant engineering team to make it easier. So hopefully that will also be an option in the near future. Let me know when the files are ready and I will take a look. Best, Ann
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Good point. The fasta headers could be identical, but the number of bases per line might be different, which could also make the file sizes different thanks to different number of newline characters.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Good point. The fasta headers could be identical, but the number of bases per line might be different, which could also make the file sizes different thanks to different number of newline characters.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User kirannbishwa01 wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Hi Ann,

          Thanks a lot for your input.
          I like the way we could setup a quickload sever on the dropbox. I have like 20 gb of space on dropbox so I think that should suffice for now.
          I am sharing releases that I had downloaded from Ensembl/iplant and JGI/phytozome as a bulk data. It contains fasta, gtf and several other annotation files. This bulk data also contains several other features and annotation files that came with the bulk data (which I think could be helpful in some way and left it in the shared folder).
          https://www.dropbox.com/sh/0j6ja1je56epgl7/AAAo_-RjnsmZBtrFnUVh46oYa?dl=0
          Regarding the research articles we have not found any labs that have done RNAseq analysis using lyrata references. The paper that discusses genome assembly of lyrata and its comparative analysis with its relative (A. thaliana) is from Hu et. al (2011) http://www.nature.com/ng/journal/v43/n5/full/ng.807.html?WT.ec_id=NG-201105 Hope this is helpful.
          Please let me know if you have any question.

          Thanks again,

          Bishwa K.

          Show
          mason Mason Meyer (Inactive) added a comment - User kirannbishwa01 wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Hi Ann, Thanks a lot for your input. I like the way we could setup a quickload sever on the dropbox. I have like 20 gb of space on dropbox so I think that should suffice for now. I am sharing releases that I had downloaded from Ensembl/iplant and JGI/phytozome as a bulk data. It contains fasta, gtf and several other annotation files. This bulk data also contains several other features and annotation files that came with the bulk data (which I think could be helpful in some way and left it in the shared folder). https://www.dropbox.com/sh/0j6ja1je56epgl7/AAAo_-RjnsmZBtrFnUVh46oYa?dl=0 Regarding the research articles we have not found any labs that have done RNAseq analysis using lyrata references. The paper that discusses genome assembly of lyrata and its comparative analysis with its relative (A. thaliana) is from Hu et. al (2011) http://www.nature.com/ng/journal/v43/n5/full/ng.807.html?WT.ec_id=NG-201105 Hope this is helpful. Please let me know if you have any question. Thanks again, Bishwa K.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User kirannbishwa01 wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Just a quick note:

          There are two different folders of JGI releases which contain different files, and I shared both of them.

          Thanks

          Show
          mason Mason Meyer (Inactive) added a comment - User kirannbishwa01 wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Just a quick note: There are two different folders of JGI releases which contain different files, and I shared both of them. Thanks
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User alolex wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Yep! Another possibility. Bottom line is any number of formatting changes could be altering the file size without actually changing the content. For fasta files I found this python script that looks like it will tell you if anything is different in content (https://www.cgat.org/downloads/public/cgat/documentation/scripts/diff_fasta.html), but I've not needed to use it yet. For GTF files I'm not aware of a tool that does a direct comparison, but I found this post (http://r.789695.n4.nabble.com/Comparing-two-gff-gtf-files-de-novo-transcripts-v-s-reference-td3934629.html) that points to rtracklayer in Bioconductor.

          Show
          mason Mason Meyer (Inactive) added a comment - User alolex wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Yep! Another possibility. Bottom line is any number of formatting changes could be altering the file size without actually changing the content. For fasta files I found this python script that looks like it will tell you if anything is different in content ( https://www.cgat.org/downloads/public/cgat/documentation/scripts/diff_fasta.html ), but I've not needed to use it yet. For GTF files I'm not aware of a tool that does a direct comparison, but I found this post ( http://r.789695.n4.nabble.com/Comparing-two-gff-gtf-files-de-novo-transcripts-v-s-reference-td3934629.html ) that points to rtracklayer in Bioconductor.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User kirannbishwa01 wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Thank you all for the inputs. The information you all provided has been helpful.

          Hi Ann,

          Could you please let me know if the files/folders that I shared was of any help to update the annotation on IGB. There was recent version of IGB release but for A. lyrata genome (and its annotation) in don't see any difference. Could you please update me in this regards.

          Thanks much,

          Bishwa K.

          Show
          mason Mason Meyer (Inactive) added a comment - User kirannbishwa01 wrote Answer: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Thank you all for the inputs. The information you all provided has been helpful. Hi Ann, Could you please let me know if the files/folders that I shared was of any help to update the annotation on IGB. There was recent version of IGB release but for A. lyrata genome (and its annotation) in don't see any difference. Could you please update me in this regards. Thanks much, Bishwa K.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Yes, the data is very useful - thank you! It looks like the gene annotations in IGB currently are up-to-date, but we need to change the names to make them match the other data sets. Once we do that, it will be much easier to use IGB to view your data. We need to double-check a few things and hope to have a new release ready soon.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Yes, the data is very useful - thank you! It looks like the gene annotations in IGB currently are up-to-date, but we need to change the names to make them match the other data sets. Once we do that, it will be much easier to use IGB to view your data. We need to double-check a few things and hope to have a new release ready soon.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Phytozome and Ensembl both report 695 scaffolds and 32,667 gene models for genome assembly version 1.0 for A. lyrata. This is also what is present in IGB QuickLoad. So I think IGB QuickLoad is up-to-date. However, I noticed that the gene model names for IGB QuickLoad are numbers like "311229," which are not very informative. IGB provides a google search feature that lets you search google using gene model ids as a query, and I doubt this would be very useful in this case. I noticed that the files you provided from JGI include a synonyms file that maps these numeric ids onto ids like: fgenesh1_pm.C_scaffold_1000009. So I am going to modify the IGB gene models file to use these ids instead of the numeric ids that are there now. I'll post again when that's done.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Phytozome and Ensembl both report 695 scaffolds and 32,667 gene models for genome assembly version 1.0 for A. lyrata. This is also what is present in IGB QuickLoad. So I think IGB QuickLoad is up-to-date. However, I noticed that the gene model names for IGB QuickLoad are numbers like "311229," which are not very informative. IGB provides a google search feature that lets you search google using gene model ids as a query, and I doubt this would be very useful in this case. I noticed that the files you provided from JGI include a synonyms file that maps these numeric ids onto ids like: fgenesh1_pm.C_scaffold_1000009. So I am going to modify the IGB gene models file to use these ids instead of the numeric ids that are there now. I'll post again when that's done.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          I also noticed that iPlant and Ensembl are using names 1, 2, 3, ... 9 instead of scaffold_1, scaffold_2, ... , scaffold_9, which are the names JGI and Phytozome appear to be using. These scaffolds (1 through 9) appear to correspond to the nine physical chromosomes of A. lyrata. I think it would be useful for IGB QuickLoad to syncrhonize with Ensembl and also iPlant, which gets its data from Ensemble. So I am also going to change the genome files in IGB QuickLoad to use names 1, 2, 3, etc instead of scaffold_1, scaffold_2, etc for the nine chromosomal sequences.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: I also noticed that iPlant and Ensembl are using names 1, 2, 3, ... 9 instead of scaffold_1, scaffold_2, ... , scaffold_9, which are the names JGI and Phytozome appear to be using. These scaffolds (1 through 9) appear to correspond to the nine physical chromosomes of A. lyrata. I think it would be useful for IGB QuickLoad to syncrhonize with Ensembl and also iPlant, which gets its data from Ensemble. So I am also going to change the genome files in IGB QuickLoad to use names 1, 2, 3, etc instead of scaffold_1, scaffold_2, etc for the nine chromosomal sequences.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User kirannbishwa01 wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Hi Ann, Thanks a lot for fixing these issues. A. lyrata has 8 physical chromosomes. So, scaffold 1-8 should represent to Chr 1-8. While 9 & 10 should represent mitochondrial and chloroplast genomes. There are several other unmapped regions (should be all other scaffolds).
          Thanks a lot again,

          Show
          mason Mason Meyer (Inactive) added a comment - User kirannbishwa01 wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Hi Ann, Thanks a lot for fixing these issues. A. lyrata has 8 physical chromosomes. So, scaffold 1-8 should represent to Chr 1-8. While 9 & 10 should represent mitochondrial and chloroplast genomes. There are several other unmapped regions (should be all other scaffolds). Thanks a lot again,
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Quick followup: Looks like scaffold_9 is bigger than scaffold_10. Also, scaffold_9 has many gene annotations and scaffold_10 has none. The gene models on scaffold_9 look a lot like genes from nuclear chromosomes - they have introns, exons, splicing.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Quick followup: Looks like scaffold_9 is bigger than scaffold_10. Also, scaffold_9 has many gene annotations and scaffold_10 has none. The gene models on scaffold_9 look a lot like genes from nuclear chromosomes - they have introns, exons, splicing.
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          Another followup:

          Code I used to change gene model names and scaffold names is (mostly) here:

          https://bitbucket.org/aloraine/alyrata

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: Another followup: Code I used to change gene model names and scaffold names is (mostly) here: https://bitbucket.org/aloraine/alyrata
          Hide
          mason Mason Meyer (Inactive) added a comment -

          User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations:

          The new files are now available on IGBQuickLoad. IGB is now using the longer, non-numeric names for gene models.

          Show
          mason Mason Meyer (Inactive) added a comment - User Ann wrote Comment: Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations: The new files are now available on IGBQuickLoad. IGB is now using the longer, non-numeric names for gene models.

            People

            • Assignee:
              Unassigned
              Reporter:
              mason Mason Meyer (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: