Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3077

Investigate Visual Analysis of Single Cell RNA seq data

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Story Points:
      2
    • Sprint:
      Spring 3 2022 Jan 31 - Feb 11, Spring 4 2022 Feb 14 - Feb 25, Spring 5 2022 Feb 28 - Mar 11, Spring 6 2022 Mar 14 - Mar 25

      Description

      In order to improve upon existing visualizations on single cell RNA seq data, a thorough investigation needs to be conducted. The investigation begins at the Aspen Institute Lecture by Aviv Regev that highlights the importance of such visual analyses and follow the breadcrumbs to find how improvements such as interactivity, connectivity to other resources such as genome browsers etc. can be introduced to the visualizations.

      To that effect, the first task is to thoroughly understand how PCA and tSNE is used in the test cases (Test Case 3) mentioned in the Aviv Regev video, break down the key features of the visualization and find other cases where PCA and tSNE are used in single cell RNA seq analysis (similar or different purposes)

        Attachments

          Activity

          Hide
          ann.loraine Ann Loraine added a comment -

          Note: Karthik Raveendran mentions that when he wrote code that implements the algorithm, he reached a deeper understanding. A take-home lesson of this is: concrete implementations are ultimately what we understand, and so only by implementing an algorithm can we achieve mastery of it.

          Show
          ann.loraine Ann Loraine added a comment - Note: Karthik Raveendran mentions that when he wrote code that implements the algorithm, he reached a deeper understanding. A take-home lesson of this is: concrete implementations are ultimately what we understand, and so only by implementing an algorithm can we achieve mastery of it.
          Hide
          ann.loraine Ann Loraine added a comment -

          Moving to "done".

          Show
          ann.loraine Ann Loraine added a comment - Moving to "done".
          Hide
          karthik Karthik Raveendran added a comment -

          PCA is rather the process of data tranformation to find a plane that maximally separates the data than dimensionality reduction. Its the first step before the dimentionality reduction where the percentage of variance of each principle component is recorded and the generally the principle components upto the cumulative proportion of variance of about 99% is kept and rest is discarded (not necessary to discard). Principle components are ordered in the decreasing order of variance.

          The one thing I found hard to understand about is how to find the correlation of the PCs to the attributes in the dataset. Every principle component is a weighted sum of all the attributes in the dataset, that is, if dataset is X with n columns then the equation will be:

          PC1 = w11*X1+w12*X2......+w1n*Xn

          So from this we can find which attribute affects the principle component positively/negatively the most by looking at the weights( or loadings). The loadings are the correlations between the attributes to the principle components and they are calculated as the dot product of eigenvectors(direction) and square root of eigen values.

          I was always curious about how a gene expression dataset looks like as well: https://www.youtube.com/watch?v=xh_wpWj0AzM

          Show
          karthik Karthik Raveendran added a comment - PCA is rather the process of data tranformation to find a plane that maximally separates the data than dimensionality reduction. Its the first step before the dimentionality reduction where the percentage of variance of each principle component is recorded and the generally the principle components upto the cumulative proportion of variance of about 99% is kept and rest is discarded (not necessary to discard). Principle components are ordered in the decreasing order of variance. The one thing I found hard to understand about is how to find the correlation of the PCs to the attributes in the dataset. Every principle component is a weighted sum of all the attributes in the dataset, that is, if dataset is X with n columns then the equation will be: PC1 = w11*X1+w12*X2......+w1n*Xn So from this we can find which attribute affects the principle component positively/negatively the most by looking at the weights( or loadings). The loadings are the correlations between the attributes to the principle components and they are calculated as the dot product of eigenvectors(direction) and square root of eigen values. I was always curious about how a gene expression dataset looks like as well: https://www.youtube.com/watch?v=xh_wpWj0AzM
          Hide
          karthik Karthik Raveendran added a comment -

          Stochastic Neighbor Embedding
          Youtube: Dimensionality reduction of scRNA-seq data lecture by Paulo Czarnewski on Chipster Tutorials channel
          Geeksforgeeks: Difference between PCA VS t-SNE

          Show
          karthik Karthik Raveendran added a comment - Stochastic Neighbor Embedding Youtube: Dimensionality reduction of scRNA-seq data lecture by Paulo Czarnewski on Chipster Tutorials channel Geeksforgeeks: Difference between PCA VS t-SNE
          Hide
          ann.loraine Ann Loraine added a comment -

          Please cite the sources for the statements about the differences between t-SNE and PCA.

          Show
          ann.loraine Ann Loraine added a comment - Please cite the sources for the statements about the differences between t-SNE and PCA.
          Hide
          karthik Karthik Raveendran added a comment -

          tSNE and PCA are unsupervised learning algorithms. tSNE focuses on preserving the smaller distances between the objects in the scatterplot while not so much with the large distances between the clusters. Whereas, in PCA, larger distances are more of the focus and the smaller clusters within the larger cluster does not have distinguishable boundaries.
          In the case of Test Case 3, the tSNE plot is based on the gene profiles of the cells and key takeaway is that malignant cells group together based on the tumor it was extracted from and non-malignant cells group together based on the cell type. However within the larger clusters of cell types, there are distinguishable blobs of cells that can be identified as from the tumors they are from. In the PCA, the distinction between the cells that are preserved are based on the PCs alone (explained in the slides attached). Cells from a certain tumor is analyzed in this method so preserving smaller distances is irrelevant as those distances would negligible in variance and distinct boundaries between larger distances is more important to interpret the scatter plot.

          Show
          karthik Karthik Raveendran added a comment - tSNE and PCA are unsupervised learning algorithms. tSNE focuses on preserving the smaller distances between the objects in the scatterplot while not so much with the large distances between the clusters. Whereas, in PCA, larger distances are more of the focus and the smaller clusters within the larger cluster does not have distinguishable boundaries. In the case of Test Case 3, the tSNE plot is based on the gene profiles of the cells and key takeaway is that malignant cells group together based on the tumor it was extracted from and non-malignant cells group together based on the cell type. However within the larger clusters of cell types, there are distinguishable blobs of cells that can be identified as from the tumors they are from. In the PCA, the distinction between the cells that are preserved are based on the PCs alone (explained in the slides attached). Cells from a certain tumor is analyzed in this method so preserving smaller distances is irrelevant as those distances would negligible in variance and distinct boundaries between larger distances is more important to interpret the scatter plot.
          Hide
          karthik Karthik Raveendran added a comment - - edited

          Updated the attached PPT with slides explaining the diagrams on the paper and some supplementary materials used for the paper (Slides 7-13). Slides 8 and 10 are the major focus in this task as it deals with the t-SNE and PCA used in the analysis but he rest of the slides are other visualizations that was used after the conclusions of those plots. The t-SNE diagram is pretty straight-forward to explain the how the researchers used it to distinguish between malignant and non malignant cells and supplementary table shows the data that they used.
          The researchers uses PCA to conclude that tumors have cells that probably wont respond to the drugs and classify them at the tumor level. The paper goes into details how the principal components were defined which is summarized in Slide 9 and tables from supplementary materials is also attached

          Show
          karthik Karthik Raveendran added a comment - - edited Updated the attached PPT with slides explaining the diagrams on the paper and some supplementary materials used for the paper (Slides 7-13). Slides 8 and 10 are the major focus in this task as it deals with the t-SNE and PCA used in the analysis but he rest of the slides are other visualizations that was used after the conclusions of those plots. The t-SNE diagram is pretty straight-forward to explain the how the researchers used it to distinguish between malignant and non malignant cells and supplementary table shows the data that they used. The researchers uses PCA to conclude that tumors have cells that probably wont respond to the drugs and classify them at the tumor level. The paper goes into details how the principal components were defined which is summarized in Slide 9 and tables from supplementary materials is also attached
          Hide
          karthik Karthik Raveendran added a comment - - edited

          Last Week: I reviewed the paper: https://pubmed.ncbi.nlm.nih.gov/27124452/
          I need go through another pass on the paper and look at the data they used more closely. Understanding how PCA and tSNE is used in this researchers in the paper is still required.

          I am also working on BINF-3121 course for fundamentals of R and started working with URD to understand how to create transcriptional trajectories: https://github.com/farrellja/URD

          Show
          karthik Karthik Raveendran added a comment - - edited Last Week: I reviewed the paper: https://pubmed.ncbi.nlm.nih.gov/27124452/ I need go through another pass on the paper and look at the data they used more closely. Understanding how PCA and tSNE is used in this researchers in the paper is still required. I am also working on BINF-3121 course for fundamentals of R and started working with URD to understand how to create transcriptional trajectories: https://github.com/farrellja/URD

            People

            • Assignee:
              karthik Karthik Raveendran
              Reporter:
              karthik Karthik Raveendran
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: