Embedding Projections

How do you visualize high-dimensional space? The size of the embeddings varies with the complexity of the underlying model. In order to visualize this high dimensional data it needs to be transformed into two or three dimensions. ![[Pasted image 20230930143104.png]] ![[Pasted image 20230930143121.png]] (GNMT Interlingua, Colored by tarlang, 2D then 3D) UMAP (Uniform Manifold Approximation and Projection), t-SNE (t-Distributed Stochastic Neighbouring Entities), and PCA (Principal Component Analysis) are all algorithms used to reduce the dimensionality of a dataset and visualize high-dimensional data in a lower (usually 2D or 3D) space. Different dimensionality reduction techniques will work better for different datasets so it's often a matter of trial and error to find which one works best. TensorFlow provides great support for performing these dimensionality reduction tasks and visualizing the lower dimensional data. The TensorBoard projector even lets you interactively explore these visualizations. t-SNE, UMAP, and PCA are all unsupervised dimensionality reduction techniques, which means they find structure in the data without using explicit labels (or without 'supervision'). Pparametric t-SNE and UMAP can be extended to a semi-supervised setting. This is done by learning a mapping from the high-dimensional space to the low-dimensional space using not just the input features but also the labels as guidance. And then, this mapping can be applied to new, unlabeled data. Parametric t-SNE differs in that it learns a parametric mapping function from the high-dimensional space to the low-dimensional space. This means once the function is learned, it can be applied to new, unseen data to map it to the low-dimensional space, overcoming one of the main limitations of the standard t-SNE which doesn't provide a way to embed new points into the existing map. ## Principal Component Analysis (PCA): PCA is a deterministic algorithm that identifies the axes in the dataset that explain the most variance - these are the principal components. It works by projecting each data point onto only the first few principal components to get lower dimensional data while preserving the variation in the data as much as possible. PCA assumes that the principal components are a linear combination of the original features. Hence, it tends to work well only if this assumption is somewhat correct. ## t-Distributed Stochastic Neighbouring Entities (t-SNE): t-SNE, unlike PCA, is a probabilistic algorithm. It maintains the probability distribution of pairwise distances in the high-dimensional space to the low-dimensional space and minimizes the Kullback-Leibler divergence between two distributions with respect to the locations of the points in the low-dimensional space. Due to this fact, t-SNE can capture much complex structure than PCA and is especially good at preserving local structure, making it great for visualization. But it's not a fast algorithm, and can take a long time to run on large datasets. https://distill.pub/2016/misread-tsne/ ## Uniform Manifold Approximation and Projection (UMAP): UMAP is a newer method in the same vein as t-SNE, but it seeks to preserve more of the global structure in addition to the local structure, and it is also much faster than t-SNE. UMAP maintains a balance between preserving the local and global structure of the data, and it scales well with dataset size. It does this by approximating the manifold the data lies on in the high-dimensional space and projecting this manifold to the low-dimensional space. https://umap-learn.readthedocs.io/en/latest/how_umap_works.html Helpful links: - https://projector.tensorflow.org/ - https://blog.research.google/2016/12/open-sourcing-embedding-projector-tool.html - https://cookbook.openai.com/examples/visualizing_embeddings_in_2d - https://youtu.be/wvsE8jm1GzE?si=pwseKi9JqW-fHqnw