Importance of staying on top of trends as a subject matter expert.
I wrote an article discussing the applications of t-distributed stochastic neighbor embedding (t-SNE) as an improved technique for dimensionality reduction and cluster association comparable to the overused Principal Component Analysis (PCA). Although t-SNE is the new emerging hot topic in the world of environmental forensics, it is already considered a jaded technique to some data scientists.
The word, “novel”, is a very subjective term because the rate of innovations between different fields, such as environmental versus data science, experience contrasting timelines. There is also a large knowledge gap that formal education cannot supplement to one specific field alone. I can remember in my first year of graduate school being blown away by a chemometrics course introducing the concept of multivariate dimensionality reduction using PCA. Fast forward now and you’d be hard-pressed to convince me to use a PCA of any similarity association in your data knowing what I know about the misrepresentation of multivariate variability.
As scientists with rigorous formal education, we are often confined to relying on our own expertise to provide innovation rather than seeking to explore alternative sources of information. To excel as a subject matter expert, it is more imperative that one is open to receiving information from sources external to their small technical niches. Confinement in your own field can be a dangerous route to follow. While writing this, it reminded me of a great example of confinement in one’s own field leading to experts being in the dark about innovation. In a recent webinar between the US air force and Elon Musk (founder of SpaceX, Tesla and eccentric billionaire), the panel was discussing recent innovations in aviation. Halfway into their panel discussion, Elon stated that his ventures had already implemented these technologies over a decade ago and that the infrastructure to make all these machines self-autonomous (without need for a pilot) was already available. He prefaced that all jobs could soon be obsolete if they implemented his technologies. For the next few moments there were a lot of crickets in the audience. I bring this up since it is a reflection of what can happen if we are not aware of what lies outside our technical niches.
Let's dive into the technical reasons why we should leave t-SNE (at this point PCA should be considered primitive) and embrace using Uniform Manifold Approximation and Projection (UMAP) as the staple for dimensionality reduction. We should also embrace the idea that this is the most up-to-date technique, but a time will come when we need to explore the new methods circulating.
What is UMAP and when should we use it instead of t-SNE?
Disadvantages of t-SNE:
- t-SNE does not preserve global structure making it hard to compare associations between different clusters. In the 2D t-SNE example below, we cannot compare if the oil sands (green circles) are closely related to the samples (blue triangles) based on the spatial distance between these two clusters. But with the 2D UMAP, a comparison can now be made between the clusters and how similar they are between each other.
- The algorithm used to calculate t-SNE components is limited to 3 dimensions. Although the representation of total variance is significantly increased, further components cannot be explored. For UMAP, the dimensionality reduction technique is not limited to 3 dimensions, allowing further components to be explored as there may be situations where the clustering is better reflected beyond the first 3 components.
- t-SNE performs a non-parametric mapping from high to low dimensions, meaning that it does not leverage features (aka PCA loadings) that drive the observed clustering. For environmental data this is extremely important since the clustering should be reflective of changes in size to better represent the complexity of the environmental data.
Cool examples of UMAP:
What can Chemistry Matters provide:
At Chemistry Matters we aim to stay current in the evolution of data science within the field of environmental forensics. Not only are we relevant to the trends in our discipline, but we also seek to employ the trends set out by the “popular kids” and pioneer a data driven focus in the field of environmental forensics.
- Linderman M. Rachh J. Hoskins S. Steinerberger Y. Kluger. Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding. ArXiv 2017.
- McInnes, J. Healy, J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv 2018.