In the real world, virtually all data has more than 1 dimension. This could be referred to as data having multiple viewpoints, for example sales data from a chain of department stores could be viewed in various dimensions such as geography, time, product, city-state and many other dimensions.
Likewise, while solving a machine learning problem the data being worked with could be considered to have multiple dimensions and features and it becomes important to select the important features that will help solve the given business problem. Example if I have a sales issue with the stores the data might be feature rich, but solving the problem will be dependent on the selection of the correct features.
Some of you might have come across such an uncomfortable scenario and that's when dimensionality reduction techniques comes in play. Here we will be talking about developing intuition related to non-linear pattern detection styles. These were introduced in 2008 and referred to as t-Distributed Stochastic Neighbor Embedding (t-SNE). Ideally, with the given scenario of 100 or more dimensional data, it gets tough to visualize and make sense of it. We have to treat the dimensions so that it can be visualized and played around. t-SNE is a powerful algorithm that helps convert high dimension data into lower dimensions.
Using the t-SNE algorithms, you may need to plot fewer plots for exploratory data analysis while working with high dimensional data. But how does it work? Curious to understand? Let me help you with it.
To put it in simple words, the Reduction of dimensionality is the technique used to depict multidimensional data in 3 or 4 dimensions so that it helps in visualizing the data via scatter plots, histograms, or boxplot that gives us a good sense of patterns of the data. Understanding the patterns helps you to better describe and work more effectively towards solving the problem statement.
T-SNE is based on random walk probability distributions on neighborhood graphs to find the structure within the results. To create a low-dimensional mapping, it uses the local relationships between data points. It is like a list of all the possible pairs of objects and makes it possible to select them randomly. This does so if a pair has two identical points, it's more likely to be chosen than a pair of dissimilar points which is a probability distribution on every pair.t-SNE can be used in the process of clustering and classification problem by using its output as the input feature as it finds the pattern by identifying observed pairs of similarity with multiple dimensions/ features.
Applying t-SNE using python:
## Importing Modules
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import numpy as num
import pandas as pan
## Read the data
##Standardize the data
model = TSNE(n_components = 2, random_state = 0) ##default parameters
tsnedata=model.fittrans(standarddata) ## advised to split data if it exceed more than 1k features
## create a new data frame and plot the data
Hope this helps you to get a good understanding of t-SNE even though there are many other dimension reduction techniques which we will cover in future posts. t-SNE is one of the powerful algorithms along with PCA. Hit like and do comments for feedback.
If you found this Article interesting, why not review the other Articles in our archive.