Support Vector Machines or SVM in-short, is one of the most popular and talked about algorithms, and were extremely popular around the time they were developed and refined in the 1990s, and continued to be popular and is one of the best choices for high-performance algorithms with a little tuning and it presents one of the most robust prediction methods.
SVM is implemented uniquely when compared to other ML algorithms. An SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.
SVM is a Supervised Learning algorithm, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification as well using a trick or parameter called as Kernel, which implicitly maps their inputs into high-dimensional feature spaces. Will see the details about the Kernel soon.
SVM is also an Unsupervised Learning algorithm. When data is unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups.
The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.
But here we will stick to the Supervised Learning model.
A support Vector Machine is a discriminative classifier formally defined by a separating hyperplane. In other words, given labelled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space, this hyperplane is a line dividing a plane into two parts wherein each class lay in either side.
In simple terms, I could explain SVM as a model that represents the data points in space, mapped such that the data points of the separate categories are divided by a clear gap that is as wide as possible.
For 1 Dimensional data, the support vector classifier is a point. Similarly, for 2 Dimensional data, the support vector classifier will be a line, and for 3-dimensional data, a support vector classifier is a plane. And for 4 dimensional or more, the support vector classifier will be a hyperplane.
Let me walk you through the hyperplane first.
In geometry, a hyperplane is a subspace whose dimension is one less than that of its ambient space.
If space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines. This notion can be used in any general space in which the concept of the dimension of a subspace is defined.
Suppose you are given a plot of two labelled classes such as stars in blue and circled in red, on a graph as shown in the below image.
How feasible is it to draw a line to separate the two classes?
We might easily draw a line. It fairly separates the two classes. Any point that is left of the line falls into blue stars class and on right falls into red circle class. Separation of classes, that’s what SVM does.
It finds out a line or a hyper-plane (in multidimensional space that separates out classes).
As we see that the dataset or data points can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and the classifier is used as Linear SVM classifier.
Looks simple right?
Let us move to the next step making it a bit complex.
Now consider, we had data points as shown in the image below, where the data points are jumbled (just added a bit complexity).
Clearly, no line can separate the two classes in this x-y plane. So what do we do?
To classify these classes, SVM introduces some additional features.
We apply transformation and add one more dimension as we call it z-axis. In this scenario, we are going to use this new feature z=x^2+y^2.
In this case, we can manipulate it as the distance of a point from z-origin. Now if we plot in the z-axis, a clear separation is visible and a line can be drawn.
Plots all data points on the x and z-axis.
All the values on z-axis should be positive because z is equal to the sum of x squared and y squared.
In the above-mentioned plot, red circles are closed to the origin of x-axis and y-axis, leading the value of z to lower and star is exactly the opposite of the circle, it is away from the origin of x-axis and y-axis, leading the value of z to high.
When we look at the hyperplane at the origin of the axis and y-axis, it looks like a circle. Refer to the below image.
Let's make the data more complex, by overlapping the data (red circle and blue star are very close or overlap) or by placing the data of class 1 into the area of another class (the blue star is very near to red circles). Something similar to below.
What do you think? Are we able to solve this using SVM? Yes, we could, and both the solutions are shown above with the green line are correct. The first one tolerates some outlier points as two blue stars are in the class of red circles.
The second one is trying to achieve 0 tolerance with perfect partition.
But, there is a trade-off. In a real-world application, finding a perfect class for millions of training data sets takes a lot of time. As you will see in coding. This is called a regularization parameter.
As the dataset or data points cannot be classified by using a straight line, then such data is termed as non-linear data and the classifier used is called as Non-linear SVM classifier.
In the SVM algorithm, it is easy to classify using linear hyperplane between two classes. But the question arises here is should we add this feature of SVM to identify hyper-plane. So the answer is no, to solve this problem SVM has a technique that is commonly known as a kernel trick.
Till now we came across two terms such as Kernel and Regularization, and there is one more important term known as gamma. Will see them now.
These are tuning parameters in the SVM classifier. Varying those we can achieve a considerable non-linear classification line with more accuracy in a reasonable amount of time.
The SVM algorithm uses a set of mathematical functions that are defined as Kernels. Sometimes it is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data into a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.
For example, it is impossible to find a line which separates the two classes in the input space, however, if we project the same data points or the input space into a higher dimension we could be able to classify the two classes using a hyperplane. Refer to the below example. Initially, it's hard to separate the two classes, however when it is projected into a higher dimension, then using hyperplane we could easily separate the two classes for classification.
Thus Kernel helps to find a hyperplane in the higher dimensional space without increasing the computational cost. Usually, the computational cost will increase with the increase of dimensions.
Kernel trick is the function that transforms data into a suitable form. There are various types of kernel functions used in the SVM algorithm i.e. Polynomial, linear, non-linear, Radial Basis Function, etc. Here using kernel trick low dimensional input space is converted into a higher-dimensional space.
The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the SVM optimization how much you want to avoid misclassifying each training example.
For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassified more points.
The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far away from plausible separation lines are considered in the calculation for the separation line. Whereas high gamma means the points close to plausible lines are considered in the calculation.
There are other hyperparameters as well for which play an important role however some of them are specific to Classification Problem or Regression PRoblem, or they are used with any specific and dependent on other hyperparameters.
Such as a degree parameter. Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
Another parameter is Epsilon, which is specific to the Regression problem. It specifies the epsilon-tube within which no penalty is associated with the training loss function with points predicted within a distance epsilon from the actual value.
We have two choices, we can either use the sci-kit learn library to import the SVM model and use it directly or we can write our model from scratch.
It's really fun and interesting creating the model from scratch, but that requires a lot of patience and time.
Instead, using a library from sklearn.SVM module which includes Support Vector Machine algorithms will be much easier in implementation as well as to tune the parameters.
Will be writing another article dedicated to the hand-on with the SVM algorithms including classification and regression problems and we shall tweak and play tuning parameters. Also will do a comparison on Kernel performance.
Some use-cases of SVM are as below.
- Face Detection
- Classification of Images
- Remote Homology Detection
- Handwriting Detection
- Generalized Predictive Control
- Text and Hypertext Categorization
SVM is vast than what is covered in the article. I have tried to cover the very basics to understand the SVM and kickstart the learning. I recommend you please dig deeper into it as there are more concepts to learn.
I hope you have learned quite a bit today. Let me know if you want me to cover any specific topic related to data science, Machine learning, SQL, etc. To do so, kindly leave a comment on my blog and I’ll be sure to get back to you.
Also, follow me on DSF & LinkedIn to get my blog updates. Also, follow me on Kaggle as well and subscribe to my notebook dedicated to hands-on experience. If you have any feedback on this article, please comment below!
Stay Happy, Stay Safe, Stay Fit, Stay Humble…!
If you found this Article interesting, why not review the other Articles in our archive.