Crazy Mathematics behind SVM in a Simple way
Let’s understand the simple maths behind SVM. From maximum margin to the kernel trick.
The origin story of SVM (Optional)
Vladimir N. Vapnik was confident about his model when it was challenged by the multilayer perceptron models for the image classification task. He knew that his model will perform better than those neural networks, nobody was trusting him. He and his friend Alexey Chervonenkis invented that method 30 years before this challenge, it was ignored ever since. Thankfully, it performed as good as those perceptron models and the machine learning community got introduced to a new classification algorithm, Support Vector Machine.
What is SVM?
Support Vector Machine is a supervised machine learning algorithm that is used for classification. In simple terms, it is used to classify the data points by using something like a divider called a hyperplane. This divider (hyperplane) is based on the no of dimensions of the data, for 2D space it is a line, for 3D it is plane and so on. It divides the data points into two groups. So is it just for binary(two-class) classification? No, we can use it for multilinear classification as well. So let’s understand the working of SVM and how we get that hyperplane.
Hyperplane
Consider, we want to separate the blue points from the red ones. There are many ways to draw a line between those points. But, let’s say we have three different ways as shown above. Our intuition can tell us that line 2 is the best separating line as it’s not closer to points like in line 1 but also not very close to just one type of data like in 3. We chose line 2 because it has maximum margin i.e. it is at maximum distance from the both red and blue points.As a human, we can draw a line that separates the blue points from red points with ease. But we want to draw the optimal hyperplane so that our algorithm can perform better not just on the train data but also on the test data.
So, How to find this hyperplane?
Let’s point out what exactly we want.
- We need a line that should separate the blue and red points
- and that line should have a maximum margin.
- It should be easy to calculate.
Consider a line (hyperplane) which separates the points optimally as shown below. The points on the inner side of the line are red and on the outer side is the blue once. This line is optimal hyperplane, i.e. it has a maximum margin. We have a point u which is unknown to us and we want to decide whether it is red or blue. A vector wÌ… has drawn such that it is perpendicular to our line while the vector uÌ… is point vector to u. xÌ…r and xÌ…b are vectors for a red and blue point respectively.
The dot product of wÌ… and uÌ… decides whether the point u is red or blue. If it is greater than a certain value then it is blue else it is red.
To simplify things we are introducing a new symbol y in the equation so that these two conditions can be merged. Let us consider y = +1 for blue and -1 for red. After multiplying y with both equations we get the following equation.
We have constraints about our point as shown in Equation 1. Next thing we want to do is to find the width of our margin.
Width
To find the width, we can use the dot product of unit vector in direction of wÌ… and (xÌ…b- xÌ…r) as shown in our 2D space.
This equation exactly represents the margin. We can fit Equation 2 into this equation and we will get the following result.
We want our points to be as separated as possible, i.e. we want this width to be maximum. Here, wÌ… can be minimised to increase the width. In other words we can write this as below
Recap 1
Let’s just recap what we just did. First, we made some assumptions about our points. We considered one red and one blue point which is closest to our hyperplane. After performing vector operations, we found the Equation 2 that is satisfied by every point in space. We then focused on finding our margin. The width can be calculated just by finding the distance between selected points. The dot product of distance and a unit vector gives us the width. We are trying to maximise this width so that we get better results from out of this algorithm. Nice!
Lagrange Multiplier
We have some rules in the form of Equation 2 and we want to maximise the margin. This is exactly where we can use one of constraint optimisation technique called Lagrange Multiplier technique. To find the optimal value of any parameters xand y we can use the following equation
Here we have width as function R and Equation 2 as function B. So, in the end we get something like this
In Equation 4, α represents the lagrange multiplier and the b represents the constant. In our case, as the constraint applies to each point in space, we take summation of constraint on each point. To know more about lagrange multipliers you can follow this course.
To maximise the width we need to prove that ∇L = 0. After doing some calculation we can get the following interesting things
Great! after putting the corresponding values in equation 4 we can get the following equation
Recap 2
After finding the width and all of the constraints we want to maximise the width. In simple terms we want to separate red points as blue as far as possible. To do that we used a technique which helped us to maximise the width. ∇L ie is the gradient of L is zero when we achieve that maximum width. While proving ∇L = 0 we found two more equations which helped us to simplify our equation 4.
Kernel Trick
Notice our final equation, we can see that while obtaining the ∇L =0 ie while maximising the margin the only term matters is (x̅i .x̅j ) ie dot product of pair of points. In the same way if we put the values obtained in equation 5 and 6 we get the following equation.
In constraint we found the same pattern. The value of the u depends on the dot product with the point in that class. We can further extend this analogy and apply it not just in 2D space but also in multidimensional space. This technique is called kernel trick.
We can use this trick in any number of dimensions without needing the lot of computation. This makes the SVM very aggressive in image classification.
There are a lot of kernels in SVM. In most cases we can use a linear kernel. While calculating the SVM is bit a hard, there is a very simple implementation of SVM in python’s Scikit-learn library.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
#importing the data
bankdata = pd.read_csv("D:/Datasets/bill_authentication.csv")
X = bankdata.drop('Class', axis=1)
y = bankdata['Class']
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# SVM classifier
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
# prediction
y_pred = svclassifier.predict(X_test)
# The performance report
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
Conclusion
SVM is a simple and very effective algorithm. There is no issue of obtaining local maxima; it achieves the optimal results after a couple of iterations. But it is not the best method for image classification. The recent rise of CNN architectures and the hardwares like GPU makes the current neural networks far superior than the SVM. But when the data with less no of parameters is used the SVM can be more effective than those neural networks.
Machine learning is not about just the programming and training the models. At its core, it’s pure maths! If you are interested in maths you will like machine learning as well.
Discover more from Arshad Kazi
Subscribe to get the latest posts sent to your email.
Leave a Reply/Feedback :)