Explanation of the support vector machine algorithm, the types, how it works, and its implementation using the python programming language with the sklearn machine learning package
Support vector machine(SVM) is another very popular machine learning algorithm, which belongs to the supervised learning class, and can be used for both regression and classification purposes. The support vector machine’s main purpose is to create a line, best known as a hyperplane(decision boundary), that can separate the data points in n-dimensional space to be able to classify any new data points into a particular class.
The hyperplanes are created due to the SVM selecting the closest points. These close points are known as support vectors, and that is where the name of support vector machines originates from. The whole SVM algorithm can be visualized using the figure below
As can be seen from the figure above, the support vectors are the points closest to the hyperplane and they determine the position and orientation of the hyperplane. In the following sections, the intuition and how the support vector machine actually works would be described, and also the implementation is done using python.
Structure Of The Support Vector Machine
The Support vector machine has two distinct types, the linear and the non-linear. The linear SVM simply means a straight line is used to separate the data points into the classes, and the non-linear SVM means that a straight line cannot be used to separate data points, it needs something more. The SVM has 2 main parts, the hyperplane, and the support vectors
In classifying data points, there are several boundaries(lines) that can be created to try to and separate the data points very well in n-dimensional space. The best decision boundary that can very well separate the data points in the best possible way is the hyperplane.
As can be seen in the figure above, there are three boundary lines but only one can very well separate the data points, and that is the hyperplane. The hyperplane’s dimension depends on the features of the data set. As there are two features in the figure above, the hyperplane is a straight line, but in case there are more than 2 features, such as 3 features, then the hyperplane becomes a 2-dimensional plane.
The support vectors, as previously mentioned, are the data points that are close to the hyperplane, and that determines the position and orientation of the hyperplane.
Support Vector Machine Steps
As mentioned, there are two types of support vector, linear and non-linear, and they would be talked about one after the other.
To illustrate the working of the linear SVM, consider the figure below,
We have a dataset of blue circles and red triangles, and we want to separate them into classes, such that any new data point that is added, can be correctly classified. As there are only two classes(2d), this can be easily separated and, there can be many lines(decision boundaries), to try and correctly separate the 2 classes.
As shown in the figure above, there are three decision boundaries, and the support vector is used to find the best decision boundary, which is the hyperplane. The green line is clearly the one that separates the data points very well, hence the hyperplane. The algorithm finds the points closest to the hyperplane, and the distance between the points and the hyperplane is known as the margin, and this margin needs to be maximized by the algorithm. The decision boundary which has the highest margin is the hyperplane, and in our case, the green line.
The Support vector for classifying data can be represented simply as :
Where D is the set of a couple of elements(xi,yi), P is the dimension, x is the vector and y is the class. Classifying the points, the positive class is classified using the wx-b=1 and the negative class classified using wx-b=-1.
When the data cannot be separated into classes with a straight hyperplane, then the data is non-linearly separated, and much more than a straight line is needed for the separation.
If the data points are separated on the plane as shown in the figure above, a straight line cannot be used to separate it, therefore, there needs to be a transformation by adding a third dimension. This transformation is done by a term called the kernel. The kernel transforms the data so that nonlinear data can be able to transform into a linear form at higher dimensions. After the transformation to the third dimension, the figure becomes like:
Now we can find a hyperplane that can correctly separate the data points, the data is now in a 3d dimension.
To separate the data points, a circle seems like the best way, it correctly separates the blue circles from the red triangles, and leaves some room for error to prevent overfitting. So, the Support vector has correctly separated the data and new data points can be classified well.
Implementation Of Support Vector Machine
We used python to implement the support vector machine algorithm, on a dataset you have already seen, which is the iris dataset. To recap, we are to use some features of the iris flower, such as its sepal length, petal width, etc, to classify iris flowers into three categories. Python, as we know, has excellent libraries to help us do this quickly and efficiently.
import pandas as pd
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
Pandas is a data wrangling and manipulation library that enables us to preprocess the data into the form that the support vector machine can work on. We import the Support vector machine as linear svc from the sci-kit learn library. We would evaluate the metrics using the accuracy score and the train_test_split, help us split our data into training and testing.
X = data.drop(columns=[‘Species’])
y = data[‘Species’]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.15, random_state=0)
We separate our data into training and testing, where all the independent features are stored into the X variable and the y variable contains only the target data. We then split our data into training and testing, by dedicating 85% of it for the training and the other 15% to test to check how our model did.
svm = LinearSVC()
svm_pred = svm.predict(X_test)
We instantiate the support vector by storing it in an alias SVM. Then we start training by fitting our training input features to our output training features. Finally, we make predictions on our data by calling the predict function, which uses the testing data, to evaluate how the training of the data performed.
We use the accuracy score, which computes all the correct classifications predicted by the support vector, divided by the total number of classifications. The dataset is balanced, which means all three classes have the same or almost equal data points, in this case, 150 for each class, so this metric is a great choice. Our model was 91.3%, which means using the features of the iris flower, such as petal length and sepal width, the support vector machine can predict with 91.3% accuracy, the class of the iris flower, which is a very good score.
We looked at the support vector machine algorithm, described what it is, the types available, the structure, the way the algorithm works, and finally, how to implement using python. The algorithm is excellent at classifying data, and most importantly, the kernel function that transforms data that is very difficult to classify into data of higher dimensions for easy classification, make this algorithm a very good option for classification purposes