SVM Using CVXPY

Support Vector Machines or (SVM) is a Supervised model used for classification and regression which can be done by finding the hyperplane in a N-dimensional space where N is the number of features which is used to classify data.

Terminologies

Hyperplane – Decision boundaries that helps to classify the data points. Depending upon data points’ side (either side of the plane), the class of the data is decided.

Support Vectors – Data points closer to the hyperplane and those points influence the position and the orientation of the hyperplane.

Weights – Used to draw the boundaries. Minimizing the weights will maximize the distance between the support vectors and the hyperplane.

Hard Margin

If the training data is linearly separable then we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the “margin”, and the maximum-margin hyperplane is the hyperplane that lies halfway between them. 

Consider the below example. Here the data points are linearly separable since there are no overlap between the data points.

The condition is to have the hyperplane at least 1unit away from both the sides. We can write the problem formulations as mentioned below.

We will be using cvxpy in python to solve this constrained problem. The code is given below.

Soft Margin

If the training data is not linearly separable then we cannot simply select two parallel hyperplanes because it might lead to wrong predictions in some case. So Soft Margin SVM is used to classify these type of datas.

Since there is an overlap between the data points, we introduce a slack variable into the formulation to tally the problem.

The formulation is given below.

Here the psi is the slack variable used to tally the overlapping problem. It can be solved using python as shown below.

Check out the other blogs on Machine Learning Algorithms under ML Algorithms Category.

K-Means Clustering Algorithm Without Libraries

K-Means clustering is a method of vector quantization used to split N number of observation into K clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Applications:

  1. Identifying crime-prone areas
  2. Customer segmentation
  3. Insurance fraud detection
  4. Public transport data analysis
  5. Clustering of IT alert, etc…

Terminology:

Centroids: A centroid is a data point at the center of a cluster. The resulting classifier is used to classify the data and thereby produce the cluster.

Clusters: A cluster refers to a collection of data points aggregated together because of certain similarities.

Steps:

Step-1) Initialize Centroids for Clusters: We need to set a k centroids by considering a k random data points among the data. The python code for each function is given under the corresponding step.

Step-2) Assign the data to corresponding cluster based on the centroids: To assign the data to corresponding cluster we need to take the distance between the data point and all the k centroids and we need to find the index at which the distance is smaller. Thus the data point goes to that cluster.

Step-3) Update Centroids: Since the random centroids are not the final centroids, we need to update the centroids using the cluster we found out for the previous centroids and take the mean between the centroid of a cluster and its corresponding data points.

Step-4) Know the difference: Since we will be using the while loop until the change in centroids are approximately equal to zero, we need to compute the difference in distance from previous centroid to the current centroid. This difference is used as the condition for the while loop.

Step-5) Plotting the data points: Now we need to scatter plot the data points and split them into k clusters and give different colors for each group in order to visualize and identify. This plot code also shows the initial and final cluster centers based on the input given in the main function(k-means function).

Step-6) K-Means: Now we integrate all the above mentioned functions into a main function called k_means. First we initialize the centroids for the given data and assigning the clusters for the same. Then we keep updating the centroids using a while loop with difference as the condition. I have given to plot the initial centers and centroids for only initial plot and final plot.

Output:

I have used a data with 2 features and to generate the data points I have used make_blobs from sklearn using which I can generator N number of points for k centers and n features. I used make_blobs because the second output from the make_blobs function is a array of integers which contains the group number of the corresponding data point.

Python code to plot only the data:

Output from the k_means function:

[NOTE: I HAVE PLOTTED ONLY INITIAL AND FINAL PLOT]

Extras:

‘color’ class for bold and print color change.

Validation of Code: 1) Using second output from make_blobs 2) Using libraries and plotting for same data and checking whether the output matches.

Validation Method 1: Since the group number might be different, we will be using index and arrays to check.

Validation Method 2: Using sklearn.cluster.KMeans

Elbow Method: Used to find optimal number of cluster for the given data.

Since the elbow is located at 3 (based above output), the optimum number of clusters for the data is 3.

Hope it was helpful.

FEEL FREE TO SEND YOUR FEEDBACKS AND COMMENTs ON THIS POST TO HELP US IMPROVE IN THE FUTURE.

THANK YOU ,

HAVE NICE DAY.