In this post:
1. About SVM
As SVM (Support Vector Machine) like algorithm is good documented, we will not discuss in current material SVM theory, we will start directly presenting a little bit dataset that we will use with SVM, and then using effectively SVM. Nevertheless, for some more details about SVM algorithm we can check it here.
To exemplify classification using SVM in Python we will use wine dataset from scikit-learn (sklearn) module sklearn package have many datasets that can be used in tests (iris, wine, files, digits, reast_cancer, diabetes, linnerud, sample_image, sample_images, svmlight_file, svmlight_files )
2. Things about dataset used
a) To load wine dataset we will use load_wine method:
# import section
from sklearn import datasets
#load wine data set
ds_wine=datasets.load_wine()
b) Let's see some things about wine dataset. Any phython dataset is characterized by features and targets (label name). For wine:
#check features of dataset
print("Features: ", ds_wine.feature_names)
# print wine type (i.e labels )
print("Labels: ", ds_wine.target_names)
#output:
Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Labels: ['class_0' 'class_1' 'class_2']
We see wine features means: 'alcohol', 'malic_acid', 'ash', etc. Wine labels (or targets, or types ) means 'class_0', 'class_1', 'class_2'
Dataset has 178 samples. Our SVM classification from this example will classify wines.
c) Let's print first 2 records of our wine dataset:
print(ds_wine.data[0:2])
#output
[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
[1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00
2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]]
We observe comparing with output from a) for the first record
alcohol=1.423e+01
malic_acid=1.710e+00
ash=2.430e+00
etc
In general, to determine how many records are in datasets and how many features each record has we use the shape method:
#shape of the dataset
print(ds_wine.data.shape)
#output
(178, 13)
This means ds_win dataset has 178 records, and as we see each one has 13 features.
3. Applying SVM classification to dataset
Now that we see what wine dataset looks like, we will apply SVM classification to it.
Like in any machine learning algorithm we will need part for data for training the model, and another part for testing it.
3.1. Train and test dataset
For the SVM model we will choose train and test dataset from ds_wine initial dataset,using train_test_split method.
# Import train_test_split function
from sklearn.model_selection import train_test_split
# split ds_wine
dsx_train, dsx_test, dsy_train, dsy_test = train_test_split(ds_wine.data, ds_wine.target, test_size=0.15,random_state=109) # 85% training and 15% test
print(dsx_train[0:2])
print(dsy_train[0:2])
#output:
[[1.279e+01 2.670e+00 2.480e+00 2.200e+01 1.120e+02 1.480e+00 1.360e+00
2.400e-01 1.260e+00 1.080e+01 4.800e-01 1.470e+00 4.800e+02]
[1.438e+01 1.870e+00 2.380e+00 1.200e+01 1.020e+02 3.300e+00 3.640e+00
2.900e-01 2.960e+00 7.500e+00 1.200e+00 3.000e+00 1.547e+03]]
[2 0]
Above we printed the first two records from train dataset, referring to dsy_train we see target for first record is 2, i.e. 'class_2' label, and the second record have target 0, i.e. 'class_0' label.
3.2. Create SVM classifier
To generate SVM model, we will create first SVM classifier object using SVC method:
from sklearn import svm
myclassifier = svm.SVC(kernel='linear')
We used method SVC (letters come from Support Vector Classification), method have many parameters, we used only kernel, a linear kernel. Kernel can also be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'.
3.3. Train and test SVM model
To train the model use "fit" method with train dataset then to test classification, use "predict" method with test dataset:
#train model with fit method using our dsx_train dataset
myclassifier.fit(dsx_train, dsy_train)
dsy_pred = myclassifier.predict(dsx_test)
#output:
[0 0 1 2 0 1 0 0 1 0 2 1 2 2 0 1 1 0 0 1 2 1 0 2 0 0 1]
To understand more intuitive, we will take a record from dataset, and predict in which target it is, let's take first record from dsx_test it looks like:
print(dsx_test[0])
#output
[1.330e+01 1.720e+00 2.140e+00 1.700e+01 9.400e+01 2.400e+00 2.190e+00
2.700e-01 1.350e+00 3.950e+00 1.020e+00 2.770e+00 1.285e+03]
and appropriate value (target) in reality is dsy_test[0]
print(dsy_test[0])
#output:
0
Predicted target for dsx_test[0] is:
dsy_pred_0=myclassifier.predict([dsx_test[0]])
print(dsy_pred_0)
#output:
[0]
We see predicted value dsy_pred_0 is for this record the same with real value dsy_test[0].
4. SVM model accuracy
To evaluate accuracy for entire test dataset we use accuracy_score method:
from sklearn import metrics
print("SVM Accuracy:",metrics.accuracy_score(dsy_test, dsy_pred))
#output:
SVM Accuracy: 0.9259259259259259
Precision is good, it is about 92%
5. Full code sample
#import section
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
#load wine data set
ds_wine=datasets.load_wine()
#check features of dataset
print("Features: ", ds_wine.feature_names)
# print wine type (i.e labels )
print("Labels: ", ds_wine.target_names)
#print first 2 records of our wind dataset
print(ds_wine.data[0:2])
#shape of the dataset
print(ds_wine.data.shape)
# split ds_wine
dsx_train, dsx_test, dsy_train, dsy_test = train_test_split(ds_wine.data, ds_wine.target, test_size=0.15,random_state=109) # 85% training and 15% test
print(dsx_train[0:2])
print(dsy_train[0:2])
# Generate SVM model, creating first SVM classifier object
myclassifier = svm.SVC(kernel='linear') # Linear Kernel
#train model with fit method using our X_train dataset
myclassifier.fit(dsx_train, dsy_train)
#make prediction for test dataset X_test
dsy_pred = myclassifier.predict(dsx_test)
print(dsy_pred)
print(dsx_test[0])
print(dsy_test[0])
dsy_pred_0=myclassifier.predict([dsx_test[0]])
print(dsy_pred_0)
print("SVM Accuracy:",metrics.accuracy_score(dsy_test, dsy_pred))