SVM in Python

In this post:

1. About SVM

As SVM (Support Vector Machine) like algorithm is good documented, we will not discuss in current material SVM theory, we will start directly presenting a little bit dataset that we will use with SVM, and then using effectively SVM. Nevertheless, for some more details about SVM algorithm we can check it here.

To exemplify classification using SVM in Python we will use wine dataset from scikit-learn (sklearn) module sklearn package have many datasets that can be used in tests (iris, wine, files, digits, reast_cancer, diabetes, linnerud, sample_image, sample_images, svmlight_file, svmlight_files )

2. Things about dataset used

a) To load wine dataset we will use load_wine method:

# import section

from sklearn import datasets

#load wine data set

ds_wine=datasets.load_wine()


b) Let's see some things about wine dataset. Any phython dataset is characterized by features and targets (label name). For wine: 

#check features of dataset

print("Features: ", ds_wine.feature_names)

# print wine type (i.e labels )

print("Labels: ", ds_wine.target_names)


#output:

Features:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 

'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

Labels:  ['class_0' 'class_1' 'class_2']

We see wine features means: 'alcohol', 'malic_acid', 'ash', etc. Wine labels (or targets, or types ) means 'class_0', 'class_1', 'class_2'


Dataset has 178 samples. Our SVM classification from this example will classify wines. 

c) Let's print first 2 records of our wine dataset:

print(ds_wine.data[0:2])


#output

[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00

  2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]

 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00

  2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]]

We observe comparing with output from a) for the first record 

alcohol=1.423e+01

malic_acid=1.710e+00

ash=2.430e+00

etc

In general, to determine how many records are in datasets and how many features each record has we use the shape method:

#shape of the dataset

print(ds_wine.data.shape)

#output

(178, 13)

This means ds_win dataset has 178 records, and as we see each one has 13 features.


3. Applying SVM classification to dataset

Now that we see what wine dataset looks like, we will apply SVM classification to it. 

Like in any machine learning algorithm we will need part for data for training the model, and another part for testing it.

3.1. Train and test dataset

For the SVM model we will choose train and test dataset from ds_wine initial dataset,using train_test_split method.

# Import train_test_split function

from sklearn.model_selection import train_test_split

# split ds_wine

dsx_train, dsx_test, dsy_train, dsy_test = train_test_split(ds_wine.data, ds_wine.target, test_size=0.15,random_state=109) # 85% training and 15% test

print(dsx_train[0:2])

print(dsy_train[0:2])

#output:

[[1.279e+01 2.670e+00 2.480e+00 2.200e+01 1.120e+02 1.480e+00 1.360e+00

  2.400e-01 1.260e+00 1.080e+01 4.800e-01 1.470e+00 4.800e+02]

 [1.438e+01 1.870e+00 2.380e+00 1.200e+01 1.020e+02 3.300e+00 3.640e+00

  2.900e-01 2.960e+00 7.500e+00 1.200e+00 3.000e+00 1.547e+03]]

[2 0]

Above we printed the first two records from train dataset, referring to dsy_train we see target for first record is 2, i.e. 'class_2' label, and the second record have target 0, i.e. 'class_0' label.


3.2. Create SVM classifier

To generate SVM model, we will create first SVM classifier object using SVC method:

from sklearn import svm

myclassifier = svm.SVC(kernel='linear')

We used method SVC (letters come from Support Vector Classification), method have many parameters, we used only kernel, a linear kernel. Kernel can also be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'.


3.3. Train and test SVM model

To train the model use "fit" method with train dataset then to test classification, use "predict" method with test dataset:

#train model with fit method using our dsx_train dataset

myclassifier.fit(dsx_train, dsy_train)

dsy_pred = myclassifier.predict(dsx_test)

#output: 

[0 0 1 2 0 1 0 0 1 0 2 1 2 2 0 1 1 0 0 1 2 1 0 2 0 0 1]


To understand more intuitive, we will take a record from dataset, and predict in which target it is, let's take first record from dsx_test it looks like: 

print(dsx_test[0])

#output

[1.330e+01 1.720e+00 2.140e+00 1.700e+01 9.400e+01 2.400e+00 2.190e+00

 2.700e-01 1.350e+00 3.950e+00 1.020e+00 2.770e+00 1.285e+03]

and appropriate value (target) in reality is dsy_test[0]

print(dsy_test[0])

#output:

0

Predicted target for dsx_test[0] is:

dsy_pred_0=myclassifier.predict([dsx_test[0]])

print(dsy_pred_0)

#output:

[0]

We see predicted value dsy_pred_0 is for this record the same with real value dsy_test[0].


4. SVM model accuracy

To evaluate accuracy for entire test dataset we use accuracy_score method:

from sklearn import metrics

print("SVM Accuracy:",metrics.accuracy_score(dsy_test, dsy_pred))

#output:

SVM Accuracy: 0.9259259259259259

Precision is good, it is about 92%


5. Full code sample 

#import section

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn import svm

from sklearn import metrics


#load wine data set

ds_wine=datasets.load_wine()

#check features of dataset

print("Features: ", ds_wine.feature_names)

# print wine type (i.e labels )

print("Labels: ", ds_wine.target_names)

#print first 2 records of our wind dataset

print(ds_wine.data[0:2])

#shape of the dataset

print(ds_wine.data.shape)

# split ds_wine

dsx_train, dsx_test, dsy_train, dsy_test = train_test_split(ds_wine.data, ds_wine.target, test_size=0.15,random_state=109) # 85% training and 15% test

print(dsx_train[0:2])

print(dsy_train[0:2])

# Generate SVM model, creating first SVM classifier object

myclassifier = svm.SVC(kernel='linear') # Linear Kernel

#train model with fit method using our X_train dataset

myclassifier.fit(dsx_train, dsy_train)

#make prediction for test dataset X_test

dsy_pred = myclassifier.predict(dsx_test)

print(dsy_pred)

print(dsx_test[0])

print(dsy_test[0])

dsy_pred_0=myclassifier.predict([dsx_test[0]])

print(dsy_pred_0)

print("SVM Accuracy:",metrics.accuracy_score(dsy_test, dsy_pred))

Create dataframe in Pandas

One of the simple modes to create dataframe in Python, pandas is to create it from a dictionary. Below example create a dataframe from dictionary.

import pandas as pd

dict1 = {

    'Ford': [120, 230, 120, 431],

    'Renault': [320, 233, 547, 622],

    'Audi': [230, 123, 457, 232],

    'Toyota': [230, 123, 457, 232],

    'Opel': [230, 123, 457, 232]

}

print(dict1.keys())

print("Ford key is:", end=' ')

print(dict1['Ford'])

sells = pd.DataFrame(dict1) # create sells dataframe from 'dict1' dictionary

print('DataFrame is:')

print(sells)


#output:

dict_keys(['Ford', 'Renault', 'Audi', 'Toyota', 'Opel'])

Ford key is: [120, 230, 120, 431]

DataFrame is:

   Ford  Renault  Audi  Toyota  Opel

0   120      320   230     230   230

1   230      233   123     123   123

2   120      547   457     457   457

3   431      622   232     232   232

Comments about above code: "dict1" dictionary contain situation with car sales for a machine dealer, for first 4 month of the year.

Dictionary keys are 'Ford', 'Renault' 'Audi' 'Toyota' 'Opel'.

Dictionary values are lists, means for 'Ford' key value is list [120, 230, 120, 431].

Recalling from dictionaries theory print(dict1.keys()), which will show: 

dict_keys(['Ford', 'Renault', 'Audi', 'Toyota', 'Opel'])

and print(dict1['Ford']) will show [120, 230, 120, 431].

And finally, method that create sells dataframe from dictionary is: sells = pd.DataFrame(dict1).


There is a possibility to create dataframe from dictionary using method from_dict of class DataFrame, which works similar like in previous example, but it has more options.

Example: 

import pandas as pd

dict2 = {

    'candy':['80%', '60%', '45%'],

    'chocolate':['12%', '24%', '7%'],

    'wafer':['14%', '18%', '16%']

}


sells2=pd.DataFrame.from_dict(dict2)

print(sells2)


#output: 

  candy chocolate wafer

0   80%       12%   14%

1   60%       24%   18%

2   45%        7%   16%

We see using DataFrame.from_dict specifying only dictionary from which we create dataframe, it will create a dataframe in which keys become dataframe columns.


Using from_dict with parameter orient='index' will create a dataframe in which keys are first values in row like below example: 

import pandas as pd

dict2 = {

    'candy':['80%', '60%', '45%'],

    'chocolate':['12%', '24%', '7%'],

    'wafer':['14%', '18%', '16%']

}


sells2i=pd.DataFrame.from_dict(dict2, orient='index')

print(sells2i)


#output:

                   0       1       2

candy         80%  60%  45%

chocolate  12%   24%   7%

wafer         14%   18%  16%


Creating dataframe from a list of tuples. 

This can be achieved using method DataFrame.from_records

Example: 

import pandas as pd

marks = [('Mike', 9), ('Debora', 10), ('Steve', 9), ('Tim', 8)]

marks_df = pd.DataFrame.from_records(marks, columns=['Student', 'Mark'])


#Output:

      Student  Mark

0    Mike         9

1    Debora    10

2    Steve        9

3    Tim           8


Other example for dataframe from list of tuples with more elements: 

import pandas as pd

marks = [('Mike', 9, 8), ('Debora', 10, 9), ('Steve', 9, 10), ('Tim', 8, 9)]

marks_df = pd.DataFrame.from_records(marks, columns=['Student', 'Mark1','Mark2'])

print(marks_df)


#output:

  Student     Mark1  Mark2

0    Mike       9         8

1   Debora    10       9

2   Steve        9        10

3   Tim           8        9


from_records can be used similar to create dataframe from a list of dictionaries,

Example:

import pandas as pd

marks = [{'Student':'Mike' , 'Mark':9},

               {'Student':'Debora' , 'Mark':10},

       {'Student':'Steve' , 'Mark':9},

       {'Student':'Tim' , 'Mark':8}

        ]

marks_df1 = pd.DataFrame.from_records(marks)

print(marks_df1)


#Output

      Student   Mark

0    Mike         9

1    Debora    10

2    Steve        9

3    Tim           8


We can create dataframe from numpy array

Example:

import pandas as pd

import numpy as np

data = np.array([('Student','Mark'),('Mike', 9), ('Debora', '10'), ('Steve', 9), ('Tim', 8)])

marks_df2= pd.DataFrame.from_records(data)

print(marks_df2)


#output:

         0            1

0     Student  Mark

1     Mike         9

2     Debora    10

3     Steve        9

4     Tim           8