Using arff datasets from openml.org – Part1

Data is one of the most important part in machine learning (ML) or in AI in general. 

In this post we will explore some of the steps to use data 

from an dataset which have .arff extension (is an arff file) 

1) Download datasets from the openml

openml contain a lots of datasets that can be used in AI 

or ML projects, those data sets are free, and are very useful. In present post we 

will choose to exemplify steel-plates-fault

We will use fetch_openml dataset.

from sklearn.datasets import fetch_openml

my_steel_plates_fault = fetch_openml(name='steel-plates-fault', version=3, as_frame=False, parser='liac-arff')

Here parameters name and version are clear. 

Parameters as_frame and parser are used just because if we not use panda parser

using liac-arff we need to specify  as_frame parameter.


2) Dataset description 

Using:

from sklearn.datasets import fetch_openml

my_steel_plates_fault = fetch_openml(name='steel-plates-fault', version=3, as_frame=False, parser='liac-arff')

print(my_steel_plates_fault.DESCR)

output is:

C:\Users\Stefan\PycharmProjects\arffProject\venv\Scripts\python.exe C:\Users\Stefan\PycharmProjects\arffProject\main.py 

**Author**: Semeion, Research Center of Sciences of Communication, Rome, Italy.     

**Source**: [UCI](http://archive.ics.uci.edu/ml/datasets/steel+plates+faults)     

**Please cite**: Dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy.  


__Changes w.r.t. version 1: included one target factor with 7 levels as target variable for the classification. Also deleted the previous 7 binary target variables.__


**Steel Plates Faults Data Set**  

A dataset of steel plates' faults, classified into 7 different types. The goal was to train machine learning for automatic pattern recognition.


The dataset consists of 27 features describing each fault (location, size, ...) and 1 feature indicating the type of fault (on of 7: Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, Other_Faults). The target is the type of fault.


### Attribute Information  

* V1: X_Minimum  

* V2: X_Maximum  

* V3: Y_Minimum  

* V4: Y_Maximum  

* V5: Pixels_Areas  

* V6: X_Perimeter  

* V7: Y_Perimeter  

* V8: Sum_of_Luminosity  

* V9: Minimum_of_Luminosity  

* V10: Maximum_of_Luminosity  

* V11: Length_of_Conveyer  

* V12: TypeOfSteel_A300  

* V13: TypeOfSteel_A400  

* V14: Steel_Plate_Thickness  

* V15: Edges_Index  

* V16: Empty_Index  

* V17: Square_Index  

* V18: Outside_X_Index  

* V19: Edges_X_Index  

* V20: Edges_Y_Index  

* V21: Outside_Global_Index  

* V22: LogOfAreas  

* V23: Log_X_Index  

* V24: Log_Y_Index  

* V25: Orientation_Index  

* V26: Luminosity_Index  

* V27: SigmoidOfAreas  

* target: 7 types of fault as classification target  


### Relevant Papers  

1.M Buscema, S Terzi, W Tastle, A New Meta-Classifier,in NAFIPS 2010, Toronto (CANADA),26-28 July 2010, 978-1-4244-7858-6/10 ©2010 IEEE  

2.M Buscema, MetaNet: The Theory of Independent Judges, in Substance Use & Misuse, 33(2), 439-461,1998


Downloaded from openml.org.

Process finished with exit code 0

3) Printing dataset details

for:

print(my_steel_plates_fault.details)

output is:

{'id': '40982', 'name': 'steel-plates-fault', 'version': '3', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-12-04T22:37:56', 'licence': 'Public', 'url': 'https://api.openml.org/data/v1/download/18151921/steel-plates-fault.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'file_id': '18151921', 'default_target_attribute': 'target', 'version_label': '3', 'tag': ['Data Science', 'Engineering', 'OpenML-CC18', 'study_135', 'study_98', 'study_99'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'status': 'active', 'processing_date': '2018-10-04 07:21:37', 'md5_checksum': '7ccdabeb01749cce9fa3b1d4a702fb8c'}

4) Printing dataset url

for

print(my_steel_plates_fault.url)

output is: 

https://www.openml.org/d/40982

5) Using data_id to download dataset

40982 is data_id for dataset. We can use data_id like parameter with 

fetch_openml to download dataset for example:

from sklearn.datasets import fetch_openml

my_steel_plates_fault_b = fetch_openml(data_id=40982, parser="liac-arff", as_frame=False)

print(my_steel_plates_fault_b.details)

This create my_steel_plates_fault_b which is basically the same dataset like previous my_steel_plates_fault

For comparison we will print again details of my_steel_plates_fault_c.

for: 

import pandas

from sklearn.datasets import fetch_openml

my_steel_plates_fault_c = fetch_openml(data_id=40982)

print(my_steel_plates_fault_c.details)

output is:

C:\Users\Stefan\PycharmProjects\arffProject\venv\Scripts\python.exe C:\Users\Stefan\PycharmProjects\arffProject\main.py 


{'id': '40982', 'name': 'steel-plates-fault', 'version': '3', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-12-04T22:37:56', 'licence': 'Public', 'url': 'https://api.openml.org/data/v1/download/18151921/steel-plates-fault.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'file_id': '18151921', 'default_target_attribute': 'target', 'version_label': '3', 'tag': ['Data Science', 'Engineering', 'OpenML-CC18', 'study_135', 'study_98', 'study_99'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'status': 'active', 'processing_date': '2018-10-04 07:21:37', 'md5_checksum': '7ccdabeb01749cce9fa3b1d4a702fb8c'}

In this case it is used pandas parser, from this reason before fetch_openml we 

need to have import pandas.

6) Downloading dataset as a dataframe

Using:

import pandas

from sklearn.datasets import fetch_openml

my_steel_plates_fault_d = fetch_openml(data_id=40982, as_frame=True)

print(my_steel_plates_fault_d.data.head())

output is:

     V1    V2       V3       V4    V5  ...     V23     V24     V25     V26     V27

0    42    50   270900   270944   267  ...  0.9031  1.6435  0.8182 -0.2913  0.5822

1   645   651  2538079  2538108   108  ...  0.7782  1.4624  0.7931 -0.1756  0.2984

2   829   835  1553913  1553931    71  ...  0.7782  1.2553  0.6667 -0.1228  0.2150

3   853   860   369370   369415   176  ...  0.8451  1.6532  0.8444 -0.1568  0.5212

4  1289  1306   498078   498335  2409  ...  1.2305  2.4099  0.9338 -0.1992  1.0000


[5 rows x 27 columns]

Those are first 5 rows.

We see here columns name is V1, V2, V3, .... 

Significance of those name is in example b) where it is dataset description.

MySQL-Python – 5 Logging and MySQL Connector

This is a continuation from previous post MySQL-Python - 4.

In this page:


1. About logging

Logging activity for MySQL connector use default Python logging features. By default events with level WARNING are logged and are printed to terminal where we run, i.e. sys.stderr 

Configuring logging, we can change logged events level and we can print messages to other destination than stderr.


2. Sample code for classical logging

Below is a sample code for logging used in case when MySql Connector is used with insert. Use can be similar for other MySQL connector usecase. 

import logging

import datetime

import mysql.connector


#--this is code for loggers

logger = logging.getLogger("mysql.connector")

logger.setLevel(logging.DEBUG)


formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s- %(message)s")


# create console handler named stream_handler

stream_handler = logging.StreamHandler()

# add formatter to stream_handler

stream_handler.setFormatter(formatter)

# add  console handler  stream_handler to logger

logger.addHandler(stream_handler)

# create a file handler

file_handler = logging.FileHandler("conn_log.log")

# add formatter to file handler

file_handler.setFormatter(formatter)

# add  file handler to logger

logger.addHandler(file_handler)


#---this is code where mysql connector is used

connection = mysql.connector.connect(user='u2024', password='password1234', host='localhost', port=3306, database='sakila')

logger.debug('user u2024 logged')

cursor = connection.cursor()

logger.info('mysql.connector created cursor')


SQL_insert_actor = ("INSERT INTO actor (first_name, last_name, last_update)"

            "VALUES  (%s, %s, %s)")

values_for_actor = ('NICK','WALKEN', datetime.date(2023, 12, 16))

cursor.execute(SQL_insert_actor, values_for_actor)

# commit to the database

connection.commit()

logger.info('mysql.connector entry inserted in DB')


cursor.close()

connection.close() 


3. Code comments

First we defined a logger using logger = logging.getLogger("mysql.connector"), here mysql.connector is the logger name, it can be anything, we  named it  mysql.connector for visibility.

Change log level to DEBUG using logger.setLevel(logging.DEBUG)

Permitted log level are here

Line: formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s- %(message)s") 

define how log record will appear. This line is not mandatory, if we not use it, by default will be used string like '%(message)s' which means only the message will be in log record.

In our example we use a formatter object  thus we will display in log line: time, name, log level of the message and message itself. More about possibility for format message in formatter-objects


Next lines:

# create console handler named stream_handler

stream_handler = logging.StreamHandler()

# add formatter to stream_handler

stream_handler.setFormatter(formatter)

# add  console handler  stream_handler to logger

logger.addHandler(stream_handler)


create a stream handler that is added to logger, this stream handler is for logging to console, means lines for example that appear if we run python script in command prompt, or lines that appear on the screen as we will see when run in PyCharm


Lines:

file_handler = logging.FileHandler("conn_log.log")

# add formatter to file handler

file_handler.setFormatter(formatter)

# add  file handler to logger

logger.addHandler(file_handler)


those create a file handler and add it to logger, those lines are responsible for what will be written in log file named "conn_log.log".


All the lines that follow after 

#---this is code where mysql connector is used

are to exemplify. From mysql connector point of view, those contain lines for inserting in DB. 

Between those lines, we added lines for logging, those contain logger.debug or logger.info

For example line like 

logger.debug('user u2024 logged')

will sent message 'user u2024 logged' with proper formatting, to console or file.


4. Sample logging output 

Below is an image with sample code run in Pycharm.


Logging in mysql connector, PyCharm


We see with red lines in output log lines from console handler defined prior. 

In  file "conn_log.log" appear also lines, this time from file handler. Below is how file looks like: 

Logging in file

SVM in Python

In this post:

1. About SVM

As SVM (Support Vector Machine) like algorithm is good documented, we will not discuss in current material SVM theory, we will start directly presenting a little bit dataset that we will use with SVM, and then using effectively SVM. Nevertheless, for some more details about SVM algorithm we can check it here.

To exemplify classification using SVM in Python we will use wine dataset from scikit-learn (sklearn) module sklearn package have many datasets that can be used in tests (iris, wine, files, digits, reast_cancer, diabetes, linnerud, sample_image, sample_images, svmlight_file, svmlight_files )

2. Things about dataset used

a) To load wine dataset we will use load_wine method:

# import section

from sklearn import datasets

#load wine data set

ds_wine=datasets.load_wine()


b) Let's see some things about wine dataset. Any phython dataset is characterized by features and targets (label name). For wine: 

#check features of dataset

print("Features: ", ds_wine.feature_names)

# print wine type (i.e labels )

print("Labels: ", ds_wine.target_names)


#output:

Features:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 

'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

Labels:  ['class_0' 'class_1' 'class_2']

We see wine features means: 'alcohol', 'malic_acid', 'ash', etc. Wine labels (or targets, or types ) means 'class_0', 'class_1', 'class_2'


Dataset has 178 samples. Our SVM classification from this example will classify wines. 

c) Let's print first 2 records of our wine dataset:

print(ds_wine.data[0:2])


#output

[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00

  2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]

 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00

  2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]]

We observe comparing with output from a) for the first record 

alcohol=1.423e+01

malic_acid=1.710e+00

ash=2.430e+00

etc

In general, to determine how many records are in datasets and how many features each record has we use the shape method:

#shape of the dataset

print(ds_wine.data.shape)

#output

(178, 13)

This means ds_win dataset has 178 records, and as we see each one has 13 features.


3. Applying SVM classification to dataset

Now that we see what wine dataset looks like, we will apply SVM classification to it. 

Like in any machine learning algorithm we will need part for data for training the model, and another part for testing it.

3.1. Train and test dataset

For the SVM model we will choose train and test dataset from ds_wine initial dataset,using train_test_split method.

# Import train_test_split function

from sklearn.model_selection import train_test_split

# split ds_wine

dsx_train, dsx_test, dsy_train, dsy_test = train_test_split(ds_wine.data, ds_wine.target, test_size=0.15,random_state=109) # 85% training and 15% test

print(dsx_train[0:2])

print(dsy_train[0:2])

#output:

[[1.279e+01 2.670e+00 2.480e+00 2.200e+01 1.120e+02 1.480e+00 1.360e+00

  2.400e-01 1.260e+00 1.080e+01 4.800e-01 1.470e+00 4.800e+02]

 [1.438e+01 1.870e+00 2.380e+00 1.200e+01 1.020e+02 3.300e+00 3.640e+00

  2.900e-01 2.960e+00 7.500e+00 1.200e+00 3.000e+00 1.547e+03]]

[2 0]

Above we printed the first two records from train dataset, referring to dsy_train we see target for first record is 2, i.e. 'class_2' label, and the second record have target 0, i.e. 'class_0' label.


3.2. Create SVM classifier

To generate SVM model, we will create first SVM classifier object using SVC method:

from sklearn import svm

myclassifier = svm.SVC(kernel='linear')

We used method SVC (letters come from Support Vector Classification), method have many parameters, we used only kernel, a linear kernel. Kernel can also be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'.


3.3. Train and test SVM model

To train the model use "fit" method with train dataset then to test classification, use "predict" method with test dataset:

#train model with fit method using our dsx_train dataset

myclassifier.fit(dsx_train, dsy_train)

dsy_pred = myclassifier.predict(dsx_test)

#output: 

[0 0 1 2 0 1 0 0 1 0 2 1 2 2 0 1 1 0 0 1 2 1 0 2 0 0 1]


To understand more intuitive, we will take a record from dataset, and predict in which target it is, let's take first record from dsx_test it looks like: 

print(dsx_test[0])

#output

[1.330e+01 1.720e+00 2.140e+00 1.700e+01 9.400e+01 2.400e+00 2.190e+00

 2.700e-01 1.350e+00 3.950e+00 1.020e+00 2.770e+00 1.285e+03]

and appropriate value (target) in reality is dsy_test[0]

print(dsy_test[0])

#output:

0

Predicted target for dsx_test[0] is:

dsy_pred_0=myclassifier.predict([dsx_test[0]])

print(dsy_pred_0)

#output:

[0]

We see predicted value dsy_pred_0 is for this record the same with real value dsy_test[0].


4. SVM model accuracy

To evaluate accuracy for entire test dataset we use accuracy_score method:

from sklearn import metrics

print("SVM Accuracy:",metrics.accuracy_score(dsy_test, dsy_pred))

#output:

SVM Accuracy: 0.9259259259259259

Precision is good, it is about 92%


5. Full code sample 

#import section

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn import svm

from sklearn import metrics


#load wine data set

ds_wine=datasets.load_wine()

#check features of dataset

print("Features: ", ds_wine.feature_names)

# print wine type (i.e labels )

print("Labels: ", ds_wine.target_names)

#print first 2 records of our wind dataset

print(ds_wine.data[0:2])

#shape of the dataset

print(ds_wine.data.shape)

# split ds_wine

dsx_train, dsx_test, dsy_train, dsy_test = train_test_split(ds_wine.data, ds_wine.target, test_size=0.15,random_state=109) # 85% training and 15% test

print(dsx_train[0:2])

print(dsy_train[0:2])

# Generate SVM model, creating first SVM classifier object

myclassifier = svm.SVC(kernel='linear') # Linear Kernel

#train model with fit method using our X_train dataset

myclassifier.fit(dsx_train, dsy_train)

#make prediction for test dataset X_test

dsy_pred = myclassifier.predict(dsx_test)

print(dsy_pred)

print(dsx_test[0])

print(dsy_test[0])

dsy_pred_0=myclassifier.predict([dsx_test[0]])

print(dsy_pred_0)

print("SVM Accuracy:",metrics.accuracy_score(dsy_test, dsy_pred))

Accessing API in Python

There are many internets API which offer data in json format. Such data can be easily used in python using python requests module. In the current post we will use requests python module toghether with Json format to find prognosed maximum temperature per day, for next 8 days based on openweathermap API. Documentation for requests module can be consulted in many links on internet, 

for example, here or in this site.

At the date of writing this short intro in accessing API and using output in JSON format, latest requests release can be found for example on community update link.

If not yet installed, python requests modul as usual can be installed via pip: 

python -m pip install requests

We will exemplify how to use requests and get JSON data for openweathermap.org API. 

openweathermap have many API, just checking on openweathermap website we will exemplify for "Current Weather Data" which have documentation at API doc 

Before doing something in Python for many of the presented endpoints we will need to get an API key, and understand API call. For API key, we need to create an account or profile on openweathermap.org site, after that in the profile it is a link where we can obtain an API key. This is easy, and let's presume we get it and now we have an API key.

For call current weather data API, based on documentation API call is 

https://api.openweathermap.org/data/2.5/weather?lat={lat}&lon={lon}&appid={API key}

In this URL:

  • endpoint is https://api.openweathermap.org/data/2.5/weather
  • parameters are:
  • lat for latitude
  • lon for longitude
  • appid is the API key discussed before
  • in  documentation exist other parameters like mode, units, lang

Parameters lat and lon can be obtained from internat, example from this site: https://www.latlong.net/, or from here https://www.gps-coordinates.net/, there are also some others. 

For lat 44.451209, lon 26.13391 and appid 70g04e4613056b159c2761a9d9e664d2 url for call is

https://api.openweathermap.org/data/2.5/onecall?lat=44.451209&lon=26.13391&appid=70g04e4613056b159c2761a9d9e664d2

Observation: appid is a fictive one here, you will need to obtain one from openweathermap site as mentioned prior. 

Loading it in browser we obtain:

Weather API In Browser

Figure 1: Weather API loaded In Browser

In this image it is a bunch of data, it looks difficult to use it effectively. Using python, we overcome this and we can extract or use data that we need. Simple call to this API using Python is in below code:

import requests

Weather_url = "https://api.openweathermap.org/data/2.5/onecall"

api_key = "70g04e4613056b159c2761a9d9e664d2"

weather_params = {

    "lat": 44.451209,

    "lon": 26.13391,

    "appid": api_key,

}

response = requests.get(Weather_url, params=weather_params)

We used get function from requests module, there is requests.get(.....)

Like parameters for get I used:

  • string Weather_url which is string with endpoint as mentioned in API doc
  • params which contain API parameters according with documentation. This "params" is a dictionary that contain parameters (lat, lon and appid ) means requests.get accept params like a dictionary with pairs key:value.

The output of the code is just "Process finished with exit code 0" and for the begining we can think it is not too useful. We can use supplimentary response.status_code it will give: 

print(response.status_code)

#output 

200

Here status_code  give well know response status code universal for any http request (example of other status code: 201, 401, 404 etc). For a complete list see: "List of HTTP status codes

In instruction: response = requests.get(Weather_url, params=weather_params)

response is an object, a class. If in GUI environment we just type response, intelisense will show a bunch of methods and variable for this class like below:  

Some methods from response module

Figure 2: Some methods from response module

Using response.json() will give in python output exact json data that we see in Figure 1, but this is displayed in a single line like:

{'lat': 44.4512, 'lon': 26.1339, 'timezone': 'Europe/Bucharest', 'timezone_offset': 10800.........text omitted ............ }

means:

Json output

Figure 3: Json output

This data still seems hard to analyze, however as we see there it seems similar with a dictionary, we can see structure of this in a Json viewer. 

There are many online Json viewers, I selected first Json data from Figure 1 using CTRL+A, press CTRL+C and then go to Json online https://jsonformatter.org/json-viewer viewer and press CTRL+V and there is:

Data In Json Viewer

Figure 4: Data in Json Viewer

We understand from here easy json structure where we see {} there is about dictionary, where is [] there is about list. 

Looking to Figure 4 if we need to access daily[0] list we will write

weather_1=response.json() # this is entire json data output

print(weather_1["daily"][0])

Expanding further there is: 

Maximum temperature for a day in Json output

Figure 5: Maximum temperature for a day in Json output

We see struture, hence for accessing daily max temperature for day 0 we will use 

weather_1=response.json() # this is entire json data output

print(weather_1["daily"][0]["temp"]["max"])

it will display 283.93 max temperature for daily[0]. 

Observation: above temperature is in Kelvin scale (to convert to celsius degtree we need to use "Kelvin value" - 273.15)


The code for find maximum temperature in next 8 days is:

import requests

Weather_Endpoint = "https://api.openweathermap.org/data/2.5/onecall"

api_key = "70g04e4613056b159c2761a9d9e664d2"

weather_params = {

    "lat": 44.451209,

    "lon": 26.13391,

    "appid": api_key,

}

response = requests.get(Weather_Endpoint, params=weather_params)

response.raise_for_status()

weather_1=response.json()

weather_2_daily = weather_1["daily"]

daily_max_temp=[]

for i in range(8):

    daily_max_temp.append(float(weather_2_daily[i]["temp"]["max"]))

    print(f"Maximum temperature in day {i} is {round(daily_max_temp[i]-273.15, 2)} Celsius degree")

print(f"Maximum daily temperature in next 8 days is {round(max(daily_max_temp)-273.15,2)} Celsius degree")


#output is: 

Maximum temperature in day 0 is 11.09 Celsius degree

Maximum temperature in day 1 is 14.41 Celsius degree

Maximum temperature in day 2 is 16.2 Celsius degree

Maximum temperature in day 3 is 17.99 Celsius degree

Maximum temperature in day 4 is 13.5 Celsius degree

Maximum temperature in day 5 is 13.01 Celsius degree

Maximum temperature in day 6 is 16.81 Celsius degree

Maximum temperature in day 7 is 16.82 Celsius degree

Maximum daily temperature in next 8 days is 17.99 Celsius degree

About the code: 

- weather_1 represent entire json output data

- weather_2_daily=weather_1["daily"] is a list based on Figure 4

- we defined an empty list with daily_max_temp=[]

- weather_2_daily[i]["temp"]["max"] is basicaly maximum daily temperature in day "i" in Kelvin degree.