Using arff datasets from openml.org – Part1

Data is one of the most important part in machine learning (ML) or in AI in general. 

In this post we will explore some of the steps to use data 

from an dataset which have .arff extension (is an arff file) 

1) Download datasets from the openml

openml contain a lots of datasets that can be used in AI 

or ML projects, those data sets are free, and are very useful. In present post we 

will choose to exemplify steel-plates-fault

We will use fetch_openml dataset.

from sklearn.datasets import fetch_openml

my_steel_plates_fault = fetch_openml(name='steel-plates-fault', version=3, as_frame=False, parser='liac-arff')

Here parameters name and version are clear. 

Parameters as_frame and parser are used just because if we not use panda parser

using liac-arff we need to specify  as_frame parameter.


2) Dataset description 

Using:

from sklearn.datasets import fetch_openml

my_steel_plates_fault = fetch_openml(name='steel-plates-fault', version=3, as_frame=False, parser='liac-arff')

print(my_steel_plates_fault.DESCR)

output is:

C:\Users\Stefan\PycharmProjects\arffProject\venv\Scripts\python.exe C:\Users\Stefan\PycharmProjects\arffProject\main.py 

**Author**: Semeion, Research Center of Sciences of Communication, Rome, Italy.     

**Source**: [UCI](http://archive.ics.uci.edu/ml/datasets/steel+plates+faults)     

**Please cite**: Dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy.  


__Changes w.r.t. version 1: included one target factor with 7 levels as target variable for the classification. Also deleted the previous 7 binary target variables.__


**Steel Plates Faults Data Set**  

A dataset of steel plates' faults, classified into 7 different types. The goal was to train machine learning for automatic pattern recognition.


The dataset consists of 27 features describing each fault (location, size, ...) and 1 feature indicating the type of fault (on of 7: Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, Other_Faults). The target is the type of fault.


### Attribute Information  

* V1: X_Minimum  

* V2: X_Maximum  

* V3: Y_Minimum  

* V4: Y_Maximum  

* V5: Pixels_Areas  

* V6: X_Perimeter  

* V7: Y_Perimeter  

* V8: Sum_of_Luminosity  

* V9: Minimum_of_Luminosity  

* V10: Maximum_of_Luminosity  

* V11: Length_of_Conveyer  

* V12: TypeOfSteel_A300  

* V13: TypeOfSteel_A400  

* V14: Steel_Plate_Thickness  

* V15: Edges_Index  

* V16: Empty_Index  

* V17: Square_Index  

* V18: Outside_X_Index  

* V19: Edges_X_Index  

* V20: Edges_Y_Index  

* V21: Outside_Global_Index  

* V22: LogOfAreas  

* V23: Log_X_Index  

* V24: Log_Y_Index  

* V25: Orientation_Index  

* V26: Luminosity_Index  

* V27: SigmoidOfAreas  

* target: 7 types of fault as classification target  


### Relevant Papers  

1.M Buscema, S Terzi, W Tastle, A New Meta-Classifier,in NAFIPS 2010, Toronto (CANADA),26-28 July 2010, 978-1-4244-7858-6/10 ©2010 IEEE  

2.M Buscema, MetaNet: The Theory of Independent Judges, in Substance Use & Misuse, 33(2), 439-461,1998


Downloaded from openml.org.

Process finished with exit code 0

3) Printing dataset details

for:

print(my_steel_plates_fault.details)

output is:

{'id': '40982', 'name': 'steel-plates-fault', 'version': '3', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-12-04T22:37:56', 'licence': 'Public', 'url': 'https://api.openml.org/data/v1/download/18151921/steel-plates-fault.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'file_id': '18151921', 'default_target_attribute': 'target', 'version_label': '3', 'tag': ['Data Science', 'Engineering', 'OpenML-CC18', 'study_135', 'study_98', 'study_99'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'status': 'active', 'processing_date': '2018-10-04 07:21:37', 'md5_checksum': '7ccdabeb01749cce9fa3b1d4a702fb8c'}

4) Printing dataset url

for

print(my_steel_plates_fault.url)

output is: 

https://www.openml.org/d/40982

5) Using data_id to download dataset

40982 is data_id for dataset. We can use data_id like parameter with 

fetch_openml to download dataset for example:

from sklearn.datasets import fetch_openml

my_steel_plates_fault_b = fetch_openml(data_id=40982, parser="liac-arff", as_frame=False)

print(my_steel_plates_fault_b.details)

This create my_steel_plates_fault_b which is basically the same dataset like previous my_steel_plates_fault

For comparison we will print again details of my_steel_plates_fault_c.

for: 

import pandas

from sklearn.datasets import fetch_openml

my_steel_plates_fault_c = fetch_openml(data_id=40982)

print(my_steel_plates_fault_c.details)

output is:

C:\Users\Stefan\PycharmProjects\arffProject\venv\Scripts\python.exe C:\Users\Stefan\PycharmProjects\arffProject\main.py 


{'id': '40982', 'name': 'steel-plates-fault', 'version': '3', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-12-04T22:37:56', 'licence': 'Public', 'url': 'https://api.openml.org/data/v1/download/18151921/steel-plates-fault.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'file_id': '18151921', 'default_target_attribute': 'target', 'version_label': '3', 'tag': ['Data Science', 'Engineering', 'OpenML-CC18', 'study_135', 'study_98', 'study_99'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'status': 'active', 'processing_date': '2018-10-04 07:21:37', 'md5_checksum': '7ccdabeb01749cce9fa3b1d4a702fb8c'}

In this case it is used pandas parser, from this reason before fetch_openml we 

need to have import pandas.

6) Downloading dataset as a dataframe

Using:

import pandas

from sklearn.datasets import fetch_openml

my_steel_plates_fault_d = fetch_openml(data_id=40982, as_frame=True)

print(my_steel_plates_fault_d.data.head())

output is:

     V1    V2       V3       V4    V5  ...     V23     V24     V25     V26     V27

0    42    50   270900   270944   267  ...  0.9031  1.6435  0.8182 -0.2913  0.5822

1   645   651  2538079  2538108   108  ...  0.7782  1.4624  0.7931 -0.1756  0.2984

2   829   835  1553913  1553931    71  ...  0.7782  1.2553  0.6667 -0.1228  0.2150

3   853   860   369370   369415   176  ...  0.8451  1.6532  0.8444 -0.1568  0.5212

4  1289  1306   498078   498335  2409  ...  1.2305  2.4099  0.9338 -0.1992  1.0000


[5 rows x 27 columns]

Those are first 5 rows.

We see here columns name is V1, V2, V3, .... 

Significance of those name is in example b) where it is dataset description.