Data is one of the most important part in machine learning (ML) or in AI in general.
In this post we will explore some of the steps to use data
from an dataset which have .arff extension (is an arff file)
1) Download datasets from the openml
openml contain a lots of datasets that can be used in AI
or ML projects, those data sets are free, and are very useful. In present post we
will choose to exemplify steel-plates-fault
We will use fetch_openml dataset.
from sklearn.datasets import fetch_openml
my_steel_plates_fault = fetch_openml(name='steel-plates-fault', version=3, as_frame=False, parser='liac-arff')
Here parameters name and version are clear.
Parameters as_frame and parser are used just because if we not use panda parser
using liac-arff we need to specify as_frame parameter.
2) Dataset description
Using:
from sklearn.datasets import fetch_openml
my_steel_plates_fault = fetch_openml(name='steel-plates-fault', version=3, as_frame=False, parser='liac-arff')
print(my_steel_plates_fault.DESCR)
output is:
C:\Users\Stefan\PycharmProjects\arffProject\venv\Scripts\python.exe C:\Users\Stefan\PycharmProjects\arffProject\main.py
**Author**: Semeion, Research Center of Sciences of Communication, Rome, Italy.
**Source**: [UCI](http://archive.ics.uci.edu/ml/datasets/steel+plates+faults)
**Please cite**: Dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy.
__Changes w.r.t. version 1: included one target factor with 7 levels as target variable for the classification. Also deleted the previous 7 binary target variables.__
**Steel Plates Faults Data Set**
A dataset of steel plates' faults, classified into 7 different types. The goal was to train machine learning for automatic pattern recognition.
The dataset consists of 27 features describing each fault (location, size, ...) and 1 feature indicating the type of fault (on of 7: Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, Other_Faults). The target is the type of fault.
### Attribute Information
* V1: X_Minimum
* V2: X_Maximum
* V3: Y_Minimum
* V4: Y_Maximum
* V5: Pixels_Areas
* V6: X_Perimeter
* V7: Y_Perimeter
* V8: Sum_of_Luminosity
* V9: Minimum_of_Luminosity
* V10: Maximum_of_Luminosity
* V11: Length_of_Conveyer
* V12: TypeOfSteel_A300
* V13: TypeOfSteel_A400
* V14: Steel_Plate_Thickness
* V15: Edges_Index
* V16: Empty_Index
* V17: Square_Index
* V18: Outside_X_Index
* V19: Edges_X_Index
* V20: Edges_Y_Index
* V21: Outside_Global_Index
* V22: LogOfAreas
* V23: Log_X_Index
* V24: Log_Y_Index
* V25: Orientation_Index
* V26: Luminosity_Index
* V27: SigmoidOfAreas
* target: 7 types of fault as classification target
### Relevant Papers
1.M Buscema, S Terzi, W Tastle, A New Meta-Classifier,in NAFIPS 2010, Toronto (CANADA),26-28 July 2010, 978-1-4244-7858-6/10 ©2010 IEEE
2.M Buscema, MetaNet: The Theory of Independent Judges, in Substance Use & Misuse, 33(2), 439-461,1998
Downloaded from openml.org.
Process finished with exit code 0
3) Printing dataset details
for:
print(my_steel_plates_fault.details)
output is:
{'id': '40982', 'name': 'steel-plates-fault', 'version': '3', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-12-04T22:37:56', 'licence': 'Public', 'url': 'https://api.openml.org/data/v1/download/18151921/steel-plates-fault.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'file_id': '18151921', 'default_target_attribute': 'target', 'version_label': '3', 'tag': ['Data Science', 'Engineering', 'OpenML-CC18', 'study_135', 'study_98', 'study_99'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'status': 'active', 'processing_date': '2018-10-04 07:21:37', 'md5_checksum': '7ccdabeb01749cce9fa3b1d4a702fb8c'}
4) Printing dataset url
for
print(my_steel_plates_fault.url)
output is:
https://www.openml.org/d/40982
5) Using data_id to download dataset
40982 is data_id for dataset. We can use data_id like parameter with
fetch_openml to download dataset for example:
from sklearn.datasets import fetch_openml
my_steel_plates_fault_b = fetch_openml(data_id=40982, parser="liac-arff", as_frame=False)
print(my_steel_plates_fault_b.details)
This create my_steel_plates_fault_b which is basically the same dataset like previous my_steel_plates_fault
For comparison we will print again details of my_steel_plates_fault_c.
for:
import pandas
from sklearn.datasets import fetch_openml
my_steel_plates_fault_c = fetch_openml(data_id=40982)
print(my_steel_plates_fault_c.details)
output is:
C:\Users\Stefan\PycharmProjects\arffProject\venv\Scripts\python.exe C:\Users\Stefan\PycharmProjects\arffProject\main.py
{'id': '40982', 'name': 'steel-plates-fault', 'version': '3', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-12-04T22:37:56', 'licence': 'Public', 'url': 'https://api.openml.org/data/v1/download/18151921/steel-plates-fault.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'file_id': '18151921', 'default_target_attribute': 'target', 'version_label': '3', 'tag': ['Data Science', 'Engineering', 'OpenML-CC18', 'study_135', 'study_98', 'study_99'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40982/dataset_40982.pq', 'status': 'active', 'processing_date': '2018-10-04 07:21:37', 'md5_checksum': '7ccdabeb01749cce9fa3b1d4a702fb8c'}
In this case it is used pandas parser, from this reason before fetch_openml we
need to have import pandas.
6) Downloading dataset as a dataframe
Using:
import pandas
from sklearn.datasets import fetch_openml
my_steel_plates_fault_d = fetch_openml(data_id=40982, as_frame=True)
print(my_steel_plates_fault_d.data.head())
output is:
V1 V2 V3 V4 V5 ... V23 V24 V25 V26 V27
0 42 50 270900 270944 267 ... 0.9031 1.6435 0.8182 -0.2913 0.5822
1 645 651 2538079 2538108 108 ... 0.7782 1.4624 0.7931 -0.1756 0.2984
2 829 835 1553913 1553931 71 ... 0.7782 1.2553 0.6667 -0.1228 0.2150
3 853 860 369370 369415 176 ... 0.8451 1.6532 0.8444 -0.1568 0.5212
4 1289 1306 498078 498335 2409 ... 1.2305 2.4099 0.9338 -0.1992 1.0000
[5 rows x 27 columns]
Those are first 5 rows.
We see here columns name is V1, V2, V3, ....
Significance of those name is in example b) where it is dataset description.