Scikit Machine learning of Car evaluation dataset

I have been working on machine learning for over a month using python, scikit-learn, and pandas. Over 90% of the work is on encoding the data formatting for machine learning, and rest 10% is setting up algorithms for machine learning.

Lets talk about car evaluation dataset and here is how i got 98% accuracy in prediction using RandomForest classifier. This is a 4type classification problem

Things you need

1. Scipy, Numpy, Scikit, Python, Pandas, Matplotlib

2. Car Evaluation Dataset

https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Lets take a look at the dataset:

The car evaluation dataset has comma separated values with about 7 attributes.

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

and the class output. We only need the values unacc,acc,good and v-good, the rest is statistical.

class    
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   vgood      65     ( 3.762 %) 

If you look at the data in csv, it looks like 1728 rows.

buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
......

The problem with the above data is it has categorical lablels which is unsuitable for machine learning algorithms. You need to convert them to unique numerical values for machine learning. Lets do it with pandas in python First we will import the csv into pandas

>> df = pd.read_csv('csv.data',header=0)
>> df.info()

Int64Index: 1728 entries, 0 to 1727

Data columns (total 7 columns):

buying      1728 non-null object

maint       1728 non-null object

doors       1728 non-null object

persons     1728 non-null object

lug_boot    1728 non-null object

safety      1728 non-null object

class       1728 non-null object

As you can see from the summary, all are string objects, we want them to be converted to numeric unqiue values. In pandas you can use factorize() function to encode the labels columnwise or a simple replace() will work.

Lets encode the labels in the dataset like this, in a very simple way…

vhigh = 4 high=3 med=2 low=1
5more = 6 more =5
small =1 med=2 big=3
unacc=1 acc=2 good=3 vgood=4

df = df.replace('vhigh',4)
 df = replace('high',3)

After encoding we see the dataset like this

buying,maint,doors,persons,lug_boot,safety,class
0,4,4,2,2,1,1,1
1,4,4,2,2,1,2,1
2,4,4,2,2,1,3,1
3,4,4,2,2,2,1,1
4,4,4,2,2,2,2,1

Now our data is ready for machine learning using scikit. You can download my encoded csv file car csv data

Machine Learning Algorithm

The best and easiest machine learning algorithm i have seen is RandomForests. Pretty much you can throw at it everything and it will work. Naive Bayes or KNN are good algorithms.

Now that our data is numeric, we make setup things for machine learning.

First we convert from pandas to numpy

car = df.values

Then we split the data to X,y which is attributes and output class (small y)

X,y = car[:,:6], car[:,6]

This selects all rows then X holds first 6 column attributes and y the last column as class (1d array)

We make sure that all the values in numpy are int to avoid potential numpy problems

X,y = X.astype(int), y.astype(int)

Lets split the data for 70% train and 30% test test for scikit machine learning

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.3,random_state=0)

Now that all is setup to apply for machine learning algorithms, lets setup the RandomForestClassifier()

>>> clf = ensemble.RandomForestClassifier(n_estimators=500)

>>> clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, compute_importances=None,

            criterion='gini', max_depth=None, max_features='auto',

            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,

            min_samples_split=2, n_estimators=500, n_jobs=1,

            oob_score=False, random_state=None, verbose=0)

>>> clf.score(X_test,y_test)

0.97880539499036612

97.88% Thats a pretty good accuracy in classification out of the box for Random forest classifier

Lets push a little bit further. Many machine learning algorithms perform very well if the data is scaled between the maximum and minimum. Lets scale the data

from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

Now we apply the random forest classifier.

clf.fit(X_train_scaled,y_train)
clf.score(X_test_scaled,y_test)
>>> clf.score(X_test_scaled,y_test)
0.98265895953757221

98.2% Wow! Not bad, we just pushed the score up!!

Lets predict the test data and compare it with the original y_test and plot a graph.

y_pred = clf.predict(X_test)

As you can see below, the red are original and blue are predicted values. We are not 100%