Vishaal Grizzly
9 min readMar 17, 2023

--

Car classification is a crucial task in the automotive industry. With the increasing number of car manufacturers, it has become challenging to classify cars based on various parameters. However, with the help of machine learning, car classification has become more efficient and accurate. In this blog, we will explore various machine learning models to classify cars based on their parameters.

Photo by Scott Umstattd on Unsplash

Data preprocessing

We begin by importing the required libraries and reading the car classification dataset using Pandas. The dataset contains information about various cars such as their circularity, distance circularity, skewness maxis, etc.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,classification_report

Loading the dataset

df=pd.read_csv("C:/Users/Vishaal Grizzly/Downloads/cars_class.csv") 

Checking head and tail of the data loaded

df.head()
png
# df.tail()

Data Cleaning

Next, we perform data cleaning by checking for any missing values and dropping duplicate columns. We also drop the ID column as it won’t serve much use in this scenario. As the data is already clean, we do not need to perform any other actions.

df.isnull().sum()ID              0
Comp 0
Circ 0
D.Circ 0
Rad.Ra 0
Pr.Axis.Ra 0
Max.L.Ra 0
Scat.Ra 0
Elong 0
Pr.Axis.Rect 0
Max.L.Rect 0
Sc.Var.Maxis 0
Sc.Var.maxis 0
Ra.Gyr 0
Skew.Maxis 0
Skew.maxis 0
Kurt.maxis 0
Kurt.Maxis 0
Holl.Ra 0
Class 0
dtype: int64

Listing the columns in the dataset

df.columnsIndex(['ID', 'Comp', 'Circ', 'D.Circ', 'Rad.Ra', 'Pr.Axis.Ra', 'Max.L.Ra',
'Scat.Ra', 'Elong', 'Pr.Axis.Rect', 'Max.L.Rect', 'Sc.Var.Maxis',
'Sc.Var.maxis', 'Ra.Gyr', 'Skew.Maxis', 'Skew.maxis', 'Kurt.maxis',
'Kurt.Maxis', 'Holl.Ra', 'Class'],
dtype='object')

We can see columns are repeated here, proceeding to drop them

df = df.drop_duplicates()

Here are the columns after dropping the duplicates

df.columnsIndex(['ID', 'Comp', 'Circ', 'D.Circ', 'Rad.Ra', 'Pr.Axis.Ra', 'Max.L.Ra',
'Scat.Ra', 'Elong', 'Pr.Axis.Rect', 'Max.L.Rect', 'Sc.Var.Maxis',
'Sc.Var.maxis', 'Ra.Gyr', 'Skew.Maxis', 'Skew.maxis', 'Kurt.maxis',
'Kurt.Maxis', 'Holl.Ra', 'Class'],
dtype='object')

Checking for the datatypes in each column

df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 719 entries, 0 to 718
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 719 non-null int64
1 Comp 719 non-null int64
2 Circ 719 non-null int64
3 D.Circ 719 non-null int64
4 Rad.Ra 719 non-null int64
5 Pr.Axis.Ra 719 non-null int64
6 Max.L.Ra 719 non-null int64
7 Scat.Ra 719 non-null int64
8 Elong 719 non-null int64
9 Pr.Axis.Rect 719 non-null int64
10 Max.L.Rect 719 non-null int64
11 Sc.Var.Maxis 719 non-null int64
12 Sc.Var.maxis 719 non-null int64
13 Ra.Gyr 719 non-null int64
14 Skew.Maxis 719 non-null int64
15 Skew.maxis 719 non-null int64
16 Kurt.maxis 719 non-null int64
17 Kurt.Maxis 719 non-null int64
18 Holl.Ra 719 non-null int64
19 Class 719 non-null int64
dtypes: int64(20)
memory usage: 118.0 KB

We can drop the ID column as it won’t serve much use in this scenario

df = df.drop(columns = 'ID')df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 719 entries, 0 to 718
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Comp 719 non-null int64
1 Circ 719 non-null int64
2 D.Circ 719 non-null int64
3 Rad.Ra 719 non-null int64
4 Pr.Axis.Ra 719 non-null int64
5 Max.L.Ra 719 non-null int64
6 Scat.Ra 719 non-null int64
7 Elong 719 non-null int64
8 Pr.Axis.Rect 719 non-null int64
9 Max.L.Rect 719 non-null int64
10 Sc.Var.Maxis 719 non-null int64
11 Sc.Var.maxis 719 non-null int64
12 Ra.Gyr 719 non-null int64
13 Skew.Maxis 719 non-null int64
14 Skew.maxis 719 non-null int64
15 Kurt.maxis 719 non-null int64
16 Kurt.Maxis 719 non-null int64
17 Holl.Ra 719 non-null int64
18 Class 719 non-null int64
dtypes: int64(19)
memory usage: 112.3 KB

As the data is already clean, we need not have to perform any other actions

Exploratory Data Analysis

We then perform exploratory data analysis (EDA) to get insights into the data. We use Seaborn and Matplotlib libraries to create scatterplots and histograms to visualize the relationships between the different car parameters and their classes. We also check the class column’s value counts and describe the dataset’s statistical summary. We notice that there are some outliers in the dataset.

plt.figure(figsize=(12, 6))
sns.scatterplot(x='Circ', y='D.Circ', hue='Class', data=df.head(100), s=200)
plt.title("Car Class Classificaton Visualization", y=1.015, fontsize=23)
plt.xlabel("Cirularity")
plt.ylabel("Distance Cirularity")
ax = plt.gca()
png
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Skew.Maxis', y='Skew.maxis', hue='Class', data=df.head(100), s=200)
plt.title("Car Class classificaton Data", y=1.015, fontsize=23)
plt.xlabel("Skew.Maxis")
plt.ylabel("Skew.maxis")
ax = plt.gca()
png
df['Class'].value_counts()0    189
1 180
3 177
2 173
Name: Class, dtype: int64

The class column seems pretty balanced

df.describe()
png
plt.boxplot(df)
plt.show()
png

It seems like there are some outliers

sns.histplot(df)
plt.show()
png

Data splitting

We split the data into training and testing datasets and assign the test size.

#Spilitting data for processing
X = df.drop(columns = 'Class')
Y = df.Class
X.shape(719, 18)Y.shape(719,)#Assigning the test size in data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
X_train.shape(575, 18)Y_train.shape(575,)X_test.shape(144, 18)Y_test.shape(144,)

Scaling the data

We then scale the data using the StandardScaler from Scikit-Learn to normalize the features. We plot boxplots and histoplots to check the scaled data’s distribution.

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_scale=sc.fit_transform(X_train)
X_test_scale=sc.transform(X_test)
X_train_scale.shape(575, 18)X_test_scale.shape(144, 18)X_train_scale.mean()1.5017509511837866e-16#Plotting box plot and histoplot
plt.boxplot(X_train_scale)
plt.show()
png
sns.histplot(X_train_scale)
plt.show()
png

Building ML model

We then build various machine learning models to classify cars based on their parameters. We begin by building the K-Nearest Neighbors (KNN) model and then move on to the Support Vector Machine (SVM) model, Random Forest Classifier (RFC), and Gradient Boosting Classifier (GBC). We evaluate each model’s accuracy and F1 score and visualize their confusion matrices using the ConfusionMatrixDisplay from Scikit-Learn.

K-nearest neighbour

knn = KNeighborsClassifier(4)
knn.fit(X_train_scale,Y_train)
knn_score=knn.score(X_test_scale,Y_test)
print("The Accuracy level = ",knn_score, " ","The f1_score is = ",f1_score(Y_test,knn.predict(X_test_scale),average='macro'))
The Accuracy level = 0.7222222222222222 The f1_score is = 0.6996933621933622


C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

Confusion matrix

cm = confusion_matrix(Y_test,knn.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
png

SVM

Now let’s move on with support vector mechanism

svc=SVC()
svc.fit(X_train_scale,Y_train)
svc_score=svc.score(X_test_scale,Y_test)
print("Accuracy:",svc_score, " ","f1_score:",f1_score(Y_test,svc.predict(X_test_scale),average='macro'))
Accuracy: 0.7847222222222222 f1_score: 0.7621118774268613cm = confusion_matrix(Y_test,svc.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
png

RFC

Now let’s move on wuth Random foresh classifier

RND=RandomForestClassifier()
RND.fit(X_train_scale,Y_train)
RND_score=RND.score(X_test_scale,Y_test)
print("Accuracy:",RND_score, " ","f1_score:",f1_score(Y_test,RND.predict(X_test_scale),average='macro'))
Accuracy: 0.7847222222222222 f1_score: 0.7654748024355486cm = confusion_matrix(Y_test,RND.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
png

GBC

Now let us move on Gradient Boosting Classifier technique

GB=GradientBoostingClassifier()
GB.fit(X_train_scale,Y_train)
GB_score=GB.score(X_test_scale,Y_test)
print("Accuracy:",GB_score, " ","f1_score:",f1_score(Y_test,GB.predict(X_test_scale),average='macro'))
Accuracy: 0.7916666666666666 f1_score: 0.7763471096804431cm = confusion_matrix(Y_test,GB.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
png

Final step

Gradient boosting has given the best F1 score and confusion matrix. We further proceed to tune the hyperparameter of the model.

Hyperparameter Tuning

We notice that the Gradient Boosting Classifier has given the best F1 score and confusion matrix. We further proceed to tune the hyperparameters of the model using GridSearchCV from Scikit-Learn. We define a parameter grid with different learning rates, number of estimators, and maximum depths and fit the model for grid search. We then extract the best parameters and estimator from the GridSearchCV and build a new GBC model using them.

from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
print(gb.get_params().keys())
dict_keys(['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'])from sklearn.model_selection import GridSearchCV

param_grid = {'learning_rate': [0.1, 0.05, 0.01],
'n_estimators': [100, 200, 300],
'max_depth': [3, 4, 5]}

grid = GridSearchCV(GradientBoostingClassifier(), param_grid, refit = True)

Fitting the model for grid search

grid.fit(X_train_scale, Y_train)GridSearchCV(estimator=GradientBoostingClassifier(),
param_grid={'learning_rate': [0.1, 0.05, 0.01],
'max_depth': [3, 4, 5],
'n_estimators': [100, 200, 300]})
grid.best_params_{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 300}grid.best_estimator_GradientBoostingClassifier(max_depth=4, n_estimators=300)GBC_new = GradientBoostingClassifier(learning_rate=0.07, n_estimators=300, max_depth = 4)
GBC_new.fit(X_train_scale, Y_train)
GBC_score = GBC_new.score(X_test_scale, Y_test)

print("Accuracy:", GBC_score)
print("f1_score:", f1_score(Y_test, GBC_new.predict(X_test_scale), average='macro'))
Accuracy: 0.7777777777777778
f1_score: 0.7566992168456375
print(classification_report(Y_test, GB.predict(X_test_scale))) #without tuningprecision recall f1-score support

0 0.97 0.95 0.96 39
1 0.61 0.62 0.62 32
2 0.61 0.62 0.62 32
3 0.93 0.90 0.91 41

accuracy 0.79 144
macro avg 0.78 0.78 0.78 144
weighted avg 0.80 0.79 0.79 144
print(classification_report(Y_test, GBC_new.predict(X_test_scale))) #with tuningprecision recall f1-score support

0 1.00 0.95 0.97 39
1 0.54 0.59 0.57 32
2 0.59 0.53 0.56 32
3 0.91 0.95 0.93 41

accuracy 0.78 144
macro avg 0.76 0.76 0.76 144
weighted avg 0.78 0.78 0.78 144

Defining parameter range

param_grid = {'learning_rate': [0.1, 0.05, 0.01, 0.09, 0.08, 0.07],
'n_estimators': [100, 200, 300, 400, 500]}

grid = GridSearchCV(GradientBoostingClassifier(), param_grid, refit=True)

Fitting the model for grid search

grid.fit(X_train_scale, Y_train)

# print best parameter after tuning
print(grid.best_params_)

# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)
{'learning_rate': 0.07, 'n_estimators': 300}
GradientBoostingClassifier(learning_rate=0.07, n_estimators=300)
svc_new=SVC(C=1000,gamma=0.01,kernel='rbf')
svc_new.fit(X_train_scale,Y_train)
svc_score=svc_new.score(X_test_scale,Y_test)
print("Accuracy:",svc_score, " ","f1_score:",f1_score(Y_test,svc_new.predict(X_test_scale),average='macro'))
Accuracy: 0.8611111111111112 f1_score: 0.8499680849407625cm = confusion_matrix(Y_test,svc_new.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
png

Here are the numbers without tuning

print(classification_report(Y_test, svc.predict(X_test_scale)))precision    recall  f1-score   support

0 0.93 0.95 0.94 39
1 0.58 0.56 0.57 32
2 0.68 0.59 0.63 32
3 0.87 0.95 0.91 41

accuracy 0.78 144
macro avg 0.76 0.76 0.76 144
weighted avg 0.78 0.78 0.78 144

Here are the numbers after tuning

print(classification_report(Y_test, svc_new.predict(X_test_scale)))precision    recall  f1-score   support

0 0.90 0.97 0.94 39
1 0.76 0.78 0.77 32
2 0.79 0.72 0.75 32
3 0.95 0.93 0.94 41

accuracy 0.86 144
macro avg 0.85 0.85 0.85 144
weighted avg 0.86 0.86 0.86 144

Conclusion

In this blog, we explored various machine learning models to classify cars based on their parameters. We performed data cleaning, exploratory data analysis, data splitting and scaling, and built KNN, SVM, RFC, and GBC models. We evaluated each model’s accuracy and F1 score and visualized their confusion matrices.

Finally, we tuned the hyperparameters of the best model using GridSearchCV and built a new model using the best parameters. Machine learning can significantly improve car classification and help the automotive industry classify cars more accurately and efficiently.

--

--

Vishaal Grizzly

Aspiring Data scientist with an enthusiasm for marketing and love for writing