9 min readMar 17, 2023

Car classification is a crucial task in the automotive industry. With the increasing number of car manufacturers, it has become challenging to classify cars based on various parameters. However, with the help of machine learning, car classification has become more efficient and accurate. In this blog, we will explore various machine learning models to classify cars based on their parameters.

Data preprocessing

We begin by importing the required libraries and reading the car classification dataset using Pandas. The dataset contains information about various cars such as their circularity, distance circularity, skewness maxis, etc.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier   
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import f1_score 
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,classification_report

Loading the dataset

df=pd.read_csv("C:/Users/Vishaal Grizzly/Downloads/cars_class.csv")

Checking head and tail of the data loaded

df.head()

# df.tail()

Data Cleaning

Next, we perform data cleaning by checking for any missing values and dropping duplicate columns. We also drop the ID column as it won’t serve much use in this scenario. As the data is already clean, we do not need to perform any other actions.

df.isnull().sum()ID              0
Comp            0
Circ            0
D.Circ          0
Rad.Ra          0
Pr.Axis.Ra      0
Max.L.Ra        0
Scat.Ra         0
Elong           0
Pr.Axis.Rect    0
Max.L.Rect      0
Sc.Var.Maxis    0
Sc.Var.maxis    0
Ra.Gyr          0
Skew.Maxis      0
Skew.maxis      0
Kurt.maxis      0
Kurt.Maxis      0
Holl.Ra         0
Class           0
dtype: int64

Listing the columns in the dataset

df.columnsIndex(['ID', 'Comp', 'Circ', 'D.Circ', 'Rad.Ra', 'Pr.Axis.Ra', 'Max.L.Ra',
       'Scat.Ra', 'Elong', 'Pr.Axis.Rect', 'Max.L.Rect', 'Sc.Var.Maxis',
       'Sc.Var.maxis', 'Ra.Gyr', 'Skew.Maxis', 'Skew.maxis', 'Kurt.maxis',
       'Kurt.Maxis', 'Holl.Ra', 'Class'],
      dtype='object')

We can see columns are repeated here, proceeding to drop them

df = df.drop_duplicates()

Here are the columns after dropping the duplicates

df.columnsIndex(['ID', 'Comp', 'Circ', 'D.Circ', 'Rad.Ra', 'Pr.Axis.Ra', 'Max.L.Ra',
       'Scat.Ra', 'Elong', 'Pr.Axis.Rect', 'Max.L.Rect', 'Sc.Var.Maxis',
       'Sc.Var.maxis', 'Ra.Gyr', 'Skew.Maxis', 'Skew.maxis', 'Kurt.maxis',
       'Kurt.Maxis', 'Holl.Ra', 'Class'],
      dtype='object')

Checking for the datatypes in each column

df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 719 entries, 0 to 718
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   ID            719 non-null    int64
 1   Comp          719 non-null    int64
 2   Circ          719 non-null    int64
 3   D.Circ        719 non-null    int64
 4   Rad.Ra        719 non-null    int64
 5   Pr.Axis.Ra    719 non-null    int64
 6   Max.L.Ra      719 non-null    int64
 7   Scat.Ra       719 non-null    int64
 8   Elong         719 non-null    int64
 9   Pr.Axis.Rect  719 non-null    int64
 10  Max.L.Rect    719 non-null    int64
 11  Sc.Var.Maxis  719 non-null    int64
 12  Sc.Var.maxis  719 non-null    int64
 13  Ra.Gyr        719 non-null    int64
 14  Skew.Maxis    719 non-null    int64
 15  Skew.maxis    719 non-null    int64
 16  Kurt.maxis    719 non-null    int64
 17  Kurt.Maxis    719 non-null    int64
 18  Holl.Ra       719 non-null    int64
 19  Class         719 non-null    int64
dtypes: int64(20)
memory usage: 118.0 KB

We can drop the ID column as it won’t serve much use in this scenario

df = df.drop(columns = 'ID')df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 719 entries, 0 to 718
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Comp          719 non-null    int64
 1   Circ          719 non-null    int64
 2   D.Circ        719 non-null    int64
 3   Rad.Ra        719 non-null    int64
 4   Pr.Axis.Ra    719 non-null    int64
 5   Max.L.Ra      719 non-null    int64
 6   Scat.Ra       719 non-null    int64
 7   Elong         719 non-null    int64
 8   Pr.Axis.Rect  719 non-null    int64
 9   Max.L.Rect    719 non-null    int64
 10  Sc.Var.Maxis  719 non-null    int64
 11  Sc.Var.maxis  719 non-null    int64
 12  Ra.Gyr        719 non-null    int64
 13  Skew.Maxis    719 non-null    int64
 14  Skew.maxis    719 non-null    int64
 15  Kurt.maxis    719 non-null    int64
 16  Kurt.Maxis    719 non-null    int64
 17  Holl.Ra       719 non-null    int64
 18  Class         719 non-null    int64
dtypes: int64(19)
memory usage: 112.3 KB

As the data is already clean, we need not have to perform any other actions

Exploratory Data Analysis

We then perform exploratory data analysis (EDA) to get insights into the data. We use Seaborn and Matplotlib libraries to create scatterplots and histograms to visualize the relationships between the different car parameters and their classes. We also check the class column’s value counts and describe the dataset’s statistical summary. We notice that there are some outliers in the dataset.

plt.figure(figsize=(12, 6))
sns.scatterplot(x='Circ', y='D.Circ', hue='Class', data=df.head(100), s=200)
plt.title("Car Class Classificaton Visualization", y=1.015, fontsize=23)
plt.xlabel("Cirularity")
plt.ylabel("Distance Cirularity")
ax = plt.gca()

plt.figure(figsize=(12, 6))
sns.scatterplot(x='Skew.Maxis', y='Skew.maxis', hue='Class', data=df.head(100), s=200)
plt.title("Car Class classificaton Data", y=1.015, fontsize=23)
plt.xlabel("Skew.Maxis")
plt.ylabel("Skew.maxis")
ax = plt.gca()

df['Class'].value_counts()0    189
1    180
3    177
2    173
Name: Class, dtype: int64

The class column seems pretty balanced

df.describe()

plt.boxplot(df)
plt.show()

It seems like there are some outliers

sns.histplot(df)
plt.show()

Data splitting

We split the data into training and testing datasets and assign the test size.

#Spilitting data for processing
X = df.drop(columns = 'Class')
Y = df.ClassX.shape(719, 18)Y.shape(719,)#Assigning the test size in data 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)X_train.shape(575, 18)Y_train.shape(575,)X_test.shape(144, 18)Y_test.shape(144,)

Scaling the data

We then scale the data using the StandardScaler from Scikit-Learn to normalize the features. We plot boxplots and histoplots to check the scaled data’s distribution.

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_scale=sc.fit_transform(X_train)
X_test_scale=sc.transform(X_test)X_train_scale.shape(575, 18)X_test_scale.shape(144, 18)X_train_scale.mean()1.5017509511837866e-16#Plotting box plot and histoplot
plt.boxplot(X_train_scale)
plt.show()

sns.histplot(X_train_scale)
plt.show()

Building ML model

We then build various machine learning models to classify cars based on their parameters. We begin by building the K-Nearest Neighbors (KNN) model and then move on to the Support Vector Machine (SVM) model, Random Forest Classifier (RFC), and Gradient Boosting Classifier (GBC). We evaluate each model’s accuracy and F1 score and visualize their confusion matrices using the ConfusionMatrixDisplay from Scikit-Learn.

K-nearest neighbour

knn = KNeighborsClassifier(4)
knn.fit(X_train_scale,Y_train)
knn_score=knn.score(X_test_scale,Y_test)
print("The Accuracy level = ",knn_score, "   ","The f1_score is = ",f1_score(Y_test,knn.predict(X_test_scale),average='macro'))The Accuracy level =  0.7222222222222222     The f1_score is =  0.6996933621933622


C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

Confusion matrix

cm = confusion_matrix(Y_test,knn.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

SVM

Now let’s move on with support vector mechanism

svc=SVC()
svc.fit(X_train_scale,Y_train)
svc_score=svc.score(X_test_scale,Y_test)
print("Accuracy:",svc_score, "   ","f1_score:",f1_score(Y_test,svc.predict(X_test_scale),average='macro'))Accuracy: 0.7847222222222222     f1_score: 0.7621118774268613cm = confusion_matrix(Y_test,svc.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

RFC

Now let’s move on wuth Random foresh classifier

RND=RandomForestClassifier()
RND.fit(X_train_scale,Y_train)
RND_score=RND.score(X_test_scale,Y_test)
print("Accuracy:",RND_score, "   ","f1_score:",f1_score(Y_test,RND.predict(X_test_scale),average='macro'))Accuracy: 0.7847222222222222     f1_score: 0.7654748024355486cm = confusion_matrix(Y_test,RND.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

GBC

Now let us move on Gradient Boosting Classifier technique

GB=GradientBoostingClassifier()
GB.fit(X_train_scale,Y_train)
GB_score=GB.score(X_test_scale,Y_test)
print("Accuracy:",GB_score, "   ","f1_score:",f1_score(Y_test,GB.predict(X_test_scale),average='macro'))Accuracy: 0.7916666666666666     f1_score: 0.7763471096804431cm = confusion_matrix(Y_test,GB.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

Final step

Gradient boosting has given the best F1 score and confusion matrix. We further proceed to tune the hyperparameter of the model.

Hyperparameter Tuning

We notice that the Gradient Boosting Classifier has given the best F1 score and confusion matrix. We further proceed to tune the hyperparameters of the model using GridSearchCV from Scikit-Learn. We define a parameter grid with different learning rates, number of estimators, and maximum depths and fit the model for grid search. We then extract the best parameters and estimator from the GridSearchCV and build a new GBC model using them.

from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
print(gb.get_params().keys())dict_keys(['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'])from sklearn.model_selection import GridSearchCV
  
param_grid = {'learning_rate': [0.1, 0.05, 0.01],
              'n_estimators': [100, 200, 300],
              'max_depth': [3, 4, 5]}
  
grid = GridSearchCV(GradientBoostingClassifier(), param_grid, refit = True)

Fitting the model for grid search

grid.fit(X_train_scale, Y_train)GridSearchCV(estimator=GradientBoostingClassifier(),
             param_grid={'learning_rate': [0.1, 0.05, 0.01],
                         'max_depth': [3, 4, 5],
                         'n_estimators': [100, 200, 300]})grid.best_params_{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 300}grid.best_estimator_GradientBoostingClassifier(max_depth=4, n_estimators=300)GBC_new = GradientBoostingClassifier(learning_rate=0.07, n_estimators=300, max_depth = 4)
GBC_new.fit(X_train_scale, Y_train)
GBC_score = GBC_new.score(X_test_scale, Y_test)

print("Accuracy:", GBC_score)
print("f1_score:", f1_score(Y_test, GBC_new.predict(X_test_scale), average='macro'))Accuracy: 0.7777777777777778
f1_score: 0.7566992168456375print(classification_report(Y_test, GB.predict(X_test_scale))) #without tuningprecision    recall  f1-score   support

           0       0.97      0.95      0.96        39
           1       0.61      0.62      0.62        32
           2       0.61      0.62      0.62        32
           3       0.93      0.90      0.91        41

    accuracy                           0.79       144
   macro avg       0.78      0.78      0.78       144
weighted avg       0.80      0.79      0.79       144print(classification_report(Y_test, GBC_new.predict(X_test_scale))) #with tuningprecision    recall  f1-score   support

           0       1.00      0.95      0.97        39
           1       0.54      0.59      0.57        32
           2       0.59      0.53      0.56        32
           3       0.91      0.95      0.93        41

    accuracy                           0.78       144
   macro avg       0.76      0.76      0.76       144
weighted avg       0.78      0.78      0.78       144

Defining parameter range

param_grid = {'learning_rate': [0.1, 0.05, 0.01, 0.09, 0.08, 0.07],
              'n_estimators': [100, 200, 300, 400, 500]}

grid = GridSearchCV(GradientBoostingClassifier(), param_grid, refit=True)

Fitting the model for grid search

grid.fit(X_train_scale, Y_train)

# print best parameter after tuning
print(grid.best_params_)

# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_){'learning_rate': 0.07, 'n_estimators': 300}
GradientBoostingClassifier(learning_rate=0.07, n_estimators=300)svc_new=SVC(C=1000,gamma=0.01,kernel='rbf')
svc_new.fit(X_train_scale,Y_train)
svc_score=svc_new.score(X_test_scale,Y_test)
print("Accuracy:",svc_score, "   ","f1_score:",f1_score(Y_test,svc_new.predict(X_test_scale),average='macro'))Accuracy: 0.8611111111111112     f1_score: 0.8499680849407625cm = confusion_matrix(Y_test,svc_new.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

Here are the numbers without tuning

print(classification_report(Y_test, svc.predict(X_test_scale)))precision    recall  f1-score   support

           0       0.93      0.95      0.94        39
           1       0.58      0.56      0.57        32
           2       0.68      0.59      0.63        32
           3       0.87      0.95      0.91        41

    accuracy                           0.78       144
   macro avg       0.76      0.76      0.76       144
weighted avg       0.78      0.78      0.78       144

Here are the numbers after tuning

print(classification_report(Y_test, svc_new.predict(X_test_scale)))precision    recall  f1-score   support

           0       0.90      0.97      0.94        39
           1       0.76      0.78      0.77        32
           2       0.79      0.72      0.75        32
           3       0.95      0.93      0.94        41

    accuracy                           0.86       144
   macro avg       0.85      0.85      0.85       144
weighted avg       0.86      0.86      0.86       144

Conclusion

In this blog, we explored various machine learning models to classify cars based on their parameters. We performed data cleaning, exploratory data analysis, data splitting and scaling, and built KNN, SVM, RFC, and GBC models. We evaluated each model’s accuracy and F1 score and visualized their confusion matrices.

Finally, we tuned the hyperparameters of the best model using GridSearchCV and built a new model using the best parameters. Machine learning can significantly improve car classification and help the automotive industry classify cars more accurately and efficiently.