Car classification is a crucial task in the automotive industry. With the increasing number of car manufacturers, it has become challenging to classify cars based on various parameters. However, with the help of machine learning, car classification has become more efficient and accurate. In this blog, we will explore various machine learning models to classify cars based on their parameters.
Data preprocessing
We begin by importing the required libraries and reading the car classification dataset using Pandas. The dataset contains information about various cars such as their circularity, distance circularity, skewness maxis, etc.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,classification_report
Loading the dataset
df=pd.read_csv("C:/Users/Vishaal Grizzly/Downloads/cars_class.csv")
Checking head and tail of the data loaded
df.head()
# df.tail()
Data Cleaning
Next, we perform data cleaning by checking for any missing values and dropping duplicate columns. We also drop the ID column as it won’t serve much use in this scenario. As the data is already clean, we do not need to perform any other actions.
df.isnull().sum()ID 0
Comp 0
Circ 0
D.Circ 0
Rad.Ra 0
Pr.Axis.Ra 0
Max.L.Ra 0
Scat.Ra 0
Elong 0
Pr.Axis.Rect 0
Max.L.Rect 0
Sc.Var.Maxis 0
Sc.Var.maxis 0
Ra.Gyr 0
Skew.Maxis 0
Skew.maxis 0
Kurt.maxis 0
Kurt.Maxis 0
Holl.Ra 0
Class 0
dtype: int64
Listing the columns in the dataset
df.columnsIndex(['ID', 'Comp', 'Circ', 'D.Circ', 'Rad.Ra', 'Pr.Axis.Ra', 'Max.L.Ra',
'Scat.Ra', 'Elong', 'Pr.Axis.Rect', 'Max.L.Rect', 'Sc.Var.Maxis',
'Sc.Var.maxis', 'Ra.Gyr', 'Skew.Maxis', 'Skew.maxis', 'Kurt.maxis',
'Kurt.Maxis', 'Holl.Ra', 'Class'],
dtype='object')
We can see columns are repeated here, proceeding to drop them
df = df.drop_duplicates()
Here are the columns after dropping the duplicates
df.columnsIndex(['ID', 'Comp', 'Circ', 'D.Circ', 'Rad.Ra', 'Pr.Axis.Ra', 'Max.L.Ra',
'Scat.Ra', 'Elong', 'Pr.Axis.Rect', 'Max.L.Rect', 'Sc.Var.Maxis',
'Sc.Var.maxis', 'Ra.Gyr', 'Skew.Maxis', 'Skew.maxis', 'Kurt.maxis',
'Kurt.Maxis', 'Holl.Ra', 'Class'],
dtype='object')
Checking for the datatypes in each column
df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 719 entries, 0 to 718
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 719 non-null int64
1 Comp 719 non-null int64
2 Circ 719 non-null int64
3 D.Circ 719 non-null int64
4 Rad.Ra 719 non-null int64
5 Pr.Axis.Ra 719 non-null int64
6 Max.L.Ra 719 non-null int64
7 Scat.Ra 719 non-null int64
8 Elong 719 non-null int64
9 Pr.Axis.Rect 719 non-null int64
10 Max.L.Rect 719 non-null int64
11 Sc.Var.Maxis 719 non-null int64
12 Sc.Var.maxis 719 non-null int64
13 Ra.Gyr 719 non-null int64
14 Skew.Maxis 719 non-null int64
15 Skew.maxis 719 non-null int64
16 Kurt.maxis 719 non-null int64
17 Kurt.Maxis 719 non-null int64
18 Holl.Ra 719 non-null int64
19 Class 719 non-null int64
dtypes: int64(20)
memory usage: 118.0 KB
We can drop the ID column as it won’t serve much use in this scenario
df = df.drop(columns = 'ID')df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 719 entries, 0 to 718
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Comp 719 non-null int64
1 Circ 719 non-null int64
2 D.Circ 719 non-null int64
3 Rad.Ra 719 non-null int64
4 Pr.Axis.Ra 719 non-null int64
5 Max.L.Ra 719 non-null int64
6 Scat.Ra 719 non-null int64
7 Elong 719 non-null int64
8 Pr.Axis.Rect 719 non-null int64
9 Max.L.Rect 719 non-null int64
10 Sc.Var.Maxis 719 non-null int64
11 Sc.Var.maxis 719 non-null int64
12 Ra.Gyr 719 non-null int64
13 Skew.Maxis 719 non-null int64
14 Skew.maxis 719 non-null int64
15 Kurt.maxis 719 non-null int64
16 Kurt.Maxis 719 non-null int64
17 Holl.Ra 719 non-null int64
18 Class 719 non-null int64
dtypes: int64(19)
memory usage: 112.3 KB
As the data is already clean, we need not have to perform any other actions
Exploratory Data Analysis
We then perform exploratory data analysis (EDA) to get insights into the data. We use Seaborn and Matplotlib libraries to create scatterplots and histograms to visualize the relationships between the different car parameters and their classes. We also check the class column’s value counts and describe the dataset’s statistical summary. We notice that there are some outliers in the dataset.
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Circ', y='D.Circ', hue='Class', data=df.head(100), s=200)
plt.title("Car Class Classificaton Visualization", y=1.015, fontsize=23)
plt.xlabel("Cirularity")
plt.ylabel("Distance Cirularity")
ax = plt.gca()
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Skew.Maxis', y='Skew.maxis', hue='Class', data=df.head(100), s=200)
plt.title("Car Class classificaton Data", y=1.015, fontsize=23)
plt.xlabel("Skew.Maxis")
plt.ylabel("Skew.maxis")
ax = plt.gca()
df['Class'].value_counts()0 189
1 180
3 177
2 173
Name: Class, dtype: int64
The class column seems pretty balanced
df.describe()
plt.boxplot(df)
plt.show()
It seems like there are some outliers
sns.histplot(df)
plt.show()
Data splitting
We split the data into training and testing datasets and assign the test size.
#Spilitting data for processing
X = df.drop(columns = 'Class')
Y = df.ClassX.shape(719, 18)Y.shape(719,)#Assigning the test size in data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)X_train.shape(575, 18)Y_train.shape(575,)X_test.shape(144, 18)Y_test.shape(144,)
Scaling the data
We then scale the data using the StandardScaler from Scikit-Learn to normalize the features. We plot boxplots and histoplots to check the scaled data’s distribution.
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_scale=sc.fit_transform(X_train)
X_test_scale=sc.transform(X_test)X_train_scale.shape(575, 18)X_test_scale.shape(144, 18)X_train_scale.mean()1.5017509511837866e-16#Plotting box plot and histoplot
plt.boxplot(X_train_scale)
plt.show()
sns.histplot(X_train_scale)
plt.show()
Building ML model
We then build various machine learning models to classify cars based on their parameters. We begin by building the K-Nearest Neighbors (KNN) model and then move on to the Support Vector Machine (SVM) model, Random Forest Classifier (RFC), and Gradient Boosting Classifier (GBC). We evaluate each model’s accuracy and F1 score and visualize their confusion matrices using the ConfusionMatrixDisplay from Scikit-Learn.
K-nearest neighbour
knn = KNeighborsClassifier(4)
knn.fit(X_train_scale,Y_train)
knn_score=knn.score(X_test_scale,Y_test)
print("The Accuracy level = ",knn_score, " ","The f1_score is = ",f1_score(Y_test,knn.predict(X_test_scale),average='macro'))The Accuracy level = 0.7222222222222222 The f1_score is = 0.6996933621933622
C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Confusion matrix
cm = confusion_matrix(Y_test,knn.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()C:\Users\Vishaal Grizzly\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
SVM
Now let’s move on with support vector mechanism
svc=SVC()
svc.fit(X_train_scale,Y_train)
svc_score=svc.score(X_test_scale,Y_test)
print("Accuracy:",svc_score, " ","f1_score:",f1_score(Y_test,svc.predict(X_test_scale),average='macro'))Accuracy: 0.7847222222222222 f1_score: 0.7621118774268613cm = confusion_matrix(Y_test,svc.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
RFC
Now let’s move on wuth Random foresh classifier
RND=RandomForestClassifier()
RND.fit(X_train_scale,Y_train)
RND_score=RND.score(X_test_scale,Y_test)
print("Accuracy:",RND_score, " ","f1_score:",f1_score(Y_test,RND.predict(X_test_scale),average='macro'))Accuracy: 0.7847222222222222 f1_score: 0.7654748024355486cm = confusion_matrix(Y_test,RND.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
GBC
Now let us move on Gradient Boosting Classifier technique
GB=GradientBoostingClassifier()
GB.fit(X_train_scale,Y_train)
GB_score=GB.score(X_test_scale,Y_test)
print("Accuracy:",GB_score, " ","f1_score:",f1_score(Y_test,GB.predict(X_test_scale),average='macro'))Accuracy: 0.7916666666666666 f1_score: 0.7763471096804431cm = confusion_matrix(Y_test,GB.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Final step
Gradient boosting has given the best F1 score and confusion matrix. We further proceed to tune the hyperparameter of the model.
Hyperparameter Tuning
We notice that the Gradient Boosting Classifier has given the best F1 score and confusion matrix. We further proceed to tune the hyperparameters of the model using GridSearchCV from Scikit-Learn. We define a parameter grid with different learning rates, number of estimators, and maximum depths and fit the model for grid search. We then extract the best parameters and estimator from the GridSearchCV and build a new GBC model using them.
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
print(gb.get_params().keys())dict_keys(['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'])from sklearn.model_selection import GridSearchCV
param_grid = {'learning_rate': [0.1, 0.05, 0.01],
'n_estimators': [100, 200, 300],
'max_depth': [3, 4, 5]}
grid = GridSearchCV(GradientBoostingClassifier(), param_grid, refit = True)
Fitting the model for grid search
grid.fit(X_train_scale, Y_train)GridSearchCV(estimator=GradientBoostingClassifier(),
param_grid={'learning_rate': [0.1, 0.05, 0.01],
'max_depth': [3, 4, 5],
'n_estimators': [100, 200, 300]})grid.best_params_{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 300}grid.best_estimator_GradientBoostingClassifier(max_depth=4, n_estimators=300)GBC_new = GradientBoostingClassifier(learning_rate=0.07, n_estimators=300, max_depth = 4)
GBC_new.fit(X_train_scale, Y_train)
GBC_score = GBC_new.score(X_test_scale, Y_test)
print("Accuracy:", GBC_score)
print("f1_score:", f1_score(Y_test, GBC_new.predict(X_test_scale), average='macro'))Accuracy: 0.7777777777777778
f1_score: 0.7566992168456375print(classification_report(Y_test, GB.predict(X_test_scale))) #without tuningprecision recall f1-score support
0 0.97 0.95 0.96 39
1 0.61 0.62 0.62 32
2 0.61 0.62 0.62 32
3 0.93 0.90 0.91 41
accuracy 0.79 144
macro avg 0.78 0.78 0.78 144
weighted avg 0.80 0.79 0.79 144print(classification_report(Y_test, GBC_new.predict(X_test_scale))) #with tuningprecision recall f1-score support
0 1.00 0.95 0.97 39
1 0.54 0.59 0.57 32
2 0.59 0.53 0.56 32
3 0.91 0.95 0.93 41
accuracy 0.78 144
macro avg 0.76 0.76 0.76 144
weighted avg 0.78 0.78 0.78 144
Defining parameter range
param_grid = {'learning_rate': [0.1, 0.05, 0.01, 0.09, 0.08, 0.07],
'n_estimators': [100, 200, 300, 400, 500]}
grid = GridSearchCV(GradientBoostingClassifier(), param_grid, refit=True)
Fitting the model for grid search
grid.fit(X_train_scale, Y_train)
# print best parameter after tuning
print(grid.best_params_)
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_){'learning_rate': 0.07, 'n_estimators': 300}
GradientBoostingClassifier(learning_rate=0.07, n_estimators=300)svc_new=SVC(C=1000,gamma=0.01,kernel='rbf')
svc_new.fit(X_train_scale,Y_train)
svc_score=svc_new.score(X_test_scale,Y_test)
print("Accuracy:",svc_score, " ","f1_score:",f1_score(Y_test,svc_new.predict(X_test_scale),average='macro'))Accuracy: 0.8611111111111112 f1_score: 0.8499680849407625cm = confusion_matrix(Y_test,svc_new.predict(X_test_scale))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Here are the numbers without tuning
print(classification_report(Y_test, svc.predict(X_test_scale)))precision recall f1-score support
0 0.93 0.95 0.94 39
1 0.58 0.56 0.57 32
2 0.68 0.59 0.63 32
3 0.87 0.95 0.91 41
accuracy 0.78 144
macro avg 0.76 0.76 0.76 144
weighted avg 0.78 0.78 0.78 144
Here are the numbers after tuning
print(classification_report(Y_test, svc_new.predict(X_test_scale)))precision recall f1-score support
0 0.90 0.97 0.94 39
1 0.76 0.78 0.77 32
2 0.79 0.72 0.75 32
3 0.95 0.93 0.94 41
accuracy 0.86 144
macro avg 0.85 0.85 0.85 144
weighted avg 0.86 0.86 0.86 144
Conclusion
In this blog, we explored various machine learning models to classify cars based on their parameters. We performed data cleaning, exploratory data analysis, data splitting and scaling, and built KNN, SVM, RFC, and GBC models. We evaluated each model’s accuracy and F1 score and visualized their confusion matrices.
Finally, we tuned the hyperparameters of the best model using GridSearchCV and built a new model using the best parameters. Machine learning can significantly improve car classification and help the automotive industry classify cars more accurately and efficiently.