Vishaal Grizzly
16 min readMar 17, 2023

--

Predicting car prices is an important problem that the automobile industry faces. In this project, we will be working with a dataset containing information about different cars and their prices. Our task is to develop a machine learning model that can accurately predict the prices of cars. We will be using a regression approach since the target variable (car prices) is continuous.

Photo by Alessio Lin on Unsplash

Problem statement

The goal of this project is to develop a machine learning model that can accurately predict the prices of cars based on different attributes. We will use the ‘cars_price.csv’ dataset, which contains information about 206 cars, including their make, model, fuel type, engine size, horsepower, and other features.

Our aim is to develop a model that can predict car prices with the highest possible accuracy. We will be using linear regression model to predict the price of the cars. Why linear regression? Linear regression can be an old algorithm and most basic concept in machine learning, but it is yet powerful to build models. It is an algorithm used to predict values that are continuous in nature. Linear regression became more popular because it is the best algorithm to start with if you are a newbie to Machine Learning.

Moving on to the approach of the project, we begin with importing the necessary packages.

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
import matplotlib.pyplot as plt
import matplotlib as mpl
from prettytable import PrettyTable
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import seaborn as sns

Then we will move on loading the data stored on our local disk.

Data Loading

auto = pd.read_csv("C:/Users/Vishaal Grizzly/Downloads/cars_price.csv")

Checking the dimensions of the data imported

auto.shape(205, 26)

Checking head and tail of the data

auto.head()
png
auto.tail()
png

Data Cleaning

Replacing ‘?’ with NAN

df = auto.replace('?',np.NAN) 
df
png
df.describe()
png

Checking for the null values present in the data

df.isnull().sum()symboling             0
normalized-losses 41
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
dtype: int64

Checking for duplicates in the data

print(df.loc[df.duplicated()].shape)(0, 26)df = df.drop_duplicates()df.shape(205, 26)

Listing the data types present in each column

df.dtypessymboling              int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object

Replacing null values wherever needed

Dealing with values in Normalized error column

n_l_data = df[df['normalized-losses']!= '?'] 
n_l_data['normalized-losses']
0 NaN
1 NaN
2 NaN
3 164
4 164
...
200 95
201 95
202 95
203 95
204 95
Name: normalized-losses, Length: 205, dtype: object
mean = n_l_data['normalized-losses'].astype(float).mean()
df['normalized-losses'] = df['normalized-losses'].replace('?', mean).fillna(mean).astype(int)
#df=df.drop(columns="normalized-losses")

Dealing with values in price column

price_data = df[df['price']!= '?']
price_data['price']
0 13495
1 16500
2 16500
3 13950
4 17450
...
200 16845
201 19045
202 21485
203 22470
204 22625
Name: price, Length: 205, dtype: object
mean = price_data['price'].astype(float).mean()
df['price'] = df['price'].replace('?', mean).fillna(mean).astype(int)

Dealing with values in horsepower column

hp_data = df[df['horsepower'] != '?']
hp_data['price']
0 13495
1 16500
2 16500
3 13950
4 17450
...
200 16845
201 19045
202 21485
203 22470
204 22625
Name: price, Length: 205, dtype: int32
mean = hp_data['price'].astype(int).mean()
df['horsepower'] = df['horsepower'].replace('?', mean).fillna(mean).astype(int)

Dealing with values in peak-rpm column

peak_rpm_data = df[df['peak-rpm'] != '?'] 
peak_rpm_data['peak-rpm']
0 5000
1 5000
2 5000
3 5500
4 5500
...
200 5400
201 5300
202 5500
203 4800
204 5400
Name: peak-rpm, Length: 205, dtype: object
mean = peak_rpm_data['peak-rpm'].astype(float).mean()
df['price'] = df['price'].replace('?', mean).fillna(mean).astype(int)

Dealing with values in the stroke column

stroke_data = df[df['stroke'] != '?']
stroke_data['stroke']
0 2.68
1 2.68
2 3.47
3 3.4
4 3.4
...
200 3.15
201 3.15
202 2.87
203 3.4
204 3.15
Name: stroke, Length: 205, dtype: object
mean = stroke_data['stroke'].astype(float).mean()
df['stroke'] = df['stroke'].replace('?', mean).fillna(mean).astype(float)

Dealing with values in the peak-rpm column

peak_rpm_data = df[df['peak-rpm']!='?']
peak_rpm_data['peak-rpm']
0 5000
1 5000
2 5000
3 5500
4 5500
...
200 5400
201 5300
202 5500
203 4800
204 5400
Name: peak-rpm, Length: 205, dtype: object
mean = peak_rpm_data['peak-rpm'].astype(float).mean()
df['peak-rpm'] = df['peak-rpm'].replace('?', mean).fillna(mean).astype(float)

Dealing with values in the bore column

bore_data = df[df['bore'] != '?']
bore_data['bore']
0 3.47
1 3.47
2 2.68
3 3.19
4 3.19
...
200 3.78
201 3.78
202 3.58
203 3.01
204 3.78
Name: bore, Length: 205, dtype: object
mean = bore_data['bore'].astype(float).mean()
df['bore'] = df['bore'].replace('?', mean).fillna(mean).astype(float)

Dealing with values in the num-of-doors column. We will replace the missing values with ‘four’ as most cars are most likely to have to four doors

df['num-of-doors'] = df['num-of-doors'].replace('?', 'four') 
df
png
df.describe<bound method NDFrame.describe of      symboling  normalized-losses         make fuel-type aspiration  \
0 3 122 alfa-romero gas std
1 3 122 alfa-romero gas std
2 1 122 alfa-romero gas std
3 2 164 audi gas std
4 2 164 audi gas std
.. ... ... ... ... ...
200 -1 95 volvo gas std
201 -1 95 volvo gas turbo
202 -1 95 volvo gas std
203 -1 95 volvo diesel turbo
204 -1 95 volvo gas turbo

num-of-doors body-style drive-wheels engine-location wheel-base ... \
0 two convertible rwd front 88.6 ...
1 two convertible rwd front 88.6 ...
2 two hatchback rwd front 94.5 ...
3 four sedan fwd front 99.8 ...
4 four sedan 4wd front 99.4 ...
.. ... ... ... ... ... ...
200 four sedan rwd front 109.1 ...
201 four sedan rwd front 109.1 ...
202 four sedan rwd front 109.1 ...
203 four sedan rwd front 109.1 ...
204 four sedan rwd front 109.1 ...

engine-size fuel-system bore stroke compression-ratio horsepower \
0 130 mpfi 3.47 2.68 9.0 111
1 130 mpfi 3.47 2.68 9.0 111
2 152 mpfi 2.68 3.47 9.0 154
3 109 mpfi 3.19 3.40 10.0 102
4 136 mpfi 3.19 3.40 8.0 115
.. ... ... ... ... ... ...
200 141 mpfi 3.78 3.15 9.5 114
201 141 mpfi 3.78 3.15 8.7 160
202 173 mpfi 3.58 2.87 8.8 134
203 145 idi 3.01 3.40 23.0 106
204 141 mpfi 3.78 3.15 9.5 114

peak-rpm city-mpg highway-mpg price
0 5000.0 21 27 13495
1 5000.0 21 27 16500
2 5000.0 19 26 16500
3 5500.0 24 30 13950
4 5500.0 18 22 17450
.. ... ... ... ...
200 5400.0 23 28 16845
201 5300.0 19 25 19045
202 5500.0 18 23 21485
203 4800.0 26 27 22470
204 5400.0 19 25 22625

[205 rows x 26 columns]>

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 205 non-null int32
2 make 205 non-null object
3 fuel-type 205 non-null object
4 aspiration 205 non-null object
5 num-of-doors 203 non-null object
6 body-style 205 non-null object
7 drive-wheels 205 non-null object
8 engine-location 205 non-null object
9 wheel-base 205 non-null float64
10 length 205 non-null float64
11 width 205 non-null float64
12 height 205 non-null float64
13 curb-weight 205 non-null int64
14 engine-type 205 non-null object
15 num-of-cylinders 205 non-null object
16 engine-size 205 non-null int64
17 fuel-system 205 non-null object
18 bore 205 non-null float64
19 stroke 205 non-null float64
20 compression-ratio 205 non-null float64
21 horsepower 205 non-null int32
22 peak-rpm 205 non-null float64
23 city-mpg 205 non-null int64
24 highway-mpg 205 non-null int64
25 price 205 non-null int32
dtypes: float64(8), int32(3), int64(5), object(10)
memory usage: 40.8+ KB

There are 6 objects type values. We can replace the values present in the columns with ‘0’ and ‘1’

#Fuel column 
df['fuel-type'].value_counts()
gas 185
diesel 20
Name: fuel-type, dtype: int64
df['fuel-type'] = df['fuel-type'].map({'diesel': 0, 'gas': 1})df['fuel-type'] = df ['fuel-type'].astype('int64')df['fuel-type'].value_counts()1 185
0 20
Name: fuel-type, dtype: int64
#Aspiration column
df['aspiration'].value_counts()
std 168
turbo 37
Name: aspiration, dtype: int64
df['aspiration'] = df['aspiration'].map({'turbo': 0, 'std':1})df['aspiration'] = df['aspiration'].astype('int64')df['aspiration'].value_counts()1 168
0 37
Name: aspiration, dtype: int64
#Num-of-doors column
df['num-of-doors'].value_counts()
four 114
two 89
Name: num-of-doors, dtype: int64
df['num-of-doors'].isnull()0 False
1 False
2 False
3 False
4 False
...
200 False
201 False
202 False
203 False
204 False
Name: num-of-doors, Length: 205, dtype: bool
df['num-of-doors'].unique()array(['two', 'four', nan], dtype=object)num_of_na = df['num-of-doors'].isna().sum()num_of_na2mask = df['num-of-doors'].isna()result = df[mask]result
png
df['num-of-doors'] = df['num-of-doors'].fillna('four')df['num-of-doors'] = df['num-of-doors'].map({'two': 0, 'four': 1})df['num-of-doors'] = df['num-of-doors'].fillna(0).astype('int64')df['num-of-doors'].value_counts()1    116
0 89
Name: num-of-doors, dtype: int64
#engine-location column
df['engine-location'].value_counts()
front 202
rear 3
Name: engine-location, dtype: int64
df['engine-location'] = df['engine-location'].map({'rear': 0, 'front': 1})#Peak-rpm column
df['price'].value_counts()
13207 4
8921 2
18150 2
8845 2
8495 2
..
45400 1
16503 1
5389 1
6189 1
22625 1
Name: price, Length: 187, dtype: int64
df['price'].dtypesdtype('int32')df['price'][0]13495#Body-style
df=df.drop(columns='body-style')
df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 204
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 205 non-null int32
2 make 205 non-null object
3 fuel-type 205 non-null int64
4 aspiration 205 non-null int64
5 num-of-doors 205 non-null int64
6 drive-wheels 205 non-null object
7 engine-location 205 non-null int64
8 wheel-base 205 non-null float64
9 length 205 non-null float64
10 width 205 non-null float64
11 height 205 non-null float64
12 curb-weight 205 non-null int64
13 engine-type 205 non-null object
14 num-of-cylinders 205 non-null object
15 engine-size 205 non-null int64
16 fuel-system 205 non-null object
17 bore 205 non-null float64
18 stroke 205 non-null float64
19 compression-ratio 205 non-null float64
20 horsepower 205 non-null int32
21 peak-rpm 205 non-null float64
22 city-mpg 205 non-null int64
23 highway-mpg 205 non-null int64
24 price 205 non-null int32
dtypes: float64(8), int32(3), int64(9), object(5)
memory usage: 47.3+ KB

Checking the number of unique values in each columns to perfom one hot encoding

for col in df:
print(col, df[col].unique())
symboling [ 3 1 2 0 -1 -2]
normalized-losses [122 164 158 192 188 121 98 81 118 148 110 145 137 101 78 106 85 107
104 113 150 129 115 93 142 161 153 125 128 103 168 108 194 231 119 154
74 186 83 102 89 87 77 91 134 65 197 90 94 256 95]
make ['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
'volvo']
fuel-type [1 0]
aspiration [1 0]
num-of-doors [0 1]
drive-wheels ['rwd' 'fwd' '4wd']
engine-location [1 0]
wheel-base [ 88.6 94.5 99.8 99.4 105.8 99.5 101.2 103.5 110. 88.4 93.7 103.3
95.9 86.6 96.5 94.3 96. 113. 102. 93.1 95.3 98.8 104.9 106.7
115.6 96.6 120.9 112. 102.7 93. 96.3 95.1 97.2 100.4 91.3 99.2
107.9 114.2 108. 89.5 98.4 96.1 99.1 93.3 97. 96.9 95.7 102.4
102.9 104.5 97.3 104.3 109.1]
length [168.8 171.2 176.6 177.3 192.7 178.2 176.8 189. 193.8 197. 141.1 155.9
158.8 157.3 174.6 173.2 144.6 150. 163.4 157.1 167.5 175.4 169.1 170.7
172.6 199.6 191.7 159.1 166.8 169. 177.8 175. 190.9 187.5 202.6 180.3
208.1 199.2 178.4 173. 172.4 165.3 170.2 165.6 162.4 173.4 181.7 184.6
178.5 186.7 198.9 167.3 168.9 175.7 181.5 186.6 156.9 157.9 172. 173.5
173.6 158.7 169.7 166.3 168.7 176.2 175.6 183.5 187.8 171.7 159.3 165.7
180.2 183.1 188.8]
width [64.1 65.5 66.2 66.4 66.3 71.4 67.9 64.8 66.9 70.9 60.3 63.6 63.8 64.6
63.9 64. 65.2 62.5 66. 61.8 69.6 70.6 64.2 65.7 66.5 66.1 70.3 71.7
70.5 72. 68. 64.4 65.4 68.4 68.3 65. 72.3 66.6 63.4 65.6 67.7 67.2
68.9 68.8]
height [48.8 52.4 54.3 53.1 55.7 55.9 52. 53.7 56.3 53.2 50.8 50.6 59.8 50.2
52.6 54.5 58.3 53.3 54.1 51. 53.5 51.4 52.8 47.8 49.6 55.5 54.4 56.5
58.7 54.9 56.7 55.4 54.8 49.4 51.6 54.7 55.1 56.1 49.7 56. 50.5 55.2
52.5 53. 59.1 53.9 55.6 56.2 57.5]
curb-weight [2548 2823 2337 2824 2507 2844 2954 3086 3053 2395 2710 2765 3055 3230
3380 3505 1488 1874 1909 1876 2128 1967 1989 2191 2535 2811 1713 1819
1837 1940 1956 2010 2024 2236 2289 2304 2372 2465 2293 2734 4066 3950
1890 1900 1905 1945 1950 2380 2385 2500 2410 2443 2425 2670 2700 3515
3750 3495 3770 3740 3685 3900 3715 2910 1918 1944 2004 2145 2370 2328
2833 2921 2926 2365 2405 2403 1889 2017 1938 1951 2028 1971 2037 2008
2324 2302 3095 3296 3060 3071 3139 3020 3197 3430 3075 3252 3285 3485
3130 2818 2778 2756 2800 3366 2579 2460 2658 2695 2707 2758 2808 2847
2050 2120 2240 2190 2340 2510 2290 2455 2420 2650 1985 2040 2015 2280
3110 2081 2109 2275 2094 2122 2140 2169 2204 2265 2300 2540 2536 2551
2679 2714 2975 2326 2480 2414 2458 2976 3016 3131 3151 2261 2209 2264
2212 2319 2254 2221 2661 2563 2912 3034 2935 3042 3045 3157 2952 3049
3012 3217 3062]
engine-type ['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv']
num-of-cylinders ['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']
engine-size [130 152 109 136 131 108 164 209 61 90 98 122 156 92 79 110 111 119
258 326 91 70 80 140 134 183 234 308 304 97 103 120 181 151 194 203
132 121 146 171 161 141 173 145]
fuel-system ['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']
bore [3.47 2.68 3.19 3.13 3.5 3.31
3.62 2.91 3.03 2.97 3.34 3.6
2.92 3.15 3.43 3.63 3.54 3.08
3.32975124 3.39 3.76 3.58 3.46 3.8
3.78 3.17 3.35 3.59 2.99 3.33
3.7 3.61 3.94 3.74 2.54 3.05
3.27 3.24 3.01 ]
stroke [2.68 3.47 3.4 2.8 3.19 3.39
3.03 3.11 3.23 3.46 3.9 3.41
3.07 3.58 4.17 2.76 3.15 3.25542289
3.16 3.64 3.1 3.35 3.12 3.86
3.29 3.27 3.52 2.19 3.21 2.9
2.07 2.36 2.64 3.08 3.5 3.54
2.87 ]
compression-ratio [ 9. 10. 8. 8.5 8.3 7. 8.8 9.5 9.6 9.41 9.4 7.6
9.2 10.1 9.1 8.1 11.5 8.6 22.7 22. 21.5 7.5 21.9 7.8
8.4 21. 8.7 9.31 9.3 7.7 22.5 23. ]
horsepower [ 111 154 102 115 110 140 160 101 121 182 48 70
68 88 145 58 76 60 86 100 78 90 176 262
135 84 64 120 72 123 155 184 175 116 69 55
97 152 200 95 142 143 207 288 13207 73 82 94
62 56 112 92 161 156 52 85 114 162 134 106]
peak-rpm [5000. 5500. 5800. 4250. 5400.
5100. 4800. 6000. 4750. 4650.
4200. 4350. 4500. 5200. 4150.
5600. 5900. 5750. 5125.36945813 5250.
4900. 4400. 6600. 5300. ]
city-mpg [21 19 24 18 17 16 23 20 15 47 38 37 31 49 30 27 25 13 26 36 22 14 45 28
32 35 34 29 33]
highway-mpg [27 26 30 22 25 20 29 28 53 43 41 38 24 54 42 34 33 31 19 17 23 32 39 18
16 37 50 36 47 46]
price [13495 16500 13950 17450 15250 17710 18920 23875 13207 16430 16925 20970
21105 24565 30760 41315 36880 5151 6295 6575 5572 6377 7957 6229
6692 7609 8558 8921 12964 6479 6855 5399 6529 7129 7295 7895
9095 8845 10295 12945 10345 6785 11048 32250 35550 36000 5195 6095
6795 6695 7395 10945 11845 13645 15645 8495 10595 10245 10795 11245
18280 18344 25552 28248 28176 31600 34184 35056 40960 45400 16503 5389
6189 6669 7689 9959 8499 12629 14869 14489 6989 8189 9279 5499
7099 6649 6849 7349 7299 7799 7499 7999 8249 8949 9549 13499
14399 17199 19699 18399 11900 13200 12440 13860 15580 16900 16695 17075
16630 17950 18150 12764 22018 32528 34028 37028 9295 9895 11850 12170
15040 15510 18620 5118 7053 7603 7126 7775 9960 9233 11259 7463
10198 8013 11694 5348 6338 6488 6918 7898 8778 6938 7198 7788
7738 8358 9258 8058 8238 9298 9538 8449 9639 9989 11199 11549
17669 8948 10698 9988 10898 11248 16558 15998 15690 15750 7975 7995
8195 9495 9995 11595 9980 13295 13845 12290 12940 13415 15985 16515
18420 18950 16845 19045 21485 22470 22625]
one_hot_encoding = pd.get_dummies(df, columns = ['drive-wheels', 'engine-type','num-of-cylinders','fuel-system', 'make'])df_final = one_hot_encoding.reset_index(drop = True)
df_final
png
df_final.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 67 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 205 non-null int32
2 fuel-type 205 non-null int64
3 aspiration 205 non-null int64
4 num-of-doors 205 non-null int64
5 engine-location 205 non-null int64
6 wheel-base 205 non-null float64
7 length 205 non-null float64
8 width 205 non-null float64
9 height 205 non-null float64
10 curb-weight 205 non-null int64
11 engine-size 205 non-null int64
12 bore 205 non-null float64
13 stroke 205 non-null float64
14 compression-ratio 205 non-null float64
15 horsepower 205 non-null int32
16 peak-rpm 205 non-null float64
17 city-mpg 205 non-null int64
18 highway-mpg 205 non-null int64
19 price 205 non-null int32
20 drive-wheels_4wd 205 non-null uint8
21 drive-wheels_fwd 205 non-null uint8
22 drive-wheels_rwd 205 non-null uint8
23 engine-type_dohc 205 non-null uint8
24 engine-type_dohcv 205 non-null uint8
25 engine-type_l 205 non-null uint8
26 engine-type_ohc 205 non-null uint8
27 engine-type_ohcf 205 non-null uint8
28 engine-type_ohcv 205 non-null uint8
29 engine-type_rotor 205 non-null uint8
30 num-of-cylinders_eight 205 non-null uint8
31 num-of-cylinders_five 205 non-null uint8
32 num-of-cylinders_four 205 non-null uint8
33 num-of-cylinders_six 205 non-null uint8
34 num-of-cylinders_three 205 non-null uint8
35 num-of-cylinders_twelve 205 non-null uint8
36 num-of-cylinders_two 205 non-null uint8
37 fuel-system_1bbl 205 non-null uint8
38 fuel-system_2bbl 205 non-null uint8
39 fuel-system_4bbl 205 non-null uint8
40 fuel-system_idi 205 non-null uint8
41 fuel-system_mfi 205 non-null uint8
42 fuel-system_mpfi 205 non-null uint8
43 fuel-system_spdi 205 non-null uint8
44 fuel-system_spfi 205 non-null uint8
45 make_alfa-romero 205 non-null uint8
46 make_audi 205 non-null uint8
47 make_bmw 205 non-null uint8
48 make_chevrolet 205 non-null uint8
49 make_dodge 205 non-null uint8
50 make_honda 205 non-null uint8
51 make_isuzu 205 non-null uint8
52 make_jaguar 205 non-null uint8
53 make_mazda 205 non-null uint8
54 make_mercedes-benz 205 non-null uint8
55 make_mercury 205 non-null uint8
56 make_mitsubishi 205 non-null uint8
57 make_nissan 205 non-null uint8
58 make_peugot 205 non-null uint8
59 make_plymouth 205 non-null uint8
60 make_porsche 205 non-null uint8
61 make_renault 205 non-null uint8
62 make_saab 205 non-null uint8
63 make_subaru 205 non-null uint8
64 make_toyota 205 non-null uint8
65 make_volkswagen 205 non-null uint8
66 make_volvo 205 non-null uint8
dtypes: float64(8), int32(3), int64(9), uint8(47)
memory usage: 39.2 KB

Applying ML techniques

from sklearn.model_selection import train_test_splitX = df_final.drop(columns = 'price')
Y = df_final.price
print (X.shape)
print (Y.shape)
(205, 66)
(205,)
#Scaling data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42 )
from sklearn.preprocessing import StandardScalersc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print (X_train.shape)
print (X_test.shape)
(164, 66)
(41, 66)
#Building Linear Regression technique
from sklearn.linear_model import LinearRegression
Lin = LinearRegression()
Lin.fit(X_train, Y_train)
LinearRegression()Y_pred = Lin.predict(X_test)from sklearn.metrics import r2_scorer2_score(Y_test, Y_pred)-3.861645124236883e+22#Gradient boosting regressor
from sklearn.ensemble import GradientBoostingRegressor
gbr= GradientBoostingRegressor(random_state=0)
gbr.fit(X_train,Y_train)
GradientBoostingRegressor(random_state=0)Y_pred_gbr=gbr.predict(X_test)
r2_score(Y_test,Y_pred_gbr)
0.9241628097439228#random forest regressor
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(random_state=0)
regr.fit(X_train, Y_train)
RandomForestRegressor(random_state=0)Y_pred_rf= regr.predict(X_test)
r2_score(Y_test,Y_pred_rf)
0.94359287053183n_estimators = [5,20,50,100] # number of trees in the random forest
max_features = [ 'sqrt'] # number of features in consideration at every split
max_depth = [2,4,6,8,10,12] # maximum number of levels allowed in each decision tree
min_samples_split = [2, 6, 10] # minimum sample number to split a node
min_samples_leaf = [1, 3, 4] # minimum sample number that can be stored in a leaf

random_grid = {'n_estimators': n_estimators,

'max_features': max_features,

'max_depth': max_depth,

'min_samples_split': min_samples_split,

'min_samples_leaf': min_samples_leaf}
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(RandomForestRegressor(),
param_grid=random_grid)
grid_search.fit(X_train, Y_train)
print(grid_search.best_estimator_)
RandomForestRegressor(max_depth=12, max_features='sqrt', n_estimators=20)#random forest regressor
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(random_state=0,bootstrap=False, max_depth=6, max_features='sqrt',min_samples_split=6, n_estimators=20)
regr.fit(X_train, Y_train)

Y_pred_rf= regr.predict(X_test)
r2_score(Y_test,Y_pred_rf)
0.871234128824367

Random forest regressor

from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(random_state=0)
regr.fit(X_train, Y_train)
Y_pred_rf= regr.predict(X_test)
r2_score(Y_test,Y_pred_rf)
0.94359287053183

Importing mean absolute error and mean squared error from sklearn metrics

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
print('mae',mean_absolute_error(Y_test,Y_pred_rf))
print('mse',mean_squared_error(Y_test,Y_pred_rf))
print('r2',r2_score(Y_test,Y_pred_rf))
mae 1407.029491869919
mse 4398181.163008079
r2 0.94359287053183

Final thoughts

After going through a bunch of processes, we have successfully built and evaluated linear regression model in python. While analyzing the report, we can conclude that Random forest regressor produces the highest R2 score for the car data set we used.

Here is a link to the full code on GitHub for better understanding.

--

--

Vishaal Grizzly

Aspiring Data scientist with an enthusiasm for marketing and love for writing