In this notebook, I'll be exploring the goal of predicting an auction sale price for a piece of heavy equipment to create a "blue book" for bulldozers. The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations.
Fast Iron is creating a "blue book for bull dozers," for customers to value what their heavy equipment fleet is worth at auction.
How well can a prediction be made on the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers were sold for?
The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data
There are 3 main datasets:
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
For more on the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation
Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning model which minimises RMSLE.
Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Google Sheets: https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
# Import training and validation sets
# Working with time series data, we want to enrich the time & date component as much as possible. This can be enabled by commanding the date columns to be parsed (conversion to a datetime datatype) using `parse_dates` parameter.
df = pd.read_csv("data/TrainAndValid.csv",
low_memory=False,
parse_dates=['saledate'])
df.sample(5)
df.info()
df.isna().sum()
df.columns
fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000]);
df.SalePrice.plot.hist()
plt.xlabel('USD ($)');
sns.distplot(df['SalePrice'], color="y");
Working with time series data, it's a good idea to sort it by date.
# Sort DataFrame in date order
df.sort_values(by=["saledate"], inplace=True, ascending=True)
df.saledate.head(20)
df.head().T
saledate
column¶df["saleYear"] = df.saledate.dt.year
df["saleMonth"] = df.saledate.dt.month
df["saleDay"] = df.saledate.dt.day
df["saleDayOfWeek"] = df.saledate.dt.dayofweek
df["saleDayOfYear"] = df.saledate.dt.dayofyear
# After enriching the DataFrame with date time features, let's remove the original 'saledate' column
df.drop("saledate", axis=1, inplace=True)
Following changes to saledate
and adding datetime parameters. Making a copy of the original dataframe to keep raw data intact.
# Make a copy of the original DataFrame to perform edits on
df1 = df.copy()
df1.to_csv("data/df1.csv")
# Check the values of different columns
df1.state.value_counts()
df1.head().T
List of ALL columns with missing data
df1.isna().sum,
[col for col in df1.columns if df1[col].isnull().any()]
🔍 Check which columns contain strings.
We can see using df1.isna().sum()
the data has plenty of missing values that will need to be addressed before the modeling stage.
# These columns contain strings - a DataFrame consists of keys (Columns) and values
for label, content in df1.items(): #label (column), content (values)
if pd.api.types.is_string_dtype(content):
print(label)
# returns all columns in 'string' datatype
✅ Convert the list of string values into categorical (numerical)
# This will turn all of the string values into category values
for label, content in df1.items():
if pd.api.types.is_string_dtype(content):
df1[label] = content.astype("category").cat.as_ordered()
# Under the hood, Python now treats previously 'string' data as categories
df1.UsageBand.cat.codes, df1.state.cat.codes
# All (categorical + numerical) missing data in terms of percentages
df1.isnull().sum()/len(df1)
✅ Categorical columns (binary/numerical padding)
# Check columns which *aren't* numeric
for label, content in df1.items():
if not pd.api.types.is_numeric_dtype(content):
print(label)
# Turn categorical variables into numbers
for label, content in df1.items():
# Check columns which *aren't* numeric
if not pd.api.types.is_numeric_dtype(content):
# Add binary column to inidicate whether sample had missing value
# fills values previously "is missing" with a 0
df1[label+"_is_missing"] = pd.isnull(content)
# We add the +1 because pandas encodes missing categories as -1
df1[label] = pd.Categorical(content).codes+1
df1.head().T
df1.isna().sum()[:20]
# Check columns which *aren't* numeric
for label, content in df1.items():
if not pd.api.types.is_numeric_dtype(content):
print(df1.isna().sum())
✅ Filled Categorical columns (binary/numerical padding)
... Let's now look to fill missing values in numerical columns with the median values.
# Make a copy of the original DataFrame to perform edits on
df2 = df1.copy()
df2.to_csv("data/df2.csv")
# List of numerical columns
for label, content in df2.items():
if pd.api.types.is_numeric_dtype(content):
print(label)
# Check for which numeric columns have null values
for label, content in df2.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
print(label)
# Fill numeric rows with the median
for label, content in df2.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
# Add a binary column which tells if the data was missing our not
df2[label+"_is_missing"] = pd.isnull(content)
# Fill missing numeric values with median since it's more robust than the mean
df2[label] = content.fillna(content.median())
Why add a binary column indicating whether the data was missing or not?
We can easily fill all of the missing numeric values in our dataset with the median. However, a numeric value may be missing for a reason. Adding a binary column which indicates whether the value was missing or not helps to retain this information. I also chose filling missing numerical values with the median over mean due its robustness against outliers. For example, if values in either of the numerical columns ['auctioneerID'], ['MachineHoursCurrentMeter] contained extremely high/low values (outliers), the mean value could become heavily skewed. The median would still retain relative value for imputation in this regard.
# Check if there's any null values - Sanity Check
for label, content in df2.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
print(label)
# if there's still any numeric columns with null values in them
# it should print out those column names
No columns were printed thus all Nan (missing) values have been sufficiently filled. Let's identify how many null numerical values were replaced by the median()
.
# Check to see how many examples were missing
df2.auctioneerID_is_missing.value_counts()
# means it filled out 20,136 previously null columns with median
✅ Fill missing values in all numerical columns (impute with median values)
♻️ All missing values (categorical + numerical) should now be adequately identified and filled.
df2.isna().sum()
# Make a copy of the original DataFrame to perform edits on
df3 = df2.copy()
df3.to_csv("data/df3.csv")
All of the data is now numeric without missing values. Knowing that this is a regression problem, I like to initiate the modeling stage with RandomForestRegressor, so we will need to split the data into training, validation, and test samples according to the date, since this a time-series problem.
According to the Kaggle data page, the validation set and test set are split according to dates (E.g. using past events to try and predict future events.)
In this case:
For more on making good training, validation and test sets, check out the post How (and why) to create a good validation set by Rachel Thomas.
df3.saleYear.value_counts()
# Split data into training and validation
df_val = df3[df3.saleYear == 2012]
df_train = df3[df3.saleYear != 2012]
len(df_val), len(df_train)
# Split data into X & y
X_train, y_train = df_train.drop("SalePrice", axis=1), df_train.SalePrice
X_valid, y_valid = df_val.drop("SalePrice", axis=1), df_val.SalePrice
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape
Instantiate the intial model, RandomForestRegressor
. Due to the amount of data, processing time, and upcoming hyperparameter tuning,I'll be setting max_samples
to 10,000 meaning every n_estimator
(default 100) in RandomForestRegressor
will only see 10,000 random samples from the DataFrame instead of the entire 401,125. This way, the model will be evaluating 40x less samples which drastically increases computation speeds but due to limiting the amount of samples the model will be learning from it should be expected the results won't be optimal to start.
%%time
from sklearn.ensemble import RandomForestRegressor
# Instantiate model
model = RandomForestRegressor(n_jobs=-1, max_samples=10000)
# Fit the model
model.fit(X_train, y_train)
According to Kaggle for the Bluebook for Bulldozers competition, the evaluation function they use is root mean squared log error (RMSLE).
RMSLE = generally you don't care as much if you're off by $10 as much as you'd care if you were off by 10%, you care more about ratios rather than differences. MAE (mean absolute error) is more about exact differences.
I'll be taking the square root of Scikit-Learn's mean_squared_log_error (MSLE).
# Create evaluation function (Root Mean Square Log Error)
from sklearn.metrics import mean_squared_log_error, r2_score, mean_absolute_error
def rmsle(y_test, y_preds): #(y_true, y_preds)
"""
Calculates root mean squared log error between predictions and true labels.
"""
return np.sqrt(mean_squared_log_error(y_test, y_preds))
# Create function to evaluate our model against coefficient of determination, MAE, and Kaggle metric
def show_scores(model):
"""
Evaluate how model does on training and validation data.
"""
train_preds = model.predict(X_train)
val_preds = model.predict(X_valid) #model performing better on validation = overfitting
scores = {"Training MAE": round(mean_absolute_error(y_train, train_preds),5),
"Valid MAE": round(mean_absolute_error(y_valid, val_preds),5),
"Training RMSLE": round(rmsle(y_train, train_preds),5),
"Valid RMSLE": round(rmsle(y_valid, val_preds),5),
"Training R^2": round(r2_score(y_train, train_preds),5),
"Valid R^2": round(r2_score(y_valid, val_preds),5)}
return scores
show_scores(model)
%%time
from sklearn.model_selection import RandomizedSearchCV
# Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
"max_depth": [None, 3, 5, 10],
"min_samples_split": np.arange(2, 20, 2),
"min_samples_leaf": np.arange(1, 20, 2),
"max_features": [0.5, 1, "sqrt", "auto"],
"max_samples": [10000]}
# Instantiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
random_state=42),
param_distributions=rf_grid,
n_iter=2,
cv=5,
verbose=True)
# Fit the RandomizedSearchCV model
rs_model.fit(X_train, y_train)
# Find the best model hyperparameters
rs_model.best_params_
rs_model.best_score_
# Evaluate the RandomizedSearch model
show_scores(rs_model)
Note: These were found after 100 iterations of RandomizedSearchCV
.
%%time
# Ideal hyperparamters
ideal_model = RandomForestRegressor(n_estimators=40,
min_samples_leaf=1,
min_samples_split=14,
max_features=0.5,
n_jobs=-1,
max_samples=None,
random_state=42) # random state so our results are reproducible
# Fit the ideal model
ideal_model.fit(X_train, y_train)
# Scores for ideal_model (trained on all the data)
show_scores(ideal_model)
# Scores on rs_model (only trained on ~10,000 examples)
show_scores(rs_model)
# Scores on ititial `RandomForestRegressor()` model
show_scores(model)
![]() |
---|
XGBoost (read more) |
XGBoost
has become a widely used and really popular tool among Kaggle competitors and Data Scientists in industry, as it has been battle tested for production on large-scale problems. It is a highly flexible and versatile tool that can work through most regression, classification and ranking problems as well as user-built objective functions.
# %%time
# from xgboost.sklearn import XGBRegressor
# # Instantiate model
# xg_reg = xgb.XGBRegressor(nthreads=-1, random_state=42,learning_rate=0.15)
# # Fit the model
# xg_reg.fit(X_train, y_train)
%%time
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as st
# Different XgBoost hyperparameters
rf_grid = {
"n_estimators": st.randint(3, 40),
"max_depth": st.randint(3, 40),
"learning_rate": st.uniform(0.05, 0.4),
"colsample_bytree": st.beta(10, 1),
"subsample": st.beta(10, 1),
"gamma": st.uniform(0, 10),
'reg_alpha': st.expon(0, 50),
"min_child_weight": st.expon(0, 50),
}
# Instantiate RandomizedSearchCV model
xg_boost = RandomizedSearchCV(XGBRegressor(n_jobs=-1,
random_state=42),
param_distributions=rf_grid,
n_iter=2,
cv=5,
verbose=True)
# Fit the RandomizedSearchCV model
xg_boost.fit(X_train, y_train)
# show_scores(xg_boost)
xg_param = xg_boost.best_params_
print(xg_param)
xg_boost.best_score_
%%time
# Most ideal hyperparamters
best_params_grid = xg_param
xgb_ideal = XGBRegressor(n_jobs=-1,
param_distributions=best_params_grid,
n_iter=2,
cv=2,
verbose=True,
random_state=42) # for reproducible results
# Fit the ideal model
xgb_ideal.fit(X_train, y_train)
Comparing the evaluation metrics for RandomForestRegressor()
and the XgBoost()
models as it stands, they're quite comparable. Specifically the 'Valid RMSLE' -Kaggle scoring metric. The XgBoost given ideal hyperparameters should outperform most models. In this case, either model is suitable given the accuracy measures and the bulldozer price prediction definition problem. With a 'Valid RMSLE' : 0.24692, the model scores in the top 30 leaderboard rank and in the 93.5 percentile.
# Scores for ideal_model
show_scores(ideal_model)
# Scores on xgboost_model
show_scores(xg_boost)
# Import the test data
df_test = pd.read_csv("data/Test.csv",
low_memory=False,
parse_dates=["saledate"])
df_test.head()
def preprocess_data(df):
"""
Performs transformations on df and returns transformed df.
"""
df["saleYear"] = df.saledate.dt.year
df["saleMonth"] = df.saledate.dt.month
df["saleDay"] = df.saledate.dt.day
df["saleDayOfWeek"] = df.saledate.dt.dayofweek
df["saleDayOfYear"] = df.saledate.dt.dayofyear
df.drop("saledate", axis=1, inplace=True)
# Fill the numeric rows with median
for label, content in df.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
# Add a binary column which tells us if the data was missing or not
df[label+"_is_missing"] = pd.isnull(content)
# Fill missing numeric values with median
df[label] = content.fillna(content.median())
# Filled categorical missing data and turn categories into numbers
if not pd.api.types.is_numeric_dtype(content):
df[label+"_is_missing"] = pd.isnull(content)
# We add +1 to the category code because pandas encodes missing categories as -1
df[label] = pd.Categorical(content).codes+1
return df
# Process the test data
df_test = preprocess_data(df_test)
df_test.head()
# Make predictions on updated test data- Received: ValueError: Number of features of the model
# must match the input. Model n_features is 102 and input n_features is 101
# test_preds = ideal_model.predict(df_test)
# \ find how the columns differ using sets
set(X_train.columns) - set(df_test.columns)
# Manually adjust df_test to have auctioneerID_is_missing column
df_test["auctioneerID_is_missing"] = False
df_test.head()
💣 The test and train DataFrames are now similarly matching in features, predictions on the test set can be made.
# Make predictions on the test data
test_preds = ideal_model.predict(df_test)
test_preds
Predictions look good but they're not in the same format Kaggle is asking for: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation
# Format predictions into the Kaggle format
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalesPrice"] = test_preds
df_preds
# Export prediction data for Kaggle submission
df_preds.to_csv("data/test_predictions.csv", index=False)
import pickle
# Save the ideal_model `RandomForestRegressor` model to file
pickle.dump(ideal_model, open("rs_RandomForestRegressor.pkl", "wb"))
Feature importance seeks to figure out which different attributes of the data were most importance when it comes to predicting the target variable (SalePrice).
# Find feature importance of our best model
ideal_model.feature_importances_
# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
df = (pd.DataFrame({"features": columns,
"feature_importance": importances})
.sort_values("feature_importance", ascending=False)
.reset_index(drop=True))
sns.barplot(x="feature_importance",
y="features",
data=df[:n],
orient="h")
plot_features(X_train.columns, ideal_model.feature_importances_)
The two features that stand out are YearMade
and ProductSize
. I would ask the subject matter experts of the company if those features made sense for deriving predictions or the company could select these features to drive future production.
Conclusive notes: Why might knowing the feature importances of a trained machine learning model be helpful? By gaining a better understanding of the model's logic the model can be improved by verifying and focusing only on the important variables. Discussing details and intricacies of the production year and size (YearMade
and ProductSize
) with management and industry experts could lead to breakthroughs and changes in production analysis, or at the least, to advice monitoring the variables as time passes. The rewards of this project was working with the challenges of uncleaned data or getting the data ready for modelling. Though I spent a great deal of time fixating on hyperparameter tuning, I am beginning to understand the correlation between model performance and data preparation. Iteratively I would like to fine-tune the XgBoost parameters to see how much the accuracy can be improved.