Online travel agencies are scrambling to meet the artificial intelligence driven personalization standard set by companies like Amazon and Netflix. In addition, the world of online travel has become a highly competitive space where brands try to capture our attention (and wallet) with recommending, comparing, matching, and sharing.
Create optimal hotel recommendations for Expedia's users that are searching for a hotel to book, specifically predict which "hotel cluster" the user is likely to book, given his (or her) search details.
Split train.csv into a training and test set (feel free to select a smaller random subset of train.csv). There is another file named destinations.csv, which contains information related to hotel reviews made by users. Then, build at least two prediction models from the training set, and report the accuracies on the test set.
Download Location: https://www.kaggle.com/c/expedia-hotel-recommendations/data
import random
import numpy as np
import pandas as pd
import seaborn as sb
import datetime as dt
import pandas_profiling as pp
from scipy.stats import norm
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV
seed = 42 # set seed
n = sum(1 for line in open(filename)) - 1
s = 75000 # desired sample size
random.seed(seed)
skip = sorted(random.sample(range(1, n + 1), n - s)) # randomly sample dataset
# Read train.csv and drop all missing (NaN) values
hotelData = pd.read_csv('Expedia_Hotel_Data/train.csv', skiprows=skip).dropna().reset_index(drop=True)
print('Dataset shape: {:,} columns and {:,} rows'.format(hotelData.shape[1], hotelData.shape[0]))
hotelData.head()
destData = hotelData.merge(pd.read_csv('expedia-hotel-recommendations/destinations.csv'),
how='left', on='srch_destination_id').dropna().reset_index(drop=True)
tmp = destData['hotel_cluster']
destData = destData.drop(['hotel_cluster'], axis=1)
destData.insert(0, 'hotel_cluster', tmp)
print('Merged Destination Dataset shape: {:,} columns and {:,} rows'.format(destData.shape[1], destData.shape[0]))
destData.head()
balData = destData.groupby('hotel_cluster')
balData = pd.DataFrame(balData.apply(lambda x:
x.sample(balData.size().min()).reset_index(drop=True))).droplevel('hotel_cluster').reset_index(drop=True)
print('Merged Balanced Dataset shape: {:,} columns and {:,} rows'.format(balData.shape[1], balData.shape[0]))
balData.head()
pp.ProfileReport(balData[balData.columns[:24]]).to_notebook_iframe()