Big Data Project: Yelp Rating Regression Predictor

18 minute read

Published:

Yelp Rating Regression Predictor

1 Introduction

When deciding where to eat, I’ll often use Yelp: a crowd-sourced review service where users can rate restaurants on a scale from 1 to 5 stars (5 being the best possible rating.) Considering that a restaurant’s success is highly correlated with its reputation, it can be useful to understand the underlying features that can affect its online perception.

In this project, I will use a Multiple Linear Regression model to investigate the features that most directly affect a restaurant’s Yelp rating and consequently use these features to predict Yelp ratings of hypothetical restaurants.

1.1 Goal:

  • Demonstrate how a Multiple Linear Regression model can be used to predict a restaurant’s Yelp rating

1.2 Approach:

  • Perform statistical analysis on a real Yelp dataset comprised of 6 json files.
    • yelp_business.json: establishment data regarding location and attributes for all businesses in the dataset
    • yelp_review.json: Yelp review metadata by business
    • yelp_user.json: user profile metadata by business
    • yelp_checkin.json: online checkin metadata by business
    • yelp_tip.json: tip metadata by business
    • yelp_photo.json: photo metadata by business

1.3 Imports

Import libraries and write settings here.

# Data manipulation
import pandas as pd
import numpy as np

# Options for pandas
pd.options.display.max_columns = 60
pd.options.display.max_rows = 500

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Visualizations
%matplotlib inline
import matplotlib.pyplot as plt

2 Data Cleaning

2.1 Load the Data

First, let’s use Pandas to investigate the data in DataFrame form.

businesses = pd.read_json('yelp_business.json',lines=True)
reviews = pd.read_json('yelp_review.json',lines=True)
users = pd.read_json('yelp_user.json',lines=True)
checkins = pd.read_json('yelp_checkin.json',lines=True)
tips = pd.read_json('yelp_tip.json',lines=True)
photos = pd.read_json('yelp_photo.json',lines=True)

Let’s preview the first five rows of each DataFrame.

businesses.head()
addressalcohol?attributesbusiness_idcategoriescitygood_for_kidshas_bike_parkinghas_wifihoursis_openlatitudelongitudenameneighborhoodpostal_codeprice_rangereview_countstarsstatetake_reservationstakes_credit_cards
01314 44 Avenue NE0{'BikeParking': 'False', 'BusinessAcceptsCredi...Apn5Q_b6Nz61Tq4XzPdf9ATours, Breweries, Pizza, Restaurants, Food, Ho...Calgary100{'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'...151.091813-114.031675Minhas Micro BreweryT2E 6L62244.0AB11
10{'Alcohol': 'none', 'BikeParking': 'False', 'B...AjEbIBw6ZFfln7ePHha9PAChicken Wings, Burgers, Caterers, Street Vendo...Henderson100{'Friday': '17:0-23:0', 'Saturday': '17:0-23:0...035.960734-114.939821CK'S BBQ & Catering89002234.5NV01
21335 rue Beaubien E1{'Alcohol': 'beer_and_wine', 'Ambience': '{'ro...O8S5hYJ1SMc8fA4QBtVujABreakfast & Brunch, Restaurants, French, Sandw...Montréal111{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'...045.540503-73.599300La BastringueRosemont-La Petite-PatrieH2G 1K7254.0QC10
3211 W Monroe St0NonebFzdJJ3wp3PZssNEsyU23gInsurance, Financial ServicesPhoenix000None133.449999-112.076979Geico Insurance85003081.5AZ00
42005 Alyth Place SE0{'BusinessAcceptsCreditCards': 'True'}8USyCYqpScwiNEb58Bt6CAHome & Garden, Nurseries & Gardening, Shopping...Calgary000{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...151.035591-114.027366Action EngineT2H 0N5042.0AB01
reviews.head()
average_review_ageaverage_review_lengthaverage_review_sentimentbusiness_idnumber_cool_votesnumber_funny_votesnumber_useful_votes
0524.458333466.2083330.808638--1UhMGODdWsrMastO9DZw16115
11199.589744785.2051280.669126--6MefnULPED_I942VcFNA322753
2717.851852536.5925930.820837--7zmmkVg-IMGaXbuVd0SQ522981
3751.750000478.2500000.170925--8LPVSo5i0Oo61X01sV9A009
4978.727273436.1818180.562264--9QQLMTbFzLJ_oT-ON3Xw437
users.head()
average_days_on_yelpaverage_number_fansaverage_number_friendsaverage_number_years_eliteaverage_review_countbusiness_id
01789.7500001.83333318.7916670.83333357.541667--1UhMGODdWsrMastO9DZw
12039.94871849.256410214.5641031.769231332.743590--6MefnULPED_I942VcFNA
21992.79629619.222222126.1851851.814815208.962963--7zmmkVg-IMGaXbuVd0SQ
32095.7500000.50000025.2500000.0000007.500000--8LPVSo5i0Oo61X01sV9A
41804.6363641.00000052.4545450.09090934.636364--9QQLMTbFzLJ_oT-ON3Xw
checkins.head()
business_idtimeweekday_checkinsweekend_checkins
07KPBkxAOEtb3QeIL9PEErg{'Fri-0': 2, 'Sat-0': 1, 'Sun-0': 1, 'Wed-0': ...7675
1kREVIrSBbtqBhIYkTccQUg{'Mon-13': 1, 'Thu-13': 1, 'Sat-16': 1, 'Wed-1...43
2tJRDll5yqpZwehenzE2cSg{'Thu-0': 1, 'Mon-1': 1, 'Mon-12': 1, 'Sat-16'...33
3tZccfdl6JNw-j5BKnCTIQQ{'Sun-14': 1, 'Fri-18': 1, 'Mon-20': 1}12
4r1p7RAMzCV_6NPF0dNoR3g{'Sat-3': 1, 'Sun-18': 1, 'Sat-21': 1, 'Sat-23...14
tips.head()
average_tip_lengthbusiness_idnumber_tips
079.000000--1UhMGODdWsrMastO9DZw1
149.857143--6MefnULPED_I942VcFNA14
252.500000--7zmmkVg-IMGaXbuVd0SQ10
3136.500000--9QQLMTbFzLJ_oT-ON3Xw2
468.064935--9e1ONYQuAa-CB_Rrw7Tw154

2.2 Merge the Data

At the moment all of our DataFrames are seperated. However, each DataFrame contains the column buisiness_id, and we can use this commonality to merge the multiple DataFrames into a single DataFrame.

Since we have six DataFrames, we will need to perform five merges to combine all of the data into one Dataframe. If the DataFrames are correctly merged, businesses will be the same length as df. Also, df.columns should contain all the unique columns from each of the 6 initial DataFrames.

print(len(businesses))
188593
df = pd.merge(businesses, reviews, how='left', on='business_id')
df = pd.merge(df, users, how='left', on='business_id')
df = pd.merge(df, checkins, how='left', on='business_id')
df = pd.merge(df, tips, how='left', on='business_id')
df = pd.merge(df, photos, how='left', on='business_id')
print(len(df))
188593
print(df.columns)
Index(['address', 'alcohol?', 'attributes', 'business_id', 'categories',
       'city', 'good_for_kids', 'has_bike_parking', 'has_wifi', 'hours',
       'is_open', 'latitude', 'longitude', 'name', 'neighborhood',
       'postal_code', 'price_range', 'review_count', 'stars', 'state',
       'take_reservations', 'takes_credit_cards', 'average_review_age',
       'average_review_length', 'average_review_sentiment',
       'number_cool_votes', 'number_funny_votes', 'number_useful_votes',
       'average_days_on_yelp', 'average_number_fans', 'average_number_friends',
       'average_number_years_elite', 'average_review_count', 'time',
       'weekday_checkins', 'weekend_checkins', 'average_tip_length',
       'number_tips', 'average_caption_length', 'number_pics'],
      dtype='object')

2.3 Clean the Data

Before we can use a Linear Regression model, we need to remove any columns in the dataset that are not continous or binary.

features_to_remove = ['address','attributes','business_id','categories','city','hours','is_open','latitude','longitude','name','neighborhood','postal_code','state','time']
df.drop(labels=features_to_remove, axis=1, inplace=True)

Now to check if our data contains missing values (i.e. Nans).

df.isna().any()
alcohol?                      False
good_for_kids                 False
has_bike_parking              False
has_wifi                      False
price_range                   False
review_count                  False
stars                         False
take_reservations             False
takes_credit_cards            False
average_review_age            False
average_review_length         False
average_review_sentiment      False
number_cool_votes             False
number_funny_votes            False
number_useful_votes           False
average_days_on_yelp          False
average_number_fans           False
average_number_friends        False
average_number_years_elite    False
average_review_count          False
weekday_checkins               True
weekend_checkins               True
average_tip_length             True
number_tips                    True
average_caption_length         True
number_pics                    True
dtype: bool

We still have a few columns with missing values. In order to fix this issue, we can use the .fill_na() method to replace any missing values in df with 0.

df.fillna({'weekday_checkins':0,
           'weekend_checkins':0,
           'average_tip_length':0,
           'number_tips':0,
           'average_caption_length':0,
           'number_pics':0},
          inplace=True)

Let’s check once again to see if our data still contains missing values (i.e. Nans).

df.isna().any()
alcohol?                      False
good_for_kids                 False
has_bike_parking              False
has_wifi                      False
price_range                   False
review_count                  False
stars                         False
take_reservations             False
takes_credit_cards            False
average_review_age            False
average_review_length         False
average_review_sentiment      False
number_cool_votes             False
number_funny_votes            False
number_useful_votes           False
average_days_on_yelp          False
average_number_fans           False
average_number_friends        False
average_number_years_elite    False
average_review_count          False
weekday_checkins              False
weekend_checkins              False
average_tip_length            False
number_tips                   False
average_caption_length        False
number_pics                   False
dtype: bool

3 Exploratory Analysis

3.1 Correlation Analysis

Now that our data has been merged and cleaned, let’s perform some analysis! Our ultimate goal is to create a Multiple Linear Regression model. We can use the .corr() method to see the correlation coefficients for each pair of our different features.

df.corr()
alcohol?good_for_kidshas_bike_parkinghas_wifiprice_rangereview_countstarstake_reservationstakes_credit_cardsaverage_review_ageaverage_review_lengthaverage_review_sentimentnumber_cool_votesnumber_funny_votesnumber_useful_votesaverage_days_on_yelpaverage_number_fansaverage_number_friendsaverage_number_years_eliteaverage_review_countweekday_checkinsweekend_checkinsaverage_tip_lengthnumber_tipsaverage_caption_lengthnumber_pics
alcohol?1.0000000.3052840.2133180.3450320.3490040.259836-0.0433320.6016700.1907380.1391080.0373690.0971880.1885980.1174720.1657750.1299010.0177940.0152610.0991410.0268460.0943980.1311750.0980370.2088560.3055700.252523
good_for_kids0.3052841.0000000.2717880.2588870.2055130.162469-0.0303820.3187290.1503600.055847-0.0791830.0738060.1132620.0606580.0838320.0450570.0249010.0165570.0942330.0406920.0689600.0798080.1219480.1565360.2914130.175058
has_bike_parking0.2133180.2717881.0000000.2351380.4160440.1555050.0680840.1601290.286298-0.080443-0.1162950.1304480.1140940.0605950.094000-0.0458490.0181200.0283070.0830620.0312030.0824740.0935790.1441630.1471150.1804680.109552
has_wifi0.3450320.2588870.2351381.0000000.2407960.195737-0.0398570.3122170.155098-0.034258-0.0377120.0546990.1473200.0822130.1206220.0004480.0239130.0159370.0828630.0440060.1074670.1268610.1047420.1735420.2589380.210583
price_range0.3490040.2055130.4160440.2407961.0000000.148277-0.0525650.3161050.4007420.1896230.0038500.0893490.1194220.0732150.0989900.1761330.1042210.0872310.2104870.1229820.0578770.0813210.1292120.1196320.1701710.143570
review_count0.2598360.1624690.1555050.1957370.1482771.0000000.0324130.1877550.1199840.0100700.0047480.0762650.8607670.5481640.7469490.0504510.0004740.0262060.014712-0.0025760.5674520.6996310.0941330.8449780.2249830.610889
stars-0.043332-0.0303820.068084-0.039857-0.0525650.0324131.000000-0.0244860.037748-0.125645-0.2770810.7821870.0433750.001320-0.000066-0.038061-0.031141-0.007629-0.064419-0.0665720.0041300.007863-0.0528990.0140380.0000400.001727
take_reservations0.6016700.3187290.1601290.3122170.3161050.187755-0.0244861.0000000.1279410.0640980.0463310.0867280.1291650.0711310.1155830.0488500.001131-0.0255220.0639900.0106180.0531620.0761830.0840980.1348320.2828230.231242
takes_credit_cards0.1907380.1503600.2862980.1550980.4007420.1199840.0377480.1279411.0000000.056399-0.0813830.0841710.0798790.0499450.0778040.078443-0.0071240.0279240.009551-0.0052600.0474020.0558980.1199250.0977000.1032710.073276
average_review_age0.1391080.055847-0.080443-0.0342580.1896230.010070-0.1256450.0640980.0563991.0000000.1923550.0036620.0315770.0321990.0281220.8208880.2433770.2189900.3773350.2616230.0303240.035531-0.0005250.050846-0.024121-0.041140
average_review_length0.037369-0.079183-0.116295-0.0377120.0038500.004748-0.2770810.046331-0.0813830.1923551.000000-0.1330780.0279760.0276620.0599790.1788720.1123970.0834950.1682610.0888880.0007890.0047800.013002-0.004609-0.0168690.006024
average_review_sentiment0.0971880.0738060.1304480.0546990.0893490.0762650.7821870.0867280.0841710.003662-0.1330781.0000000.0790570.0269480.0358390.0830460.0643850.0647380.0998040.0455170.0259670.036676-0.0036200.0565950.0679120.044696
number_cool_votes0.1885980.1132620.1140940.1473200.1194220.8607670.0433750.1291650.0798790.0315770.0279760.0790571.0000000.7255540.8630730.0776680.0503030.0771860.0613950.0358270.5606540.6841750.0729800.7779850.1781040.554507
number_funny_votes0.1174720.0606580.0605950.0822130.0732150.5481640.0013200.0711310.0499450.0321990.0276620.0269480.7255541.0000000.9007950.0546880.0284210.0451710.0345700.0206240.3604970.4442570.0486250.5075700.1034910.325476
number_useful_votes0.1657750.0838320.0940000.1206220.0989900.746949-0.0000660.1155830.0778040.0281220.0599790.0358390.8630730.9007951.0000000.0618810.0166450.0388930.0204590.0060160.4508940.5569730.0916500.6499130.1498200.441297
average_days_on_yelp0.1299010.045057-0.0458490.0004480.1761330.050451-0.0380610.0488500.0784430.8208880.1788720.0830460.0776680.0546880.0618811.0000000.3207880.3153040.4678930.3454810.0521680.0607820.0145440.0780310.000783-0.006241
average_number_fans0.0177940.0249010.0181200.0239130.1042210.000474-0.0311410.001131-0.0071240.2433770.1123970.0643850.0503030.0284210.0166450.3207881.0000000.7811610.6258910.7986370.0292870.0318030.0308410.0279030.0027380.001965
average_number_friends0.0152610.0165570.0283070.0159370.0872310.026206-0.007629-0.0255220.0279240.2189900.0834950.0647380.0771860.0451710.0388930.3153040.7811611.0000000.5253800.5459400.0535680.0569550.0455070.0605060.0044450.010809
average_number_years_elite0.0991410.0942330.0830620.0828630.2104870.014712-0.0644190.0639900.0095510.3773350.1682610.0998040.0613950.0345700.0204590.4678930.6258910.5253801.0000000.6877010.0451120.0519600.0590310.0492840.0351180.019713
average_review_count0.0268460.0406920.0312030.0440060.122982-0.002576-0.0665720.010618-0.0052600.2616230.0888880.0455170.0358270.0206240.0060160.3454810.7986370.5459400.6877011.0000000.0293920.0318950.0321180.0255420.0045970.002460
weekday_checkins0.0943980.0689600.0824740.1074670.0578770.5674520.0041300.0531620.0474020.0303240.0007890.0259670.5606540.3604970.4508940.0521680.0292870.0535680.0451120.0293921.0000000.9471180.0393700.8021600.0886000.262576
weekend_checkins0.1311750.0798080.0935790.1268610.0813210.6996310.0078630.0761830.0558980.0355310.0047800.0366760.6841750.4442570.5569730.0607820.0318030.0569550.0519600.0318950.9471181.0000000.0427270.8751690.1095520.346862
average_tip_length0.0980370.1219480.1441630.1047420.1292120.094133-0.0528990.0840980.119925-0.0005250.013002-0.0036200.0729800.0486250.0916500.0145440.0308410.0455070.0590310.0321180.0393700.0427271.0000000.0818280.0819290.054535
number_tips0.2088560.1565360.1471150.1735420.1196320.8449780.0140380.1348320.0977000.050846-0.0046090.0565950.7779850.5075700.6499130.0780310.0279030.0605060.0492840.0255420.8021600.8751690.0818281.0000000.1905280.450343
average_caption_length0.3055700.2914130.1804680.2589380.1701710.2249830.0000400.2828230.103271-0.024121-0.0168690.0679120.1781040.1034910.1498200.0007830.0027380.0044450.0351180.0045970.0886000.1095520.0819290.1905281.0000000.249235
number_pics0.2525230.1750580.1095520.2105830.1435700.6108890.0017270.2312420.073276-0.0411400.0060240.0446960.5545070.3254760.441297-0.0062410.0019650.0108090.0197130.0024600.2625760.3468620.0545350.4503430.2492351.000000

3.2 Data Visualization: Yelp Rating Scatterplots

From the previous correlation analysis, we determined that the three features with the strongest correlations to Yelp rating ( the stars column) are average_review_sentiment, average_review_length, and average_review_age.

Let’s better visualize these three features by creating three separate scatterplots where we plot our Yelp rating, stars against average_review_sentiment, average_review_length, and average_review_age, respectively.

# plot stars against average_review_sentiment here
plt.scatter(df['average_review_sentiment'],df['stars'],alpha=0.1)
plt.xlabel('average_review_sentiment')
plt.ylabel('Yelp Rating')
plt.show()

png

# plot stars against average_review_length here
plt.scatter(df['average_review_length'],df['stars'],alpha=0.1)
plt.xlabel('average_review_length')
plt.ylabel('Yelp Rating')
plt.show()

png

# plot stars against average_review_age against stars here
plt.scatter(df['average_review_age'],df['stars'],alpha=0.1)
plt.xlabel('average_review_age')
plt.ylabel('Yelp Rating')
plt.show()

png

3.3 Data Selection

Again, the three features with the strongest correlations to Yelp rating are average_review_sentiment, average_review_length, and average_review_age.

Let’s use this knowledge to create our first model with average_review_sentiment, average_review_length, and average_review_age as features.

features = df[['average_review_sentiment','average_review_length','average_review_age']]
ratings = df['stars']

3.4 Split the Data into Training and Testing Sets

Before we can create a model, our data must be separated into a training set and a test set.

X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)

3.5 Create and Train the Model

First we need to import LinearRegression from scikit-learn’s linear_model module.

In order to train our model, we will create an instance of the LinearRegression Class, and then use the .fit() method on this instance.

model = LinearRegression()
model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

3.6 Evaluate Model

The effectiveness of our model can be determined with the .score() method, which provides the R^2 value for our model. R^2 values range from 0 to 1, with 0 indicating that 0% of the variability in y can be explained by x, and with 1 indicating the 100% of the variability in y can be explained by x. Let’s use the .score() method on our training and testing sets.

model.score(X_train,y_train)
0.6520510292564032
model.score(X_test,y_test)
0.6495675480094902

We can use .coef_ to generate an array of the feature coefficients determined by fitting our model to the training data. Let’s list the feature coefficients in descending order.

sorted(list(zip(['average_review_sentiment','average_review_length','average_review_age'],model.coef_)),key = lambda x: abs(x[1]),reverse=True)
[('average_review_sentiment', 2.243030310441708),
 ('average_review_length', -0.0005978300178804348),
 ('average_review_age', -0.00015209936823152394)]

3.7 Data Visualization Pt 2: Scatterplot Predictions

Another way to determine the reliability of the model is to calculate the predicted Yelp ratings for our testing data and compare them to their actual Yelp ratings. We will use a scatterplot to plot Predicted Yelp Rating against the actual Yelp Rating.

We can use the .predict() method to use model’s coefficients to calculate the predicted Yelp rating.

y_predicted = model.predict(X_test)
plt.scatter(y_test,y_predicted)
plt.xlabel('Yelp Rating')
plt.ylabel('Predicted Yelp Rating')
plt.ylim(1,5)
plt.show()

png

3.8 Future Modeling

Let’s explore the previous process with a new set of features. Instead of re-doing this entire process every time we’d like to change our list of features, we can create a function instead:

# take a list of features to model as a parameter
def model_these_features(feature_list):

    # define ratings and features, with the features limited to our chosen subset of data
    ratings = df.loc[:,'stars']
    features = df.loc[:,feature_list]

    # perform train, test, split on the data
    X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)

    # if only one feature is modeled, reshape data to prevent errors
    if len(X_train.shape) < 2:
        X_train = np.array(X_train).reshape(-1,1)
        X_test = np.array(X_test).reshape(-1,1)

    # create and fit the model to the training data
    model = LinearRegression()
    model.fit(X_train,y_train)

    # print the train and test scores
    print('Train Score:', model.score(X_train,y_train))
    print('Test Score:', model.score(X_test,y_test))

    # print the model features and their corresponding coefficients, from most predictive to least predictive
    print(sorted(list(zip(feature_list,model.coef_)),key = lambda x: abs(x[1]),reverse=True))

    # calculate the predicted Yelp ratings from the test data
    y_predicted = model.predict(X_test)

    # plot the actual Yelp Ratings vs the predicted Yelp ratings for the test data
    plt.scatter(y_test,y_predicted)
    plt.xlabel('Yelp Rating')
    plt.ylabel('Predicted Yelp Rating')
    plt.ylim(1,5)
    plt.show()

Let’s use this function on a new set of features.

# subset of all features that have a response range [0,1]
binary_features = ['alcohol?','has_bike_parking','takes_credit_cards','good_for_kids','take_reservations','has_wifi']

# create a model on all binary features here
model_these_features(binary_features)
Train Score: 0.012223180709591164
Test Score: 0.010119542202269072
[('has_bike_parking', 0.19003008208039676), ('alcohol?', -0.14549670708138332), ('has_wifi', -0.13187397577762547), ('good_for_kids', -0.08632485990337231), ('takes_credit_cards', 0.07175536492195614), ('take_reservations', 0.04526558530451594)]

png

# subset of all features that vary on a greater range than [0,1]
numeric_features = ['review_count','price_range','average_caption_length','number_pics','average_review_age','average_review_length','average_review_sentiment','number_funny_votes','number_cool_votes','number_useful_votes','average_tip_length','number_tips','average_number_friends','average_days_on_yelp','average_number_fans','average_review_count','average_number_years_elite','weekday_checkins','weekend_checkins']

# create a model on all numeric features here
model_these_features(numeric_features)
Train Score: 0.673499259376666
Test Score: 0.6713318798120138
[('average_review_sentiment', 2.2721076642097686), ('price_range', -0.0804608096270259), ('average_number_years_elite', -0.07190366288054195), ('average_caption_length', -0.00334706600778316), ('number_pics', -0.0029565028128950613), ('number_tips', -0.0015953050789039144), ('number_cool_votes', 0.0011468839227082779), ('average_number_fans', 0.0010510602097444858), ('average_review_length', -0.0005813655692094847), ('average_tip_length', -0.0005322032063458541), ('number_useful_votes', -0.00023203784758702592), ('average_review_count', -0.00022431702895061526), ('average_review_age', -0.0001693060816507226), ('average_days_on_yelp', 0.00012878025876700503), ('weekday_checkins', 5.918580754475574e-05), ('weekend_checkins', -5.518176206986478e-05), ('average_number_friends', 4.826992111594799e-05), ('review_count', -3.48348376378989e-05), ('number_funny_votes', -7.884395674183897e-06)]

png

# all features
all_features = binary_features + numeric_features

# create a model on all features here
model_these_features(all_features)
Train Score: 0.6807828861895333
Test Score: 0.6782129045869245
[('average_review_sentiment', 2.280845699662378), ('alcohol?', -0.14991498593470778), ('has_wifi', -0.12155382629262777), ('good_for_kids', -0.11807814422012647), ('price_range', -0.06486730150041178), ('average_number_years_elite', -0.0627893971389538), ('has_bike_parking', 0.027296969912285574), ('takes_credit_cards', 0.02445183785362615), ('take_reservations', 0.014134559172970311), ('number_pics', -0.0013133612300815713), ('average_number_fans', 0.0010267986822657448), ('number_cool_votes', 0.000972372273441118), ('number_tips', -0.0008546563320877247), ('average_caption_length', -0.0006472749798191067), ('average_review_length', -0.0005896257920272376), ('average_tip_length', -0.00042052175034057535), ('number_useful_votes', -0.00027150641256160215), ('average_review_count', -0.00023398356902509327), ('average_review_age', -0.00015776544111326904), ('average_days_on_yelp', 0.00012326147662885747), ('review_count', 0.00010112259377384992), ('weekend_checkins', -9.239617469645031e-05), ('weekday_checkins', 6.1539091231461e-05), ('number_funny_votes', 4.8479351025072536e-05), ('average_number_friends', 2.0695840373717654e-05)]

png

3.9 Prediction of a Hypothetical Restaurant: Adrian’s Taco Shop

Let’s create a hypothetical restaurant and predict its Yelp Rating. First, let’s recall what our features are.

print(all_features)
['alcohol?', 'has_bike_parking', 'takes_credit_cards', 'good_for_kids', 'take_reservations', 'has_wifi', 'review_count', 'price_range', 'average_caption_length', 'number_pics', 'average_review_age', 'average_review_length', 'average_review_sentiment', 'number_funny_votes', 'number_cool_votes', 'number_useful_votes', 'average_tip_length', 'number_tips', 'average_number_friends', 'average_days_on_yelp', 'average_number_fans', 'average_review_count', 'average_number_years_elite', 'weekday_checkins', 'weekend_checkins']

For some perspective on preexisting restaurants, let’s calculate the mean, minimum, and maximum values for each feature.

pd.DataFrame(list(zip(features.columns,features.describe().loc['mean'],features.describe().loc['min'],features.describe().loc['max'])),columns=['Feature','Mean','Min','Max'])
FeatureMeanMinMax
0average_review_sentiment0.554935-0.9952000.996575
1average_review_length596.46356762.4000004229.000000
2average_review_age1175.50102171.5555564727.333333

Let’s call our hypothetical restaurant Adrian's Taco Shop and assign this taco shop reasonable values for each feature.

adrians_taco_shop = np.array([1,1,1,1,1,1,75,2,3,10,10,1200,0.95,3,6,10,50,3,50,500,20,100,1,0,0]).reshape(1,-1)

Before we make a prediction, let’s retrain our model on all our features.

#retrain model on all features
features = df.loc[:,all_features]
ratings = df.loc[:,'stars']
X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Finally, let’s make our Yelp rating prediction on Adrian's Taco Shop!

model.predict(adrians_taco_shop)
array([3.82929417])

3.8 stars, huh……. Not too bad, I guess.

4 Discussion & Conclusion

We were able to build a Multiple Linear Regression model with the capability to somewhat predict a restaurant’s Yelp rating. Although we obtained our highest Test Score of 0.6713318798120138 when modeling all available features, this test score was not much higher than when we modeled only numeric features or our top 3 features.

This project demonstrated that even if a plethora of data is available, it can still be difficult to make predictions. Additonally, I learned how initial analysis can provide valuable insight for future projects.

For example, we determined that average_review_sentiment has the strongest correlation with Yelp rating; it might be worth further investigating how “sentiment” is determined by using Natural Language Processing techniques. (More on NLP soon!)