Use the lecture's notebook as a starting point. The easiest way to solve these problems is using pandas
and scikit-learn
. But feel free to use Spark if you're up for a challenge.
Model Evaluation and Tuning
- Head over to the Somerville Happiness Survey data and download it.
- Inspect the data set. Your target variable will be
How.satisfied.are.you.with.Somerville.as.a.place.to.live.
.
- Find at least one categorical variable that you want to include in your model. A good place to start is
What.is.your.annual.household.income.
.
- Note that the data set has a lot of missing values. Choose at least one variable that you want to include in your model that has missing values. Replace the missing values by the mean (for continuous variables) or the mode (for categorical variables).
- Make a pipeline that includes your data processing steps from the previous two tasks.
- Decide on a simple model you want to use to predict the satisfaction. A tree would be a good place to start.
- Using grid search, tune the most important model parameters, e.g. tree depth. What accuracy can you achieve?