Use the lecture's Jupyter Notebook as a starting point.
Anomaly Detection
Regression Methods Using Trees
Use as last week the Oslo city bike data.
- Make a Spark data frame containing the hourly counts as a target column and the lagged counts (using 1 hour, 1 day, 7 days, 14 days, and 28 days as lags) as the feature column.
- Split the data in training and test sets, using a date around two thirds towards the end of the period your data spans as split point.
- For a number of different tree depths, fit a
DecisionTreeRegressor
to the training data. Plot the test error vs. the depth. What seems to be a good value for the tree depth?
- Make a histogram of the normalized deviations from your test data using a tree of optimal depth.
- Fit a
GBTRegressor
(Gradient Boosted Tree) to the data. How does its performance compare to the single tree models? Try a number of different values for the number of boosting steps and compare.
- Make a histogram of the normalized deviations from your test data.