Use the lecture's Jupyter Notebook as a starting point.
Anomaly Detection
- Head over to the Oslo City Bike webpage and download their 2016 data.
- Load the data into Spark and parse the timestamps of the starts of the trips.
- We'll now repeat the steps from the lecture.
- Calculate the per-day-and-hour trip counts.
- For each day, construct a feature vector containing the trip counts for each hour.
- Using the elbow method, find a good \(k\), such that dividing the days in \(k\) clusters seems to make sense.
- Produce the elbow plot along the way.
- Cluster the days according to the \(k\) you chose.
- Which weekdays occur most commonly in your clusters? Make plot.
- Which months occur most commonly in your clusters? Make a plot.
- Calculate the per-cluster z-values and find the most anomalous days.
- Advanced. Repeat the above analysis, using counts per-day-and-station. Try to identify anomalous days for a given station.