Use the code templates to get started!
Classification in Spark
- Download the cover type data set.
- Load the data into a Spark data frame and compute summary statistics such as mean, standard deviation, and quantiles on some of the columns.
- Investigate how the means of some of the columns vary with the cover type using the
groupBy
method.
- Compute the group means \(\mu_l\), the priors \(\pi_l\) and the \(\Sigma\) matrix for a linear discriminant analysis (LDA) on the data using Spark.
- Evaluate how well LDA performs per cover type.
- How can you tweak the prediction algorithm to increase the classification to one specific cover type, possibly at the cost of decreasing performance for the classification to other types?