Use the templates repository for boilerplate code.

Apache Spark

Download Apache Spark from the [Spark website][sprk]. You want the latest Spark version (currently 2.1.0), pre-built for Hadoop 2.7 and later, choosing a direct download. This should give you a file named spark-2.1.0-bin-hadoop2.7.tgz. Extract this file.

Open the pyspark shell, issuing the command

/path/to/spark-2.1.0-bin-hadoop2.7/bin/pyspark

Make sure you have the spark context sc available and can issue simple commands like sc.parallelize(range(10)). 3. Now try to run Spark in a Jupyter Notebook issuing in your shell (Mac/Linux, if you're on Windows the best option is to run a VM)

    export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS=notebook
    /path/to/spark-2.1.0-bin-hadoop2.7/bin/pyspark

Write a distance metric for points represented by arbitrary-length arrays (e.g. using the 2-norm \(d(x,y) = \|x-y\|_2\).
Write a function that, given a RDD containing records of the form [k, (X1, X2, ...)], returns a transformed RDD such that all columns Xi have zero mean and unit variance. Assume that each value (X1, X2, ...) of the RDD is a numpy array.
Implement a K-Nearest neighbor classifier in Spark. Your input is a RDD containing records of the form [l, y], an integer k and a point

Write a function that finds the k closest y to x in the RDD according the metric passed as argument and returns the average value for l of those.

Python (if covered in class)

Write a function returning the unique elements from a iterable (e.g. list, tuple or similar).
Write a generator returning the first \(N + 1\) elements of the Fibonacci series given starting with values \(k_0\) and \(k_1\), such that \[k_i = k_{i-1} + k_{i-2},\quad i = 2, \ldots, N\].
Write a function that calls another function f passed as argument to yours. It should try to call f and return the result, or None if f raises a ValueError. Any other error should be ignored.

Week 7

Apache Spark

Python (if covered in class)