This web-page contains homework assignments for the course STK-INF4000 'Selected Topics in Data Science' taught at the University of Oslo in the spring term 2017.
Weekly Assignments
Syllabus
The course is meant to be very interactive and I'll react to some degree to student's requests concerning the syllabus. As soon as I can foresee the content of the following lectures, it will be published here and at UiO's course page.
- Week 1 (Jan. 16th):
- Welcome.
- What is data science?
- Why should you take this course?
- What you'll learn.
- Practical information.
- Python, part 1
- Basics.
- Data types.
- Lists and tuples.
- Functions.
- Plotting.
- Week 2 (Jan. 23rd):
- Python, part 2
- Dictionaries
- Objects
- Generators
- Web Scraping 101
- Scraping etiquette.
- Extracting information from HTML pages.
- Spiders and Crawlers in Python.
- Week 3 (Jan. 30th):
- Python, part 3
- Databases 101
- RESTful APIs
- Week 4 (Feb. 6th):
- Introduction to Machine Learning.
- Aims of ML.
- Interpretability vs. accuracy.
- Just enough theory.
- K-Nearest Neighbors.
scikit-learn
: Machine Learning in Python.
- K-Nearest Neighbors on real-world data.
- Week 5 (Feb. 13th):
- Project discussion.
- Small data EDA in Python: The
pandas
package.
- What is EDA?
- What should I look for?
- Week 6 (Feb. 20th)
- Linear regression.
- Introduction to linear methods.
- Linear regression in
scikit-learn
.
- Linear regression in
scipy
.
- Regularized linear regression.
- Python leftovers.
scipy
and numpy
- Generators.
- Error handling and logging.
- Week 7 (Feb. 27th)
- Big data strategies.
- MapReduce.
- Apache Spark.
- Week 8 (Mar. 6th)
- Linear classification.
- The classification problem.
- Discriminant Analysis.
- Logistic regression.
- Applied classification
- Classifying by hand.
- Scikit-lean and Statsmodels.
- Week 9 (Mar. 13th)
- Applied classification in Spark
- Spark data frames
- Logistic regression in Spark.
- Unsupervised Learning.
- Anomaly detection.
- Clustering.
- K-Means in Spark.
- Hierarchical clustering.
- Week 10 (Mar. 20th)
- Tree Methods.
- Decision trees, pros and cons.
- Trees for classification.
- Trees for regression.
- Trees in
scikit-learn
.
- Week 11 (Mar. 27th)
- Project delivery discussion (projects due Apr. 3rd).
- Regression methods for anomaly detection.
- Ensemble Methods.
- Boosted decision trees in Spark.
- Random forests in Spark.
- Week 12 (Apr. 3rd)
- Week 13 (Apr. 24th)
- Model Evaluation.
- Train error, test error, and all that.
- Optimism and AIC.
- Bootstrapping and bagging.
- Random Forests.
- Clean Modeling: Data pipelines.
- Week 14 (May 8th)
- Data pipelines in Apache Spark.
- Natural Language Processing 101.
- What is NLP?
- Basic methods of NLP.
- Building a simple sentiment analysis tool.
- Week 15 (May 15th)
- Presenting results - the open source way.
- Javascript 101
- JQuery, Vue.js and Google Charts
- Serving data with Flask.
- Week 16 (May 22nd)
- What we missed.
- Neural Networks and deep learning.
- Support vector machines.
- Heuristics.
- Review and preparation for the exam.
- Q&A.