Foundations of Data Mining
Machine learning is the science of making computers act without being explicitly programmed. Instead, algorithms are used to find patterns in data. It is so pervasive today that you probably use it dozens of times a day without knowing it, for instance in web search, speech recognition, and (soon) self-driving cars. It is also a crucial component of data-driven industry (Big Data), scientific discovery, and modern healthcare. In this class, you will learn the foundations of how data mining and machine learning work internally, understand when and how to use key concepts and techniques, and gain hands-on experience in getting them to work for yourself. You'll learn about the theoretical underpinnings of data analysis, and leverage that to quickly and powerfully apply this knowledge to tackle new problems.
- dr. ir. Joaquin Vanschoren (firstname.lastname@example.org) MF 7.104a - Responsible Lecturer
- dr. Vlado Menkovski (email@example.com) MF 7.097b - Lecturer
- dr. Anne Driemel (firstname.lastname@example.org) MF 7.073 - Lecturer
Upon completion of the course, students should be able to
- write a program that builds a predictive model from training data
- evaluate a predictive model using test/training splits
- compare the performance of different types of predictive models
- reason about the mathematical foundations of data mining techniques
- recognize when a predictive model is overfitting
- understand and exploit the fundamental bias-variance tradeoff
- combine the above with dimension-reduction techniques
- visualize and explore data sets using embeddings and clustering
Required prior knowledge: While there are no strict requirements, it is highly recommended to have a working knowledge of statistics, and to have programming experience. Programming is part of the assignments. The course will mostly feature examples using Python.
The course runs in Q3 and has the following weekly contact hours:
- Mondays, 10:45 - 12:30: Plenary Lectures (Flux 1.02)
- Thursdays, 13:45 - 15:30: Tutorials and Feedback (Flux 1.02)
- Thursdays, 15:45 - 17:30: Plenary Lectures (Flux 1.02)
Materials, Assignments, Questions
We use Canvas for all course activities:
- Check the 'Assignments' page for the assignments. You will have to submit your assignments here.
- Pose (course related) questions under 'Discussions'. You are encouraged to answer each other's questions. The lecturers will answer open questions as soon as possible.
- Check 'Files' and 'Pages' for other resources. The materials will also be available in a GitHub repo.
Please don't email the lecturers directly, except for personal questions. Even in those cases, please use Canvas to send a direct message.
It is your responsibility to keep up to date with postings and activities, but these will also clearly be announced in class. It is highly recommended that you adjust your email settings so that you get notified of all discussions happening on Canvas.
This schedule is preliminary. The order may change and parts of lectures may be removed (or added).
Course guidelines. Introduction to machine learning. k-Nearest Neighbors.
Tutorial: Linear Algebra (A. Driemel)
Basics of linear algebra (matrix operations, projections,...).
Tutorial: Introduction to ML in Python
Installation and environment setup. NumPy, SciPy, scikit-learn, Jupyter Notebooks.
Introduction Assignment 1
Linear regression (least squares), ridge regression, lasso, logistic regression, linear SVMs.
Evaluation and model selection
Evaluating predictive models. Avoiding overfitting. Cross-validation. ROC analysis, Bias-Variance analysis. Optimizing hyperparameters.
Tutorial: Introduction to ML in Python (2)
MatplotLib, OpenML, Feature engineering with scikit-learn.
Tutorial: Pipelines - data preprocessing and feature engineering
Constructing pipelines to include basic data preprocessing techniques: scaling, feature encoding, missing value imputation, feature selection,...
Decision trees, Bagging, RandomForests, Boosting, Gradient Boosting, Stacking.
Deadline Assignment 1. Introduction Assignment 2
Support Vector Machines, maximal margin, Kernel methods.
Bayes' rule, Naive Bayes, Gaussian processes
Machine Learning PipelinesPreprocessing, pipelines, practical considerations, Q&A
Feedback Assignment 1
Dimensionality Reduction I
PCA, Multi-dimensional scaling, Isomap
Dimensionality Reduction II
Random Projections, Locality-sensitive hashing
Deadline Assignment 2. Introduction Assignment 3
Locality-sensitive hashing, Jaccard similarity, MinHashing
Lloyd's algorithm, kMeans++, Gonzales' algorithm
Feedback Assignment 2
Introduction Individual Assignment
Introduction to Learning Deep Representations
Artificial Neuron, Gradient Descent, Back-propagation
Feedback and Q&A
Deadline Assignment 3. Introduction Assignment 4
Convolutional Neural Networks
Recurrent Neural Networks
Feedback and Q&A
Feedback Assignment 3
Deadline Assignment 4
Feedback Assignment 4
Deadline Individual Assignment
There is no exam. Students are evaluated using a series of 5 problem sets, containing both theoretical and practical assignments. Students work in teams of 2 people for the first 4 problem sets, and have to complete the final (larger) problem set individually.
To pass, you need to score at least 50% on the individual assignment and 60% overall.
Assignments: Deadlines and grade breakdown:
- Assignment 1 (15pt): Thursday March 1, 12:00 (noon)
- Assignment 2 (15pt): Thursday March 15, 12:00 (noon)
- Assignment 3 (15pt): Thursday March 29, 12:00 (noon)
- Assignment 4 (15pt): Thursday April 12, 12:00 (noon)
- Individual Assignment (30pt): Thursday April 20, 23:55
Participation. As this class endeavors to teach professional skills, we ask that students act professionally and treat all course participants with respect. We also encourage you to offer your ideas and thoughts to the class and to question the material presented.
Assignments. Assignments are due at the time and in the manner specified in the assignment description. Late work will lose 33% of its original point-value for each day late, and once solutions are posted or discussed, late submissions will not be accepted.
Plagiarism. Plagiarism and cheating will not be tolerated. University policy will be adhered to in all such cases. There is a difference between collaboration and plagiarism. Plagiarism is the act of using another’s work without giving them credit for it. Collaboration is the exchange of ideas, the debate of issues and the examination of readings among each other that enables you to arrive at your own independent thoughts and designs.
The syllabus page shows a table-oriented view of course schedule and basics of course grading. You can add any other comments, notes or thoughts you have about the course structure, course policies or anything else.
To add some comments, click the 'Edit' link at the top.