Foundations of Data Mining
Machine learning is the science of making computers act without being explicitly programmed. Instead, algorithms are used to find patterns in data. It is so pervasive today that you probably use it dozens of times a day without knowing it, for instance in web search, speech recognition, and (soon) self-driving cars. It is also a crucial component of data-driven industry (Big Data), scientific discovery, and modern healthcare. In this class, you will learn the foundations of how data mining and machine learning work internally, understand when and how to use key concepts and techniques, and gain hands-on experience in getting them to work for yourself. You'll learn about the theoretical underpinnings of data analysis, and leverage that to quickly and powerfully apply this knowledge to tackle new problems.
- dr. ir. Joaquin Vanschoren (email@example.com) MF 7.104a - Responsible Lecturer
- dr. Vlado Menkovski (firstname.lastname@example.org) MF 7.097b - Lecturer
- dr. Anne Driemel (email@example.com) MF 7.073 - Lecturer
Upon completion of the course, students should be able to
- write a program that builds a predictive model from training data
- evaluate a predictive model using test/training splits
- compare the performance of different types of predictive models
- reason about the mathematical foundations of data mining techniques
- recognize when a predictive model is overfitting
- understand and exploit the fundamental bias-variance tradeoff
- combine the above with dimension-reduction techniques
- visualize and explore data sets using embeddings and clustering
Required prior knowledge: While there are no strict requirements, it is highly recommended to have a working knowledge of statistics, and to have programming experience. Programming is part of the assignments. The course will mostly feature examples using Python.
The course runs in Q3 and has the following weekly contact hours:
- Mondays, 10:45 - 12:30: Plenary Lectures (Flux 1.02)
- Thursdays, 13:45 - 15:30: Tutorials and Feedback (Flux 1.02)
- Thursdays, 15:45 - 17:30: Plenary Lectures (Flux 1.02)
Materials, Assignments, Questions
We use Canvas for all course activities:
- Check the 'Assignments' page for the assignments. You will have to submit your assignments here.
- Pose (course related) questions under 'Discussions'. You are encouraged to answer each other's questions. The lecturers will answer open questions as soon as possible.
- Check 'Files' and 'Pages' for other resources. The materials will also be available in a GitHub repo.
Please don't email the lecturers directly, except for personal questions. Even in those cases, please use Canvas to send a direct message.
It is your responsibility to keep up to date with postings and activities, but these will also clearly be announced in class. It is highly recommended that you adjust your email settings so that you get notified of all discussions happening on Canvas.
This schedule is preliminary. The order may change and parts of lectures may be removed (or added).
Course guidelines. Introduction to machine learning. k-Nearest Neighbors.
Tutorial: Linear Algebra (A. Driemel)
Basics of linear algebra (matrix operations, projections,...).
Tutorial: Introduction to ML in Python
Installation and environment setup. NumPy, SciPy, scikit-learn, Jupyter Notebooks.
Introduction Assignment 1
Linear regression (least squares), ridge regression, lasso, logistic regression, linear SVMs.
Evaluation and model selection
Evaluating predictive models. Avoiding overfitting. Cross-validation. ROC analysis, Bias-Variance analysis. Optimizing hyperparameters.
Tutorial: Introduction to ML in Python (2)
MatplotLib, OpenML, Feature engineering with scikit-learn.
Data preprocessing and feature engineering
Basic data preprocessing techniques: scaling, feature encoding, missing value imputation, feature selection,...
Decision trees, Bagging, RandomForests, Boosting, Gradient Boosting, Stacking.
Q&A Assignment 1, Introduction Assignment 2
Tutorial: More Machine Learning Pipelines
More data processing techniques, practical considerations, OpenML, Q&A
Support Vector Machines, maximal margin, Kernel methods.
Deadline Assignment 1.
Bayes' rule, Naive Bayes, Gaussian processes
Dimensionality Reduction I
PCA, Multi-dimensional scaling, Isomap
Dimensionality Reduction II
Random Projections, Locality-sensitive hashing
Introduction Assignment 3. Feedback Assignment 1.
Locality-sensitive hashing, Jaccard similarity, MinHashing
Deadline Assignment 2
Lloyd's algorithm, kMeans++, Gonzales' algorithm
Feedback Assignment 2
Introduction Individual Assignment
Introduction to Learning Deep Representations
Artificial Neuron, Gradient Descent, Back-propagation
Deep neural networks, activation functions, output functions, loss functions for classification and regression
Deadline Assignment 3. Introduction Assignment 4
Tutorial: Backpropagation, Keras MLP implementation
Simple python api for backpropagation, Keras API for MLP
Convolutional Neural Networks
Neural network models for spatially correlated data
Recurrent Neural Networks
Neural network models for temporally correlated data (time-series)
Tutorial: Convolutional Neural Networks, Recurrent Neural Networks
Keras implementation for CNN and RNN
Feedback Assignment 3
Deadline Assignment 4
Feedback Assignment 4
Deadline Individual Assignment
There is no exam. Students are evaluated using a series of 5 problem sets, containing both theoretical and practical assignments. Students work in teams of 2 people for the first 4 problem sets, and have to complete the final (larger) problem set individually.
To pass, you need to score at least 50% on the individual assignment and 60% overall.
Assignments: Deadlines and grade breakdown:
- Assignment 1 (15pt): Thursday March 5, 17:00
- Assignment 2 (15pt): Thursday March 17, 23:55
- Assignment 3 (15pt): Thursday March 29, 12:00 (noon)
- Assignment 4 (15pt): Thursday April 12, 12:00 (noon)
- Individual Assignment (30pt): Sunday April 22, 23:55
Students who do not pass the class can do a resit in the next quarter. The resit consists of an individual assignment, much like the normal individual assignment, but it replaces the entire course grade. It will be released before the middle of Q4, and the deadline is 24th of June (just before the Q4 exam week).
Participation. As this class endeavors to teach professional skills, we ask that students act professionally and treat all course participants with respect. We also encourage you to offer your ideas and thoughts to the class and to question the material presented.
Assignments. Assignments are due at the time and in the manner specified in the assignment description. Late work will lose 33% of its original point-value for each day late, and once solutions are posted or discussed, late submissions will not be accepted.
Plagiarism. Plagiarism and cheating will not be tolerated. University policy will be adhered to in all such cases. There is a difference between collaboration and plagiarism. Plagiarism is the act of using another’s work without giving them credit for it. Collaboration is the exchange of ideas, the debate of issues and the examination of readings among each other that enables you to arrive at your own independent thoughts and designs.
The syllabus page shows a table-oriented view of course schedule and basics of course grading. You can add any other comments, notes or thoughts you have about the course structure, course policies or anything else.
To add some comments, click the 'Edit' link at the top.