Course syllabus
Foundations of Data Mining
Machine learning is the science of making computers act without being explicitly programmed. Instead, algorithms are used to find patterns in data. It is so pervasive today that you probably use it dozens of times a day without knowing it, for instance in web search, speech recognition, and (soon) self-driving cars. It is also a crucial component of data-driven industry (Big Data), scientific discovery, and modern healthcare. In this class, you will learn the foundations of how data mining and machine learning work internally, understand when and how to use key concepts and techniques, and gain hands-on experience in getting them to work for yourself. You'll learn about the theoretical underpinnings of data analysis, and leverage that to quickly and powerfully apply this knowledge to tackle new problems.
People
- dr. ir. Joaquin Vanschoren (j.vanschoren@tue.nl) MF 7.104a - Responsible Lecturer
- dr. Vlado Menkovski (v.menkovski@tue.nl) MF 7.097b - Lecturer
- dr. Anne Driemel (a.driemel@tue.nl) MF 7.073 - Lecturer
Learning objectives
Upon completion of the course, students should be able to
- write a program that builds a predictive model from training data
- evaluate a predictive model using test/training splits
- compare the performance of different types of predictive models
- reason about the mathematical foundations of data mining techniques
- recognize when a predictive model is overfitting
- understand and exploit the fundamental bias-variance tradeoff
- combine the above with dimension-reduction techniques
- visualize and explore data sets using embeddings and clustering
Required prior knowledge: While there are no strict requirements, it is highly recommended to have a working knowledge of statistics, and to have programming experience. Programming is part of the assignments. The course will mostly feature examples using Python.
Course Structure
The course runs in Q3 and has the following weekly contact hours:
- Mondays, 10:45 - 12:30: Plenary Lectures (Flux 1.02)
- Thursdays, 13:45 - 15:30: Tutorials and Feedback (Flux 1.02)
- Thursdays, 15:45 - 17:30: Plenary Lectures (Flux 1.02)
Materials, Assignments, Questions
We use Canvas for all course activities:
- Check the 'Assignments' page for the assignments. You will have to submit your assignments here.
- Pose (course related) questions under 'Discussions'. You are encouraged to answer each other's questions. The lecturers will answer open questions as soon as possible.
- Check 'Files' and 'Pages' for other resources. The materials will also be available in a GitHub repo.
Please don't email the lecturers directly, except for personal questions. Even in those cases, please use Canvas to send a direct message.
It is your responsibility to keep up to date with postings and activities, but these will also clearly be announced in class. It is highly recommended that you adjust your email settings so that you get notified of all discussions happening on Canvas.
Schedule
This schedule is preliminary. The order may change and parts of lectures may be removed (or added).
Feb 5 |
Introduction Course guidelines. Introduction to machine learning. k-Nearest Neighbors. |
Vanschoren |
Feb 8 |
Tutorial: Linear Algebra (A. Driemel) Basics of linear algebra (matrix operations, projections,...). Tutorial: Introduction to ML in Python Installation and environment setup. NumPy, SciPy, scikit-learn, Jupyter Notebooks. Introduction Assignment 1 Linear models Linear regression (least squares), ridge regression, lasso, logistic regression, linear SVMs. |
Vanschoren |
Spring break |
||
Feb 19 |
Evaluation and model selection Evaluating predictive models. Avoiding overfitting. Cross-validation. ROC analysis, Bias-Variance analysis. Optimizing hyperparameters. |
Vanschoren |
Feb 22 |
Tutorial: Introduction to ML in Python (2) MatplotLib, OpenML, Feature engineering with scikit-learn. Data preprocessing and feature engineering Basic data preprocessing techniques: scaling, feature encoding, missing value imputation, feature selection,... |
Vanschoren |
Feb 26 |
Ensemble learning Decision trees, Bagging, RandomForests, Boosting, Gradient Boosting, Stacking. |
Vanschoren |
Mar 1 |
Q&A Assignment 1, Introduction Assignment 2 Tutorial: More Machine Learning Pipelines More data processing techniques, practical considerations, OpenML, Q&A Kernel methods Support Vector Machines, maximal margin, Kernel methods. |
Vanschoren |
Mar 5 |
Deadline Assignment 1. Bayesian Learning Bayes' rule, Naive Bayes, Gaussian processes |
Vanschoren |
Mar 8 |
Dimensionality Reduction I PCA, Multi-dimensional scaling, Isomap |
Driemel |
Mar 12 |
Dimensionality Reduction II Random Projections, Locality-sensitive hashing |
Driemel |
Mar 15 |
Introduction Assignment 3. Feedback Assignment 1. Locality-sensitive hashing Locality-sensitive hashing, Jaccard similarity, MinHashing |
Driemel |
Mar 16 |
Deadline Assignment 2 |
|
Mar 19 |
Clustering Lloyd's algorithm, kMeans++, Gonzales' algorithm |
Driemel |
Mar 22 |
Feedback Assignment 2 Introduction Individual Assignment Introduction to Learning Deep Representations Artificial Neuron, Gradient Descent, Back-propagation |
Menkovski |
Mar 26 |
Multilayer Perceptron Deep neural networks, activation functions, output functions, loss functions for classification and regression |
Menkovski |
Mar 29 |
Deadline Assignment 3. Introduction Assignment 4 Tutorial: Backpropagation, Keras MLP implementation Simple python api for backpropagation, Keras API for MLP Convolutional Neural Networks Neural network models for spatially correlated data |
Menkovski |
Apr 2 |
Easter Monday |
|
Apr 5 |
Recurrent Neural Networks Neural network models for temporally correlated data (time-series) Tutorial: Convolutional Neural Networks, Recurrent Neural Networks Keras implementation for CNN and RNN |
Menkovski |
Apr 9 |
Feedback Assignment 3 |
|
Apr 12 |
Deadline Assignment 4 |
|
Apr 19 |
Feedback Assignment 4 |
|
Apr 22 |
Deadline Individual Assignment |
Evaluation
There is no exam. Students are evaluated using a series of 5 problem sets, containing both theoretical and practical assignments. Students work in teams of 2 people for the first 4 problem sets, and have to complete the final (larger) problem set individually.
To pass, you need to score at least 50% on the individual assignment and 60% overall.
Assignments: Deadlines and grade breakdown:
- Assignment 1 (15pt): Thursday March 5, 17:00
- Assignment 2 (15pt): Thursday March 17, 23:55
- Assignment 3 (15pt): Thursday March 29, 12:00 (noon)
- Assignment 4 (15pt): Thursday April 12, 12:00 (noon)
- Individual Assignment (30pt): Sunday April 22, 23:55
Resit
Students who do not pass the class can do a resit in the next quarter. The resit consists of an individual assignment, much like the normal individual assignment, but it replaces the entire course grade. It will be released before the middle of Q4, and the deadline is 1th of July.
Course Policies
Participation. As this class endeavors to teach professional skills, we ask that students act professionally and treat all course participants with respect. We also encourage you to offer your ideas and thoughts to the class and to question the material presented.
Assignments. Assignments are due at the time and in the manner specified in the assignment description. Late work will lose 33% of its original point-value for each day late, and once solutions are posted or discussed, late submissions will not be accepted.
Plagiarism. Plagiarism and cheating will not be tolerated. University policy will be adhered to in all such cases. There is a difference between collaboration and plagiarism. Plagiarism is the act of using another’s work without giving them credit for it. Collaboration is the exchange of ideas, the debate of issues and the examination of readings among each other that enables you to arrive at your own independent thoughts and designs.
Course summary:
Date | Details | Due |
---|---|---|