Course syllabus

Foundations of Data Mining

Machine learning is the science of making computers act without being explicitly programmed. Instead, algorithms are used to find patterns in data. It is so pervasive today that you probably use it dozens of times a day without knowing it, for instance in web search, speech recognition, and (soon) self-driving cars. It is also a crucial component of data-driven industry (Big Data), scientific discovery, and modern healthcare. In this class, you will learn the foundations of how data mining and machine learning work internally, understand when and how to use key concepts and techniques, and gain hands-on experience in getting them to work for yourself. You'll learn about the theoretical underpinnings of data analysis, and leverage that to quickly and powerfully apply this knowledge to tackle new problems.

People

  • dr. ir. Joaquin Vanschoren (j.vanschoren@tue.nl) MF 7.104a - Responsible Lecturer
  • dr. Vlado Menkovski (v.menkovski@tue.nl) MF 7.097b - Lecturer
  • dr. Anne Driemel (a.driemel@tue.nl) MF 7.073 - Lecturer

Learning objectives

Upon completion of the course, students should be able to

  • write a program that builds a predictive model from training data
  • evaluate a predictive model using test/training splits
  • compare the performance of different types of predictive models
  • reason about the mathematical foundations of data mining techniques
  • recognize when a predictive model is overfitting
  • understand and exploit the fundamental bias-variance tradeoff
  • combine the above with dimension-reduction techniques
  • visualize and explore data sets using embeddings and clustering

Required prior knowledge: While there are no strict requirements, it is highly recommended to have a working knowledge of statistics, and to have programming experience. Programming is part of the assignments. The course will mostly feature examples using Python.

Course Structure

The course runs in Q3 and has the following weekly contact hours:

  • Mondays, 10:45 - 12:30: Plenary Lectures (Flux 1.02)
  • Thursdays, 13:45 - 15:30: Tutorials and Feedback (Flux 1.02)
  • Thursdays, 15:45 - 17:30: Plenary Lectures (Flux 1.02)

Materials, Assignments, Questions

We use Canvas for all course activities:

  • Check the 'Assignments' page for the assignments. You will have to submit your assignments here.
  • Pose (course related) questions under 'Discussions'. You are encouraged to answer each other's questions. The lecturers will answer open questions as soon as possible.
  • Check 'Files' and 'Pages' for other resources. The materials will also be available in a GitHub repo.

Please don't email the lecturers directly, except for personal questions. Even in those cases, please use Canvas to send a direct message.

It is your responsibility to keep up to date with postings and activities, but these will also clearly be announced in class. It is highly recommended that you adjust your email settings so that you get notified of all discussions happening on Canvas.

Schedule

This schedule is preliminary. The order may change and parts of lectures may be removed (or added).

Feb 5

Introduction

Course guidelines. Introduction to machine learning. k-Nearest Neighbors.

Vanschoren
Feb 8

Tutorial: Linear Algebra (A. Driemel)

Basics of linear algebra (matrix operations, projections,...). 

Tutorial: Introduction to ML in Python

Installation and environment setup. NumPy, SciPy, scikit-learn, Jupyter Notebooks.

Introduction Assignment 1

Linear models

Linear regression (least squares), ridge regression, lasso, logistic regression, linear SVMs.

Vanschoren

Spring break

Feb 19

Evaluation and model selection

Evaluating predictive models. Avoiding overfitting. Cross-validation. ROC analysis, Bias-Variance analysis. Optimizing hyperparameters.

Vanschoren
Feb 22

Tutorial: Introduction to ML in Python (2)

MatplotLib, OpenML, Feature engineering with scikit-learn.

Tutorial: Pipelines - data preprocessing and feature engineering

Constructing pipelines to include basic data preprocessing techniques: scaling, feature encoding, missing value imputation, feature selection,...

Vanschoren
Feb 26

Ensemble learning

Decision trees, Bagging, RandomForests, Boosting, Gradient Boosting, Stacking. 

Vanschoren
Mar 1

Q&A Assignment 1, Introduction Assignment 2

Tutorial: More Machine Learning Pipelines

More data processing techniques, practical considerations, OpenML, Q&A

Kernel methods

Support Vector Machines, maximal margin, Kernel methods.

Vanschoren
Mar 5

Deadline Assignment 1.

Bayesian Learning

Bayes' rule, Naive Bayes, Gaussian processes

Vanschoren
Mar 8

Dimensionality Reduction I

PCA, Multi-dimensional scaling, Isomap

Driemel
Mar 12

Dimensionality Reduction II 

Random Projections, Locality-sensitive hashing

Driemel
Mar 15

Introduction Assignment 3. Feedback Assignment 1.

Locality-sensitive hashing

Locality-sensitive hashing, Jaccard similarity, MinHashing

Driemel
Mar 16

Deadline Assignment 2

Mar 19

Clustering

Lloyd's algorithm, kMeans++, Gonzales' algorithm

Driemel
Mar 22

Feedback Assignment 2

Introduction Individual Assignment

Introduction to Learning Deep Representations

Artificial Neuron, Gradient Descent, Back-propagation 

Menkovski

Mar 26

Multilayer Perceptron

Deep neural networks, activation functions, output functions, loss functions for classification and regression 

Menkovski
Mar 29

Deadline Assignment 3. Introduction Assignment 4

Tutorial: Backpropagation, Keras MLP implementation

 Simple python api for backpropagation, Keras API for MLP

Convolutional Neural Networks

Neural network models for spatially correlated data  

Menkovski
Apr 2

Easter Monday

Apr 5

Recurrent Neural Networks

Neural network models for temporally correlated data (time-series)

Tutorial: Convolutional Neural Networks, Recurrent Neural Networks 

 Keras implementation for CNN and RNN

Menkovski
Apr 9

 

Feedback Assignment 3 

Apr 12

Deadline Assignment 4

Apr 19

Feedback Assignment 4

Apr

22

Deadline Individual Assignment

  

Evaluation

There is no exam. Students are evaluated using a series of 5 problem sets, containing both theoretical and practical assignments. Students work in teams of 2 people for the first 4 problem sets, and have to complete the final (larger) problem set individually.

To pass, you need to score at least 50% on the individual assignment and 60% overall.

Assignments: Deadlines and grade breakdown:

  • Assignment 1 (15pt): Thursday March 5, 17:00
  • Assignment 2 (15pt): Thursday March 17, 23:55
  • Assignment 3 (15pt): Thursday March 29, 12:00 (noon)
  • Assignment 4 (15pt): Thursday April 12, 12:00 (noon)
  • Individual Assignment (30pt): Sunday April 22, 23:55

 

Course Policies

Participation. As this class endeavors to teach professional skills, we ask that students act professionally and treat all course participants with respect. We also encourage you to offer your ideas and thoughts to the class and to question the material presented.

Assignments. Assignments are due at the time and in the manner specified in the assignment description. Late work will lose 33% of its original point-value for each day late, and once solutions are posted or discussed, late submissions will not be accepted.

Plagiarism. Plagiarism and cheating will not be tolerated. University policy will be adhered to in all such cases. There is a difference between collaboration and plagiarism. Plagiarism is the act of using another’s work without giving them credit for it. Collaboration is the exchange of ideas, the debate of issues and the examination of readings among each other that enables you to arrive at your own independent thoughts and designs.

Course summary:

Date Details