Lectures

Here you can find information on the lectures, as well as information on when you should have enough information to be able to attempt the various practical and theoretical python notebook assignments. The project should be started as soon as possible, and you should try different machine learning algorithms on the data set as you learn them in the course.

There is a menu item for each lecture where you can find a reading guide for the textbook, links to additional material and links to slides.

Please note that the dates still could change, and the dates in timeedit are always correct.

Lecture Date Topic
1 1/15 Introduction and Overview of the Course
2 18/1 Linear Regression as Machine Learning

You should now have all the knowledge you need to attempt practical notebooks 1 and 2 (P1 & P2)

Lecture Date Topic
Help Session 18/1
3 22/1 Probability and Naive Bayes Classification

You now should have the knowledge to attempt the the first theoretical notebook (T1).

Lecture Date Topic
4 24/1 Logistic Regression and Regularisation
Help Session 25/1

You should now be able to attempt the second theoretical notebook (T2) and the third practical notebook (P3).

Lecture Date Topic
5 30/1 Support Vector Machines
Help Session 1/2
6 2/2 Guest Lecture Madhushanka Padmal Cross Validation and Feature Encoding

You should now be able to attempt the final practical notebook (P4).

Lecture Date Topic
7 6/2 Clustering and Nearest Neighbours
8 8/2 Decision Trees
Help Session 8/2

You should now be able to attempt theoretical notebook 3 (T3) and (T4)

Lecture Date Topic
9 12/2 Principle Component Analysis and Preprocessing
Help Session 15/2
10 20/2 Gradient Boosting & AdaBoost
Help Session 22/2
11 27/2 Ethics and Bias in Machine Learning (Slides on Studium)
Exam 12/3

Practical and Theoretical Notebooks

There are 4 practical and theoretical notebooks that are done individually as assignments. Links will be provided but the topics covered in the notebooks are as follows.

Practical Notebooks

  • P1 : Basic programming with Python, lists, sets and an introduction to NumPy.
  • P2 : Introduction to Pandas
  • P3 : Linear and Logistic regression with scikit-learn
  • P4 : Preprocessing, feature engineering and cross validation.

Theoretical Notebooks

  • T1 - Logistic Regression
  • T2 - Using naive Bayes for classifying tweets.
  • T3 - Using K-means classifiers
  • T4 - The Entropy of the normal distribution and binary decision trees using ID3.

Deadlines

These should be the same deadlines as are in Studium. If there is any discrepancy then please inform me.

What When?
P1 & P2 29/1 13:00
T1 & T2 5/2 13:00
P3 & P4 19/2 13:00
T3 & T4 26/2 13:00
Project 8/3 13:00

Subsections of Lectures

Lecture 1: Introduction and Overview of the Course

Today’s Topic

Overview of the course and introduction to machine learning. What is supervised learning and what is unsupervised learning.

I use these slides.

Reading Guide

What should I know by the end of this lecture?

  • How is the course organised?
  • What is machine learning?
  • What is supervised learning?
  • What is unsupervised learning?

Lecture 2: Linear Regression as Machine Learning

Today’s Topics

Linear Regression

Linear Regression as a machine learning algorithm. Machine learning algorithms and hypothesises. In short a machine learning find the best hypothesis that explains that data. A cost function (or an error function, or a loss function) measure how far way a hypothesis is from explaining the data: the smaller the cost, the better the hypothesis.

Ideally you want an algorithm takes the training data and gives you the hypothesis that with the smallest cost value. With linear regression this is possible (using linear algebra), but in general it is not possible.

If you have been reading about neural networks, then in a neural network the weight roughly corresponds to the set of all possible hypothesis.

Training and Test Sets

As the course moves along we will learn more about best practices with machine learning. The first important idea is that you should split your data into a test set and a training set. If do not do this then there is a possibility that you will over fit to your data set, and when you meet new examples your machine learning system will not perform that well. It is also important to keep in mind that when you are using gradient descent to find the best hypothesis you use the training set, but when you are evaluating the performance of the learning algorithm you should use the test set. Later on when we look at cross validation we will look at more advance ways to divide up your data.

  • The slides can be found here..

Reading Guide

What should I know by the end of this lecture?

  • How does linear regression work with one variable?
  • How does linear regression work with many variables?
  • What is a hypothesis in a machine learning algorithm?
  • What is a cost function? Note that in machine learning there is no standard terminology. This is because machine learning comes from many different disciplines. Other names for the cost function are the Error function and the loss function.
  • What is the goal of a machine learning algorithm with the hypothesis and the cost function? What does the cost function measure? Why is a low value of the cost function desirable?
  • How does gradient descent work for linear regression? Can you derive it from first principles?
  • Why is it necessary to split the data up into a training and a test set?

Lecture 3: Probability and Naive Bayes Classification

Today’s Topic

Using Bayes’ theorem for machine learning. You should do some revision on the use of Bayes’ theorem in general. In this lecture you will look at how to use Bayes’ theorem to build a spam detector. One important idea to take away from this lecture is that there are a variety of ways implementing spam detection: in particular there are different feature models that you can use that give you different ways of calculating the relevant probabilities. It is important that you understand the difference between the different ways of implementing spam detection.

Reading Guide

What should I know by the end of this lecture?

  • How do I use Bayes’ theorem?
  • How do I use Bayes’ theorem to build a simple spam detector that only uses one word?
  • What is independence assumption in the naive Bayes’ algorithm?
  • What are the various models estimating the probabilities for spam detection?
  • What is Laplacian smoothing and how does it work in the context of spam detection?
  • When do you need to use logarithms to calculate the relevant probabilities, and how do you use them?

Subsections of Lecture 3: Probability and Naive Bayes Classification

Naive Bayes for Spam Classification

There are a lot of tutorials and youtube videos out there on using Naive Bayes for document classification. None of these tutorials are wrong, but they often hide some subtle points that if you think too hard about you will get confused. In this posts I want to explain what is really going on in Naive Bayes for spam classification.

This post assumes that you are already familiar with Bayes’ theorem.

Rather foolishly I did all the calculations in the post by hand. If you find any errors the please report them to me.

Our data set

To make things more concrete we will work on a very small data set where we can do the calculations directly. We are classifying micro-tweets of exactly 3 words. Our training set is as follows. $S$ indicates that a message is spam and $\neg S$ indicates that a message is not spam. $\neg S$
Number Tweet Spam ( $S$ or $\neg S$ )
1 money aardvark boondoggle $S$
2 money money money $S$
3 money money world $S$
4 money world world $S$
5 viagra money back $S$
6 viagra heart honey $S$
7 aardvark boondoggle world $\neg S$ (not spam)
8 honey honey honey $\neg S$ (not spam)
9 viagra heart money $\neg S$ (not spam)
10 money honey now $\neg S$ (not spam)

Background: Classifying with only one word.

As a warm up let's just build a classifier that uses one particular word $w = \mathrm{money}$. Bayes' Theorem should be familiar to you by now: $$ P(S|w) = \frac{P(w|S)P(S)}{P(w)} $$ $P(S|w)$ is the even that an email is Spam given that the word $w$ occurs in it. Using Bayes theorem we can calculate the probability that a message containing the word $w$ is spam. We can estimate the values $P(w|S)$, $P(S)$ and $P(w)$ from out data set. $$ P(S) = \frac{\mathrm{number\ spam}}{\mathrm{total\ messages}} = \frac{6}{10} $$ To estimate $P(w|S)$ we have to count the number of times that a particular word occurs in a spam message. So $$ P(\mathrm{money}|S) = \frac{5}{6} $$ When you are only considering a single word then estimating $P(\mathrm{money})$ is easy. It is the ratio of the number of tweets that contain the word 'money' and the total number of tweets. The word money appears in $7$ tweets. So $$ P(\mathrm{money}) = \frac{7}{10} $$ So if we get a message that contains the word 'money' we can calculate the probability that it is a spam message. $$ P(S|\mathrm{money}) = \frac{P(w|S)P(S)}{P(w)} = \frac{\frac{5}{6}\frac{6}{10}}{\frac{7}{10}} = \frac{5}{10}\frac{10}{7} = \frac{5}{7} \approx 0.71 $$ There is an important identity that will become useful later on $$ P(\mathrm{money}) = P(\mathrm{money}|S)P(S) + P(\mathrm{money}|\neg S)P(\neg S) $$ So $$ P(\mathrm{money})= \frac{5}{6}\frac{6}{10} + \frac{2}{4}\frac{4}{10} = \frac{7}{10} $$

First Pitfall: Estimating the probabilities

So how to we estimate the probabilities $P(w|S)$ , $P(S)$ and $P(w)$? What do they really mean? The probabilities $P(S)$ and $P(\neg S$) are unambiguous. They are just the probability that a tweet is spam or not. But $P(w|S)$, $P(w|\neg S)$, and $P(w)$ can mean different things depending exactly which model we use to calculate the probabilities.

There are two models:

  • (A): To calculate $P(\mathrm{money}|S)$. There are 6 messages that are spam and in those 6 messages 5 of them (1,2,3,4,5) contain the word money so $P(\mathrm{money}|S) = 5/6$, and of the 10 messages 7 of them (1,2,3,4,5,9,10) contain the word 'money' so $P(\mathrm{money}) = 7/10$. This is exactly what we did above.

  • (B): To calculate $P(\mathrm{money})$ there are $10\times 3 = 30$ words in our training set and the word money appears 10 times so $$P(\mathrm{money}) = 10/30.$$ To calculate $P(\mathrm{money}|S)$ there are 6 spam messages each of 3 words long. In the words of the spam messages the word 'money' appears 8 times. So $$ P(\mathrm{money} | S) = \frac{8}{3 \times 6} = \frac{8}{18} = \frac{4}{9} $$ The probability a message being spam is still $6/10$. So if I get the message 'money for nothing' then the probability that it is spam is calculated as before $$ P(S|\mathrm{money}) = \frac{P(\mathrm{money}|S)P(S)}{P(\mathrm{money})} = \frac{{\frac{8}{3 \times 6}\times \frac{6}{10}}}{ \frac{10}{30}} = \frac{8}{10} $$ It seems that if spammers are prone to repeat words in their message then this increases the probability that a message containing that word is spam.

So how do you calculate the probability that the message 'money money mammon' is spam? In model (A) it does not matter how many times 'money' appears in a message: you only count the number of messages 'money' appears in. While in model (B) there is some weighting of the number of times that a word appears. But to calculate $P(S|\mathrm{money}^2)$ (where is short hand for 'money appearing twice) we have calculate $P(\mathrm{money}^2|S)$. How you do this depends a bit on your model and the assumptions underlying the model. We'll get to that later.
Take Home Message

Take home message

So the first take home message is be careful how you count the words and how you calculate the probabilities. If you confuse model (A) and model (B) while writing your code you will get strange answers (as I did at one point).

Naive Bayes: first version

We are going to use model (A). That is we going to ignore how many times a word appears in a message. We are only interested if the word appears on the message or not. One word is not going to much of a spam classifier. Even in our little data set above, the word ‘money’ can appear in spam and non spam messages. We will get a better classifier if we take into account more words. Our data set is quite small and for each word we can count the number of times it appears in a spam tweet and the number of times it appears in a non-spam tweet.

Word occurrences in spam occurrences in non spam
$w_1 = \mathrm{money}$ 5 2
$w_2 = \mathrm{world}$ 2 1
$w_3 = \mathrm{viagra}$ 2 1
$w_4 = \mathrm{aardvark}$ 1 1
$w_5 = \mathrm{heart}$ 1 1
$w_6 = \mathrm{boondoggle}$ 1 1
$w_7 = \mathrm{honey}$ 1 2
$w_8 = \mathrm{back}$ 1 0
$w_9 = \mathrm{now}$ 0 1
You can turn these counts into probabilities, and thus you can calculate quantities like $P(\mathrm{money}|S) = 5/6$, $P(\mathrm{money}|\neg S) = 2/4$. Suppose I receive a message 'viagra money boondoggle' what is the probability that it is spam message? When we use Bayes' theorem we have to calculate $$P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}|S)$$ where $$\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}$$ is the event that the words 'viagra', 'money' and 'boondoggle' appears in a message.

The Naive in Naive Bayes

We need to make an independence assumption. In a spam or non spam message the probability of words are independent. That is $$ P(w_1 \land w_2 \land \cdots \land w_n | S) = P(w_1|S)P(w_1|S) \cdots P(w_n|S) $$ and $$ P(w_1 \land w_2 \land \cdots \land w_n | \neg S) = P(w_1|\neg S)P(w_1|\neg S) \cdots P(w_n|\neg S) $$ Note this is a weaker assumption than simply saying $$ P(w_1 \cdots w_n) = \prod_{1\leq i \leq n} P(w_i) $$

Take home message

Note that because we have made the assumptions that $$P(w_1 \land w_2 \land \cdots \land w_n | S) = \prod_{i=1}^n P(w_i|S)$$ and $$P(w_1 \land w_2 \land \cdots \land w_n | S) = \prod_{i=1}^n P(w_i|S)$$ it does not make sense to directly estimate the probabilities of $P(w_i)$ directly from the data set. Later on we will see that you do actually need the probabilities $P(w_1 \cdots w_n)$ to decide if a message is spam or not. If you want to calculate the probability $P(w_1 \cdots w_n)$ then you must use the identity $P(w_1 \cdots w_n)$ equals $$ P(w_1 \cdots w_n|S)P(S) + P(w_1 \cdots w_n|\neg S)P(\neg S) $$

The independence assumption is why Naive Bayes is referred to as naive. Although this model could be improved. It seems that the probability of one word appearing in a message should not be independent of another word. If a spammer write ‘money’ then he is likely to also include ‘viagra’ in the message. Even so, assuming independence works very well in practice.

Calculating the spam probability.

We can now apply Bayes' theorem $P(S | \mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})$ equals $$ \frac{ P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}|S) P(S)}{P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})} $$ From the independence assumption we have that $P(S | \mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})$ equals $$ \frac{ P(\mathrm{viagra}|S)P(\mathrm{money}|S)P(\mathrm{boondoggle}|S)P(S)} {P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})} $$ To calculate $P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})$ we use the identity above. Taking product $$P(\mathrm{viagra})P(\mathrm{money})P(\mathrm{boondoggle})$$ is the wrong answer. So instead we get that $P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})$ equals $P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}|S)P(S)$ plus $P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}|\neg S)P(\neg S)$. The by the independence assumption $P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}|S)$ equals $$ P(\mathrm{viagra}|S)P(\mathrm{money}|S)P(\mathrm{boondoggle}|S) = \frac{2}{6}\frac{5}{6}\frac{1}{6} = \frac{5}{108} $$ Putting the numbers in we get $P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})$ equals $$ \left(\frac{2}{6}\cdot \frac{5}{6}\cdot \frac{1}{6}\right)\frac{6}{10} + \left(\frac{1}{4}\cdot \frac{2}{4}\cdot \frac{1}{4}\right)\frac{4}{10} \approx 0.08 $$ So $P(S|\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})$ equals $$ \frac{\frac{5}{108}\frac{6}{10}}{0.08} \approx 0.35 $$

Not calculating the whole probability.

When implementing the spam filter we do not actually need to calculate the denominators. We just compare the expressions $P(\mathrm{viagra}|S)P(\mathrm{money}|S)P(\mathrm{boondoggle}|S)P(S)$ and $P(\mathrm{viagra}|\neg S)P(\mathrm{money}|\neg S)P(\mathrm{boondoggle}|\neg S)P(\neg S)$ and see which one is bigger. This is also important because some of the numbers start getting smaller and smaller and you end up with floating point underflow errors. If the numbers get too small then you have to calculate with the logarithm of the probability and do additions rather than subtraction.

Laplacian Smoothing

What if we have the message 'now money viagra', if we look at our data set the word 'now' has not appeared in a spam message. There could be two reasons for this, one is that a spam message will never contain the word 'now' (unlikely), or that we just do not have a spam message with 'now' appearing in our training set. If we use model (A) and calculate the probability that our message is spam we get $P(S|\mathrm{now}\land\mathrm{money}\land{viagra})$ equals $$ \frac{P(\mathrm{now}|S)P(\mathrm{money}|S)P(\mathrm{viagra}|S)P(S)}{P(\mathrm{now}\land\mathrm{money}\land\mathrm{viagra})} $$ which equals $$\frac{0 \cdot \frac{5}{6} \cdot \frac{2}{6}\frac{6}{10}}{P(\mathrm{now}\land\mathrm{money}\land\mathrm{viagra})} $$ So even though the words 'money' and 'viagra' are pretty good indicators of a message being spam we get probability 0. To get around this we add one to all our counts to avoid probability $0$ estimates and adjust the total count so as to avoid any probabilities greater than $1$. So in model (A) if we are considering $9$ words as a above then we estimate $P(\mathrm{now}|S)$ to be $$ \frac{0 + 1}{6 + 1} $$ instead of $$ \frac{0}{6} $$ If you had a word that appeared in all 6 of the spam tweets then you would get an estimate $\frac{6+1}{6+1}$ which would be $1$. I leave it an exercise to work out the correct thing to do in model (B).

Feature Modelling

All most all machine learning algorithms require numbers and vectors as inputs. Our Naive Bayes classifier does not really work with words, but feature vectors. There are different possible models, but we use something similar to model (A). First we take our data set and find the first $n$ most popular words. The most popular word a data set consisting of English messages is typically the word 'the'. You can improve things by filtering out the most popular words that don't contribute much a message being (referred to as stop words). We will not worry about that here. But a word like 'the' is equally likely to appear in a spam tweet or a non spam tweet, so it is better to ignore it. Then we turn each message into a vector of length $n$ where each entry is $1$ or $0$, and the $i$th entry is $1$ if the message contains the $i$th most popular word. So in a typical English message data set 'the' is the most popular and 'to' is the second most popular word. So if our message contained the words 'to the' then the first two entries of its feature vector would have the value $1$. $$ f = (f_1, \ldots , f_i, \ldots, f_n) $$ Where $f_i$ is $1$ if the $i$th most popular word $w_i$ occurs in the message and $0$ otherwise. It is easy to write a function takes a message and turns it into our feature vector. Given our training set we can estimate two probabilities for our two classes $S$ and $\neg S$ , $P(w_i|S)$, $P(\overline{w}_i|S)$, $P(w_i| \neg S)$ and $P(\overline{w}_i | \neg S)$, where $w_i$ is the even that word $i$ occurs in the message and $\overline{w}_i$ is the event that word $i$ does not occur in the message. In our example above we only have $9$ words in our data set and they appear in order of popularity as 'money', 'world', 'viagra', 'aardvark', 'heart', 'boondoogle' , 'honey', 'back' , 'now'. You have to break ties (words that are equally popular) and you have to do it consistently. So give the message 'aardvark money now' its feature vector would be $$ f = (1,0,0,1,0,0,0,0,1) $$ This vector $f$ corresponds to the event $$ w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7w_8 $$ So to use Bayes' theorem to work on the probability that the tweet is is spam we have to calculate the quantity $$ \frac{P(w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7\overline{w}_8|S)P(S)}{P(w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7\overline{w}_8 w_9)} $$ Calculating $P(w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7\overline{w}_8|S)$ is easily done by the independence assumption. It is the product of the terms $P(w_1|S)$,$P(\overline{w}_2|S)$,$P(\overline{w}_3|S)$,$P(w_4|S)$,$P(\overline{w}_5|S)$,$P(\overline{w}_6|S)$,$P(\overline{w}_7|S)$ and $P(\overline{w}_8|S)$. All these values are easily estimated from our data set. For example $P(\overline{w}_3|S)$ is the probability that the word 'money' does not appear in a spam tweet. We had 6 spam tweets and 5 of them contained the word money and so we get that $P(\overline{w}_3|S)$ equals $1/6$.

Model (A) and Feature Vectors

If you go back to model (A) and you try to estimate if a message is spam or not, then using the same message we would only need to calculate $$ \frac{P(w_1|S)(w_4|S)P(w_9|S)P(S)}{P(w_1\land w_w \land w_3)} $$ since $w_1$ is 'money', $w_4$ is 'aardvark' and $w_9$ is 'now'. We are throwing away information about the words that do not occur in the tweet. Does this matter? More importantly is this calculation incorrect. To simply things imagine that we only had two words in our feature vector $W_1$ and $W_2$. Then given a message there are 4 possible atomic events: $$ W_1 W_2, W_1 \overline{W}_2, \overline{W}_1 W_2, \overline{W}_1 \overline{W_2}$$ What do we mean when we write $P(W_1|S)$? Looking at our atomic events we actually mean $$ P( W_1 W_2 \lor W_1\overline{W}_2|S) $$ Any event is a union of the atomic events in your probability model. Using the independence assumption for $W_1$ and $W_2$ and the basic rule of probability that $P(A\lor B)$ equals $P(A)+P(B)$ when then events $A$ and $B$ are independent atomic events we get that $P(W_1|S)$ equals $P( W_1 W_2 \lor W_1\overline{W}_2|S)$ which equals $$ P(W_1|S)P(W_2|S) + P(W_1|S)P(\overline{W}_2|S)$$ refactoring gives $$P(W_1|S)(P(W_2|S) + P(\overline{W}_2|S)) $$ and since $P(W_2|S) + P(\overline{W}_2|S)$ equals $1$ we get $P(W_1|S)$. So if we ignore any information about $W_2$ then we get a factor of $1$. If we use what information we have about $W_2$ then we get a better estimate for the probability. You can show that the same applies if you have lots of words in your feature vector. So our original model (A) is not wrong, but the feature vector model where we take into account if a word appears or not when we are estimating the probabilities and gives a better estimate if the word is spam or not. Note the above argument depends on our independence assumption (the naive in Naive Bayes). If you did not look at any words then the only information that you would have is $P(S)$ or $P(\neg S)$ as you look at more words in your feature you get a better estimate of the probability that the message is spam or not.

Model (B) with multiple words

How do you calculate the probability that the message 'money money boondoggle' is spam? We have already assumed that the probabilities of a words occurring in a spam or non-spam tweet are independent. If we also assume that the probability of a word appearing $k$ times is $$ P(w^k|S) = P(w|S)^k $$ That is each occurrence is independent, then we can calculate the probability that a message containing the multiple occurrences of a word is spam or not, but only if you use model (B) to calculate the probabilities. You should not mix up model (A) and model (B). It does not make sense in model (A) to ask what the probability of a word occurring $k$ times in a spam message is. We can only ask if a m message contains the word or not.
Take Home Message

If you want to take into account multiple words then do not use model (A) to calculate your probabilities.

Model (B) with multiple words and feature vectors.

The feature vector approach for model (A) considered vectors where the entries are $0$ or $1$. Given a feature vector $f$ the entry $i$th entry is $1$ if the word appears in the message and $0$ otherwise. Instead we would have a feature vector where the $ith$ entry tells you how many times the word appears in the message. Thus for our message 'money money boondoggle' its feature vector would be (using the same ordering as above): $$ (2,0,0,0,0,1,0,0,0) $$ Again it is not hard to use the information that a word appears $0$ times.

Take home messages

  • Are you using model (A) or model (B). Don’t get them confused, especially when you are estimating the probabilities $P(w_i|S)$ from the training data.

  • Are you using the negative information, the information that a word does not occur? It does not matter your maths is correct, but you are not using all the information that you have.

  • To understand how this is related to other machine learning algorithms, then you have to understand that we take a message and construct a feature vector. Depending on if you are using model (A) or (B) your feature vector either has $0$ or $1$ entries or positive integer that tells you how many times a word occurs in a message. Feature vectors is an example modelling that you often have to do in machine learning. Your data set and package does not always come as ready packaged as a set of vectors.

  • If you watch some random video on the internet, it is not always clear which model they are using when they calculate the probabilities.

The documentation to scikit-learn has a nice entry on naive Bayes, which discusses the various options on modelling as well as links to various interesting articles on the different modelling approaches to naive Bayes.

Lecture 4: Logistic Regression as Machine Learning

Today’s Topics

Today’s slides.

Logistic Regression

Logistic regression, overfitting and regularisation. Again Logistic regression is an algorithm that comes from statistics, but it can also seen as a machine learning algorithm. The hypothesis is very similar to linear regression is it a set of values that defines a linear function. The difference between logistic regression and linear regression is the linear function goes through a logistic function that works as a threshold function. Unlike linear regression it is not possible to solve the model exactly, and gradient descent is necessary.

There are a lot of ways of thinking about how logistic regression works.

  • As a modification of linear regression to get $0$ or $1$ values to divide the data set into two halves, or two find a separating hyperplane between the two classes.
  • As an estimator of the probability that point begins to one class or another.
  • As a single neuron. You can see logistic regression as the beginning of neural networks.

Overfitting and Regularisation

Both linear and logistic regression can be improved with a regularisation term that avoids overfitting. You should try to begin to understand why overfitting is a problem and some strategies for avoiding it.

Reading Guide

Logistic Regression

Overfitting and Regularisation

Multiclass classification.

Confusion Matrices

What should I know by the end of this lecture?

  • What is logistic regression and how does it differ from linear regression?
  • What is the cost function? What does the logistic function do?
  • How do I implement gradient descent for logistic regression?
  • How does logistic regression relate to log-odds and what is it relationship with probability.
  • What is overfitting?
  • How does the regularisation term work in linear and logistic regression and how does it avoid overfitting.
  • How do you use a binary classifier for multi-class classification? What is one-vs-all classification?
  • What is a confusion matrix?

Lecture 5: Support Vector Machines

Today’s topic Support Vector Machines (SVM)

Support vector machines (SVMs) are used for classification, and use some of the ideas from logistic regression. Support vector machines deal with noisy data, where some labels are miss-classified, with large-margin classifiers.

Support vector machines can also do non-linear classification using kernels. A kernel is a non-linear transformation of the input data into a higher dimensional space. Kernels transform your input data into a space where it is possible to do linear separation as in logistic regression.

This sounds very abstract, but the basic mathematics is not very complicated. Support vector machines are very powerful, and before you try neural networks try SVMs with different kernels.

These are the slides that I use.

Reading Guide

What should I know by the end of this lecture?

  • What are Support Vector Machines?
  • What is a large margin classifier?
  • What is the hinge loss function?
  • What are Kernels? You should know some common kernels including the polynomial kernel and the Gaussian kernel (or radial bases kernel).
  • What is the Kernel trick?
  • How does the learning algorithm work for SVMs?

Lecture 6: Cross Validation and Feature Engineering

Today’s Topics

In today’s lecture you will look at various techniques to deal with and understand overfitting. Dealing with overfitting leads nicely to the model selection problem. First and foremost, how do you decide which machine learning algorithm to use. Further, in many machine learning algorithms there are hyper-parameters that are not decided by the training data. Choosing which model to use or values of the models hyper-parameters is a difficult task and can greatly affect the performance of you algorithm. Cross validation is a useful technique from statistics that allows you to partition your data up into many combinations of training, test and validation sets. You can then use cross validation to help you decide which machine learning model to use and what values to set the hyper-parameters.

Finally we will cover various techniques for data-normalisation, and dealing with categorical data. One important thing to remember is that a little bit of data prepossessing can often improve the performance of your learning algorithm significantly.

Slides

I used these slides in the lecture.

Reading Guide

What should I know by the end of this lecture?

  • What is over fitting? what are some strategies to avoid it?
  • What is the bias
  • What is the model selection problem?
  • What are hyper-parameters?
  • Why do you need to split data in training, test and validation sets?
  • What is cross validation?
  • What is k-fold cross validation?
  • What are the different encoding strategies for categorical data? One-hot encoding, dummy encoding?
  • What is data normalisation and when is it necessary?

Lecture 7: Clustering and Nearest Neighbours

Today’s Topics

In this segment you are looking at some unsupervised algorithms, as well as one supervised learning method (K-nearest neighbours) The main unsupervised algorithms are Hierarchical clustering, K-means clustering and DBSCAN. It is important not to get K-nearest neighbours and K-means clustering confused.

The K-means algorithm works by gradient descent. Unlike a lot of the algorithms that we have been looking at, K-means often suffers from the problem of many local minima. In Andrew Ng’s lectures you will meet various ways of dealing with local minima.

If you have taken Algorithms and Data Structures II(1DL231) or AD3 (1DL481), then you will have met the concept of NP-hardness. K-means clustering is NP-hard (see the reference below). This means that the problem is not easy to solve. If you could guarantee that there would only be one global minimum then gradient descent would be an efficient algorithm. This implies that there will always often be local minima in K-means clustering.

Slides

I used these slides in the lecture.

Reading Guide

K-Means Clustering

K-Nearest Neighbours

What should I know by the end of this lecture?

  • What are some of the applications of clustering?
  • What is hierarchical clustering and what algorithms are there?
  • How does the K-means algorithm work? What is the cost function?
  • What is a local optima and why is it a problem with the K-means algorithm?
  • What are some approaches to choosing the number of clusters in K-means?
  • How does the K-nearest neighbour algorithm work and what are some of its applications?
  • What is DBSCAN and how does it work?

Lecture 8: Decision Trees

Today’s Topics

Decision trees are another powerful machine learning technique, and has the advantage that it is easy to see what the algorithm has learned. Constructing the optimal decision tree for an arbitrary data set is yet another NP-hard question, and there are many algorithms available. In this lecture you will look at one of the simpler algorithm ID3. ID3 uses information theory to decide how to construct a tree.

Slides and Notebooks

Reading Guide

Information theory.

You do not need to know much about information theory to apply the ID3 algorithm, but the first five chapters of An introduction to information theory: symbols, signals & noise by Pierce, John R will give some background to people who are interested. The book is available online at the University library.

Decision Trees

What should I know by the end of this lecture?

  • What are decision trees?
  • What is Shannon entropy and how does it measure information?
  • How do I calculate the Shannon entropy of a distribution?
  • How does the ID3 learning algorithm work?

Lecture 9: Principle Component Analysis and Preprocessing

Today’s Topics

As we have seen before it is often a good idea to do some pre-processing of your input data. Sometimes there are linear relationships hidden in the data, and if you do not remove or discover them then your machine learning algorithm will be trying to learn these linear relationships. In linear algebra, principle component analysis (PCA) is a well established method of reducing the dimension of a data set. It uses eigenvalues and eigenvectors to uncover hidden linear relationships in your data-set. For this course you do not need to know how to actually compute eigenvalues and eigenvectors by hand, but you should understand the definition and what is going on.

An interesting application of PCA are Eigenfaces.

Slides

I use these slide. In previous years, I have done the lecture on the blackboard, and these are some notes that I used. They contain some extra derivations.

Reading Guide

What should I know by the end of this lecture?

  • What is principle competent analysis (PCA)?
  • What is the co-variance matrix and what do the entries mean?
  • What do the eigenvalues of the co-variance matrix tell you about the data?
  • How do you choose the number of dimensions in PCA?
  • What are some applications of PCA in machine learning?

Lecture 10: Ensemble Learning

Slides

Link for the slides.

Today’s Topics: Ensemble Learning

Ensemble learning is a simple idea. Instead of training one model, we train multiple models and combine the results. A popular ensemble learning algorithm is random forests that combines many decision trees that are trained on random subsets of the features. To build a classifier a majority vote is taken. There are various techniques to improve the system including boosting, bagging and gradient boosting. As you will see below these are not limited to random forests, but to other combinations of learning algorithms. There is a lot extensions and refinements of ensemble methods including AdaBoost. We will only cover gradient boosting and not go too much into the mathematics behind ensemble learning (that is reserved for a more advanced course). The aim of this lecture is to give you access to another powerful machine learning technique.

One thing that we did not cover in Lecture 8 was using decision trees for regression. Although the idea is not that complicated you should spend some time understand how Regression trees can be used for regression. Although you will not be examined on other algorithms for constructing decisions trees (we covered ID3 in Lecture 8 ) you should be aware that there are other algorithms.

Constructing the perfect decision tree is a computationally hard problem (in fact it is NP-complete). For machine learning we want fast and efficient algorithms for learning, and most decision tree algorithms in the literature are approximations.

Reading Guide

What should I know by the end of this lecture?

  • How do I use decision trees for regression?
  • What is ensemble learning?
  • How do random forest work?
  • What is boosting and bagging?
  • What is gradient boosting?