Lectures
Here you can find information on the lectures, as well as information
on when you should have enough information to be able to attempt the
various practical and theoretical python notebook assignments. The
project should be started as soon as possible, and you should try
different machine learning algorithms on the data set as you learn
them in the course.
There is a menu item for each lecture where you can find a reading
guide for the textbook, links to additional material and links to
slides.
Please note that the dates still could change, and the dates in
timeedit
are always correct. If you find any errors, then please email me.
You should now have all the knowledge you need to attempt practical
notebooks 1 and 2 (P1 & P2)
You now should have the knowledge to attempt the the first theoretical
notebook (T1).
You should now be able to attempt the second theoretical notebook (T2)
and the third practical notebook (P3).
You should now be able to attempt the final practical notebook (P4).
You should now be able to attempt theoretical notebook 3 (T3) and (T4)
Practical and Theoretical Notebooks
There are 4 practical and theoretical notebooks that are done
individually as assignments. Links will be provided but the topics
covered in the notebooks are as follows.
Practical Notebooks
- P1 : Basic programming with Python, lists, sets and an introduction
to NumPy.
- P2 : Introduction to Pandas
- P3 : Linear and Logistic regression with
scikit-learn
- P4 : Preprocessing, feature engineering and cross validation.
Theoretical Notebooks
- T1 - Logistic Regression
- T2 - Using naive Bayes for classifying tweets.
- T3 - Using K-means classifiers
- T4 - The Entropy of the normal distribution and binary decision
trees using ID3.
Deadlines
These should be the same deadlines as are in Studium. If there is any
discrepancy then please inform me.
What | When? |
---|
P1 & P2 | 31/1 13:00 |
T1 & T2 | 10/2 13:00 |
P3 & P4 | 17/2 13:00 |
T3 & T4 | 24/2 13:00 |
Project | 7/3 13:00 |
Subsections of Lectures
Lecture 1: Introduction and Overview of the Course
Today’s Topic
Overview of the course and introduction to machine learning. What is
supervised learning and what is unsupervised learning.
Link to the slides.
I use
these
slides.
Reading Guide
What should I know by the end of this lecture?
- How is the course organised?
- What is machine learning?
- What is supervised learning?
- What is unsupervised learning?
Lecture 2: Linear Regression as Machine Learning
Today’s Topics
Linear Regression
Linear Regression as a machine learning algorithm. Machine learning
algorithms and hypothesises. In short a machine learning find the best
hypothesis that explains that data. A cost function (or an error
function, or a loss function) measure how far way a hypothesis is from
explaining the data: the smaller the cost, the better the hypothesis.
Ideally you want an algorithm takes the training data and gives you
the hypothesis that with the smallest cost value. With linear
regression this is possible (using linear algebra), but in general it
is not possible.
If you have been reading about neural networks, then in a neural
network the weight roughly corresponds to the set of all possible
hypothesis.
Training and Test Sets
As the course moves along we will learn more about best practices with
machine learning. The first important idea is that you should split
your data into a test set and a training set. If do not do this then
there is a possibility that you will over fit to your data set, and
when you meet new examples your machine learning system will not
perform that well. It is also important to keep in mind that when you
are using gradient descent to find the best hypothesis you use the
training set, but when you are evaluating the performance of the
learning algorithm you should use the test set. Later on when we look
at cross validation we will look at more advance ways to divide up
your data.
Links to Slides
- The slides can be found
here..
Reading Guide
What should I know by the end of this lecture?
- How does linear regression work with one variable?
- How does linear regression work with many variables?
- What is a hypothesis in a machine learning algorithm?
- What is a cost function? Note that in machine learning there is no
standard terminology. This is because machine learning comes from
many different disciplines. Other names for the cost function are
the Error function and the loss function.
- What is the goal of a machine learning algorithm with the hypothesis
and the cost function? What does the cost function measure? Why is a
low value of the cost function desirable?
- How does gradient descent work for linear regression? Can you derive
it from first principles?
- Why is it necessary to split the data up into a training and a test
set?
Lecture 3: Probability and Naive Bayes Classification
Today’s Topic
Using Bayes’ theorem for machine learning. You should do some revision
on the use of Bayes’ theorem in general. In this lecture you will look
at how to use Bayes’ theorem to build a spam detector. One important
idea to take away from this lecture is that there are a variety of
ways implementing spam detection: in particular there are different
feature models that you can use that give you different ways of
calculating the relevant probabilities. It is important that you
understand the difference between the different ways of implementing
spam detection.
Reading Guide
What should I know by the end of this lecture?
- How do I use Bayes’ theorem?
- How do I use Bayes’ theorem to build a simple spam detector that
only uses one word?
- What is independence assumption in the naive Bayes’ algorithm?
- What are the various models estimating the probabilities for spam
detection?
- What is Laplacian smoothing and how does it work in the context of
spam detection?
- When do you need to use logarithms to calculate the relevant
probabilities, and how do you use them?
Subsections of Lecture 3
Naive Bayes for Spam Classification
There are a lot of tutorials and youtube videos out there on using
Naive Bayes for document classification. None of these tutorials are
wrong, but they often hide some subtle points that if you think too
hard about you will get confused. In this posts I want to explain what
is really going on in Naive Bayes for spam classification.
This post assumes that you are already familiar with Bayes’ theorem.
Rather foolishly I did all the calculations in the post by hand. If
you find any errors the please report them to me.
Our data set
To make things more concrete we will work on a very small data set
where we can do the calculations directly. We are classifying
micro-tweets of exactly 3 words. Our training set is as follows. $S$
indicates that a message is spam and $\neg S$ indicates that a
message is not spam.
$\neg S$Number | Tweet | Spam ( $S$ or $\neg S$ ) |
---|
1 | money aardvark boondoggle | $S$ |
2 | money money money | $S$ |
3 | money money world | $S$ |
4 | money world world | $S$ |
5 | viagra money back | $S$ |
6 | viagra heart honey | $S$ |
7 | aardvark boondoggle world | $\neg S$ (not spam) |
8 | honey honey honey | $\neg S$ (not spam) |
9 | viagra heart money | $\neg S$ (not spam) |
10 | money honey now | $\neg S$ (not spam) |
Background: Classifying with only one word.
As a warm up let's just build a classifier that uses one particular
word $w = \mathrm{money}$.
Bayes' Theorem should be familiar to you by now:
$$
P(S|w) = \frac{P(w|S)P(S)}{P(w)}
$$
$P(S|w)$ is the even that an email is Spam given that the word $w$
occurs in it. Using Bayes theorem we can calculate the probability
that a message containing the word $w$ is spam. We can estimate the
values $P(w|S)$, $P(S)$ and $P(w)$ from out data set.
$$
P(S) = \frac{\mathrm{number\ spam}}{\mathrm{total\ messages}} = \frac{6}{10}
$$
To estimate $P(w|S)$ we have to count the number of times that a
particular word occurs in a spam message. So
$$
P(\mathrm{money}|S) = \frac{5}{6}
$$
When you are only considering a single word then estimating $P(\mathrm{money})$
is easy. It is the ratio of the number of tweets that contain the
word 'money' and the total number of tweets. The word money appears in
$7$ tweets. So
$$
P(\mathrm{money}) = \frac{7}{10}
$$
So if we get a message that contains the word 'money' we can calculate
the probability that it is a spam message. $$ P(S|\mathrm{money}) =
\frac{P(w|S)P(S)}{P(w)} = \frac{\frac{5}{6}\frac{6}{10}}{\frac{7}{10}}
= \frac{5}{10}\frac{10}{7} = \frac{5}{7} \approx 0.71 $$ There is an
important identity that will become useful later on $$
P(\mathrm{money}) = P(\mathrm{money}|S)P(S) +
P(\mathrm{money}|\neg S)P(\neg S) $$ So $$
P(\mathrm{money})= \frac{5}{6}\frac{6}{10} + \frac{2}{4}\frac{4}{10} =
\frac{7}{10} $$First Pitfall: Estimating the probabilities
So how to we estimate the probabilities $P(w|S)$ , $P(S)$ and $P(w)$?
What do they really mean? The probabilities $P(S)$ and $P(\neg S$) are
unambiguous. They are just the probability that a tweet is spam or
not. But $P(w|S)$, $P(w|\neg S)$, and $P(w)$ can mean different things
depending exactly which model we use to calculate the probabilities.There are two models:
(A): To calculate $P(\mathrm{money}|S)$. There are 6
messages that are spam and in those 6 messages 5 of them (1,2,3,4,5)
contain the word money so $P(\mathrm{money}|S) = 5/6$, and of the 10
messages 7 of them (1,2,3,4,5,9,10) contain the word 'money' so
$P(\mathrm{money}) = 7/10$. This is exactly what we did above.
(B): To calculate $P(\mathrm{money})$ there are $10\times 3 = 30$
words in our training set and the word money appears 10 times so
$$P(\mathrm{money}) = 10/30.$$ To calculate $P(\mathrm{money}|S)$
there are 6 spam messages each of 3 words long. In the words of the
spam messages the word 'money' appears 8 times. So $$
P(\mathrm{money} | S) = \frac{8}{3 \times 6} = \frac{8}{18} =
\frac{4}{9} $$
The probability a message being spam is still
$6/10$. So
if I get the message 'money for nothing' then the probability that
it is spam is calculated as before $$ P(S|\mathrm{money}) =
\frac{P(\mathrm{money}|S)P(S)}{P(\mathrm{money})} =
\frac{{\frac{8}{3 \times 6}\times \frac{6}{10}}}{ \frac{10}{30}} =
\frac{8}{10} $$
It seems that if spammers are prone to repeat words in their
message then this increases the probability that a message
containing that word is spam.
So how do you calculate the probability that the message 'money money
mammon' is spam? In model (A) it does not matter how many times 'money'
appears in a message: you only count the number of messages 'money'
appears in. While in model (B) there is some weighting of the number
of times that a word appears. But to calculate
$P(S|\mathrm{money}^2)$ (where is short hand for 'money appearing
twice) we have calculate $P(\mathrm{money}^2|S)$. How you do this
depends a bit on your model and the assumptions underlying the
model. We'll get to that later.
Take Home Message
Take home message
So the first take home message is be careful how you count the words
and how you calculate the probabilities. If you confuse model (A) and
model (B) while writing your code you will get strange answers (as I
did at one point).
Naive Bayes: first version
We are going to use model (A). That is we going to ignore how many
times a word appears in a message. We are only interested if the word
appears on the message or not.
One word is not going to much of a spam classifier. Even in our little
data set above, the word ‘money’ can appear in spam and non spam
messages. We will get a better classifier if we take into account more
words. Our data set is quite small and for each word we can count the
number of times it appears in a spam tweet and the number of times it
appears in a non-spam tweet.
Word | occurrences in spam | occurrences in non spam |
---|
$w_1 = \mathrm{money}$ | 5 | 2 |
$w_2 = \mathrm{world}$ | 2 | 1 |
$w_3 = \mathrm{viagra}$ | 2 | 1 |
$w_4 = \mathrm{aardvark}$ | 1 | 1 |
$w_5 = \mathrm{heart}$ | 1 | 1 |
$w_6 = \mathrm{boondoggle}$ | 1 | 1 |
$w_7 = \mathrm{honey}$ | 1 | 2 |
$w_8 = \mathrm{back}$ | 1 | 0 |
$w_9 = \mathrm{now}$ | 0 | 1 |
You can turn these counts into probabilities, and thus you can
calculate quantities like $P(\mathrm{money}|S) = 5/6$,
$P(\mathrm{money}|\neg S) = 2/4$.
Suppose I receive a message 'viagra money boondoggle' what is the
probability that it is spam message? When we use Bayes' theorem we
have to calculate $$P(\mathrm{viagra} \land \mathrm{money} \land
\mathrm{boondoggle}|S)$$ where $$\mathrm{viagra} \land \mathrm{money}
\land \mathrm{boondoggle}$$ is the event that the words 'viagra',
'money' and 'boondoggle' appears in a message.The Naive in Naive Bayes
We need to make an independence assumption. In a spam or non spam
message the probability of words are independent. That is
$$
P(w_1 \land w_2 \land \cdots \land w_n | S) = P(w_1|S)P(w_1|S) \cdots
P(w_n|S)
$$
and
$$
P(w_1 \land w_2 \land \cdots \land w_n | \neg S) = P(w_1|\neg S)P(w_1|\neg S) \cdots
P(w_n|\neg S)
$$
Note this is a weaker assumption than simply saying
$$
P(w_1 \cdots w_n) = \prod_{1\leq i \leq n} P(w_i)
$$Take home message
Note that because we have made the assumptions that
$$P(w_1 \land w_2
\land \cdots \land w_n | S) =
\prod_{i=1}^n P(w_i|S)$$
and
$$P(w_1 \land w_2 \land \cdots \land w_n | S) =
\prod_{i=1}^n P(w_i|S)$$
it does not make sense to directly estimate the probabilities of
$P(w_i)$ directly from the data set. Later on we will see that you do
actually need the probabilities $P(w_1 \cdots w_n)$ to decide if a
message is spam or not. If you want to calculate the probability
$P(w_1 \cdots w_n)$ then you must use the identity $P(w_1 \cdots
w_n)$ equals
$$
P(w_1 \cdots w_n|S)P(S) + P(w_1 \cdots w_n|\neg S)P(\neg S)
$$The independence assumption is why Naive Bayes is referred to as
naive. Although this model could be improved. It seems that the
probability of one word appearing in a message should not be
independent of another word. If a spammer write ‘money’ then he is
likely to also include ‘viagra’ in the message. Even so, assuming
independence works very well in practice.
Calculating the spam probability.
We can now apply Bayes' theorem $P(S | \mathrm{viagra} \land
\mathrm{money} \land \mathrm{boondoggle})$ equals $$ \frac{
P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}|S)
P(S)}{P(\mathrm{viagra} \land \mathrm{money} \land
\mathrm{boondoggle})} $$
From the independence assumption we have that $P(S | \mathrm{viagra}
\land \mathrm{money} \land \mathrm{boondoggle})$ equals $$ \frac{
P(\mathrm{viagra}|S)P(\mathrm{money}|S)P(\mathrm{boondoggle}|S)P(S)}
{P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle})} $$
To calculate $P(\mathrm{viagra} \land \mathrm{money} \land
\mathrm{boondoggle})$ we use the identity above.
Taking product $$P(\mathrm{viagra})P(\mathrm{money})P(\mathrm{boondoggle})$$
is the wrong answer.
So instead we get that $P(\mathrm{viagra} \land \mathrm{money} \land
\mathrm{boondoggle})$ equals
$P(\mathrm{viagra} \land \mathrm{money} \land
\mathrm{boondoggle}|S)P(S)$ plus $P(\mathrm{viagra} \land \mathrm{money} \land
\mathrm{boondoggle}|\neg S)P(\neg S)$.
The by the independence assumption
$P(\mathrm{viagra} \land \mathrm{money} \land \mathrm{boondoggle}|S)$ equals
$$
P(\mathrm{viagra}|S)P(\mathrm{money}|S)P(\mathrm{boondoggle}|S) =
\frac{2}{6}\frac{5}{6}\frac{1}{6} = \frac{5}{108}
$$
Putting the numbers in we get $P(\mathrm{viagra} \land \mathrm{money} \land
\mathrm{boondoggle})$ equals
$$ \left(\frac{2}{6}\cdot \frac{5}{6}\cdot
\frac{1}{6}\right)\frac{6}{10} + \left(\frac{1}{4}\cdot
\frac{2}{4}\cdot \frac{1}{4}\right)\frac{4}{10} \approx 0.08
$$
So $P(S|\mathrm{viagra} \land
\mathrm{money} \land \mathrm{boondoggle})$ equals
$$
\frac{\frac{5}{108}\frac{6}{10}}{0.08} \approx 0.35
$$Not calculating the whole probability.
When implementing the spam filter we do not actually need to calculate
the denominators. We just compare the expressions
$P(\mathrm{viagra}|S)P(\mathrm{money}|S)P(\mathrm{boondoggle}|S)P(S)$
and
$P(\mathrm{viagra}|\neg S)P(\mathrm{money}|\neg S)P(\mathrm{boondoggle}|\neg S)P(\neg S)$
and see which one is bigger. This is also important because some of
the numbers start getting smaller and smaller and you end up with
floating point underflow errors. If the numbers get too small then you
have to calculate with the logarithm of the probability and do
additions rather than subtraction.Laplacian Smoothing
What if we have the message 'now money viagra', if we look at our data
set the word 'now' has not appeared in a spam message. There could be
two reasons for this, one is that a spam message will never contain
the word 'now' (unlikely), or that we just do not have a spam message
with 'now' appearing in our training set. If we use model (A) and
calculate the probability that our message is spam we get
$P(S|\mathrm{now}\land\mathrm{money}\land{viagra})$ equals
$$
\frac{P(\mathrm{now}|S)P(\mathrm{money}|S)P(\mathrm{viagra}|S)P(S)}{P(\mathrm{now}\land\mathrm{money}\land\mathrm{viagra})}
$$
which equals
$$\frac{0 \cdot \frac{5}{6} \cdot
\frac{2}{6}\frac{6}{10}}{P(\mathrm{now}\land\mathrm{money}\land\mathrm{viagra})}
$$
So even though the words 'money' and 'viagra' are pretty good
indicators of a message being spam we get probability 0.
To get around this we add one to all our counts to avoid probability
$0$ estimates and adjust the total count so as to avoid any
probabilities greater than $1$. So in model (A) if we are considering
$9$ words as a above then we estimate $P(\mathrm{now}|S)$ to be
$$
\frac{0 + 1}{6 + 1}
$$
instead of
$$
\frac{0}{6}
$$
If you had a word that appeared in all 6 of the spam tweets then you
would get an estimate $\frac{6+1}{6+1}$ which would be $1$.
I leave it an exercise to work out the correct thing to do in model
(B).Feature Modelling
All most all machine learning algorithms require numbers and vectors as
inputs. Our Naive Bayes classifier does not really work with words,
but feature vectors. There are different possible models, but we use
something similar to model (A). First we take our data set and find
the first $n$ most popular words. The most popular word a data set
consisting of English messages is typically the word 'the'. You can
improve things by filtering out the most popular words that don't
contribute much a message being (referred to as stop words). We will
not worry about that here. But a word like 'the' is equally likely to
appear in a spam tweet or a non spam tweet, so it is better to ignore
it.
Then we turn each message into a vector of length $n$ where each entry
is $1$ or $0$, and the $i$th entry is $1$ if the message contains the
$i$th most popular word. So in a typical English message data set
'the' is the most popular and 'to' is the second most popular word. So
if our message contained the words 'to the' then the first two entries
of its feature vector would have the value $1$. $$ f = (f_1, \ldots ,
f_i, \ldots, f_n) $$ Where $f_i$ is $1$ if the $i$th most popular word
$w_i$ occurs in the message and $0$ otherwise.
It is easy to write a function takes a message and turns it into our
feature vector. Given our training set we can estimate two
probabilities for our two classes $S$ and $\neg S$ , $P(w_i|S)$, $P(\overline{w}_i|S)$, $P(w_i| \neg S)$ and $P(\overline{w}_i |
\neg S)$, where $w_i$ is the even that word $i$ occurs in the
message and $\overline{w}_i$ is the event that word $i$ does not occur
in the message.
In our example above we only have $9$ words in our data set and they
appear in order of popularity as 'money', 'world', 'viagra',
'aardvark', 'heart', 'boondoogle' , 'honey', 'back' , 'now'. You have
to break ties (words that are equally popular) and you have to do it
consistently. So give the message 'aardvark money now' its feature
vector would be $$ f = (1,0,0,1,0,0,0,0,1) $$
This vector $f$ corresponds to the event
$$
w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7w_8
$$
So to use Bayes' theorem to work on the probability that the tweet
is is spam we have to
calculate the quantity
$$
\frac{P(w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7\overline{w}_8|S)P(S)}{P(w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7\overline{w}_8
w_9)}
$$
Calculating
$P(w_1\overline{w}_2\overline{w}_3w_4\overline{w}_5\overline{w}_6\overline{w}_7\overline{w}_8|S)$
is easily done by the independence assumption. It is the product of
the terms
$P(w_1|S)$,$P(\overline{w}_2|S)$,$P(\overline{w}_3|S)$,$P(w_4|S)$,$P(\overline{w}_5|S)$,$P(\overline{w}_6|S)$,$P(\overline{w}_7|S)$
and $P(\overline{w}_8|S)$. All these values are easily estimated from
our data set. For example $P(\overline{w}_3|S)$ is the probability
that the word 'money' does not appear in a spam tweet. We had 6 spam
tweets and 5 of them contained the word money and so we get that
$P(\overline{w}_3|S)$ equals $1/6$.Model (A) and Feature Vectors
If you go back to model (A) and you try to estimate if a message is
spam or not, then using the same message we would only need to
calculate
$$
\frac{P(w_1|S)(w_4|S)P(w_9|S)P(S)}{P(w_1\land w_w \land w_3)}
$$
since $w_1$ is 'money', $w_4$ is 'aardvark' and $w_9$ is 'now'. We
are throwing away information about the words that do not occur in the
tweet. Does this matter? More importantly is this calculation
incorrect.
To simply things imagine that we only had two words in our feature
vector $W_1$ and $W_2$. Then given a message there are 4 possible
atomic events:
$$ W_1 W_2, W_1 \overline{W}_2, \overline{W}_1 W_2, \overline{W}_1
\overline{W_2}$$
What do we mean when we write $P(W_1|S)$? Looking at our atomic
events we actually mean
$$
P( W_1 W_2 \lor W_1\overline{W}_2|S)
$$
Any event is a union of the atomic events in your probability model.
Using the independence assumption for $W_1$ and $W_2$ and the basic
rule of probability that $P(A\lor B)$ equals $P(A)+P(B)$ when then
events $A$ and $B$ are independent atomic events we get that
$P(W_1|S)$ equals $P( W_1 W_2 \lor W_1\overline{W}_2|S)$ which equals
$$
P(W_1|S)P(W_2|S) +
P(W_1|S)P(\overline{W}_2|S)$$ refactoring gives
$$P(W_1|S)(P(W_2|S) +
P(\overline{W}_2|S))
$$
and since $P(W_2|S) + P(\overline{W}_2|S)$ equals $1$ we get
$P(W_1|S)$.
So if we ignore any information about $W_2$ then we get a factor of
$1$. If we use what information we have about $W_2$ then we get a
better estimate for the probability. You can show that the same
applies if you have lots of words in your feature vector. So our
original model (A) is not wrong, but the feature vector model where we
take into account if a word appears or not when we are estimating the
probabilities and gives a better estimate if the word is spam or
not. Note the above argument depends on our independence assumption
(the naive in Naive Bayes).
If you did not look at any words then the only information that you
would have is $P(S)$ or $P(\neg S)$ as you look at more words in
your feature you get a better estimate of the probability that the
message is spam or not.Model (B) with multiple words
How do you calculate the probability that the message 'money money
boondoggle' is spam? We have already assumed that the probabilities of
a words occurring in a spam or non-spam tweet are independent. If we
also assume that the probability of a word appearing $k$ times is
$$
P(w^k|S) = P(w|S)^k
$$
That is each occurrence is independent, then we can calculate the
probability that a message containing the multiple occurrences of a
word is spam or not, but only if you use model (B) to calculate the
probabilities. You should not mix up model (A) and model (B). It does
not make sense in model (A) to ask what the probability of a word
occurring $k$ times in a spam message is. We can only ask if a m
message contains the word or not.
Take Home Message
If you want to take into account multiple words then do not use model
(A) to calculate your probabilities.
Model (B) with multiple words and feature vectors.
The feature vector approach for model (A) considered vectors where the
entries are $0$ or $1$. Given a feature vector $f$ the entry $i$th
entry is $1$ if the word appears in the message and $0$ otherwise.
Instead we would have a feature vector where the $ith$ entry tells you
how many times the word appears in the message. Thus for our message 'money money
boondoggle' its feature vector would be (using the same ordering as
above):
$$
(2,0,0,0,0,1,0,0,0)
$$
Again it is not hard to use the information that a word appears $0$
times.Take home messages
Are you using model (A) or model (B). Don’t get them confused,
especially when you are estimating the probabilities $P(w_i|S)$ from
the training data.
Are you using the negative information, the information that a word
does not occur? It does not matter your maths is correct, but you
are not using all the information that you have.
To understand how this is related to other machine learning
algorithms, then you have to understand that we take a message and
construct a feature vector. Depending on if you are using model (A)
or (B) your feature vector either has $0$ or $1$ entries or positive
integer that tells you how many times a word occurs in a
message. Feature vectors is an example modelling that you often have
to do in machine learning. Your data set and package does not
always come as ready packaged as a set of vectors.
If you watch some random video on the internet, it is not always
clear which model they are using when they calculate the
probabilities.
The documentation to scikit-learn
has a nice entry on naive
Bayes, which
discusses the various options on modelling as well as links to various
interesting articles on the different modelling approaches to naive
Bayes.
Lecture 4: Logistic Regression as Machine Learning
Today’s Topics
Today’s
slides.
Logistic Regression
Logistic regression, overfitting and regularisation. Again Logistic
regression is an algorithm that comes from statistics, but it can also
seen as a machine learning algorithm. The hypothesis is very similar
to linear regression is it a set of values that defines a linear
function. The difference between logistic regression and linear
regression is the linear function goes through a logistic function
that works as a threshold function. Unlike linear regression it is
not possible to solve the model exactly, and gradient descent is
necessary.
There are a lot of ways of
thinking about how logistic regression works.
- As a modification of linear regression to get $0$ or $1$ values to
divide the data set into two halves, or two find a separating
hyperplane between the two classes.
- As an estimator of the probability that point begins to one class
or another.
- As a single neuron. You can see logistic regression as the beginning
of neural networks.
Overfitting and Regularisation
Both linear and logistic regression can be improved with a
regularisation term that avoids overfitting. You should try to begin
to understand why overfitting is a problem and some strategies for
avoiding it.
Reading Guide
Logistic Regression
Overfitting and Regularisation
Multiclass classification.
Confusion Matrices
What should I know by the end of this lecture?
- What is logistic regression and how does it differ from linear
regression?
- What is the cost function? What does the logistic function do?
- How do I implement gradient descent for logistic regression?
- How does logistic regression relate to log-odds and what is it
relationship with probability.
- What is overfitting?
- How does the regularisation term work in linear and logistic
regression and how does it avoid overfitting.
- How do you use a binary classifier for multi-class classification?
What is one-vs-all classification?
- What is a confusion matrix?
Lecture 5: Support Vector Machines
Today’s topic Support Vector Machines (SVM)
Support vector machines (SVMs) are used for classification, and use
some of the ideas from logistic regression. Support vector machines
deal with noisy data, where some labels are miss-classified, with
large-margin classifiers.
Support vector machines can also do non-linear classification using
kernels. A kernel is a non-linear transformation of the input data
into a higher dimensional space. Kernels transform your input data
into a space where it is possible to do linear separation as in
logistic regression.
This sounds very abstract, but the basic mathematics is not very
complicated. Support vector machines are very powerful, and before
you try neural networks try SVMs with different kernels.
These are the
slides
that I use.
Reading Guide
What should I know by the end of this lecture?
- What are Support Vector Machines?
- What is a large margin classifier?
- What is the hinge loss function?
- What are Kernels? You should know some common kernels including the
polynomial kernel and the Gaussian kernel (or radial bases kernel).
- What is the Kernel trick?
- How does the learning algorithm work for SVMs?
Lecture 6: Cross Validation and Feature Engineering
Today’s Topics
In today’s lecture you will look at various techniques to deal with
and understand overfitting. Dealing with overfitting leads nicely to
the model selection problem. First and foremost, how do you decide
which machine learning algorithm to use. Further, in many machine
learning algorithms there are hyper-parameters that are not decided by
the training data. Choosing which model to use or values of the models
hyper-parameters is a difficult task and can greatly affect the
performance of you algorithm. Cross validation is a useful technique
from statistics that allows you to partition your data up into many
combinations of training, test and validation sets. You can then use
cross validation to help you decide which machine learning model to
use and what values to set the hyper-parameters.
Finally we will cover various techniques for data-normalisation, and
dealing with categorical data. One important thing to remember is that
a little bit of data prepossessing can often improve the performance
of your learning algorithm significantly.
Slides
I used these
slides
in the lecture.
Reading Guide
What should I know by the end of this lecture?
- What is over fitting? what are some strategies to avoid it?
- What is the bias
- What is the model selection problem?
- What are hyper-parameters?
- Why do you need to split data in training, test and validation sets?
- What is cross validation?
- What is k-fold cross validation?
- What are the different encoding strategies for categorical data?
One-hot encoding, dummy encoding?
- What is data normalisation and when is it necessary?
Lecture 7: Clustering and Nearest Neighbours
Today’s Topics
In this segment you are looking at some unsupervised algorithms, as
well as one supervised learning method (K-nearest neighbours) The
main unsupervised algorithms are Hierarchical clustering, K-means
clustering and DBSCAN. It is important not to get K-nearest neighbours
and K-means clustering confused.
The K-means algorithm works by gradient descent. Unlike a lot of the
algorithms that we have been looking at, K-means often suffers from
the problem of many local minima. In Andrew Ng’s lectures you will
meet various ways of dealing with local minima.
If you have taken Algorithms and Data Structures II(1DL231)
or AD3
(1DL481),
then you will have met the concept of NP-hardness. K-means clustering is NP-hard
(see the reference below). This means that the problem is not easy to
solve. If you could guarantee that there would only be one global
minimum then gradient descent would be an efficient algorithm. This
implies that there will always often be local minima in K-means
clustering.
Slides
I used these
slides
in the lecture.
Reading Guide
K-Means Clustering
K-Nearest Neighbours
What should I know by the end of this lecture?
- What are some of the applications of clustering?
- What is hierarchical clustering and what algorithms are there?
- How does the K-means algorithm work? What is the cost function?
- What is a local optima and why is it a problem with the K-means
algorithm?
- What are some approaches to choosing the number of clusters in
K-means?
- How does the K-nearest neighbour algorithm work and what are some of
its applications?
- What is DBSCAN and how does it work?
Lecture 8: Decision Trees
Today’s Topics
Decision trees are another powerful machine learning technique, and
has the advantage that it is easy to see what the algorithm has
learned. Constructing the optimal decision tree for an arbitrary data
set is yet another NP-hard question, and there are many algorithms
available. In this lecture you will look at one of the simpler
algorithm ID3. ID3 uses information theory to decide how to construct
a tree.
Slides and Notebooks
Reading Guide
You do not need to know much about information theory to apply the
ID3 algorithm, but the first five chapters of
An introduction to information theory: symbols, signals & noise
by Pierce, John R will give some
background to people who are interested. The book is available
online at the University library.
Decision Trees
What should I know by the end of this lecture?
- What are decision trees?
- What is Shannon entropy and how does it measure information?
- How do I calculate the Shannon entropy of a distribution?
- How does the ID3 learning algorithm work?
Lecture 9: Principle Component Analysis and Preprocessing
Today’s Topics
As we have seen before it is often a good idea to do some
pre-processing of your input data. Sometimes there are linear
relationships hidden in the data, and if you do not remove or discover
them then your machine learning algorithm will be trying to learn
these linear relationships. In linear algebra, principle component
analysis (PCA) is a well established method of reducing the dimension
of a data set. It uses eigenvalues and eigenvectors to uncover hidden
linear relationships in your data-set. For this course you do not need
to know how to actually compute eigenvalues and eigenvectors by hand,
but you should understand the definition and what is going on.
An interesting application of PCA are
Eigenfaces.
Slides
I use these
slide. In
previous years, I have done the lecture on the blackboard, and these
are some
notes
that I used. They contain some extra derivations.
Reading Guide
What should I know by the end of this lecture?
- What is principle competent analysis (PCA)?
- What is the co-variance matrix and what do the entries mean?
- What do the eigenvalues of the co-variance matrix tell you about the
data?
- How do you choose the number of dimensions in PCA?
- What are some applications of PCA in machine learning?
Lecture 10: Ensemble Learning
Slides
Link for the
slides.
Today’s Topics: Ensemble Learning
Ensemble learning is a simple idea. Instead of training one model, we
train multiple models and combine the results.
A popular ensemble learning
algorithm is random forests that combines many decision trees that are
trained on random subsets of the features. To build a classifier a
majority vote is taken. There are various techniques to improve the
system including boosting, bagging and gradient boosting. As you will
see below these are not limited to random forests, but to other
combinations of learning algorithms. There is a lot extensions and
refinements of ensemble methods including AdaBoost. We will only
cover gradient boosting and not go too much into the mathematics
behind ensemble learning (that is reserved for a more advanced
course). The aim of this lecture is to give you access to another
powerful machine learning technique.
One thing that we did not cover in Lecture 8 was using decision trees for
regression. Although the idea is not that complicated you should spend
some time understand how Regression trees can be used for
regression. Although you will not be examined on other algorithms for
constructing decisions trees (we covered ID3 in Lecture 8 ) you should be aware that
there are other algorithms.
Constructing the perfect decision tree is
a computationally hard problem (in fact it is NP-complete). For
machine learning we want fast and efficient algorithms for learning,
and most decision tree algorithms in the literature are
approximations.
Reading Guide
What should I know by the end of this lecture?
- How do I use decision trees for regression?
- What is ensemble learning?
- How do random forest work?
- What is boosting and bagging?
- What is gradient boosting?
Assignment
This course has four theoretical and four practical assignments that
are Python notebooks that you should complete. The practical notebooks
introduce you to tools such as pandas,
NumPy and
scikit-learn. The theoretical
notebooks help you to deepen you knowledge of some of the topics
covered in the course.
The assignments are available publicly on
github.
Project
The project will be hosted on kaggle and
will only available to students registered on the course. All
information about the project can be found on
studium.
Subsections of Resources
Recommended Books and other study material
All the material in this course is standard material. There are a lot
of books on machine learning, some more theoretical and some more
practical. It is difficult to recommend a book that fits the
background for all students on this course. Some books are too
mathematical and some books are too practical. The course has two
recommended text books. One is more theoretical, and one is more hands
on with Python. Both of them cover most of the material in the course.
Course Text Book: The Hundred-Page Machine Learning Book
.
It can be found here. You can buy an
electronic version of the book for as little as $20. Unfortunately the
book is not available the university library. If you follow the link
above you can access individual chapters of the book. In the lecture
notes I will give references to the book. The book being a little over
100 pages (141) does not go into all the topics in detail, and omit
some background material that is necessary for some students. My
advice is to start there, and move onto other references for more
background or more detail.
Course Text Book: A Hands-On Introduction to Machine Learning
This is a much more practical book that has lots of Python code.
Other Books and resources
In the course we
are going to use Python and
scikit-learn for a lot of the
assignments. The user
guide not only
contains a lot of examples, but also a lot of the background
mathematics. I strongly recommend the dive into
scikit-learn’s website. You are
also going to use pandas to parse input
data, numpy for fast manipulation and processing
of arrays, and matplotlib to plot data. All
these libraries have extensive documentation with examples.
Andrew Ng’s material.
Andrew Ng is one of the
giants in machine learning. He has written numerous very important
papers, and it one of the founders of
Coursera.
He has made of videos on machine learning of various formats. He is a
far better lecturer than me, and his lectures will be of much higher
quality than I can record. Some of his lectures are on a whiteboard
and some of them are him talking over his slides. I prefer him talking
over slides and the playlist on youtube can be found
here.
Somebody
has put the slides that he used on
here,
and here you can find notes on
the lectures. The following Github
repository
has his course notes, although they are a bit too advanced for this course.
Other Online Lectures
If you find some particular good online lectures, then please email
them to me.
So far:
- There is a good channel
DataCamp
that has lots of machine learning videos with some real Python
examples in that illustrate many of the concepts in this course.
Machine Learning Terminology
There are a lot of different terms in machine learning that refer to
the same thing. A good resource is
Other recommended text books
There are a lot of books on machine learning. Some are very
theoretical that go into great depth, and some contain a lot of hands
on material that you can use in your own projects. The following list
contains some books that I have found useful, together with links to
the online text at the university
library. The university library has a lot of
online resources on machine learning, as well as access to
O’Reilly’s book catalogue.
I used to put up direct links to the books at the University Library,
but these links are never stable. To find the books below such for
them. You can access the contents of books by logging in via
CAS. The system is a bit flaky, and sometimes you
might have to refresh the book’s page for the system to notice that
you have logged on.
More theoretical books
More practical books
Creating and using Python Virtual Environments
Windows users
These instructions are mostly for Linux or MacOS. If you are windows
there are other options such as AnaConda. You
also have problems with integrating with various windows IDEs. If you
are using windows then maybe these instructions are useful.
In a lot of my courses I encourage students to use python virtual
environments. Virtual environments are a great way of making sure
that you have the correct version of packages installed.
This is very short cheat sheet on how to set them up. I
will assume that we are using python 3. Luckily python 3 has virtual
environments set up. It is all in the
documentation, but then
sometimes people are too lazy to google, or do not know what to google
for.
To create a virtual environment in the current directory do the
following
If you look in the directory you’ll see a sub-directory env
this
contains all files that drive the virtual environment. In particular
there is a sub-directory env/bin/
Justins-MacBook-Air-2:Example justin$ ls env/bin/
Activate.ps1 easy_install pip3.9
activate easy_install-3.9 python
activate.csh pip python3
activate.fish pip3 python3.9
The most important file is the activate
script. To start your
virtual environment you simply do
source ./env/bin/activate
This executes the activate script. Notice that you prompt has changed
to (env)
You are now free to install what ever packages want. For exampe
pip3 install panda
pip3 install numpy
Once you have done all your python goodness you should leave your
virtual environment.
(env) Justins-MacBook-Air-2:Example justin$ deactivate
Setting up Python notebooks locally for Machine Learning
While there are many options for hosting python notebooks
for free remotely, but I prefer to do things locally. If you wish to do
the same then please read on. Note that I did home some problems
installing skikit-image
on a new Apple with M1 silicon, but after
upgrading the latest version of the operating system and doing and
updating my homebrew
setup everything works. Note that these
instructions are for using pip
. If you are using an alternative
package manager such as Anaconda
then you are on your own. I know
nothing about setting up Python on a windows machine.
Setting up the Virtual Environment
First it is better set up a virtual environment (see Creating and
Using Virtual Environments in Python . In a suitable place on your
machine do:
python3 -m venv env
source ./env/bin/activate
Note that you might be prompted to upgrade pip
. If you are, then
follow the instructions. You are only upgrading the version in your
local virtual environment.
Installing the relevant packages
When you prompt changes to (env)
you are inside your virtual
environment. You can now use pip
to install packages. For reasons
that I’m not going to explain your life is made a little easier if you
install wheel
You can now install various useful packages for machine learning
pip install numpy scipy matplotlib scikit-learn pandas scikit-image
treelib nltk
This might take a while. The package jupyter
drives the
notebooks, so install that
Emacs key bindings
I have been using the editor
Emacs for over 30 years. The
key binding are hardwired into my brain. If you are like me then
installing the jupyter-emacskeys
package will make your life a
little less frustrating.
pip install jupyter-emacskeys
Launching notebooks
You talk to Python notebooks using the web browser when you are
running them locally you start your own little server
If you have thing configured then a web browser might open up. If not
then look at the output on the terminal and you will see something
like:
[I 09:47:31.185 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 09:47:31.216 NotebookApp]
To access the notebook, open this file in a browser:
file:///Users/justin/Library/Jupyter/runtime/nbserver-49899-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=83e4ad6deece18a89f8c244f28bb3268d115a4a047764f36
or http://127.0.0.1:8888/?token=83e4ad6deece18a89f8c244f28bb3268d115a4a047764f36
So in this case you can point your browser at
http://127.0.0.1:8888/?token=83e4ad6deece18a89f8c244f28bb3268d115a4a047764f36
and you are ready to go.
Using Anaconda with Windows.
Disclaimer
These instructions come with no warranty. I do not use Windows. If you
have any questions about Windows then I cannot answer them. If you
have any constructive comments the please contact me.
Anaconda is a distribution of Python and R that includes many popular Data Science packages, such as Jupyter Notebook, NumPy, pandas, and scikit-learn. For numerous reasons, it is suggested for working with Jupyter Notebook.
Ease of installation: Anaconda comes with a built-in package
manager called conda, which makes it easy to install and manage
packages, including Jupyter Notebook. Anaconda also automatically
manages dependencies between packages, so you do not have to worry
about conflicts between different versions of packages.
Integrated development environment (IDE): Anaconda includes an
IDE called Anaconda Navigator, which provides a user-friendly
interface for managing your Python environment and launching
Jupyter Notebook.
Large collection of pre-installed packages: Anaconda comes with
a large collection of pre-installed packages, which can save you a
lot of time compared to installing each package individually.
Virtual environment management: Anaconda comes with conda which
is a powerful virtual environment manager that allows you to create
and manage multiple isolated environments, each with its own set of
packages. This can help you to avoid conflicts between packages and
maintain reproducibility of your projects.
Now that we know the advantages of using Anaconda, we proceed with a
detailed explanation of how to install anaconda and subsequently open
Jupyter notebooks through Anaconda. There are two options to do this.
First option
First, download the Anaconda distribution for Windows from
https://www.anaconda.com/products/distribution.
Once the download is complete, double-click the installer file to
begin the installation process.
Follow the prompts in the installer to complete the
installation. During this process, you will be prompted to select
the installation location. By default, Anaconda will be installed
in the C:\ProgramData\Anaconda3 folder, but you can choose a
different location if you prefer. The next screen is the Advanced
Installation Options. Here you have some additional options like if
you want to add anaconda to your PATH variable, and if you want to
register Anaconda as your default Python. Make sure you check the
“Add Anaconda to my PATH environment variable” option, it will make
your life easier.
Once the installation is complete, open the Start menu and search
for “Anaconda Prompt”. Click on the Anaconda Prompt shortcut to
open a command prompt window.
In the command prompt window, type the following command to verify
that Anaconda is installed correctly: conda –version
Next, we need to create a virtual environment in which to install
Jupyter Notebook. In the Anaconda prompt, type the following
command: conda create -n env_name , replacing env_name with the
name you want to give to your virtual environment.
After the environment is created, activate it by typing the
following command: conda activate env_name
Once the virtual environment is active, we can install Jupyter
Notebook by typing the following command: conda install jupyter
After Jupyter Notebook is installed, we can start the Jupyter
Notebook server by typing the following command in the Anaconda
prompt: jupyter notebook
This will open a web browser window with the Jupyter Notebook
interface. From here, you can create new notebooks, open existing
notebooks, and run code cells.
This is a step by step explanation of how to install Anaconda and how
to open Jupyter Notebook through the Anaconda prompt, however, with
the installation of Anaconda it comes Anaconda Navigator. This is a
graphical user interface (GUI) that allows you to easily manage your
Anaconda environments and packages, as well as launch Jupyter Notebook
and other applications. Let us see how to open Jupyter Notebook with
Anaconda Navigator (more intuitive than the first option).
Second option
Open the Anaconda Navigator application. You can find it by
searching for “Anaconda Navigator” in the Start menu (Windows) or
Spotlight (Mac).
Once the Anaconda Navigator window is open, you should see Jupyter
Notebook in the list of applications. Click on the “Launch” button
next to Jupyter Notebook.
A new tab will open in your default web browser, and you will be
taken to the Jupyter Notebook interface.
You should see the list of files and directories in the current
working directory, you can navigate to the desired folder and
create a new notebook by clicking the “New” button on the right
corner and select “Python 3”.