#+TITLE: Machine Learning
#+AUTHOR: Mueen Nawaz
#+OPTIONS: toc:2 todo:nil
#+OPTIONS: H:10
#+LaTeX_HEADER: \let\iint\relax
#+LaTeX_HEADER: \usepackage{amsmath}
#+LaTeX_HEADER: \usepackage{mathrsfs}
#+LaTeX_HEADER: \newcommand{\union}{\bigcup}
#+LaTeX_HEADER: \newcommand{\inter}{\bigcap}
#+LaTeX_HEADER: \usepackage{color}
#+LaTeX_HEADER: \usepackage{soul}
#+LaTeX_HEADER: \newcommand{\q}[1]{\hl{#1}}
#+LaTeX_HEADER: \usepackage{dsfont}
#+LaTeX_HEADER: \DeclareMathOperator{\tr}{tr}
#+LaTeX_CLASS_OPTIONS: [integrals, nointegrals]
* Ideas to Try :noexport:
1. Othello training
2. Confirmation Bias
3. Solar Cell optimization
4. Classification  Is a baby's name Muslim or not?
* Tips from Caltech :noexport:
 Criteria for using a ML algorithm:
 A pattern should exist.
 Hard to pin this pattern down mathematically (hence the need for
data).
 Data is available.
 In a sense, ML is the inverse of a model. A model takes inputs and
using well defined equations (based on theory) gives you an output. ML
looks at the inputs/outputs, and is trying to learn what the model is.
 He uses $g$ for the /final/ hypothesis. $h$ is used for any hyopthesis
in the set of hypotheses.
 He uses $N$ for the size of the training set.
 $t$ is the target function. It is the "true" function and is what $g$
approximates.
 $\mathcal{H}$ is the set of all hypotheses. Our algorithm will select
one from it.
 The *learning model* is the combination of the /algorithm/ and the
/hypothesis set/.
 *Reinforcement Learning*: You have inputs, and /graded/ outputs. Kind
of like the 246 confirmation bias problem. It queries and gets a
response on whether it is correct.
 Often used for playing games.
 Try /not/ to look at or analyze the training set. You may end up with
an algorithm that's tailored to your data.
 Let's say we have a container of red and green balls. We want to
estimate the fraction of red balls by taking a sample of size $N$. Let
the proportion in the sample be $\nu$ and in the population be $\mu$.
 *Hoeffding's Inequality*: $P(\nu\mu>\epsilon)=2e^{2\epsilon^2N}$
\q{Not proven.}
 This gives you an idea of what $N$ you need for your estimate to be
accurate within $\epsilon$.
 Note that the bound on the RHS does not depend on $\mu$ !
 P.A.C = Probably Approximately Correct.
 *Regression* just means realvalued.
 He also advocates computing the pseudoinverse.
 His notation:
 $a<1$ means return 1 if the "argument" is true, 0 otherwise.
 When he does a nonlinear transformation to the inputs (e.g. $x^2$),
he calls the new feature $z_i$
 He uses $\tilde{\mathbf{w}}$ to denote the weights in the $Z$space.
 Error measure (how "close" one function is to another):
 Nothing to do with measure theory.
 *Pointwise definition*: $E(h,f)=$
 What you use to measure the error depends on the problem (which the
/customer/ needs to provide). As an example, the exact same machine
learning problem can require different false accepts/rejects for
different applications.
 If the customer doesn't supply it, then pick one. Perhaps one that
makes convergence easy (convex optimization, etc).
** Noisy Targets
 This is why we use $P(y\mathbf{x})$ instead of $y=f(\mathbf{x})$
where $f$ is the "true" function we're trying to learnbecause in a
real world problem you won't have a real function (i.e. it'll be
multivalued).
 Can model a noisy target as a deterministic function $+$ some noise.
 So $f(\mathbf{x})=E(y\mathbf{x})$ plus some noise
($yf(\mathbf{x})$).
 If you really have a deterministic target, you can do
$P(y\mathbf{x})=0$ except when $y=f(\mathbf{x})$.
 $P(y\mathbf{x})$ and $P(\mathbf{x})$ are both unknown. We are trying
to learn the formerdon't care too much about the latter.
 Merge $P(\mathbf{x})P(y\mathbf{x})$ with $P(\mathbf{x},y)$. Keep in
mind we're /not/ trying to learn the latter!
 As you make your model more complex, your "insample" error goes down,
but the discrepancy between the insample and out of sample error goes
up.
 Regularization helps.
** More Theory
 Hoeffding's Inequality with the union bound (used in training):
$P(\nu\mu>\epsilon)\le2Me^{2\epsilon^2N}$.
 $M$ is the number of hypotheses.
 $M$ could be infinite, in which case this is a useless bound.
 The reason the bound is so poor is that it comes from assuming the
"worst" case, where all the events are disjoint. In reality, the
intersection (overlap) of "bad" events is often /huge/.
 We get around this limitation by not looking at the whole hypothesis
set (which can be infinite), but by sampling a finite set of points
and looking at those.
 Think of the set as being "covered", and those points are holes we
peer through and see the classification endowed by a hypothesis.
 If you change the hypothesis slightly such that those points remain
of the same classification, then you as an observer will not be able
to detect it.
 So small changes in the hypothesis are irrelevant (and those are
also the ones with large overlaps).
 Use these sample points to form a bound on the error.
 A *dichotomy* is a hypothesis whose domain is not $\mathcal{X}$, but
is instead a /finite/ set of pointssay $N$ of them.
 It is simply a restriction of a hypothesis from the original domain
to a finite domain.
 The number of hypotheses can be infinite.
 The number of dichotomies is at most $2^N$at most because it's not
necessary that the hypothesis set really did cover all $2^N$ cases...
 Denote the set of dichotomies to be
$\mathcal{H}(\mathbf{x}_1,\dots,\mathbf{x}_N)$.
 The *growth function* counts the most dichotomies one can get from a
hypothesis set on any $N$ points.
 So it is a function of both $N$ and $\mathcal{H}$.
 Definition:
$m_{\mathcal{H}}(N)=\max_{\mathbf{x}_1,\dots,\mathbf{x}_N\in
\mathcal{X}}\mathcal{H}(\mathbf{x}_1,\dots,\mathbf{x}_N)$
 $m_{\mathcal{H}}(N)\le2^N$
 The goal is to replace $M$ in the equality way above with $m$.
 In his lecture notes he gives several examples of calculating the
growth function for special cases.
 If we can bound the growth function to a polynomial, then that's a
great bound (as the exponential will kill it off with a sufficiently
large $N$).
 If no data set of size $k$ can be shattered by $\mathcal{H}$, then $k$
is the *break point* of $\mathcal{H}$.
 Another way of saying it is that it is the highest $k$ for which you
cannot get $2^k$.
 This will reduce the bound (strictly) to $2^k$. It doesn't matter if
you choose $N$ to be greater than $k$.
 _Key result_: If a finite break point exists, the growth function is
guaranteed to be polynomial.
 This means that the growth function is either $2^N$, or it is
polynomial! Nothing in between.
 All the discussion in this section is about binary
classification. There are counterparts for real valued functions.
 If your hypothesis set shatters the set, it's good because it means
you can fit the data exactly. It's bad in terms of the bound it
provides as well as for generalization. You may overfit.
 Determining the break point (or a bound for it) for a given problem
may be challenging, but it has been done for the famous algorithms.
*** $m_{\mathcal{H}}(N)$ is polynomial if $k$ is finite
 Define $B(N,k)$ as the maximum number of dichotomies on $N$ points
with a break point of $k$.
 By pattern he means an $N$tuple of 1 and 1 (positive and negative
class).
 If you have $N=3$ and $k=2$, then how many $N$tuples can you have
without breaking the limitation set by $k$? In general a truth table
will have $2^N$ /without/ the break point constraint. So how many
entries can a truth table have without violating that constraint?
 \q{Does the definition of $B$ pertain only to binary
classification?}
 Does /not/ depend on any $\mathcal{H}$ or $\mathcal{X}$ ! It is
purely a combinatorial quantity!
 We want to show that $B$ has a polynomial bound.
 The proof involves recursion.
 Write the truth table with $\mathbf{x}_i$ as your columns from
$i=1,\dots,N$keeping the constraint in mind.
 Isolate $\mathbf{x}_N$
 Construct a "maximal" truth table and partition it into 3 sets:
 $S_2$ and $S_3$ are the rows where $\mathbf{x}_N$ is 1 and 1
respectively /and/ where for any row in $S_2$, you have the exact
same row in $S_3$ with $\mathbf{x}_N$ reversed (and vice
versa). Let $\beta=S_2=S_3$
 Let $S_1$ be the set of leftovers. Let $S_1=\alpha$
 Then $B=\alpha+2\beta$
 Now look at $S_1$ and $S_2$ and omit $\mathbf{x}_N$
 All the entries are unique. Convince yourself of this.
 The $k$ constraint still holds. If it didn't, then this set with
$\mathbf{x}_{N}$ added would violate the constraint. Convince
yourself of this.
 Thus $\alpha+\beta\le B(N1,k)$. Note we can't claim the equality
sign as we haven't shown this "subtable" is maximal.
 Now focus on $S_2$ while omitting $\mathbf{x}_N$.
 The break point on it must be at most $k1$. If not, one could
combine this with $S_3$ to get $k$ entriesthat would be a
violation.
 Hence $\beta\le B(N1,k1)$
 This gives you a recursive statement.
 $B(N,k)\le\sum_{i=0}^{k1}\dbinom{N}{i}$
 This is a _polynomial_ bound.
 Note that $m_{\mathcal{H}}(N)\le B(N,k)$
*** $m_{\mathcal{H}}(N)$ is a substitute for $M$
 See his sketch of the proof.
 New bound:
$P(\nu\mu>\epsilon)\le4m_{\mathcal{H}}(2N)e^{\frac{1}{8}\epsilon^2N}$
 This is the *VapnikChervonenkis* inequality.
 *Important*: Keep in mind that this bound is for binary
classification! It has to be modified for real valued regression, and
is not all that useful a result.
** VC Dimension
 His notation: $d_{VC}(\mathcal{H})$
 It is the largest value of $N$ for which $m_{\mathcal{H}}(N)=2^N$
 $m_{\mathcal{H}}(N)\le\sum_{i=0}^{d_{VC}}\dbinom{N}{i}$
 Note that the maximum power on the RHS is $d_{VC}$
 VC Dimension always gives you an upper bound. It can be very
pessimistic.
 However, empirically it has been observed that the actual value
somewhat tracks the VC bound. If your bound increases, so does your
value.
 Rule of thumb: Need $N\ge10d_{VC}$ ($N$ is the number of training
samples). This is /really/ a heuristic (no dependence on
$\delta,\epsilon$).
** Bias/Variance Tradeoff
 He has a really good lecture on this.
 He showed if you have a sine function (I think over one period), and
you're given only two points to learn from, then on average you'll do
better fitting it with a constant vs a linear function!
 Match the model complexity to the data resources you have, and /not/
to the target complexity!
* Linear Regression
** Definitions
 *Supervised Learning*: Informally it means you gave the algorithm a
data set with the "correct" answers specified. There's also
*unsupervised* learning as well as *reinforcement* learning.
 *Regression*: Predict continuous valued output.
 _Notation_:
 $m$ = Number of training examples.
 $x$ = Inputalso called the input *feature*
 $y$ = Outputalso called the *target*
 $(x^{(i)},y^{(i)})$ is the $i$ th *training example*.
 The full set of training examples is called the *training set*.
 $\mathcal{X}$ is the space of inputs, $\mathcal{Y}$ the space of
outputs.
 $p(yx;\theta)$ vs $p(yx,\theta)$. In the former, $\theta$ is not a
random variable, it is just a parameter. In the latter, $\theta$
/is/ a random variable, and we're saying "the probability of $y$
given $x$ /and/ $\theta$". The two cases are differentiated with the
use of a comma versus a semicolon.
 $h$ is the *hypothesis*: It takes $x$ as the input and computes
$y$. Formally, $h:\mathcal{X}\mapsto\mathcal{Y}$
 *Linear regression* with one variable:
$h(x)=\theta_{0}+\theta_{1}x$. Also called *univariate* linear regression.
 If you have multiple input variables (denoted by $x_{j}^{(i)}$ where
the subscript denotes another feature of the input), then
$h(x)=\sum_{i=0}^{n}\theta_{i}x_{i}=\Theta^{T}\bold{x}$. Note that
$x_{0}=1$ alwaysit was introduced for notational convenience and is
called the *intercept term*.
 $\theta_{i}$'s are called the *parameters*, or *weights*.
 Want to minimize
$\min_{\theta_{0},\theta_{1}}\frac{1}{2m}\sum_{i=1}^{m}\left(h(x^{(i)})y^{(i)}\right)^{2}$. The
leading fraction is just to simplify the math later on.
 More general notation:
$\min_{\theta_{0},\theta_{1}}J(\theta_{0},\theta_{1})$ where $J$ is
the *cost function*.
 In our case $J$ is the squared error function.
** Gradient Descent
 *Gradient Descent* algorithm:
$\theta_{j}^{n+1}=\theta_{j}^{n}\alpha\frac{\partial}{\partial\theta_{j}}J$
or $\Theta^{n+1}=\Theta^{n}\alpha\nabla J$
 $\alpha$ is the *learning rate*. It is always positive.
 It controls the step size in the direction of maximum change. Too
small and the convergence is slow. Too large and it will overshoot
and possible diverge.
 This "rule", when applied to the linear cost function, is also
called the *LMS* (least mean squares) update, or the *WidrowHoff*
learning rule. Note that the update is proportional to the error.
 For linear regression, the cost function is always convex. You're
guaranteed to hit a global minimum (if you hit one). \q{Why?}
 *Batch* gradient descent: Each iteration uses all the training
examples.
 There is also the *stochastic* gradient descent. Here we update
$\Theta$ every time we come across an input $x^{(i)}$.
 The difference between stochastic and batch is that the latter
updates all the $\theta_{j}$'s with all the $x^{(i)}$'s (i.e. each
time each $\theta_{j}$ is updated, we have to calculate over all
inputsthen we calculate the new cost function). The stochastic
approach iterates over the inputs. When it encounters an input, it
updates all the parameters, and then calculates the cost
function. Then it moves on to the next input.
 Stochastic gradient can get closer to the minimum much more quickly
but is not guaranteed to hit a minimumit may oscillate around
it. \q{Why?} This can be remedied by decreasing $\alpha$ as
it approaches the minimum.
 For the simple case of linear regression, you can always use least
squares method which is noniterative. However, I think Andrew Ng
claims that for large data sets, gradient descent is quicker?
*** Feature Scaling & Normalizing
 If the $\theta_{j}$'s are of widely different scales (e.g. one of them
is over a 1000, and the other is between 0 and 1), you'll have
convergence issues (or slow convergence).
 To get around this, scale all your $\theta_{j}$'s so that they are all
around unity.
 It is also desirable to offset your features such that the mean is
roughly 0. This way after scaling your features will (roughly) be
between 1 and 1. \q{Why is this important?}
 More generally, let $x_{j}'=\frac{x_{j}\mu}{s}$ where $s$ is the
_range_. One could also use the standard deviation.
 Don't get too dogmatic about these transformations.
 If you're solving directly using the normal equations (explained
later), then scaling is unnecessary as no iteration is
involved. \q{I strongly suspect scaling can be an issue even
for direct approaches!}
*** Convergence Heuristics
 Plot $J(\Theta)$ vs the number of iterations (can do this while you're
iteratingoutput $J(\Theta)$ to a file).
 Ideally we'd like to decrease fast and monotonically.
 One could use some automatic criterion for convergence (e.g. percent
change in $J$). However, Andrew has not had good experience with it
and often just looks at the plot and kills the iteration after he
feels confident it has converged.
 If your $J$ shoots up sharply with the number of iterations, reduce
your $\alpha$.
 Similarly if it oscillates, use a lower $\alpha$.
 There should be some $\alpha_{0}$ below which you always get a
decrease. However, it's not obvious what that is and too low an
$\alpha$ will lead to slow convergence.
 Andrew starts from a low $\alpha=0.001$, and tries it. He increases it
each time by a factor of 3 until he hits a sweet spot.
** Alternatives to Gradient Descent
 Alternatives:
 Conjugate Gradient
 BFGS
 LBFGS
 Advantages to the above:
 No need to pick $\alpha$
 Usually faster
 Disadvantages:
 More complex
 Don't write your own implementations.
** Gradient Checking
 Many optimization algorithms require a gradient. If you supply one
(and not an approximation), also write up an approximation
routine.
 Start a run with gradient checking, and verify that it is
approximately close to your claimed gradient.
 Then kill the run, disable the checking, and run the algorithm.
** Matrix Derivatives
 Let $f:\mathds{R}^{m\times n}\mapsto\mathds{R}$. Define the derivative
of $f$ with respect to matrix $A$ as:
\begin{equation}
\nabla_{A}f(A)=\begin{pmatrix}\frac{\partial f}{\partial A_{11}} &
\cdots & \frac{\partial f}{\partial A_{1n}} \\ \vdots & \ddots & \vdots
\\ \frac{\partial f}{\partial A_{m1}} &
\cdots & \frac{\partial f}{\partial A_{mn}} \end{pmatrix}
\end{equation}
 $\tr(AB)=\tr(BA)$ assuming $AB$ is square. This extends to a product
of three or more matricescyclic rotation.
 $A$ is the determinant of $A$.
 $\nabla_{A}\tr(AB)=B^{T}$
 $\nabla_{A^{T}}f(A)=\left(\nabla_{A}f(A)\right)^{T}$
 $\nabla_{A}\tr(ABA^{T}C)=CAB+C^{T}AB^{T}$ \q{Show!}
 $\nabla_{A}A=A\left(A^{1}\right)^{T}$ assuming $A$
exists. \q{Show! To prove it, write $A^{1}$ in terms of its
adjoint.}
** Least Squares Revisited
 Given a training set, define the *design matrix* $X$ to be the
$m\times n$ matrix that contains the training examples' input values
in its rows:
\begin{equation}
X=\begin{pmatrix}(x^{(1)})^{T} \\ \vdots \\ (x^{(m)})^{T}\end{pmatrix}
\end{equation}
This of course assumes $n$ features.
 Let $\mathbf{y}$ be the $m$ dimensional vector with the outputs.
 As $h_{\theta}(x^{(i)})=\left(x^{(i)}\right)^{T}\Theta$, then
$X\Theta$ has as its elements $h_{\theta}(x^{(i)})$ and
$X\Theta\mathbf{y}$ is the vector of errors.
 It is straightforward to show that
$J(\Theta)=\frac{1}{2}(X\Theta\mathbf{y})^{T}(X\Theta\mathbf{y})$. Note
that $z^{T}z$ is the sum of the square of the elements.
 Minimize $J$ by calculating $\nabla_{\Theta}J(\Theta)$ and setting it
to 0. The result is the *normal equation*:
$X^{T}X\theta=X^{T}\mathbf{y}$. The solution is
$\Theta=(X^{T}X)^{1}X^{T}\mathbf{y}$.
 $X^TX$ can be singular, but that is rare. If it happens, use the
pseudoinverse. If this happens:
 Check that you don't have duplicate training examples. This can
happen if you have linear dependence (e.g. same inputs but in
different units).
 Check that you don't have more features than the size of your data
set. If you do, either delete some or use regularization (covered
later).
 If you're solving using the normal equation (without iterating), then
feature scaling is unnecessary. \q{I have my doubts...}
 The normal method doesn't scale well with respect to
$n$. \q{Andrew's argument for this is the cost of inverting a
matrix. However, one should not have to explicitly invert the matrix
to solve a linear system! Is his argument valid if one simply does a
simple solution without inversion?}
 Andrew's heuristic. $n=1000$ is OK. $n=10^4$ is borderline. More
than that and he'd use gradient descent.
** Probabilistic Interpretaion
 We can assume $y^{(i)}=\Theta^{T}\mathbf{x}^{(i)}+\epsilon^{(i)}$
where $\epsilon^{(i)}$ is the error term.
 Assume all the error terms (across all $i$'s) are independent and
identically distributed Gaussians, and that the mean of the error
is 0.
 It can be shown that the maximum likelihood function for the $\Theta$
that is the one that is the solution to the normal equations.
 $L(\Theta;X,\mathbf{y})=p(\mathbf{y}X;\Theta)$
 $L(\theta)=\prod_{i=1}^mp(y^{(i)}x^{(i)};\theta)$. The product is
because it is assumed each training example is independent of the
others.
 This solution is independent of the $\sigma$ of our distributions. Of
course, it is assumed that they are identical.
** Locally Weighted Linear Regression
 In the usual linear regression, we're trying to minimize
$\sum_{i}\left(y^{(i)}\Theta^{T}\mathbf{x}^{(i)}\right)^2$. A modification is
to minimize
$\sum_{i}w^{(i)}\left(y^{(i)}\Theta^{T}\mathbf{x}^{(i)}\right)^2$. The
*weights* are nonnegative.
 Intuitively, the weights give some features more preference over
others.
 A standard choice for the weights is
$w^{(i)}=\exp\left(\frac{\left(x^{(i)}x\right)^2}{2\tau^2}\right)$.
 Note that this makes the weight a function of $x$.
 One benefit of this is that if $x$ is close to one of the inputs in
the training set, the weight is close to 1. This makes $\Theta$
ensure it's a good fit to the training sample.
 $\tau$ controls how quickly the weight declines away from a
training point. It is called the *bandwidth* parameter.
 If $x$ is actually $\mathbf{x}$, then the numerator is
$(\mathbf{x}^{(i)}\mathbf{x})^{T}(\mathbf{x}^{(i)}\mathbf{x})$. Sometimes
$(\mathbf{x}^{(i)}\mathbf{x})^{T}\Sigma^{1}(\mathbf{x}^{(i)}\mathbf{x})$
is used for some choice of $\Sigma$
 Similarities with the Gaussian (or probability in general) are
superficial. Don't look for a deeper connection. One can even use
weighting functions that integrate to infinity.
 This algorithm is an example of a *nonparametric* one. A *parametric*
algorithm uses the training data to get the parameters, and once
trained, we can discard the training data. In this algorithm the
training data is part of the algorithm (at least if it depends on $x$).
 Every time you make a prediction, you need to retrain
everything. \q{Why?}
** Random Tips
 Make sure your hypothesis function's form makes sense.
 Don't pick a quadratic hypothesis if when you extrapolate you get
nonsensical values (e.g. a large enough input causes a decrease when
you know from reality that never happens). Pick a cubic
instead. \q{Although frankly most cubics will have a drop
(two extrema).}
 Can convert a polynomial of one feature into a linear function of many
features. Just let $x_{2}=x^2$ and so on.
* Classification
*Classification* is another task ML is used for. It is used for
discrete outputsusually of finite size and few in number.
 *Binary classification* is where the output takes only two values. Let
it be 0 and 1.
 0 is called the *negative class* and 1 the *positive class*. Sometimes
they are denoted by  and +.
 Given $x^{(i)}$ the $y^{(i)}$ is called the *label* for the training
example.
 The *decision boundary* is the "line" (or plane or surface) that
separates the two classes. It is also called the *separating
hyperplane*.
 The boundary is a property of our hypothesis, not of the data set.
** Logistic Regression
 Of course, one could try using a linear regression algorithm for a
classification problem as it is simply a special case, but it's easy
to construct problems where linear regression has poor
performance.
 An example is where an outlier in the training set will
significantly alter your hypothesis (compared to when you didn't
have an outlier).
 Try
\begin{equation}
h_{\Theta}(x)=g(\Theta^{T}\mathbf{x})=\frac{1}{1+e^{\Theta^{T}\mathbf{x}}}
\end{equation}
 $g(z)$ is called the *logistic* or *sigmoid* function.
 \q{Insert plot here}.
 $g(z)$ tends to 0 as $z\rightarrow \infty$ and tends to 1 as
$z\rightarrow \infty$. It's monotonic and smoothly goes to 1
(bounded between 0 and 1). This is important as it is a binary
classification problem.
 Note that for this form of the hypothesis, the decision boundary is
given by $z=\Theta^T\mathbf{x}=0$. Examining this can give you
insights.
 Can also use other functions that smoothly go from 0 to 1.
 Note that $g'(z)=g(z)(1g(z))$
 Using maximum likelihood we can compute the update rule. Let
$P(y=1x;\theta)=h_{\theta}(x)$ and $P(y=0x;\theta)=1h_{\theta}(x)$
for a given training example.
 This can be written as $p(yx;\theta)=(h_{\theta}(x))^y(1h_{\theta}(x))^{1y}$.
 Then we want to /maximize/ $\sum_{i=1}^m(y^{(i)}\log
h(x^{(i)})+(1y^{(i)})\log(1h(x^{(i)})))$
 You can make your cost function to be the negative of the
above. Divide by $m$ for some kind of normalization.
 This cost function makes sense if you consider its values near 1 and
0 (when correct or when wrong)
 Had we used the same cost function as for linear regression, our cost
function would be nonconvex.
 Using maximum likelihood we get the update rule:
$\theta_j^{n+1}=\theta_{j}^n+\alpha\sum_{i=1}^m\left(y^{(i)}h_{\theta}(x^{(i)})\right)x_{j}^{(i)}$
This is the _same_ as for linear regressionbut our hypothesis is no
longer linear.
 Note that this approach is essentially assuming a Bernoulli distribution.
 Note that this method does not output $0$ or $1$it can output
anything in between.
** The Perceptron Learning Algorithm
 What if we want it to output either 0 or 1? Make $g(z)$ be the step
function and you get the *perceptron learning algorithm* if we use the
same update rule.
 \q{Does this update rule follow from maximum likelihood?}
** Another way to maximize the maximum likelihood function
 Generalization of Newton's Method to multiple dimensions:
$\Theta^{n+1}=\Theta^nH^{1}\nabla_{\Theta}l(\Theta)$
 $H$ is the *Hessian*
matrix. $H_{ij}=\frac{\partial^2}{\partial\theta_{i}\partial\theta_j}$
 Faster convergence than batch gradient descent.
 Each iteration is more expensive due to inverting a matrix. Not that
much of a downside if the matrix is not large.
 *Fisher scoring* is when this method is used to the maximum likelihood
function.
** Classifying into $k$ Categories
*onvsall*:
 Take one category, and do a logistic regression on it vs everything
else.
 Do the same for all the other categories.
 Then when you have a data point, try all $k$ categories and see which
one fits it best.
* Generalized Linear Models
** The Exponential Family
 $p(y;\eta)=b(y)\exp\left(\eta^TT(y)a(\eta)\right)$
 $\eta$ is called the *natural* or *canoncial* parameter.
 $T(y)$ is the *sufficient statistic*. It will often simply be $y$.
 $a(\eta)$ is the *log partition* function. $\exp(a(\eta))$ is more
or less a normalization constant s.t. the integral of $p$ over the
whole space is 1.
 Fixing $T,a,b$ gives a /family/ of distributions parametrized by
$\eta$.
 Bernoulli and Gaussian distributions are just special cases.
 Bernoulli: $b(y)=1$, $\eta=\log(\phi/(1\phi))$, $T(y)=y$,
$a(\eta)=\log(1\phi)$
 Gaussian: We're free to use any $\sigma$ (it did not impact the
linear regression) so for convenience we set it to 1. $\eta=\mu$,
$T(y)=y$, $a(\eta)=\eta^{2}/2$, $b(y)=\left(1/\sqrt{2\pi}\right)\exp\left(y^2/2\right)$
 Can also derive the multinomial, Poisson, gamma, exponential, beta,
Dirichlet distributions from this.
** Constructing GLMs
 GLM means *generalized linear models*
 3 assumptions:
1. $yx;\theta$ is a member of some exponential family.
2. Given $x$, the goal is to predict $T(y)$. This is normally $y$. In
other words we'd like $h(x)=E[yx]$. This is satisfied for both the
logistic and the linear regression.
3. $\eta$ and $\mathbf{x}$ are linearly related:
$\eta=\Theta^T\mathbf{x}$
 The third assumption is a design choice of ours.
*** Ordinary Least Squares
 This is a special case of the GLM family.
 Assume $y$ is continuous, and the distribution of $y$ given $x$
follows a Gaussian where $\mu$ may be a function of $x$.
 From the previous section's "equivalence" of the exponential family
and the Gaussian, we have $h_{\theta}(x)=\Theta^{T}\mathbf{x}$
*** Logistic Regression
 Assuming a Bernoulli distribution gives us
$h_{\theta}(x)=1/(1+\exp(\Theta^T\mathbf{x}))$
 $g(\eta)=E[T(y);\eta]$ is called the *canonical response
function*.
 Its inverse is called the *canonical link function*.
 Thus for the Gaussian family $g$ is just the identity function
$g(z)=z$ and for the Bernoulli it is the logistic function.
*** Softmax Regression
 Assume we need to classify into $k$ categories. We use a multinomial
distribution.
 Let the output be from $\{1,2,\cdots,k\}$
 Define $k1$ parameters $\phi_i$, which is the probability of getting
$i$ as the outcome. We have $k1$ of them as we have the constraint
that $\sum_{i=1}^k\phi_i=1$
 Let:
\begin{equation}
T(i)=\begin{pmatrix}0\\\vdots\\1\\\vdots\\0\end{pmatrix}
\end{equation}
where the $i$ th row is 1 and $T(y)\in\mathds{R}^{k1}$
 For convenience, $T(k)$ is the vector of 0's.
 Notation. Let $1\{\cdot\}$ be the _indicator_ function whose value is
1 if the argument is true and 0 if false.
 Thus $(T(y))_i=1\{y=i\}$
 This is shown to be the exponential function/distribution parametrized as:
 $\eta$ is a $k1$ sized vector whose $i$ th element is
$\log(\phi_{i}/\phi_k)$
 $a(\eta)=\log(\phi_k)$
 $b(y)=1$
 The link function is given by $\eta_i=\log\frac{\phi_i}{\phi_k}$ for
$i=1,\dots,k$ ($\eta_k$ trivially is 0). \q{Why?}
 Inverting it to get $g$, the response function,
$\phi_i=\frac{e^{\eta_i}}{\sum_{j=1}^ke^{\eta_j}}$.
 The denominator is merely a normalization constant. Its product with
$\phi_k$ gives 1.
 This function is called the *softmax* function.
 Apply the 3rd assumption of a GLM.
 Let $\theta_0=0$ for convenience. Note that
$\theta_i\in\mathds{R}^{n+1}$, with $i=1,\dots,k$. \q{Why?}
 This model is called the *softmax regression*
 The hypothesis outputs the estimated probability that
$p(y=ix;\theta)$ for all $i$.

$l(\theta)=\sum_{i=1}^m\log\prod_{l=1}^k\left(\frac{e^{\theta_l^Tx^{(i)}}}{\sum_{j=1}^ke^{\theta_j^Tx^{(i)}}}\right)^{1\{y^{(i)}=l\}}$
is the loglikelihood function. Maximize $\theta$ using some technique
(Newton's, etc). \q{Verify if I ever use it.}
* The Problem of Overfitting
 When you have a poor fit (doesn't match many points well), it is
called *underfitting* or *high bias*. In a sense you're saying that
the hypothesis function is biased despite the evidence to the
contrary.
 When you have a fit that matches the training data really well but
performs poorly otherwise, you are *overfitting*also known as *high
variance*.
 Overfitting is essentially "memorizing" your training data. You don't
really learn much about the model!
** Addressing overfitting
1. Reduce number of features:
 Manually select features.
 Model selection algorithms (it'll choose for you what features to
keep).
 This entails potentially losing valuable information.
 On the flip side, too many features requires a larger data set?
2. Regularization:
 Keep all the features but reduce magnitudes of $\theta_j$.
 Works well when you have a lot of features, each of which
contributes a bit to predicting $y$.
** Regularization
 Choose small values for $\theta_i$.
 You get a simpler hypothesis. \q{Why?}
 Get a "smoother" function. \q{Why?}
 Less prone to overfitting. \q{Why?}
 To force smaller $\theta_i$, modify the cost function:
 $J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^m\left(h_{\theta}(x^{(i)})y^{(i)}\right)^2+\lambda\sum_{i=1}^{n}\theta_i^2\right]$
 Note that the sum over the features starts from $i=1$. This is by
convention: $\theta_0$ is not shrunk. In practice, it makes little
difference.
 $\lambda$ is called the *regularization parameter* and the new sum
is called the *regularization term*.
 $\lambda$ controls the tradeoff between fitting the values and keeping
the hypothesis simple.
 \q{Too large a $\lambda$ will result in an almost straight horizontal
line?}
** Regularized Linear Regression
 The new update rule for gradient descent becomes:

$\theta_j^{k+1}=\theta_{j}^k\left(1\alpha\frac{\lambda}{m}\right)\alpha\frac{1}{m}\sum_{i=1}^m\left(h_{\theta}(x^{(i)})y^{(i)}\right)x_{j}^{(i)}$
for $j\ne0$.
 For $j=0$, the update rule is the same as before.
\q{Verify!}
 The normal equation solution is:
 $\Theta=(X^{T}X+\lambda K)^{1}X^{T}\mathbf{y}$ where K is like the
identity matrix but with its first element as 0. \q{Verify}. \q{Put
K explicitly!}
 If $\lambda>0$, then the matrix you are inverting will _not_ be
singular! \q{Not proven}
** Regularized Logistic Regression
 The update rule is the same as for gradient descent, but with the
logistic hypothesis.
* Generative Learning Algorithms
 Thus far the content has been about modeling $p(yx;\theta)$. For
example in classification, we give it a training set, and it tries to
come up with a boundary between the two categories. When trying to
classify a new item, it will check which end of the boundary the new
element is.
 A different approach is to build a model of each category
(\q{independent of other categories?}) Then when a new item
comes along, we see which model fits it better.
 *Discriminative* learning algorithms are ones that try to learn
$p(yx)$ directly.
 *Generative* algorithms try to model $p(xy)$ and $p(y)$.
 What is the probability of getting this item assuming it is of a
certain category?
 $p(y)$ is called the *class prior*.
 Then use Baye's Rule to get the
$p(yx)=\frac{p(xy)p(y)}{p(x)}$. Note that
$p(x)=p(xy=0)p(y=0)+p(xy=1)p(y=1)$.
 Of course, this makes sense only if $x$ cannot be in multiple
categories!
 As $p(x)$ is a fixed number, we don't really need it. Just find the
$y$ that maximizes the numerator.
** The Multivariate Normal Distribution
 $\mu\in\mathds{R}^n$
 *Covariance matrix*: $\Sigma\in\mathds{R}^{n\times n}$. $\Sigma\ge
0$. It is symmetric and positive semidefinite. \q{Verify the
mean.}
 $p(x;\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}\Sigma^{1/2}}\exp\left(\frac{1}{2}(x\mu)^T\Sigma^{1}(x\mu)\right)$
 $A$ is the determinant of $A$.
 *Covariance* of a random variable $Z$ (vector valued) is defined by
$\mathrm{Cov}(Z)=E[(ZE[Z])(ZE[Z])^T]=E[ZZ^T](E[Z])(E[Z])^T$. \q{Show
the second equality from the first.}
 It is a generalization of the variance.
 *Standard normal distribution*: $\mu=1,\Sigma=I$
** Gaussian Discriminant Analysis
*** The Gaussian Discriminant Analysis Model
 If the input features $x$ are continuous valued random variables, then
we can use the *Gaussian Discriminant Analysis* (GDA) model:
 $y$ is $\mathrm{Bernoulli}(\phi)$. So $p(y)=\phi^y(1\phi)^{1y}$
 $xy=0$ is $\mathcal{N}(\mu_0,\Sigma)$
 $xy=1$ is $\mathcal{N}(\mu_1,\Sigma)$
 Note that $\Sigma$ is the same in both cases! \q{This is a
choice?}
 By maximizing the loglikelihood function, we get: (\q{verify!})
 $\phi=\frac{1}{m}\sum_{i=1}^m1\{y^{(i)}=1\}$
 $\mu_0=\frac{\sum_{i=1}^m1\{y^{(i)}=0\}x^{(i)}}{\sum_{i=1}^m1\{y^{(i)}=0\}}$
 $\mu_1=\frac{\sum_{i=1}^m1\{y^{(i)}=1\}x^{(i)}}{\sum_{i=1}^m1\{y^{(i)}=1\}}$
 $\Sigma=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}\mu_{y^{(i)}}\right)\left(x^{(i)}\mu_{y^{(i)}}\right)^{T}$
*** GDA and Logistic Regression
 If you look at $p(y=1\mathbf{x};\phi,\mu_0,\mu_{1},\Sigma)$ as a function of
$x$, you'll get
$p(y=1\mathbf{x};\phi,\Sigma,\mu_0,\mu_1)=\frac{1}{1+\exp(\Theta^T\mathbf{x})}$
 \q{Verify!}
 $\Theta$ is some function of $\phi,\Sigma,\mu_0,\mu_1$
 Note that this matches what we had earlier.
 GDA and logistic regression will give different boundaries on the same
data set.
 This is because while GDA reduces to the logistic regression in the
case of a multivariate Gaussian with common $\Sigma$, the converse
is not true: $p(yx)$ being logistic does not imply $p(xy)$ is
multivariate Gaussian.
 Thus GDA makes stronger assumptions (e.g. $p(xy)$ is Gaussian or
near it).
 GDA does better when these assumptions are valid.
 If the assumptions are correct exactly (i.e. really a Gaussian),
then GDA is *asymptotically efficient*. In the limit of large
training sets, no algorithm will beat the GDA.
 Logistic regression is more robustit is less sensitive to errors in
your assumptions about the model.
 Hence it is used more frequently.
 \q{Similar discussion on Naive Bayes}
 On the flip side, because the assumptions GDA makes are stronger, you
need fewer training examples. The model is exploiting the extra
information it has (i.e. the assumptions).
** Naive Bayes
 Assume $x_i$'s are discrete.
*** Spam Filtering
 Example of email spam classification. Given an email, let $\mathbf{x}$
be the vector such that $x_i=1$ if and only if the $i$ th word in the
dictionary appears in the email.
 The set of all known words is the *vocabulary*.
 So $\mathbf{x}$ is a huge, mostly sparse vector.
 In practice they don't use a dictionary, but use the union of the
words that appear in the training set. This has the benefit of
including words not in the dictionary.
 Sometimes some high frequency words ("and", "the") are excluded as
they rarely help in determining the spam status. These are called
*stop words*.
 The dictionary is large, and if we use a multinomial distribution
(\q{Why would we do that?}) we would have a $2^N$ vector space
where $N$ is the number of words in the dictionary. Need to reduce the
space.
 An assumption is made that the $x_i$'s are conditionally independent
/given/ $y$. This means that $p(x_iy)=p(x_iy,x_j)$.
 This does *not* mean that $x_i$ and $x_j$ are independent!
 This is called the *Naive Bayes (NB) assumption* and the algorithm
is called the *Naive Bayes Classifier*.
 Then $p(x_1,\dots,x_Ny)=\prod_{i=1}^np(x_iy)$
 _Parametrization_: Let $\phi_{iy=1}$ be the probability:
$p(x_i=1y=1)$the probability that given the message is spam, the
$i$ th word was present. Let $\phi_y=p(y=1)$.
 Using the maximum likelihood method, we get (\q{Show!}):
 $\phi_{jy=1}=\frac{\sum_{i=1}^m1\{x_j^{(i)}=1\wedge y^{(i)}=1\}}{\sum_{i=1}^m1\{y^{(i)}=1\}}$
 $\phi_{jy=0}=\frac{\sum_{i=1}^m1\{x_j^{(i)}=1\wedge y^{(i)}=0\}}{\sum_{i=1}^m1\{y^{(i)}=0\}}$
 $\phi_y=\frac{\sum_{i=1}^m1\{y^{(i)}=1\}}{m}$
 The interpretation of the above is obvious.
 Then to determine if a new message is spam:
 $p(y=1x)=\frac{p(xy=1)p(y=1)}{p(x)}$
 Note: $p(xy=1)=\prod_{i=1}^np(x_iy=1)$ and
$p(x)=p(xy=1)p(y=1)+p(xy=0)p(y=0)$.
 Can generalize this to cases where the value of $x_i$ is one of $k$
choices (we had binary above). Instead of using a Bernoulli for
$p(x_iy)$, we use a multinomial.
 If our data set were continuous, just discretize it make it a
multinomial.
 Discretized Naive Bayes often works well when GDA doesn't (i.e. the
continuous distribution does not follow a multivariate normaljust
discretize and apply NB).
 The Naive Bayes as presented in this section is called the
*multivariate Bernoulli event model*.
*** Laplace Smoothing
 The problem with the spam filtering approach just described is that if
you now receive an email with a new word, you'll have
$\phi_{jy=1}=0=\phi_{jy=0}$ where $j$ represents our new word.
 To fix this, change the estimate to
$\phi_{jy=1}=\frac{1+\sum_{i=1}^m1\{x_j^{(i)}=1\wedge
y^{(i)}=1\}}{k+\sum_{i=1}^m1\{y^{(i)}=1\}}$
 $k$ is the number of allowed values (2 for spam).
 In general, we have $\phi_j=\frac{\sum_{i=1}^m1\{z^{(i)}=j\}}{m}$ when
the observations are independent and $j$ is one of the $k$ allowed
values. The *Laplace smoothing* is where you add 1 to the numerator
and $k$ to the denominator.
 Note that this still sums to 1! \q{Verify}
 Under certain conditions, Laplace smoothing gives the optimal
estimator. \q{What does this even mean?}
*** Event Models for Text Classification
 Another way to do spam filtering is called the *multinomial event model*.
 Let $x_i$ represent the $i$th word in the email. The possible values
for $x_i$ are $\{1,2,\dots,V\}$ where $V$ is the size of the
vocabulary.
 So an email of $n$ words is represented by a vector of size $n$.
 $n$ is no longer fixed across emails.
 Define $\phi_y=p(y),\phi_{iy=1}=p(x_j=iy=1),\phi_{iy=0}=p(x_j=iy=0)$.
 Note that we assume $p(x_jy)$ is the same for all values of
$j$the likelihood of a word being present is invariant to its
position in the text.
 Training set: $x^{(i)}$ is now the vector whose elements are
$x_k^{(i)}$.
 Using likelihood techniques, we have (\q{Verify!}):

$\phi_{ky=1}=\frac{1+\sum_{i=1}^m\sum_{j=1}^{n_i}1\{x_j^{(i)}=k\wedge
y^{(i)}=1\}}{V+\sum_{i=1}^m1\{y^{(i)}=1\}n_i}$

$\phi_{ky=0}=\frac{1+\sum_{i=1}^m\sum_{j=1}^{n_i}1\{x_j^{(i)}=k\wedge
y^{(i)}=0\}}{V+\sum_{i=1}^m1\{y^{(i)}=0\}n_i}$
 Note that I've applied Laplace smoothing.
 Andrew claims the multinomial event model works better than the
multivariate Bernoulli event model.
* Support Vector Machines
Andrew claims many consider it the best supervised learning algorithm.
** Notation
 For convenience, let the negative class be $y=1$ instead of $y=0$.
 Instead of using $\Theta$, we'll parametrize using $\mathbf{w}$ and $b$
s.t. $h_{\mathbf{w},b}(x)=g(\mathbf{w}^T\mathbf{x}+b)$
 Note that $w$ is a vector and $b$ is a scalar.
 We no longer need $x_0=1$$b$ fills the intercept term role now.
 No longer endow a probability interpretation to this (\q{Why
not?}). We say that $g(z)=1$ if $z\ge 0$ and 1 otherwise (not a
continuous function)
** Functional and Geometric Margins
 Given a training example, the *functional margin* of $(w,b)$ with
respect to the training example is given by
$\hat{\gamma}^{(i)}=y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)}+b)$.
 If the functional margin is positive, we predicted correctly.
 For the functional margin to be large, we need a high
$\mathbf{w}^T\mathbf{x}^{(i)}+b$ (positive for positive class, and
negative for negative class).
 A large functional margin implies a correct and confident prediction.
 Note that we can artificially make the functional margin large by
scaling $\mathbf{w}$ and $b$. So we may want to enforce a
normalization constraint.
 The functional margin is the distance from a training point to the
seperating hyperplane. \q{I don't think sotrue for the geometric
margin!}
 Can also define a function margin with respect to the training set
$S$: $\hat{\gamma}=\min_{i=1,\dots,m}\hat{\gamma}^{(i)}$
 This just means it is the "worst case" functional margin.
 Note that $\mathbf{w}$ will always be orthogonal to the separating
hyperplane.
 Let the *geometric margin* be
$\gamma^{(i)}=y^{(i)}\left(\left(\frac{\mathbf{w}}{\mathbf{w}}\right)^T\mathbf{x}^{(i)}+\frac{b}{\mathbf{w}}\right)$
 When $y^{(i)}=1$, then this represents the distance of
$\mathbf{x}^{(i)}$ to the decision boundary.
 $\gamma^{(i)}=\hat{\gamma}^{(i)}$ when $\mathbf{w}=1$ (normalized).
 Hence the geometric margin is invariant to scaling.
 This property is utilized by allowing us to scale $\mathbf{w}$ any
way it is convenient for us to without impacting the outcome. For
example, we can freely make the constraint: $w_1+b+w_2=4$
 Can also define a geometric margin with respect to the training set
$S$: $\gamma=\min_{i=1,\dots,m}\gamma^{(i)}$ like we did for the
functional margin.
** The Optimal Margin Classifier
 The goal is to maximize the geometric margin.
 Assume we have a linearly separable data set. This means that a
hyperplane exists that can separate _all_ the training points.
 The optimal margin classifier will fail for data that is not
linearly separable.
 We want to find $\gamma,\mathbf{w},b$ such that for all $i=1,\dots,m$,
$y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)}+b)\ge\gamma$, and also keep
$w=1$.
 We want to find the largest possible $\gamma$.
 In words, this means that each training example has a
geometric/functional margin greater than $\gamma$, and we want to
maximize that $\gamma$.
 The problem is that $w=1$ is a nonconvex constraint.
 Not a general problem you can plug into an optimization routine.
 We could try replacing $\gamma$ with $\hat{\gamma}$ and maximizing
$\frac{\hat{\gamma}}{w}$, but our objective function is now
nonconvex.
 To get around this, utilize the freedom to scale the geometric
margin.
 Constrain $\hat{\gamma}=1$. This finally will lead to optimizing
$\min_{\gamma,\mathbf{w},b}\frac{1}{2}\mathbf{w}^2$ such that
$y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)}+b)\ge 1\ \forall i$
 This has a convex quadratic objective, with linear constraints.
 The solution is called the *optimal marginal classifier*.
 Can use a generic quadratic programming (QP) optimizer.
** Lagrange Duality
 Optimizing using Lagrange multipliers works when the constraints
involve equalities. We need something that can handle inequalities.
 *Primal* optimization problem:
$\min_w f(w)$ such that $g_i(w)\le0, i=1,\dots,k$ and $h_i(w)=0,\
i=1,\dots,l$.
 Define the *generalized Lagrangian*:
$\mathcal{L}(w,\alpha,\beta)=f(w)+\sum_{i=1}^k\alpha_ig_i(w)+\sum_{i=1}^l\beta_ih_i(w)$
 $\alpha_i,\beta_i$ are the Lagrange multipliers.
 Consider the quantity
$\theta_{\mathcal{P}}(w)=\max_{\alpha,\beta:\alpha_i\ge0}\mathcal{L}(w,\alpha,\beta)$.
 If a $w$ violates the primal constraints (e.g. $h_i(w)\ne0$ for some
$i$), then $\theta_{\mathcal{P}}(w)=\infty$
 If any $w$ satisfies the constraints, then $\theta_{\mathcal{P}}=f(w)$
 Try to get $\min_w\theta_{\mathcal{P}}(w)=\min_w\max_{\alpha,\beta:\alpha_i\ge0}\mathcal{L}(w,\alpha,\beta)$
 This is identical to the original primal problem.
 The optimal value of the objective is denoted by $p^{\ast}$. It is
called the *value* of the primal problem.
 The *dual* optimization problem:
 Define $\theta_{\mathcal{D}}(\alpha,\beta)=\min_w\mathcal{L}(w,\alpha,\beta)$
 Want to get $\max_{\alpha,\beta:\alpha_i\ge0}\theta_{\mathcal{D}}(\alpha,\beta)$
 This is similar to the primal problem except the order of the min
and the max's have been reversed.
 Denote the result with $d^{\ast}$it is called the *value* as
well.
 Note that $d^{\ast}\le p^{\ast}$
 This follows from the general principle that the "max min" of a
function is never greater than the "min max".
 Under certain conditions we'll have $d^{\ast}=p^{\ast}$. So we can
solve the dual problem instead. The (sufficient) conditions for this
to occur are:
 Let $f$ and the $g_i$'s be convex.
 If $f$ has a Hessian, it is convex iff the Hessian is positive
semidefinite. \q{Verify.}
 All linear and affine functions are convex. \q{Verify.}
 Let the $h_i$'s be affine.
 Affine means that there exists $a_i,b_i$ such that
$h_i(\mathbf{w})=a_i^T\mathbf{w}+b_i$
 Affine really means linear but allowing for an intercept term.
 Assume $g_i$ are strictly feasible.
 Feasible means that there exists some $w$ so that $g_i(w)<0\
\forall i$.
 Then there exists $w^{\ast},\alpha^{\ast},\beta^{\ast}$ such that
$w^{\ast}$ is the solution to the primal problem,
$\alpha^{\ast},\beta^{\ast}$ are the solution to the dual problem,
and
$p^{\ast}=d^{\ast}=\mathcal{L}(w^{\ast},\alpha^{\ast},\beta^{\ast})$. \q{Stated
without proof.}
 The *KarushKuhnTucker (KKT) conditions* are satisfied: (\q{Stated
without proof})
 $\frac{\partial}{\partial
w_i}\mathcal{L}(w^{\ast},\alpha^{\ast},\beta^{\ast})=0,\ i=1,\dots,n$
 $\frac{\partial}{\partial
\beta_i}\mathcal{L}(w^{\ast},\alpha^{\ast},\beta^{\ast})=0,\ i=1,\dots,l$
 $\alpha_i^{\ast}g_i(w^{\ast})=0,\ i=1,\dots,k$ (The KKT *dual
complementarity condition*)
 $g_i(w^{\ast})\le0,\ i=1,\dots,k$
 $\alpha^{\ast}\ge0,\ i=1,\dots,k$
 If any $w^{\ast},\alpha^{\ast},\beta^{\ast}$ satisfy the KKT
conditions, then they are a solution to the primal and dual
problems.
 The dual complementarity condition shows that if $\alpha_i\ge0$,
then $g_i(w^{\ast})=0$.
** Optimal Margin Classifiers
 The primal problem for the optimal margin classifier has a constraint
that can be written as
$g_i(\mathbf{w})=1y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)}+b)\le0$
 From KKT dual complementarity, we'll have $\alpha_i>0$ only when
$g_i(\mathbf{w})=0$
 There will be only a few such points. They are called the *support
vectors*.
 Note that these support vectors have a functional margin
of 1. They're the ones closest to the separating hyperplane.
 Setting it up as a Lagrange multiplier problem, can look at the dual
form. All conditions required for $p^{\ast}=d^{\ast}$ are met, and so
the KKT conditions hold. So solve the dual problem instead.
 The optimal value for $\mathbf{w}^{\ast}$ is
$\sum_{i=1}^m\alpha_iy^{(i)}\mathbf{x}^{(i)}$
 We also will have the constraint that $\sum_{i=1}^m\alpha_iy^{(i)}=0$
(in the optimal case).
 The optimal value for the intercept term is:
 $b^{\ast}=\frac{1}{2}\left(\max_{i:y^{(i)}=1}\mathbf{w}^{\ast
T}\mathbf{x}^{(i)}+\min_{i:y^{(i)}=1}\mathbf{w}^{\ast
T}\mathbf{x}^{(i)}\right)$ \q{Verify!}
 This should be easy to derive. $\mathbf{w}^{\ast}$ dictates the
normal vector of your plane. $b$ is just the intercept. Sweep it to
find the maximum.
 To make a prediction at a new input point $\mathbf{x}$:
 Can use $\mathbf{w}^{T}\mathbf{x}+b$ and pick 1 or 1 accordingly.
 Look at $b+\sum_{i=1}^m\alpha_iy^{(i)}<\mathbf{x}^{(i)},\mathbf{x}>$
(and see if it is positive or negative).
 Note that this is straightforward as only the support vectors have
nonzero $\alpha_i$'s
 \q{Is any of this valid if the data is /not/ linearly separable?}
** Kernels
 It had been mentioned in linear regression that given a feature $x$,
we can make new features from it (e.g. $x^2$).
 New terminology: Let the "base" feature $x$ be called an *attribute*,
and all features derived from it (including $x$, if used) are the
*features*.
 Let $\phi$ denote the *feature mapping*
(e.g. $\phi(x)=\begin{pmatrix}x\\x^2 \end{pmatrix}$)
 Given a $\phi$, the corresponding *kernel* is
$K(x,z)=\phi(x)^T\phi(z)$.
 This is an extension of $$.
 Not all kernels $K(x,z)$ can be written in the above form. I think we
need it to be writable in that form if we want to use them for SVM's.
 Computing the kernel can be very fast if you /don't/ compute the
individual $\phi(x),\phi(z)$ (e.g. $O(n)$ vs $O(n^2)$).
 \q{Or so Andrew claims. I see it in his examples, but I don't know
if it's a general truism}.
 *Gaussian kernel*:
$K(x,z)=\exp\left(\frac{xz^2}{2\sigma^2}\right)$
 It is a valid kernel for an SVM (i.e. it can be written as
$\phi(x)^T\phi(z)$).
 Assume $K$ is "valid". Then we denote a $k\times k$ matrix $K$ such
that $K_{ij}=K(x^{(i)}, x^{(j)})$ It is called the *Kernel matrix*.
 For a valid kernel, $K$ is a symmetric, positive semidefinite matrix.
 It is also a sufficiency criterion for validity. \q{Not shown.}
 *Mercer's Theorem* (for $\mathds{R}^n$): Let
$K:\mathds{R}^{n}\times\mathds{R}^{n}\mapsto\mathds{R}$. Then for $K$
to be valid, it is necessary and sufficient that for all
$\{x^{(1)},\dots,x^{(m)}\}$ where $m$ is finite, the corresponding
kernel matrix is symmetric positive semidefinite. \q{I think the
phrasing is "for all" and not "there exists".}
 Here $m$ is any numberit has no bearing to the training set.
 Kernels can be used for things other than SVM's. In general, if you
have any machine learning algorithm that can be written in terms of
$$. then you can replace it with $K(x,z)$ and get a lower
dimensional problem. Generally referred to as the *kernel trick*.
 So you can use this trick even for linear regression, logistic
regression, perceptron, etc.
 Alternatively, think of it as a "free" way of going to higher
dimensions. Your problem may not be linearly separable in your
initial space, but it may be so in the higher dimensional one.
 Although you often won't know in advance if it will be separable in
the higher dimensional space.
** Regularization and the nonseparable Case
 As a general rule of thumb, the higher the number of dimensions
(e.g. mapped using $\phi$), the more likely the data will be
separable. (It makes sense!). But you shouldn't assume it's the case a
priori.
 A separating hyperplane is susceptible to outliers. In some cases
you'd want to get those outliers wrong.
 $l_1$ *regularization*: $\min_{\gamma,\mathbf{w},b}\frac{1}{2}\mathbf{w}^2+C\sum_{i=1}^m\xi_i$ such that
$y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)}+b)\ge 1\xi_{i}\ \forall i$ and
$\xi_i\ge0\ \forall i$
 Note that if $\xi>1$, then for that point, your classification will
be _incorrect_ (you need it to be positive for correctness).
 Hence this algorithm is allowing for misclassifications, and I
suppose this is where nonseparability is not a factor.
 This makes one less sensitive to outliers. \q{Why?}
 Can now have functional margins less than 1.
 Writing the Lagrangian and going through the same procedure as before,
we get the same form for $\mathbf{w}$ in terms of $\alpha_i$ as before
and the same equation with inner products for making predictions to
new data. \q{I did not rederive.}
 The only difference is $0\le\alpha_i\le C$ \q{Not verified.}
 $b^{\ast}$ is now different.
 Dualcomplementarity conditions are now (\q{Not verified}):
 $\alpha_i=0\implies
y^{(i)}\left(\mathbf{w}^T\mathbf{x}^{(i)}+b\right)\ge1$
 $\alpha_i=C\implies
y^{(i)}\left(\mathbf{w}^T\mathbf{x}^{(i)}+b\right)\le1$
 $0<\alpha_i$
such that $0\le\alpha_i\le C,i=1,\dots,m$ and $\sum_{i=1}^m\alpha_iy^{(i)}=0$
 Can't directly use coordinate ascent because one of the constraints
would be violated. Fixing all but one $\alpha$ fixes that one as well.
 So we need to update two $\alpha$'s at a time.
 The SMO algorithm:
1. Select $\alpha_i,\alpha_j$ to update next.
2. Reoptimize $W(\alpha)$ with respect to $\alpha_i,\alpha_j$ holding
all others constant.
3. Repeat until convergence.
 Test for convergence can be done by seeing if KKT conditions are
satisfied to within some tolerance.
 Typically the tolerance is 0.001 to 0.01.
 The update can be done efficiently.
 \q{Read his notes and add the details heremay need to refer to
Platt's paper!}
** SVM & Kernel Method Examples
1. Say you want to recognize a digityou scan it into an $n\times n$
grid. Then using a kernel like
$K(\mathbf{x},\mathbf{y})=(\mathbf{x}^T\mathbf{y})^d$ or the Gaussian
kernel, you can do as well as the best neural networks.
 This occurs in a $\mathds{R}^{n^2}$ space, but kernel methods
reduce the complexity.
 You don't need to know which pixel is adjacent to which one.
2. A typical protein sequence can be represented by
BADEMDIMENGNIQ... etc. Each letter represents an amino acid. The
sequence length is variable. Need to categorize a protein into one of
a number of classes.
 Represent $\phi(\mathbf{x})$ by looking at all possible 4letter
sequences counting how many times each of them appears in the given
sequence.
 So you end up with a vector of dimension $26^4$.
 There are clever ways using dynamic programming for taking the
scalar product, so the algorithm is fast.
* Support Vector MachinesCoursera Treatment
 Let the classes be 0,1 as we originally had them.
 The cost function is now:
 $C\sum_{i=1}^m\left[y^{(i)}cost_1(\Theta^T\mathbf{x}^{(i)})+(1y^{(i)})cost_0(\Theta^T\mathbf{x}^{(i)})\right]+\frac{1}{2}\sum_{i=1}^n\theta_j^2$
 By convention we don't have the $m$ termit's a scaling factor anyway.
 $C$ is essentially $1/\lambda$another convention.
 The cost functions are essentially 0 when the argument is greater
than 1 (for positive class) and increase linearly as you decrease
the argument. For the negative class it is just the reflection about
the $x$axisthe "corner" point is now at 1.
 \q{Draw cost functions!}
 The cost functions are a crude approximation of the one you have for
logistic regression.
 If $y=1$, we want $\Theta^T\mathbf{x}\ge1$ (and not just 0).
 If $y=0$, we want $\Theta^T\mathbf{x}\le1$ (and not just 0).
 Hypothesis: If $\Theta^T\mathbf{x}\ge0$, predict 1. Else predict 0. It
is now very binary (not a probability).
 So note that even if you predicted correctly, you may still incur a cost.
 \q{Not clear how this formalism ties to the one in the previous
chapter. Here we're allowing $\Theta$ to vary in lengththis is
essentially not constraining $\mathbf{w}$ to be 1.}
 Assume we have a separating hyperplane. We're minimizing $\Theta$,
which will in effect move/rotate the separating hyperplane to an ideal
point.
 This works because if you have a poor hyperplane, then
$\Theta^T\mathbf{x}^{(i)}$ may be small for a given point, and you'd
need a large $\Theta$ to compensate for it to make
$\Theta^T\mathbf{x}^{(i)}\ge1$ (for positive class) so that you have
no contribution to the cost function by the first sum.
** Kernels
 Define $\mathbf{l}^{(i)}=\mathbf{x}^{(i)}$ for $i=1,\dots,m$.
 Note that using all $m$ training inputs is merely one way to do it.
 Given $\mathbf{x}$, Define $f_i=K(\mathbf{x},\mathbf{l}^{(i)})$ where
$K$ is a kernel as discussed in the previous chapter. It could be the
Gaussian kernel, for example.
 Consider this a mapping of $\mathbf{x}^{(i)}\mapsto \mathbf{f}^{(i)}$
where $\mathbf{f}$ is the vector of all $f_j$ associated with
$\mathbf{x}^{(i)}$. Its dimension is $m$ or $m+1$.
 Clearly $f^{(i)}_i=1$ for the Gaussian kernel.
 Can also define $f_0^{(i)}=1$
 To make a hypothesis, evaluate $\Theta^T\mathbf{f}$ and see if it's
greater than 0.
 So we've transformed the "base" features $\mathbf{x}$ into $m$ new
features.
 To train, minimize the cost function as before, but use $\Theta^T\mathbf{f}^{(i)}$
 In many/most implementations, for regularization, instead of
minimizing $\Theta^2$, they minimize $\Theta^TM\Theta$ where $M$
is a matrix tied to your choice of kernel.
 This is done for computational efficiency to offset the fact that
your $\Theta$ is large ($m$ entries).
 Andrew suggests not writing code to minimize the cost function but
instead using an off the shelf SVM code.
 Examples are liblinear and libsvm.
 To pick $\sigma$:
 A large $\sigma$ causes $f_i$ to vary smoothly. Leads to a higher
bias and lower variance.
** Choice of Kernel
 Linear kernel:
 The same as no kernel. Use this if your $n$ is large and $m$ is
small.
 Gaussian kernel:
 Use if $n$ is small and $m$ is large.
 Perform feature scaling /before/ using the Gaussian kernel.
 Polynomial kernel: $K(\mathbf{x},
\mathbf{l})=(\mathbf{x}^T\mathbf{l}+a)^{d}$
 Usually used when all $x_i, l_i$ are nonnegative.
 Other kernels:
 String kernel
 Chisquare kernel
 Histogram Intersection kernel.
 Always try linear/Gaussian kernel first!
 If you implement your own kernel, you /must/ satisfy Mercer's
Theorem. Most SVM algorithms will rely on this.
** Logistic Regression vs SVM
 If $n\ge m$, use logistic regression or a linear kernel.
 If $n=110000,m=1010000$, use a Gaussian kernel.
 If $n=11000,m\ge50000$, then create/add more features and then use
logistic regression or a linear kernel. Gaussian kernel may be very
slow to run for large $m$.
 Neural network likely to work well, but slower to train.
* Neural Networks
 Say you have $n$ features. The way to form a neural network is to pass
all $n$ features into $k$ sigmoid functions, each of which outputs an
$a_i=g(\Theta_i^T\mathbf{x})$ where $\Theta_i$ is a vector of
dimension $n$ and is the set of parameters that correspond to the
$i$th sigmoid function. These $a_i$ in turn are passed to another
sigmoid function which outputs the final hypothesis
$h_{\Theta}(\mathbf{x})=\mathbf{a}^T\Theta$, where $\Theta$ depends on
all $k+1$ $\Theta_i$ 's
 To train it, use the objective function
$J(\Theta)=\frac{1}{2}\sum_{i=1}^m\left(h_{\Theta}(\mathbf{x}^{(i)})y^{(i)}\right)^2$
 Can apply gradient descent. *Back propagation* is the formal term
used, but it's just gradient descent.
 Almost always a /non/convex optimization problem. Can get stuck in a
local optimum.
 Considered the most effective learning algorithm before SVM's (and
that claim is still contested).
 $x_0=1$ is called the *bias unit*. Not always drawn in the diagrams.
 *Activation function* is usually just another name for the sigmoid
function.
 *Weights* are just another way of saying "parameters".
 The first layer is called the *input layer*. The final layer is the
*output layer*. Intermediate layers are called *hidden layers*.
 $a_i^{(j)}$: Activation of unit $i$ in layer $j$.
 $\Theta^{(j)}$: Matrix of weights controlling function mapping from
layer $j$ to layer $j+1$.
 $\Theta_{mn}^{(j)}$ is the $n$th weight of the $m$th activation unit in
layer $j+1$.
 \q{Have a diagram of a neural network!}
 Let $z_i^{(j+1)}=\sum_{k=0}^{m}\Theta_{ik}^{(j)}x_k$
 Then $a_i^{(j)}=g(z_i^{(j)})$
 In vector notation, this becomes:
$\mathbf{z}^{(j+1)}=\Theta^{(j)}\mathbf{a}^{(j)}$ where by convention
we have $\mathbf{a}^{(1)}=\mathbf{x}$
 $\mathbf{a}^{(j)}=g(\mathbf{z}^{(j)})$ where $g$ is the sigmoid
function and this is merely shorthand notation for saying we apply $g$
to all the elements of $\mathbf{z}$
 This is *forward propagation*.
 One benefit of neural networks over linear regression is that the
latter explodes combinatorially if you want all combinations of your
inputs (e.g. you have $k$ base features and you want all 4fold
products like $x_1x_3x_4x_6$ and $x_2^3x_5$)
** Multiclass Classification
 You now have multiple outputs, with your output being a vector
(e.g. [1 0 0 0] for class A, [0 1 0 0] for class B, etc). Thus your
training set label will also be a vector.
** Cost Function
 Let $L$ be the number of layers in the network.
 Let $s_l$ be the number of units in layer $l$, excluding the bias
unit.
 Let $k$ be the number of output units ($s_L$).
 Cost function for a neural network:
$$J(\Theta)=\frac{1}{m}\left[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}\log(h_{\Theta}(x^{(i)}))_k+(1y^{(i)}_k)\log(1(h_{\Theta}(x^{(i)}))_k)\right]+\frac{\lambda}{2m}\sum_{l=1}^{L1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2$$
 $h_{\Theta}(x)\in\mathds{R}^K$ where $K$ is the number of outputs.
 Note: No sum over the bias units for regularization. It doesn't
really make a big difference, though.
 \q{I'm being sloppy about vector notation.}
 The cost function is typically nonconvexin practice it's not a
serious problem, though.
** Backpropagation Algorithm
 Notation: Let $\delta_j^{(l)}$ be the "error" of node $j$ in layer $l$.
 For the output node, $\delta_j^{(L)}=a_j^{(L)}y_j$. Can write it in
vector notation.
 \q{I'm being sloppy about vector notation}
 Then for the next to last layer,
$\mathbf{\delta}^{(L1)}=(\left(\Theta^{(L1)}\right)^T\mathbf{\delta}^{(L)})^Tg'(\mathbf{z}^{(L1)})$
 $g'(z^{(k)})=(\mathbf{a}^{(k)})^T(\mathbf{1}\mathbf{a}^{(k)})$ \q{Verify!}
 No $\mathbf{\delta}^{(1)}$ term.
 Can show that if you ignore regularization:
 $\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)=a_j^{(l)}\delta_i^{(l+1)}$ \q{Show it!}
 Algorithm:
 Set $\Delta_{ij}^{(l)}=0$ for all $i,j,l$. These will be a proxy for
the partial derivatives (gradient needed for optimization).
 For $i=1$ to $m$ (number of training examples):
 Set $\mathbf{a}^{(1)}=\mathbf{x}^{(i)}$
 Perform forward propagation to compute $\mathbf{a}^{(l)}$ for
$l=2,\dots,L$
 Using $y^{(i)}$, compute $\delta^{(L)}=\mathbf{a}^{(L)}y^{(i)}$
 Compute $\delta^{(L1)},\dots,\delta^{(2)}$
 Set
$\Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$
In matrix notation this is
$\Delta^{(l)}=\Delta^{(l)}+\delta^{(l+1)}(\mathbf{a}^{(l)})^T$
\q{Verify!}
 Compute
$D_{ij}^{(l)}:=\frac{1}{m}\Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)}$
if $j\ne0$ and $D_{ij}^{(l)}:=\frac{1}{m}\Delta_{ij}^{(l)}$ if
$j=0$. Recall that $j=0$ is a bias term.
 \q{I think it should be $\lambda/m$!}
 Can show that $D_{ij}^{(l)}$ is the partial derivative of the cost
function with respect to $\Theta_{ij}^{(l)}$ \q{Show!}
 So now you have the cost function and the derivative. Can use
conjugate gradient, etc.
*** Implementation Details
 As backpropagation has a lot of minute details you can screw up, use
gradient checking.
 Don't use a vector of 0's as your initial guess.
 Instead initialize $\Theta_{ij}^{(l)}$ to a random value between
$\epsilon$ and $\epsilon$. This is just to ensure symmetry breaking.
 One strategy for picking $\epsilon$ is to relate it to the number of
units in the network.
 $\epsilon=\frac{\sqrt{6}}{\sqrt{L_{in}+L_{out}}}$
 $L_{in}=s_l$ and $L_{out}=s_{l+1}$ for $\Theta^{(l)}$
** Putting All The Pieces Together
 Pick an architecture.
 By default, choose 1 hidden layer. You can try crossvalidation to
figure out how many layers to use.
 If you have more than 1 hidden layer, have the same number of hidden
units per layer.
 More units is better. Anywhere from the number of inputs to
24$\times$
 Randomly initialize weights
 Implement forward propagation to calculate hypothesis.
 Compute cost function.
 Implement backward propagation.
 Use gradient checking to confirm your gradients.
 Use gradient descent or similar algorithm to minimize the cost
function.
 A larger neural network will result in overfitting and you'll have
to regularize. However, this usually works out better than a smaller
neural network.
* Learning Theory
 *Generalization error* of a hypothesis is its expected error on
examples not necessarily in the training set.
 *Bias* of a model is the expected generalization error even when fit
to a very large or infinite training set. I suppose this means it is
the error from the model, as opposed to from your data.
 \q{Definition of variance?}
 *Union Bound*: $P(\union_iA_i)\le\sum_iP(A_i)$ as long as the union is
of a countable number of sets.
 *Hoeffding Inequality*: Let $Z_1,\dots,Z_m$ be independent and
identically distributed random variables from a Bernoulli distribution
($P(Z_i=1)=\phi$). Let $\hat{\phi}=(1/m)\sum_{i=1}^mZ_i$ be the mean
of these random variables. Let $\gamma>0$. Then
$P(\phi\hat{\phi}>\gamma)\le2\exp(2\gamma^2m)$ \q{Not proven.}
 Also called the *Chernoff bound*.
 Informally, it is saying that if you estimate $\phi$ by taking the
average of a number of observations, then your estimate will be good
as long as the number of observations is large.
** Definitions
 Restrict ourselves to binary classification, but the results will
apply to regression, etc.
 Let the training set $S$ be of size $m$ and the training examples
$(x^{(i)},y^{(i)})$ be drawn from an independent and identically
distributed distribution $\mathcal{D}$.
 The *training error* of a hypothesis $h$ is
$\hat{\epsilon}(h)=\frac{1}{m}\sum_{i=1}^m1\{h(x^{(i)})\ne y^{(i)}\}$
 This is just the fraction of cases we misclassify.
 Also known as the *empirical risk* or the *empirical error*.
 Note that it depends on the training set $\mathcal{S}$.
 Define the *generalization error* as $\epsilon(h)=P_{(x,y)\sim
D}(h(x)\ne y)$
 It is the probability of misclassifying a new sample (not
necessarily from the training set).
 Note that it is assumed that this is coming from the same set
$\mathcal{D}$. This is one of the *PAC* (probably approximately
correct) assumptions,
 One way to minimize the training error is
$\hat{\theta}=\arg\min_{\theta}\hat{\epsilon}(h_{\theta})$ for the
linear classification case.
 This is called the *empirical risk minimization* (ERM) and the
resulting hypothesis is $\hat{h}=h_{\hat{\theta}}$.
 ERM is one of the most basic learning algorithms.
 This is a nonconvex optimization problem. It is NP hard.
 The logistic regression can be thought of as an approximation to
ERM to make it a convex optimization.
 The *hypothesis class* $\mathcal{H}$ /used by a learning algorithm/ is
the set of all classifiers considered by it.
 ERM is a minimization over $\mathcal{H}$the learning algorithm
picks the hypothesis:
$\hat{h}=\arg\min_{h\in\mathcal{H}}\hat{\epsilon}(h)$.
** Finite $\mathcal{H}$
 Let $\mathcal{H}$ be a set of $k$ hypotheses (for linear
/classification/!).
 ERM will select one of these hypotheses with respect to a training
sample.
 Can show that the probability that there does /not/ exist a hypothesis
in the training set such that
$P(\epsilon(h_i)\hat{\epsilon}(h_i)>\gamma)$ (i.e. we are always
within $\gamma$) is greater than or equal to $12k\exp(2\gamma^2m)$.
 This is called the *uniform convergence* result.
 If $k$ is large (which it normally is), then this isn't a very
useful result.
 An algorithm's *sample complexity* is the $m$ needed to get a certain
level of performance.
 Let $h^{\ast}$ be the best possible hypothesis in $\mathcal{H}$. Then
$\epsilon(\hat{h})\le\epsilon(h^{\ast})+2\gamma$.
 Let $\mathcal{H}=k$ and fix $m,\delta$. Then with probability at
least $1\delta$, we have $\epsilon(\hat{h})\le\left(\min_{h\in
\mathcal{H}}\epsilon(h)\right)+2\sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}$
 Let $\mathcal{H}=k$ and fix $\delta,\gamma$. Then for
$\epsilon(\hat{h})\le\min_{h\in \mathcal{H}}\epsilon(h)+2\gamma$ to
hold with probability at least $1\delta$, it suffices that
$m\ge\frac{1}{2\gamma^2}\log\frac{2k}{\delta}$
** Infinite $\mathcal{H}$
 All results below are for ERM for linear classificationmay not be
that correct for other algorithms.
 Given a set $S$ of $d$ points $\mathbf{x}$ (not necessarily the training set),
we say that $\mathcal{H}$ *shatters* $S$ if $\mathcal{H}$ can realize
any labeling on $S$.
 In other words no matter what labels you give to the $x$'s, there
will be some hypothesis that guesses them correctly (a different
hypothesis for each method of labeling).
 Let $\mathcal{H}$ be the set of linear binary classifiers.
 It can shatter any set of 2 points.
 But not all sets of 3 points (it cannot handle 3 collinear points).
 It cannot shatter even a single set of 4 points.
 Given $\mathcal{H}$, its *VapnikChervonenkis dimension*
$VC(\mathcal{H})$ is the size of the largest set it can shatter.
 Its value could be $\infty$.
 This does /not/ mean it can shatter any set of that sizejust that
there exists at least one set of that size that it can shatter.
 For linear classifiers in $n$ dimensions,
$VC(\mathcal{H})=n+1$. \q{Not proven}. Yaser proves it for the
perceptron algorithm.
 Given $\mathcal{H}$, let $d=VC(\mathcal{H})$. Then with probability at
least $1\delta$, we have for all $h\in \mathcal{H}$,
$\epsilon(h)\hat{\epsilon}(h)\le
O\left(\sqrt{\frac{d}{m}\log\frac{m}{d}+\frac{1}{m}\log\frac{1}{\delta}}\right)$
\q{Not shown!}
 Thus, with probability at least $1\delta$,
$\epsilon(\hat{h})\le\epsilon(h^{\ast})+O\left(\sqrt{\frac{d}{m}\log\frac{m}{d}+\frac{1}{m}\log\frac{1}{\delta}}\right)$
\q{Haven't verified!}
 If you do the math, this means that you need $m=O(d)$ \q{Did not
verify.}
 It is also a lower bound. \q{So it should be $m=\Theta(d)$}?
 For most ordinary problems, $d$ will be roughly similar to the number
of parameters in your model. So as a heuristic, if you increase the
number of parameters, then you correspondingly should increase
(roughly linearly) the number of training examples you use.
* Model Selection
The main idea is: How can we decide which algorithm will work best for
our problem? This could be linear regression vs neural networks vs SVM,
or even hypotheses within each of these models (how many features, etc).
 Assume we have a finite set of models $\mathcal{M}$.
** Cross Validation
 Assume you have a training set $S$.
*** Bad Method
 Poor algorithm:
 Train each $M_i$ on S to get $h_i$.
 Pick the hypothesis with the smallest training error.
 It's poor because you can always overfit to get a small training
error.
*** HoldOut Cross Validation
 Algorithm:
1. Randomly split $S$ into $S_{\mathrm{train}}$ (usually 70% of the
data) and $S_{cv}$. cv stands for cross validation.
2. Train each $M_i$ on $S_{\mathrm{train}}$ to get $h_i$.
3. Choose the hypothesis with the smallest $\hat{\epsilon}_{S_{cv}}(h_i)$
 Usually the hold out cross validation set has between 1/4 and 1/3 of
the data.
 One could use step 3 to select the /model/ $M_i$, and then retrain
over the whole $S$.
 Often a good idea unless your algorithm is sensitive to small
perturbations in the data.
 Downside: You "waste" 30% of the data.
 If you want to report the /generalization error/ (the expected error
on untested samples), do /not/ report the error on the cross
validation set as the generalization error. Do it on a "test" set that
has not been used for either training or cross validation.
 Some times people use 60% for training, 20% for cross validation,
and 20% to report the generalization error.
*** $k$fold Cross Validation
 Algorithm:
1. Randomly split $S$ into $k$ disjoint subsets of $m/k$ training
examples.
2. For each $M_i$:
 For each $j=1,\dots,k$, train $M_i$ on all the subsets /except/
$S_j$ to get $h_{ij}$.
 Test $h_{ij}$ on $S_j$ to get $\hat{\epsilon}_{S_j}(h_{ij})$
 Take the average of $\hat{\epsilon}_{S_j}(h_{ij})$averaged
over $j$.
3. Pick the $M_i$ with the lowest averaged error, and then retrain
$M_i$ over $S$.
 $k=10$ is typically used.
 Generally this method is more expensive than holdout cross
validation.
*** LeaveOneOut Cross Validation
 At extremes people take $k=m$ for the above.
 This is essentially training on all but one example in $S$, and using
that one as the test.
 This method is called *leaveoneout cross validation*.
*** Comments
Can use these techniques to evaluate a /single/ model/algorithm. \q{I
don't understand how!}
** Feature Selection
 You may have a large $n$ (possibly even larger than $m$). You want to
pare it down to the main features.
 Use heuristics to drop the number
*** Forward Search
 Algorithm:
1. Initialize $\mathcal{F}=\emptyset$.
2. Repeat:
 For $i=1,\dots,n$, if $i\notin \mathcal{F}$, let
$\mathcal{F}_i=\mathcal{F}\union\{i\}$, and use some cross
validation technique to evaluate $\mathcal{F}_i$
 Pick the $\mathcal{F}_j$ that worked best in the above step and
set it to $\mathcal{F}$.
3. Select the best feature set of all the ones evaluated.
 Have to decide when to stop the outer loop.
 One method is to stop when $\mathcal{F}$ hits some size (which
could be $n$).
 This algorithm is an instance of a *wrapper model feature
selection*. They work well but are expensive.
*** Backward Search
 Start with the full set of features, and delete one at a time until
you have the empty set.
*** Filter Feature Selection
 Computationally cheaper
 Assign a score $S(i)$ that tells you how informative each $x_i$ is
about the class labels $y$. Then pick the $k$ features with the
largest scores.
 Choose $S(i)$ to be the *mutual information* $MI(x_i,y)$ between $x_i$
and $y$:
$MI(x,y)=\sum_{x_i\in\{0,1\}}\sum_{y\in\{0,1\}}p(x_i,y)\log\frac{p(x_i,y)}{p(x_i)p(y)}$.
 This equation assumes both $x_i$ and $y$ are binary. More generally
the sum would be over the domain of each.
 The probabilities in the equation can be estimated based on their
distributions in the training set.
 The mutual information can be expresses as a KullbackLeibler (KL)
divergence: $MI(x,y)=KL(p(x_i,y)p(x_i)p(y))$
 This gives a measure of how different $p(x_i,y)$ and $p(x_i)p(y)$
are.
 If they were independent random variables, then they'd be equal and
the divergence would be 0.
 In the independent case the score should be lowknowing $x_i$
doesn't help us know $y$.
 Once you have ranked the features, how do you pick $k$? One way is to
use cross validation to select the $k$.
** Bayesian Statistics and Regularization
 *Frequentist* view: $\theta$ is a constant, but unknown quantity. It
is not random. We use statistics to estimate it.
 *Bayesian* view: $\theta$ is a random variable. Specify a *prior
distribution* $p(\theta)$ that expresses "prior beliefs".
 Given a training set $S$, compute the posterior distribution:
 $p(\thetaS)=\frac{p(S\theta)p(\theta)}{p(S)}$
 Which is equal to
$\frac{\left(\prod_{i=1}^mp(y^{(i)}x^{(i)},\theta)\right)p(\theta)}{\int_{\theta}\left(\prod_{i=1}^mp(y^{(i)}x^{(i)},\theta)p(\theta)\right)\ d\theta}$
 $p(y^{(i)}x^{(i)},\theta)$ comes from your model (e.g. the one we
used for logistic regression, etc).
 Note the switch from $p(yx;\theta)$ to $p(yx,\theta)$.
 Given a new test example $x$, compute the posterior distribution on
$y$ using the posterior one on $\theta$:
$p(yx,S)=\int_{\theta}p(yx,\theta)p(\thetaS)\ d\theta$
 To get the expected value of $y$ given $x$, calculate
$E[yx,S]=\int_yyp(yx,S)\ dy$
 Doing all those integrals is expensiveespecially if $\theta$ has a
lot of features. So we approximate $p(\thetaS)$.
 One approximation is to use the *maximum a posteriori* (MAP) estimate:
$\theta_{MAP}=\arg\max_{\theta}\prod_{i=1}^mp(y^{(i)}x^{(i)},\theta)p(\theta)$
 This is /almost/ the same as the maximum likelihood estimate for
$\theta$. \q{Verify!}
 In practice, a common choice for $p(\theta)$ is $\theta\sim
\mathcal{N}(0,\tau^2I)$
 With this choice, $\theta_{MAP}$ will have smaller norm than what
you get from maximum likelihood. \q{Not shown!} \q{So what does that
mean?}
 This will cause the estimate to be less susceptible to overfitting
than maximum likelihood. \q{Why?}
 Bayesian logistic regression good for text classification.
* The Perceptron and Large Margin Classifiers
 *Batch Learning*: First given the training set, then evaluate $h$ on
separate test data.
 *Online Learning*: Algorithm has to make predictions while it's
learning.
 For convenience, let the negative class be $1$ in this section and
we'll consider the perceptron algorithm.
 If, given $(\mathbf{x},y)$, the hypothesis correctly classifies the input, no
updates to the weights are needed.
 If it misclassifies, then we update $\Theta:=\Theta+y\mathbf{x}$. \q{He
claims this is the same as the earlier update rule with $1$ instead
of 0 and with $\alpha$ dropped (not useful in perceptron), but I can't
reproduce that claim.}
 *Theorem*: Let's say you have a sequence of
$(\mathbf{x}^{(1)},y^{(1)}),\dots,(\mathbf{x}^{(m)},y^{(m)})$, and that
$\mathbf{x}^{(i)}\le D\ \forall i$. If there exists a unitlength
vector $\mathbf{u}$ such that
$y^{(i)}\cdot(\mathbf{u}^T\mathbf{x}^{(i)})\ge\gamma$ for all
examples (this implies $\mathbf{u}$ separates the data with a margin
of at least $\gamma$), then the total number of mistakes the online
perceptron algorithm makes is at most $(D/\gamma)^2$.
* The kmeans Clustering Algorithm
 This is an unsupervised learning algorithm for clustering a set of $m$
points.
 Algorithm:
1. Initialize *cluster centroids* $\mu_1,\dots,\mu_k\in\mathds{R}^n$
randomlyassuming we want $k$ clusters.
 One option is to pick $k$ of the $m$ points, and use those as
your initial $\mu_i$.
2. Repeat until convergence:
 For every $i$, set
$c^{(i)}=\arg\min_j\mathbf{x}^{(i)}\mu_j^2$
 This is basically finding the centroid closest to each point.
 For each $j$, set
$\mu_j:=\frac{\sum_{i=1}^m1\{c^{(i)}=j\}\mathbf{x}^{(i)}}{\sum_{i=1}^m1\{c^{(i)}=j\}}$
 This is moving the centroid to the mean of the points assigned
to it.
3. If at any point you have a $\mu_j$ that has no points assigned to
it, can do one of two things:
 Eliminate that cluster (so you'll have $k1$ ones). This is more
common.
 Randomly assign it elsewhere.
 $k$ is a parameter of the algorithm. It is the number of clusters we
desire. $\mu_j$ will represent the centroid of a cluster.
 The algorithm is guaranteed to converge. \q{Not shown.}
 Let the *distortion function* be
$J(c,\mu)=\sum_{i=1}^m\mathbf{x}^{(i)}\mu_{c^{(i)}}^2$
 $k$means is coordinate descent on $J$. Step 1 will minimize $c(i)$
while holding $\mu_j$ fixed. Step 2 does the opposite.
 In principle, it is possible to have oscillations between different
clusterings, but it is very unlikely.
 $J$ is nonconvex, so a global minimum is not guaranteed.
 However, in practice, it tends to work quite well.
 To get around this, you can run the algorithm many times (501000)
with different initial values. Then pick the one with the lowest
$J$.
 If $K=210$, then running it several times can really help. If $K$
is large, you don't get much benefit and you might as well use your
first guess.
 To choose the number of clusters, the /most common/ approach is to pick
one through visualization of the data. You can try other methods, but
keep this in mind.
 One approach is to plot $J$ vs $K$. Pick the "elbow" point beyond
which $J$ doesn't reduce much.
 You often don't get a clear elbow.
 Sometimes your application domain will suggest a $K$.
* Mixtures of Gaussians and the EM Algorithm
 *EM* is the *Expectation Maximization* algorithm.
 Let's say we assume that each $z^{(i)}$ (unknown to us) is a value
from 1 to $k$ (integer). We assume that
$P(\mathbf{x}^{(i)}z^{(i)}=j)\sim \mathcal{N}(\mu_j,\Sigma_j)$
 $z^{(i)}$ indicate which of the $k$ Gaussians each $x^{(i)}$ came
from.
 This is the *mixture of Gaussians* model.
 The $z^{(i)}$'s are *latent* random variables as they are hidden.
 If we try to solve this using log likelihood, then we don't get a
closed form expression.
 \q{Is $k$ known?}
 The EM algorithm:
 Repeat until convergence:
1. _Estep_: For each $i,j$, set $w_j^{(i)}=p(z^{(i)}=jx^{(i)};\phi,\mu,\Sigma)$
2. _Mstep_: Update the parameters:
 $\phi_j=\frac{1}{m}\sum_{i=1}^mw_j^{(i)}$
 $\mu_j=\frac{\sum_{i=1}^mw_j^{(i)}x^{(i)}}{\sum_{i=1}^mw_j^{(i)}}$
 $\sigma_j=\frac{\sum_{i=1}^mw_j^{(i)}(x^{(i)}\mu_j)(x^{(i)}\mu_j)^T}{\sum_{i=1}^mw_j^{(i)}}$
 The Estep tries to guess $z^{(i)}$. The Mstep updates the parameters
based on the guess.
 Susceptible to local optima, so repeat several times with different
initializations.
* Dimensionality Reduction
** Motivation
 One motivation is to compress the data to speed up the algorithm.
 Can do this if it seems, for example, that data in 3D seems to lie
on a plane. Parametrize it then into two new features (instead of
the original three).
 This works well if some features have a strong correlation. Often
the case with a lot of features and large data sets.
 Another motivation is to aid in visualization (e.g. to understand the
system, diagnostics, etc).
 The new features need not have a meaning.
** Principal Component Analysis
 First perform feature normalization and scaling.
 Problem statement. If you want to reduce from $n$dimensions to
$k$dimensions, find $\mathbf{u}_1,\dots,\mathbf{u}_k\in\mathds{R}^n$
such that the projection error is minimized.
 PCA is *not* linear regression. In the latter you're not minimizing
the projectionyou're minimizing something else.
** Principal Component Analysis Algorithm
 Compute the *covariance matrix*:
$\Sigma=\frac{1}{m}\sum_{i=1}^n\mathbf{x}^{(i)}(\mathbf{x}^{(i)})^T$
 \q{Should the index for the summation be $m$ or $n$?}
 This matrix is symmetric positive semidefinite.
 Compute its eigenvectors.
 Andrew prefers using SVD to do soclaims it is more numerically
stable for these kinds of matrices than the usual eigenvector
routines.
 Pick the first $k$ eigenvectors from the SVD result.
 \q{Why the first ones? Are the singular values sorted from highest
to lowest?}
 Let $\mathbf{z}=[\mathbf{u}^{(1)}\dots \mathbf{u}^{(k)}]^T\mathbf{x}$ (matrix
whose columns are the eigenvectors). This is the projection in the
reduced space.
 This procedure minimizes the projection error:
$\sum_{i=1}^m\mathbf{x}^{(i)}\mathbf{z}^{(i)}^2$ \q{Not shown.}
 To choose $k$, one heuristic is to pick the minimum $k$ that
satisfies:
$$\frac{\sum_{i=1}^m\mathbf{x}^{(i)}\mathbf{z}^{(i)}^2}{\sum_{i=1}^m\mathbf{x}^{(i)}^2}\le0.01$$
 The denominator is the variance. So in effect 99% of variance is
"retained"
 Can use 0.05 or 0.1 if 0.01 won't work. Don't go above 0.15.
 The quantity above is equivalent to:
$$1\frac{\sum_{i=1}^kS_{ii}}{\sum_{i=1}^nS_{ii}}$$
 where $S_{ii}$ are the singular values.
 This is much less expensive to calculate!
 \q{Shouldn't the sum be to $m$ instead of $n$?}
 To get back $\mathbf{x}$ from $\mathbf{z}$, do
$\mathbf{x}_{approx}=U^T\mathbf{z}$ where $U$ is the truncated
$k\times n$ matrix of eigenvectors you got from SVD.
 This is only an approximation of the original $\mathbf{x}$ unless
the correlation was 1.
 Run PCA /only/ on the training set! So $m$ is the number of entries in
the training setexclude the crossvalidation set.
 For visualization, only $k=2$ or 3 would be effective. It's mostly
used for data compression.
 Do /not/ try to use PCA to prevent overfitting!
 Only use PCA if you can't do without it (i.e. if memory or speed is a
constraint).
* Anomaly Detection
 Examples:
 Detecting outliers.
 Detecting fraud.
 You usually have a data set of nonanomalous examples.
** Algorithm
 Assume $x_j\sim \mathcal{N}(\mu_j,\sigma_j)$
 Define $p(\mathbf{x})=\prod_{j=1}^np(x_j;\mu_j,\sigma_j)$.
 This implicitly assumes the features are independent. This is rarely
true but Andrew claims the algorithm is robust nevertheless.
 This is called the *density estimation*.
 Algorithm:
 Choose features $x_i$ that you think may be indicative of anomalous
behavior.
 Fit parameters $\mu_j, \sigma_j$:
$$\mu_j=\frac{1}{m}\sum_{i=1}^mx_j^{(i)}$$
$$\sigma_j^2=\frac{1}{m}\sum_{i=1}^m\left(x_j^{(i)}\mu_j\right)^2$$
 Given a new example $\mathbf{x}$, compute $p(\mathbf{x})$
 It is an anomaly if $p(\mathbf{x})<\epsilon$ ($\epsilon$ is chosen
independently).
 To understand the rationale, consider the extreme case where all
features are far along the tail of the Gaussian. That's clearly
anomalous, and $p$ will be very low.
** Evaluating the Algorithm:
 Assume we have labeled data of anomalous and nonanomalous examples.
 Make a training set out of it and unlabel them. Make all of them are
nonanomalous (although it's OK if it contains a few anomalous
ones). Perhaps use 60% of the labeled data for this.
 Create a test and crossvalidation setinclude anomalous examples
among them. Perhaps 20% each. Put half of the anomalous examples in
each.
 Fit model $p(\mathbf{x})$ on training set.
 On a cross validation example, predict anomalousness.
 Pick your evaluation metric carefully. It's likely your data is
heavily skewed (few anomalies).
 True positive, true negative, false positive, false negative.
 Precision/recall
 $F_1$ score.
 Can use the crossvalidation set to choose $\epsilon$.
** Anomaly Detection vs Supervised Learning Algorithm
Prefer anomaly detection when:
 Very small number of positive samples (020) and large number of
negative examples.
 If you have many different types of anomalies. Hard for any algorithm
to learn from positive examplesnew ones may look nothing like
previously known ones.
** Choosing What Features to Use
 Make a plot (e.g. histogram) to see if the data is Gaussian.
 Although it often works OK if it's not Gaussian.
 If it's not Gaussian, try a transformation that makes it look
Gaussian.
 Try $\log(x+c)$, or $x^c$.
 A common problem is where both your normal and anomalous examples end
up with a comparable $p$. That means you may need to add another
feature. Examine the anomalous example to see if you can find anything
that sets it apart from the others.
 Try to choose features that take very large or small values on an
anomaly.
** Using the Multivariate Gaussian Distribution
 It can happen that when you examine the data, you can clearly see
anomalous examples, but the individual features don't seem anomalous
when you look at each one independently of the others. The algorithm
may fail for this.
 Model $p(\mathbf{x})$ as a multivariate Gaussian all in one go.
 $\mu\in\mathds{R}^n$
 $\Sigma\in\mathds{R}^{n\times n}$ \q{I think this is the covariance
matrix.}
 To fit:
 $\mu=\frac{1}{m}\sum_{i=1}^m\mathbf{x}^{(i)}$
 $\sigma=\frac{1}{m}\sum_{i=1}^m(\mathbf{x}^{(i)}\mu)(\mathbf{x}^{(i)}\mu)^T$
 The original model using a Gaussian for each feature is a special case
of the multivariate. So why use it?
 Use the original if performance is a concernit tends to scale
better with $n$.
 Use the original model if $m