Logistic Regression

Logistic regression is another linear model that outputs a probability (confidence score) between 0 and 1. The output is real and bounded. Logistic regression can be used to make predictions about something with a certain probability (instead of a binary classification), for lets say the probability diabetes occuring in a person depending on their BP, height, weight, age, etc.

Hypothesis Function¶

The output of logistic regression is given by pluggin in the original prediciton function into a sigmoid function:

\(h(x) = \sigma(\theta^T x + \theta_0) = \cfrac{\exp(\theta^T x + \theta_0)}{1 + \exp(\theta^T x + \theta_0)}\)

Here, \(P(y | x) = \begin{cases} h(x) & y = 1 \\ 1 - h(x) & y = -1\end{cases}\)

due to a property of the sigmoid function where \(\sigma(-s) = 1 - \sigma(x)\) we can rewrite the probability as:

\(P(y | x) = \sigma(y (\theta^T x + \theta_0))\)

Maximum Likelyhood Estimation¶

The probability of predicting all the training examples in a set \(S_n\) are thus given by \(p = \prod\limits_{i} P(y_i|x_i)\) This is the function that we will be maximising.

Maximising this is equivalent to maximising \(\cfrac{\ln(p)}{n}\). This is because \(\ln\) is a monotonically increasing function:

Maximise: \(\cfrac{1}{n} \sum\limits_{i = 1}^{n} \ln(P(y_i | x_i))\)

Minimise: \(J(\theta) = \cfrac{1}{n} \sum\limits_{i = 1}^{n} \ln(1/P(y_i | x_i))\)

Minimise: \(J(\theta) = \cfrac{1}{n} \sum\limits_{i = 1}^{n} \ln(1 + \exp(-y_i (\theta^T x_i + \theta_0 )))\)

The benifit of doing this is that deriving a sum is much more easy than deriving a product.

We can use both regular gradient descent or stochastic gradient descent to find the minima of this cost.

Stochastic GD¶

\(e_t(\theta) = \ln(1 + e^{\displaystyle -y_t \theta^T x_t})\)

\(\nabla_{\theta} e_t(\theta) = \cfrac{1}{(1 + e^{\displaystyle -y_t \theta^T x_t})} \times e^{\displaystyle -y_t \theta^T x_t} (-y_t \theta^T x_t)\)

\(\nabla_{\theta} e_t(\theta) = \cfrac{-y_t \theta^T x_t}{(1 + e^{\displaystyle y_t \theta^T x_t})}\)

Then the weight update is:

\(\theta_{t+1} = \theta_t - \eta \nabla_{\theta} e_t (\theta)\)

Prediction¶

Our prediction is \(1\) iff:

\(P(y = +1|x) \geq 0.5\)
\(P(y = +1|x) > P(y = -1|x) \implies P(y = +1|x)/P(y = -1|x) \geq 1\)
\(\theta^T x \geq 0\) (by taking \(\ln\) of the above inequality)