# Logistic Regression as a Neural Network

## Introduction

This post continues my blog series on Neural Networks and Deep Learning with (an) R (twist), which is motivated by my current enrollment in Andrew Ng’s Deep Learning Specialisation on Coursera. One of the very first things I picked in this course is that the familiar logistic regression classifiers can be seen as a neural network. In fact it turns out that the logistic regression classifier is a good example to illustrate and motivate the basics of neural networks. I start by motivating the binary classification problem.

## The Binary Classification Problem

The binary classification problem is one in which given a set of input $$X$$ (called features), we want to output a binary prediction $$y$$. A fascinating example of this type of problem I saw recently is given a set of (some Nike and Adidas) shoe pictures, can we learn a binary classifier to tell whether a shoe was made by Nike or Adidas!

$X \rightarrow y$

$(shoe\ image\ pixel\ values) \rightarrow shoe\ is\ Nike\ or\ Adidas$

In this setting, the possible outputs (Nike/Adidas) of $$y$$ are denoted with $$0$$ and $$1$$ and the logistic regression classifier is typically used on this type of problem because it is capable of producing an output (prediction) which is basically the probability that $$y = 1$$ given the input features $$X$$: $\hat{y} = (P(y=1|X) ; \ \ \ \ \ \ 0 \leq \hat{y} \leq 1$

## Why The Logistic Regression Classifier ?

The logistic regression classifier can easily be motivated from linear regression model given by

$\hat{y} = w_1x_1 + w_2 x_2 + \cdots + w_{nx} x_{nx} + b = w^Tx + b$ where $$n_x$$ is the number of features (or predictors or columns) of $$x$$, $$x^T = [x_1\ x_2\ \cdots x_{nx}]$$, and $$w^T = [w_1\ w_2\ \cdots w_{nx}]$$. However, we want $$0 \leq \hat{y} \leq 1$$ but the current linear regression model does not satisfy that. Consequently, we pass $$\hat{y}$$ through the sigmoid function $$\sigma$$ given by:

$\sigma(z) = \frac{1}{1+e^{-z}}$

Note that if $$z$$ is very large, $$\sigma(z)$$ is close to 1.and if $$z$$ is very small, $$\sigma(z)$$ is close to zero. If we let $$z = w^Tx + b$$, the output of the logistic regression classifier can then be written as: $\hat{y} = \sigma(z) = \sigma(w^Tx + b) = \frac{1}{1+e^{-(w^Tx + b)}}$

## Learning The Parameters of The Logistic Regression Classifier

This section details the steps needed to train the logistic classifier. First we setup the problem with the appropriate notations.

### Setup

Given the set of $$m$$ training examples: $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})\}$ where any $$x^{(i)} \in \mathbb{R}^{nx}$$ and $$y^{(i)} \in \mathbb{R}$$ with the superscript indices refering to training examples. We want for any single training example $$(x^{(i)}, y^{(i)})$$, a prediction that is close to the actual value as possible, i.e we want: $\hat{y}^{(i)} \approx {y}^{(i)}$ where $\hat{y}^{(i)} = \sigma({z}^{(i)}) = \sigma(w^T\cdot{x}^{(i)} + b) = \frac{1}{1+e^{-(w^T\cdot{x}^{(i)} + b)}}$

Next we proceed by first defining logistic regression loss function.

### The Logistic Regression Loss (or Error) Function

To assess how good the current values of parameters $$w$$ and $$b$$ are, we need to define a metric to compare how close a single prediction $$\hat{y}^{(i)}$$ (on the training example $$x^{(i)}$$ given the $$w$$ and $$b$$) is to the actual value $$y^{(i)}$$. This metric is called the loss (or error) on a training example and it basically measures the difference between $$\hat{y}^{(i)}$$ and $$y^{(i)}$$. For logistic regression, the preferred loss function is $L(\hat{y}^{(i)}, y^{(i)}) = -\left[y^{(i)} \cdot \log(\hat{y}^{(i)}) + (1-y^{(i)})\cdot\log(1-\hat{y}^{(i)})\right]$ Note that if $$y^{(i)} =1$$, then $$L(\hat{y}^{(i)}, y^{(i)}) = -\log(\hat{y}^{(i)})$$ and since we want to minimise the loss $$L(\hat{y}^{(i)}, y^{(i)})$$ (we want $$\hat{y}^{(i)}$$ to be close to $$y^{(i)}$$ as possible), $$\log(\hat{y}^{(i)})$$ must be large and that implies that $$\hat{y}^{(i)}$$ must be large which consequently means that $$\hat{y}^{(i)}$$ will be close to $$1$$ (since the sigmoid function ensures that $$\hat{y}^{(i)} \leq 1$$).

Likewise, if $$y^{(i)} = 0$$, then $$L(\hat{y}^{(i)}, y^{(i)}) = -\log(1 - \hat{y}^{(i)})$$ and to minimise $$L(\hat{y}^{(i)}, y^{(i)})$$, $$\log(1- \hat{y}^{(i)})$$ must be large and that implies that $$1 - \hat{y}^{(i)}$$ must be large which in turn means that $$\hat{y}^{(i)}$$ must be small and consequent upon the constraint of the sigmoid function, $$\hat{y}^{(i)} \approx 0$$ (because the sigmoid function ensures that $$0 \leq \hat{y}^{(i)}$$).

Therefore, minimising $$L(\hat{y}^{(i)}, y^{(i)})$$ corresponds to getting $$\hat{y}^{(i)}$$ which is close to $$y^{(i)}$$ (a prediction which is close to the actual value) as much as possible.

### The Logistic Regression Cost Function

The loss function defined above measures the error between the prediction and the actual value ($$\hat{y}_i$$ and $$y^{(i)}$$ respectively) on a single training example $$i$$. To assess the parameters $$w$$ on the entire training data, we need to define the cost function which averages the loss function over all the entire training examples. Consequently, the cost function given $$w$$ and $$b$$ is defined as $J(w,b) = -\frac{1}{m}\sum_{i = 1}^mL(\hat{y}^{(i)}, y^{(i)}) = -\frac{1}{m}\sum_{i = 1}^m \left[y^{(i)} \cdot \log(\hat{y}^{(i)}) + (1-y^{(i)})\cdot\log(1-\hat{y}^{(i)})\right]$ So therefore, to get the values of $$w$$ and $$b$$ which guarantees that $$\hat{y}^{(i)}$$ is as close to $$y^{(i)}$$ as possible for all $$i = 1, \ldots, m$$, we need to minimize $$J(w,b)$$. That is we want to find $$w$$ and $$b$$ which minimizes $$J$$!