Theory and Code for Building First Neural Network

Abstract:

The Deep Learning model is derived, or say inspired from neural network(NN). I derived the theory and then implement it with python to build the first extensionable NN model.

Content:

Firstly, I’ll give the convention used in the devivation, and then deduce the math with the convention, later the code implemention is given and in the last, the application.

Structure

for this network system, we assume the structure is like:

$\begin{align*} x.shape = A^{[0]}.shape &= (n^{[0]}, m) \\ W^{[1]}.shape &= (n^{[1]}, n^{[0]}) \\ b^{[1]}.shape &= (n^{[1]}, 1) \\ Z^{[1]}.shape &= (n^{[1]}, m) \\ A^{[1]}.shape &= (n^{[1]}, m) \\ W^{[2]}.shape &= (n^{[2]}, n^{[1]}) \\ b^{[2]}.shape &= (n^{[2]}, 1) \\ Z^{[2]}.shape &= (n^{[2]}, m)\\ \hat{y}.shape = A^{[2]}.shape &= (n^{[2]}, m) \end{align*}$

where $n^{[i]}$ is the $i$ layer hidden unit number; $x$ is input data with $n^{[0]}$ features and m examples in total; $\hat{y}$ is estimated output, usually $n^{[2]} = 1$ for binary classification system.

Conventions

Assuming we apply sigmoid function $\sigma(z)=\frac{1}{1+e^{-Z}}$ for activate function for all layers.
$y$ is the real ouput/label for data with $y.shape=(n^{[2]},m)$.
Broadcasting convention in python is applied automaticlly when needed(such as braodcasting $1$ for $1-y$ or braodcasting $b^{[1]}$ for $W^{[1]}A^{[0]}+b^{[1]}$).
We apply:
- $\cdot$ for matrix dot production,
- $\times$ for real value production
- $*$ for eliment wised matrix production.

Forward Propogation

Giving input $A^{[0]} = x$, and randomly initialized parameters $W^{[1]}$, $b^{[1]}$, $W^{[2]}$, $b^{[2]}$. We can calculate $Z^{[1]}$, $A^{[1]}$, $Z^{[2]}$, $A^{[2]}$.

$\begin{align*} Z^{[1]} &= W^{[1]} \cdot A^{[0]}+b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[2]} &= W^{[2]} \cdot A^{[1]}+b^{[2]} \\ A^{[2]} &= \sigma(Z^{[2]}) \\ \end{align*}$

And final cost function:

$\mathcal{J}=-\frac{1}{m}\times\sum_{i=1}^{m}[y^{(i)} \cdot log\hat{y}^{(i)T} + (1-y^{(i)}) \cdot log(1-\hat{y}^{(i)T})]$

Backward Propogation

In order to do gradient decent optimization, we need to calculate $\frac{\partial \mathcal{J}}{\partial W^{[2]}}$, $\frac{\partial \mathcal{J}}{\partial b^{[2]}}$, $\frac{\partial \mathcal{J}}{\partial W^{[1]}}$,$\frac{\partial \mathcal{J}}{\partial b^{[1]}}$. Then we can apply them to update $W^{[1]}$, $W^{[2]}$, $b^{[1]}$, $b^{[2]}$.

chain rule

We apply chain rule for gradient calculation as following:

$\begin{align*} \frac{\partial \mathcal{J}}{\partial b^{[2]}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}}\cdot\frac{\partial A^{[2]}}{\partial Z^{[2]}}\cdot\frac{\partial Z^{[2]}}{\partial b^{[2]}} \\ \frac{\partial \mathcal{J}}{\partial W^{[2]}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}}\cdot\frac{\partial A^{[2]}}{\partial Z^{[2]}}\cdot\frac{\partial Z^{[2]}}{\partial W^{[2]}} \\ \frac{\partial \mathcal{J}}{\partial b^{[1]}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}}\cdot\frac{\partial A^{[2]}}{\partial Z^{[2]}}\cdot\frac{\partial Z^{[2]}}{\partial A^{[1]}}\cdot\frac{\partial A^{[1]}}{\partial Z^{[1]}}\cdot\frac{\partial Z^{[1]}}{\partial b^{[1]}} \\ \frac{\partial \mathcal{J}}{\partial W^{[1]}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}}\cdot\frac{\partial A^{[2]}}{\partial Z^{[2]}}\cdot\frac{\partial Z^{[2]}}{\partial A^{[1]}}\cdot\frac{\partial A^{[1]}}{\partial Z^{[1]}}\cdot\frac{\partial Z^{[1]}}{\partial W^{[1]}} \\ \end{align*}$

derivative for sigmoid function and loss function

Firstly, we get the derivative for scalar activate function and loss function. Only considering one example so that $z$ and $a$ are both real value other than vector:

$\begin{align*} \frac{d\sigma(z)}{dz} &= (\frac{1}{1+e^{-z}})' \\ &= -(\frac{1}{1+e^{-z}})^{2}\times(e^{-z})\times(-1) \\ &= \frac{1}{1+e^{-z}}\times\frac{(1+e^{-z})-1}{1+e^{-z}} \\ &= \sigma(z)[1-\sigma(z)] \\ \frac{d\mathcal{L}(a)}{da} &= \{-[ylog(a) + (1-y)log(1-a)]\}' \\ &=-[\frac{y}{a}+\frac{1-y}{1-a}\times(-1)] \\ &=-[\frac{y}{a}-\frac{1-y}{1-a}] \end{align*}$

matrix derivative

Now, we’re ready to calculate matrix derivative. For matrix calculation, We’ll apply Einstein sumption convention.

We introduce a symbol $\epsilon_{ij}= 1$ when $j=i$, $\epsilon_{ij} = 0$ when $j\neq i$. we denote the row $i$, column $j$ eliment in matrix $M$ as $M_{ij}$.

$\begin{align*} \frac{\partial \mathcal{J}}{\partial b^{[2]}_{ij}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}_{kl}}\times\frac{\partial A^{[2]}_{kl}}{\partial Z^{[2]}_{mn}}\times\frac{\partial Z^{[2]}_{mn}}{\partial b^{[2]}_{ij}} \\ &=-\frac{1}{m}(\frac{y_{kl}}{A^{[2]}_{kl}}-\frac{1-y_{kl}}{1-A^{[2]}_{kl}}) \times A^{[2]}_{kl}(1-A^{[2]}_{kl})\epsilon_{km}\epsilon_{ln} \times \epsilon_{mi}\epsilon_{nj} \\ &=-\frac{1}{m}(\frac{y_{ij}}{A^{[2]}_{ij}}-\frac{1-y_{ij}}{1-A^{[2]}_{ij}}) \times A^{[2]}_{ij}(1-A^{[2]}_{ij}) \\ &=-\frac{1}{m}[y_{ij}(1-A^{[2]}_{ij})-(1-y_{ij})A^{[2]}_{ij}] \\ &=\frac{1}{m}(A^{[2]}_{ij}-y_{ij}) \\ \end{align*}$ $\begin{align*} \frac{\partial \mathcal{J}}{\partial W^{[2]}_{ij}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}_{kl}}\times\frac{\partial A^{[2]}_{kl}}{\partial Z^{[2]}_{mn}}\times\frac{\partial Z^{[2]}_{mn}}{\partial W^{[2]}_{ij}} \\ &=-\frac{1}{m}(\frac{y_{kl}}{A^{[2]}_{kl}}-\frac{1-y_{kl}}{1-A^{[2]}_{kl}}) \times A^{[2]}_{kl}(1-A^{[2]}_{kl})\epsilon_{km}\epsilon_{ln} \times A^{[1]}_{jn}\epsilon_{mi} \\ &=-\frac{1}{m}(\frac{y_{in}}{A^{[2]}_{in}}-\frac{1-y_{in}}{1-A^{[2]}_{in}}) \times A^{[2]}_{in}(1-A^{[2]}_{in}) \times A^{[1]}_{jn} \\ &=-\frac{1}{m}[y_{in}(1-A^{[2]}_{in})-(1-y_{in})A^{[2]}_{in}] \times A^{[1]}_{jn} \\ &=\frac{1}{m}(A^{[2]}_{in}-y_{in}) \times A^{[1]}_{jn} \\ \end{align*}$ $\begin{align*} \frac{\partial \mathcal{J}}{\partial b^{[1]}_{ij}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}_{kl}}\times\frac{\partial A^{[2]}_{kl}}{\partial Z^{[2]}_{mn}} \times \frac{\partial Z^{[2]}_{mn}}{\partial A^{[1]}_{gh}} \times \frac{\partial A^{[1]}_{gh}}{\partial Z^{[1]}_{op}} \times \frac{\partial Z^{[1]}_{op}}{\partial b^{[1]}_{ij}}\\ &=-\frac{1}{m}(\frac{y_{kl}}{A^{[2]}_{kl}}-\frac{1-y_{kl}}{1-A^{[2]}_{kl}}) \times A^{[2]}_{kl}(1-A^{[2]}_{kl})\epsilon_{km}\epsilon_{ln} \times W^{[2]}_{mg}\epsilon_{nh} \times A^{[1]}_{gh}(1-A^{[1]}_{gh})\epsilon_{go}\epsilon_{hp} \times \epsilon_{io}\epsilon_{jp}\\ &=-\frac{1}{m}(\frac{y_{kj}}{A^{[2]}_{kj}}-\frac{1-y_{kj}}{1-A^{[2]}_{kj}}) \times A^{[2]}_{kj}(1-A^{[2]}_{kj}) \times W^{[2]}_{ki} \times A^{[1]}_{ij}(1-A^{[1]}_{ij})\\ &=-\frac{1}{m}[y_{kj}(1-A^{[2]}_{kj})-(1-y_{kj})A^{[2]}_{kj}] \times W^{[2]}_{ki} \times A^{[1]}_{ij}(1-A^{[1]}_{ij}) \\ &=\frac{1}{m}(A^{[2]}_{kj}-y_{kj}) \times W^{[2]}_{ki} \times A^{[1]}_{ij}(1-A^{[1]}_{ij}) \\ \end{align*}$ $\begin{align*} \frac{\partial \mathcal{J}}{\partial W^{[1]}_{ij}} &= \frac{\partial \mathcal{J}}{\partial A^{[2]}_{kl}}\times\frac{\partial A^{[2]}_{kl}}{\partial Z^{[2]}_{mn}} \times \frac{\partial Z^{[2]}_{mn}}{\partial A^{[1]}_{gh}} \times \frac{\partial A^{[1]}_{gh}}{\partial Z^{[1]}_{op}} \times \frac{\partial Z^{[1]}_{op}}{\partial W^{[1]}_{ij}}\\ &=-\frac{1}{m}(\frac{y_{kl}}{A^{[2]}_{kl}}-\frac{1-y_{kl}}{1-A^{[2]}_{kl}}) \times A^{[2]}_{kl}(1-A^{[2]}_{kl})\epsilon_{km}\epsilon_{ln} \times W^{[2]}_{mg}\epsilon_{nh} \times A^{[1]}_{gh}(1-A^{[1]}_{gh})\epsilon_{go}\epsilon_{hp} \times A^{[0]}_{jp}\epsilon_{io}\\ &=-\frac{1}{m}(\frac{y_{kl}}{A^{[2]}_{kl}}-\frac{1-y_{kl}}{1-A^{[2]}_{kl}}) \times A^{[2]}_{kl}(1-A^{[2]}_{kl}) \times W^{[2]}_{ki} \times A^{[1]}_{il}(1-A^{[1]}_{il}) \times A^{[0]}_{jl}\\ &=-\frac{1}{m}[y_{kl}(1-A^{[2]}_{kl})-(1-y_{kl})A^{[2]}_{kl}] \times W^{[2]}_{ki} \times A^{[1]}_{il}(1-A^{[1]}_{il}) \times A^{[0]}_{jl} \\ &=\frac{1}{m}(A^{[2]}_{kl}-y_{kl}) \times W^{[2]}_{ki} \times A^{[1]}_{il}(1-A^{[1]}_{il}) \times A^{[0]}_{jl} \\ \end{align*}$

Since we apply broadcasting, we have $b^{[l]}_{ij} = b^{[l]}_{ik}$ for any $i, j, k, l$. So we could denote $b^{[l]}_i = b^{[l]}_{ij}$. And all the derivative $db^{[l]}_{ij}$ could contribute(add up) to $db^{[l]}_{i}$. In other words, $db^{[l]}_{i} = \sum_{j=1}^{m}db^{[l]}_{ij}$

Based on Einstein sumption convention, above equation can be transfered as matrix equation:

$\begin{align*} \frac{\partial \mathcal{J}}{\partial W^{[2]}} &= \frac{1}{m}(A^{[2]}-Y) \cdot (A^{[1]})^T \\ \frac{\partial \mathcal{J}}{\partial b^{[2]}} &= \frac{1}{m}np.sum(A^{[2]}-Y, axis=1, keepdims=True)\\ \frac{\partial \mathcal{J}}{\partial W^{[1]}} &= \frac{1}{m}[(W^{[2]})^T \cdot (A^{[2]}-Y)] * [A^{[1]} * (1-A^{[1]})] \cdot (A^{[0]})^T\\ \frac{\partial \mathcal{J}}{\partial b^{[1]}} &= \frac{1}{m}np.sum([(W^{[2]})^T \cdot (A^{[2]}-Y)] * [A^{[1]} * (1-A^{[1]}]), axis=1, keepdims=True) \\ \end{align*}$

python code

For python code variable name convenient, we denote:

$\begin{align*} W2 &= W^{[2]} \\ b2 &= b^{[2]} \\ W1 &= W^{[1]} \\ b1 &= b^{[1]} \\ dW2 &= \frac{\partial \mathcal{J}}{\partial W^{[2]}} \\ db2 &= \frac{\partial \mathcal{J}}{\partial b^{[2]}} \\ dW1 &= \frac{\partial \mathcal{J}}{\partial W^{[1]}} \\ db1 &= \frac{\partial \mathcal{J}}{\partial b^{[1]}} \\ \end{align*}$

Then the python code for getting them is:

db2 = 1.0/m * np.sum(A2 - Y, axis=1, keepdims=True)
dw2 = 1.0/m * np.dot(A2 - Y, A1.T)
db1 = 1.0/m * np.sum(np.dot(W2.T, A2 - Y) * (A1 * (1 - A1)), axis=1, keepdims=True)
dw1 = 1.0/m * np.dot(np.dot(W2.T, A2 - Y) * (A1 * (1 - A1)), X.T)

cache for computational benefit

From above, there are many repetitive calculation. such as $A^{[2]} - Y$, which appear everywhere. We could cache such frequently used values to save calculation. Actually, from the equation above it’s easy to see that $\frac{\partial \mathcal{J}}{\partial Z^{[1]}}$ and $\frac{\partial \mathcal{J}}{\partial Z^{[2]}}$ are the variables worth to cache. Because:

$\begin{align*} \frac{\partial \mathcal{J}}{\partial W^{[2]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[2]}} \cdot \frac{\partial Z^{[2]}}{\partial W^{[2]}} \\ \frac{\partial \mathcal{J}}{\partial b^{[2]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[2]}} \cdot \frac{\partial Z^{[2]}}{\partial b^{[2]}} \\ \frac{\partial \mathcal{J}}{\partial W^{[1]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[1]}} \cdot \frac{\partial Z^{[1]}}{\partial W^{[1]}} \\ \frac{\partial \mathcal{J}}{\partial b^{[1]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[1]}} \cdot \frac{\partial Z^{[1]}}{\partial b^{[1]}} \\ \end{align*}$

It’s easy to know, that:

$\begin{align*} \frac{\partial \mathcal{J}}{\partial Z^{[2]}} &= \frac{1}{m}(A^{[2]}-Y) \\ \frac{\partial \mathcal{J}}{\partial Z^{[1]}} &= \frac{1}{m}[(W^{[2]})^T \cdot (A^{[2]}-Y)] * [A^{[1]} * (1-A^{[1]})] \\ &= [(W^{[2]})^T \cdot \frac{\partial \mathcal{J}}{\partial Z^{[2]}}] * [A^{[1]} * (1-A^{[1]})] \\ \end{align*}$

So:

$\begin{align*} \frac{\partial \mathcal{J}}{\partial b^{[2]}} &= np.sum(\frac{\partial \mathcal{J}}{\partial Z^{[2]}}, axis=1, keepdims=True)\\ \frac{\partial \mathcal{J}}{\partial W^{[2]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[2]}}\cdot (A^{[1]})^T \\ \frac{\partial \mathcal{J}}{\partial b^{[1]}} &= np.sum(\frac{\partial \mathcal{J}}{\partial Z^{[1]}}, axis=1, keepdims=True) \\ \frac{\partial \mathcal{J}}{\partial W^{[1]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[1]}} \cdot (A^{[0]})^T\\ \end{align*}$

So the pytho code with cache is:

dZ2 = (1.0/m) * (A2 - Y)
db2 = np.sum(dZ2, axis=1, keepdims=True)
dw2 = np.dot(dZ2, A1.T)
dZ1 = np.dot(W2.T, dZ2) * (A1 * (1 - A1)
db1 = np.sum(dZ1, axis=1, keepdims=True)
dw1 = np.dot(dZ1, X.T)

Extend layer number and hidden unit number

unit number

In our above deduction, we actually not limit(hard code) the layer unit number, we only apply $n^{[l]}$ to represent the unit number for layer $l$. Even for the input layer $X$ and output layer $Y$. So you can choose any unit number you like, our equation will still apply to them.

Just to address that, if you choose more that one unit for output layer $Y$, then it could be multi-classification system other than binary classification. But then you may need to redefine the cost function and lost function(original may not work well). If you want to extend output layer unit and redefine loss function $\mathcal{L}$, just remember to replace $\frac{\partial \mathcal{J}}{\partial Z^{[n]}}$ carefully in the equation, then all will work fine.

layer number

In our above deduction, we limit the layer number to be 2(exclude input layer). But we could do the same calculation for 3, 4 or n layer system. It’s not hard to get the formula for n layer network by mathematical induction. In following, we denote $A^{[0]}=X$, $A^{[n]}=\hat Y$, $Y$ to be the real label from data. And assuming only one unit in output layer $Y$, so apply original $\mathcal{L}$ and $\mathcal{J}$.

formula

forward propogation:

$\begin{align*} Z^{[1]} &= W^{[1]} \cdot A^{[0]}+b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[2]} &= W^{[2]} \cdot A^{[1]}+b^{[2]} \\ A^{[2]} &= \sigma(Z^{[2]}) \\ ... \\ Z^{[n]} &= W^{[n]} \cdot A^{[n-1]}+b^{[n]} \\ A^{[n]} &= \sigma(Z^{[n]}) \end{align*}$

And final cost function:

$\mathcal{J}=-\frac{1}{m}\times[YlogA^{[n]} + (1-Y)log(1-(A^{[n]})^T)]$

backward propogation:

$\begin{align*} \frac{\partial \mathcal{J}}{\partial Z^{[n]}} &= \frac{1}{m}(A^{[n]}-Y) \\ \frac{\partial \mathcal{J}}{\partial b^{[n]}} &= np.sum(\frac{\partial \mathcal{J}}{\partial Z^{[n]}}, axis=1, keepdims=True)\\ \frac{\partial \mathcal{J}}{\partial W^{[n]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[n]}}\cdot (A^{[n-1]})^T \\ \frac{\partial \mathcal{J}}{\partial Z^{[n-1]}} &= [(W^{[n]})^T \cdot \frac{\partial \mathcal{J}}{\partial Z^{[n]}}] * [A^{[n-1]} * (1-A^{[n-1]})] \\ \frac{\partial \mathcal{J}}{\partial b^{[n-1]}} &= np.sum(\frac{\partial \mathcal{J}}{\partial Z^{[n-1]}}, axis=1, keepdims=True)\\ \frac{\partial \mathcal{J}}{\partial W^{[n-1]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[n-1]}}\cdot (A^{[n-2]})^T \\ ... \\ \frac{\partial \mathcal{J}}{\partial Z^{[k]}} &= [(W^{[k+1]})^T \cdot \frac{\partial \mathcal{J}}{\partial Z^{[k+1]}}] * [A^{[k]} * (1-A^{[k]})] \\ \frac{\partial \mathcal{J}}{\partial b^{[k]}} &= np.sum(\frac{\partial \mathcal{J}}{\partial Z^{[k]}}, axis=1, keepdims=True)\\ \frac{\partial \mathcal{J}}{\partial W^{[k]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[k]}}\cdot (A^{[k-1]})^T \\ ... \\ \frac{\partial \mathcal{J}}{\partial Z^{[1]}} &= [(W^{[2]})^T \cdot \frac{\partial \mathcal{J}}{\partial Z^{[2]}}] * [A^{[1]} * (1-A^{[1]})] \\ \frac{\partial \mathcal{J}}{\partial b^{[1]}} &= np.sum(\frac{\partial \mathcal{J}}{\partial Z^{[1]}}, axis=1, keepdims=True) \\ \frac{\partial \mathcal{J}}{\partial W^{[1]}} &= \frac{\partial \mathcal{J}}{\partial Z^{[1]}} \cdot (A^{[0]})^T\\ \end{align*}$

python code

We assume provided W=[0, W1, W2, …, Wn], b=[0, b1, b2, …, bn].

import package:

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(1) # set a seed so that the results are consistent

forward propogation:

# sigmoid function
def sigmoid(x):
    return 1/(1+np.exp(-x))
# forward to get A
def forward(X, b, W, layer_number): 
    A0 = X
    n = layer_number
    A = [A0]
    for i in range(1, n+1): 
        A.append(sigmoid(np.dot(W[i], A[i-1]) + b[i])) #
    return A # A = [A0, A1, A2 ... An]
# calculate cost based on A
def cal_cost(An, Y):
    m = Y.shape[1] # number of examples
    cost = -(1.0)/m * np.sum(np.multiply(np.log(An), Y) + np.multiply((1 - Y), np.log(1 - An)))
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. like [[17]] into 17
    return cost

backward propogation:

# backward to get db, dW
def backward(W, A, Y, layer_number):
    dZ = None #for keeping the latest dZi
    db = [0]
    dW = [0]
    n = layer_number
    m = Y.shape[1]
    for i in range(n, 0, -1):
        if i == n:
            dZ = 1.0/m * (A[i] - Y) # since A=[A0, A1, ..An], so An = A[n]
        else:
            dZ = np.dot(W[i+1].T, dZ) * (A[i] * (1 - A[i])) # giving W=[0, W1, W2, ... Wn], so Wn = W[n]
        db.insert(1, np.sum(dZ, axis=1, keepdims=True)) # db=[0, dbi, dbi+1, ..., dbn]
        dW.insert(1, np.dot(dZ, A[i-1].T)) # dW=[0, dWi, dWi+1, ..., dWn]
        
    return db, dW # now db=[0, db1, db2, ..., dbn], dW=[0, dW1, dW2, ..., dWn]
# update d and W
def update_params(db, dW, b, W, learning_rate, layer_number):
    n = layer_number
    for i in range(1, n+1):
        b[i] = b[i] - learning_rate * db[i]
        W[i] = W[i] - learning_rate * dW[i]
    return b, W

bulid model:

# initialize d and W
def inital_params(X, Y, hidden_layer_units):
    np.random.seed(2)
    b = [0]
    W = [0]
    n_last = X.shape[0]
    
    # for all hidden layer, saying n[1], n[2], ... n[l-1]
    for unit_number in hidden_layer_units: # hidden_layer_units = [n[1], n[2], ..., n[l-1]]
        n_current = unit_number      
        b.append(np.zeros((n_current, 1)))
        W.append(np.random.randn(n_current, n_last)*0.01)
        n_last = n_current
    
    # for last ouput layer n[l]
    b.append(np.zeros((Y.shape[0], 1)))
    W.append(np.random.randn(Y.shape[0], n_last)*0.01)
    
    return b, W

# build model by put all together
def model(X, Y, hidden_layer_units=[4], learning_rate=0.005, iterate_times=1000, print_cost=True):
    layer_number = len(hidden_layer_units)+1
    b, W = inital_params(X, Y, hidden_layer_units)
    costs = []

    for i in range(iterate_times):
        A = forward(X, b, W, layer_number)   
        db, dW = backward(W, A, Y, layer_number)
        b, W = update_params(db, dW, b, W, learning_rate, layer_number)
        if print_cost and i%1000 == 0:
            cost = cal_cost(A[layer_number], Y)
            costs.append(cost)
            print("cost after " + str(i) + " iterations: " + str(cost))
            
    model_paras = {'b':b,
                   'W':W,
                   'costs':costs}
            
    return model_paras
# predict Y based on model, Y is either 0 or 1
def predict(X, b, W):
    layer_number = len(b)-1
    A = forward(X, b, W, layer_number)
    Y = np.round(A[-1])
    return Y
# evaluate model with visulization
def evaluate(X, Y, b, W):    
    predictions = predict(X, b, W)
    accuracy = float((np.dot(Y, predictions.T) + np.dot(1 - Y, 1 - predictions.T)) / float(Y.size) * 100)
    return accuracy

Case Application

The model we build is suitable for all binary classification problem. And it’s structure is configurable in hidden_layer parameter. Asuming we could get training data from load_dataset() function, then the regular way to apply our model is:

# set parameter for learning, those are the super parameters you can tune.
hidden_layer_units = [4]
learning_rate = 1.2
iterate_times = 10000

# load training data
X, Y = load_dataset()

# build model
model_paras = model(X, Y, hidden_layer_units, learning_rate, iterate_times)
b = model_paras['b']
W = model_paras['W']

# evaludate model
accuracy = evaluate(X, Y, b, W)
print ('Accuracy: %d ' % accuracy + '%')

Other framework

You could implement the equivalent model with framework easily. two of the frameworks are tensorflow and keras.

tensorflow

the key functions to implement the same model in tensorflow are like following. As you can see from the code, all you need to care is the forward part, tensorflow would take care of the backward part automaticlly since it stored the flow graph of all the units as sessions.

def initialize_parameters():
    
        tf.set_random_seed(1)                   # so that your "random" numbers match ours
        
        W1 = tf.get_variable("W1", [25, 12288], initializer=tf.contrib.layers.xavier_initializer(seed=1))
        b1 = tf.get_variable("b1", [25, 1], initializer=tf.zeros_initializer())
        W2 = tf.get_variable("W2", [12, 25], initializer=tf.contrib.layers.xavier_initializer(seed=1))
        b2 = tf.get_variable("b2", [12, 1], initializer=tf.zeros_initializer())
        W3 = tf.get_variable("W3", [6, 12], initializer=tf.contrib.layers.xavier_initializer(seed=1))
        b3 = tf.get_variable("b3", [6, 1], initializer=tf.zeros_initializer())

        parameters = {"W1": W1,"b1": b1,"W2": W2,"b2": b2,"W3": W3,"b3": b3}
    
        return parameters

def forward_propagation(X, parameters):
        # Retrieve the parameters from the dictionary "parameters" 
        W1 = parameters['W1']
        b1 = parameters['b1']
        W2 = parameters['W2']
        b2 = parameters['b2']
        W3 = parameters['W3']
        b3 = parameters['b3']
    
        ### START CODE HERE ### (approx. 5 lines)              # Numpy Equivalents:
        Z1 = tf.add(tf.matmul(W1, X), b1)                                              # Z1 = np.dot(W1, X) + b1
        A1 = tf.nn.relu(Z1)                                              # A1 = relu(Z1)
        Z2 = tf.add(tf.matmul(W2, A1), b2)                                              # Z2 = np.dot(W2, a1) + b2
        A2 = tf.nn.relu(Z2)                                              # A2 = relu(Z2)
        Z3 = tf.add(tf.matmul(W3, A2), b3)
        
        return Z3

 def compute_cost(Z3, Y):  
        # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
        logits = tf.transpose(Z3)
        labels = tf.transpose(Y)
    
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
    
        return cost

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,
          num_epochs = 1500, minibatch_size = 32, print_cost = True):    
    
        # Create Placeholders of shape (n_x, n_y)
        X, Y = create_placeholders(n_x, n_y)

        # Initialize parameters
        parameters = initialize_parameters()
    
        # Forward propagation: Build the forward propagation in the tensorflow graph
        Z3 = forward_propagation(X, parameters)
    
        # Cost function: Add cost function to tensorflow graph
        cost = compute_cost(Z3, Y)
    
        # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
    
        # Initialize all the variables
        init = tf.global_variables_initializer()

        # Start the session to compute the tensorflow graph
        with tf.Session() as sess:        
            # Run the initialization
            sess.run(init)
            _ , cost = sess.run([optimizer, cost], feed_dict={X:X, Y:Y})
            
                
            # lets save the parameters in a variable
            parameters = sess.run(parameters)
            print ("Parameters have been trained!")
        
        return parameters

keras

Keras is a higher level framework than tensorflow, it provided many more useful functions for implementing CNN and RNN. Basicly, you could apply it to build a model with only few line of code as it take care of both forward and backward part for common models, it’s basic block is layer, you could choose the provided layer to stack your model quickly.

The key code for building the same model we did before is:

from keras.layers import Input, Dense, Activation

X_input = Input(input_shape)
X_hidden = Dense(4, activation='sigmoid', name='fullycon1')(X_input)
Y = Dense(1, activation='sigmoid', name='fullycon2')(X_hidden)
model = Model(inputs=X_input, outputs=Y, name='fullycon_model')

model.compile(optimizer='adam', loss='mean_squared_error', metrics=["accuracy"])
model.fit(x=X_train, y=Y_train, epochs=10, batch_size=100)

Questions:

History:

2019-02-06: create post and draft based on my notebook.