Table of Content

Model description

Forward propagation

Backward propagation

Math theory

Deriving backward propagation formulas

Output layer

Hidden-2 Layer

Hidden-1 Layer

Updating weights and biases

Optimizing on batch

Model description

Let’s say we have simple forward propagation neural network with 4 layers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10



    Cost function: C = (a₀ — y₀)² + (a₁ — y₁)²

    * *     Output Layer   | 2 neurons, 2*3 weights + 2 biases, Sigmoid activation
   /\/\\                   |
   * * *    Hidden-2 Layer | 3 neurons, 3*4 weights + 3 biases, Sigmoid activation
  /\//\\\                  |
  * * * *   Hidden-1 Layer | 4 neurons, 4*2 weights + 4 biases, Sigmoid activation
   \/\\/                   |
    * *     Input Layer    | 2 neurons, 2 input values (say 0.0 .. 0.1)

Initially we have all our 35 weights and biases set randomly. In our example their typical values are somewhere between $-3.0 … +3.0$.

We also have some training data like:

for input $a_0^{I} = 0.3, a_1^{I} = 0.5$ we expect output $y_0 = 1.0, y_1 = 0.0$
for input $a_0^{I} = 0.2, a_1^{I} = 0.7$ we expect output $y_0 = 0.0, y_1 = 1.0$
and so on

Our aim is to find such weights and biases that our training data will have cost function output as low as possible. This minimization of cost function is what is called training of neural network.

For simplicity let’s say we only optimize for one exact input-output pair. In reality we optimize for batch of input-output pairs.

Forward propagation

At this stage we have 35 weights and biases set (randomly, if it’s our 1st pass). Our aim is to see what our NN will output on it’s 2 neurons on output layer for a given values of input neurons.

Input Layer: 2 neurons are set, for example:

$\textcolor{red}{a_0^{I}} = 0.3, \textcolor{red}{a_1^{I}} = 0.5$

Hidden-1 Layer: 4 neurons are set

$\textcolor{grey}{z_0^{H1}} = (\textcolor{red}{a_0^{I}}w_{00}^{H1} + \textcolor{red}{a_1^{I}}w_{01}^{H1}) + b_0^{H1}$
$\textcolor{grey}{z_1^{H1}} = (\textcolor{red}{a_0^{I}}w_{10}^{H1} + \textcolor{red}{a_1^{I}}w_{11}^{H1}) + b_1^{H1}$
$\textcolor{grey}{z_2^{H1}} = (\textcolor{red}{a_0^{I}}w_{20}^{H1} + \textcolor{red}{a_1^{I}}w_{21}^{H1}) + b_2^{H1}$
$\textcolor{grey}{z_3^{H1}} = (\textcolor{red}{a_0^{I}}w_{30}^{H1} + \textcolor{red}{a_1^{I}}w_{31}^{H1}) + b_3^{H1}$

And then activaved:

$\textcolor{teal}{a_0^{H1}} = Sigmoid(\textcolor{grey}{z_0^{H1}})$
$\textcolor{teal}{a_1^{H1}} = Sigmoid(\textcolor{grey}{z_1^{H1}})$
$\textcolor{teal}{a_2^{H1}} = Sigmoid(\textcolor{grey}{z_2^{H1}})$
$\textcolor{teal}{a_3^{H1}} = Sigmoid(\textcolor{grey}{z_3^{H1}})$

Hidden-2 Layer: 3 neurons are set

$\textcolor{magenta}{z_0^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{00}^{H2} + \textcolor{teal}{a_1^{H1}}w_{01}^{H2} + \textcolor{teal}{a_2^{H1}}w_{02}^{H2} + \textcolor{teal}{a_3^{H1}}w_{03}^{H2}) + b_0^{H2}$
$\textcolor{magenta}{z_1^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{10}^{H2} + \textcolor{teal}{a_1^{H1}}w_{11}^{H2} + \textcolor{teal}{a_2^{H1}}w_{12}^{H2} + \textcolor{teal}{a_3^{H1}}w_{13}^{H2}) + b_1^{H2}$
$\textcolor{magenta}{z_2^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{20}^{H2} + \textcolor{teal}{a_1^{H1}}w_{21}^{H2} + \textcolor{teal}{a_2^{H1}}w_{22}^{H2} + \textcolor{teal}{a_3^{H1}}w_{23}^{H2}) + b_2^{H2}$

And then activaved:

$\textcolor{goldenrod}{a_0^{H2}} = Sigmoid(\textcolor{magenta}{z_0^{H2}})$
$\textcolor{goldenrod}{a_1^{H2}} = Sigmoid(\textcolor{magenta}{z_1^{H2}})$
$\textcolor{goldenrod}{a_2^{H2}} = Sigmoid(\textcolor{magenta}{z_2^{H2}})$

Output Layer: 2 neurons are set

$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$

And then activaved:

$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$

So, let’s say we got with our current weights and biases: $a_0^{O} = 0.8, a_1^{O} = 0.3$

And we know that for our input ($a_0^{I} = 0.3, a_1^{I} = 0.5$) we were ideally expecting $y_0 = 1.0, y_1 = 0.0$

Cost function measures how far we are from desired result:

Cost function is calculated. In previous example:

$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2 = (0.8 - 1.0)^2 + (0.3 - 0.0)^2 = 0.13$

Backward propagation

Math theory

Our task is now to change weights and biases such that cost function $C$ will be smaller than $0.13$.

As a mathematical function, $C$ is function of 35 variables (which are weights and biases of neural network).

So, now we need to find minimum (global or at least local) of our mathematical function of 35 variables. How do we do it? We find all 35 partial derivatives of $C$ (which are $∂C/∂w_{ij}$ and $∂C/∂b_k$). Those 35 partial derivatives form vector in 35-dimensional space of $C$. Direction along gradient is a good candidate for direction towards some reasonable minimim of $C$.

Once we have this direction, we update all our weights and biases as (example for $w_{21}$):
$w_{21}^{H1_{upd}} = w_{21}^{H1_{org}} + r*∂C/∂w_{21}^{H1_{org}}$

$r$ is learning rate — mathematically it is scaling kof for gradient vector

make it too small — and you’ll need more training steps, plus you won’t be able to get out of local minimums

make it too big — and you will be overshooting your minimum back and forth

$r$ is taken to be $< 0$ (and usually it is of value around -0.01), because we search for minimum, not maximum

The question is now how to efficiently find those 35 partial derivatives.

Deriving backward propagation formulas

Backpropagation starts at output layer and finishes at input layer.

We will further need sigmoid derivative:

Sigmoid function:

Sigmoid function $y(x) = 1/(1+e^{-x})$ has derivative $dy/dx = y*(1-y)$

Output layer

To find derivatives for output layer we will need following (given above) formulas:

$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2$

Output layer:

$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$

$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$

Derivation process:

Step 1: Let’s trace how one particular derivative ($∂C/∂w_{00}^O$) is found:

There is only one path to reach $∂w_{00}^O$:

$\textcolor{blue}{a_0^{O}} → \textcolor{lime}{z_0^{O}} → \textcolor{goldenrod}{a_0^{H2}}$

Following this path we get:

$$ \frac{∂C}{∂w_{00}^O} = \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} * \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} = \underbrace{2(\textcolor{blue}{a_0^O} - y_0)}_{\text{by cost function}} * \underbrace{\textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O})}_{\text{by activation}} * \underbrace{\textcolor{goldenrod}{a_0^{H2}} }_{\text{by layer}} $$ Notice that final derivatives are only true for our particular choice of cost, activation and layer types.

Step 2: repeat the same logic for all other weights:

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{00}^O} \\ \frac{∂C}{∂w_{01}^O} \\ \frac{∂C}{∂w_{02}^O} \\ \frac{∂C}{∂b_0^O} \end{pmatrix} = \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} * \begin{pmatrix} \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂w_{01}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂w_{02}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂b_0^O} \end{pmatrix} = 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^O} \\ \frac{∂C}{∂w_{11}^O} \\ \frac{∂C}{∂w_{12}^O} \\ \frac{∂C}{∂b_1^O} \end{pmatrix} = \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{z_1^O}} * \begin{pmatrix} \frac{\textcolor{lime}{z_1^O}}{∂w_{10}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{11}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{12}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂b_1^O} \end{pmatrix} = 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} $$

Step 3: above can be condensed in nice matrix formulas.

General formula (it doesn’t depend on exact cost/activation/layers type):

$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = \begin{vmatrix} \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} & 0 \\ 0 & \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{z_1^O}} \end{vmatrix} * \begin{vmatrix} \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} & \frac{\textcolor{lime}{z_0^O}}{∂w_{01}^O} & \frac{\textcolor{lime}{z_0^O}}{∂w_{02}^O} & \frac{\textcolor{lime}{z_0^O}}{∂b_0^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{10}^O} & \frac{\textcolor{lime}{z_1^O}}{∂w_{11}^O} & \frac{\textcolor{lime}{z_1^O}}{∂w_{12}^O} & \frac{\textcolor{lime}{z_1^O}}{∂b_1^O} \end{vmatrix} $
1
2
3
4
     Matrix sizes:

     x x x x _ x 0 * x x x x
     x x x x ‾ 0 x   x x x x
Formula for our case of cost/activation/layers types:

$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = \begin{pmatrix} 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) \\ 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) \end{pmatrix} * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} ^ T $
1
2
3
4
     Matrix sizes:

     x x x x _ x * x x x x
     x x x x ‾ x

Notice that derivatives for output layer weight/biases depend only on:

activated values of O-layer neurons
activated values of H2-layer neurons
target values $y_0, y_1$

Hidden-2 Layer

Remember that:

$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2$

Output layer:

$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$

$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$

Hidden layer:

$\textcolor{goldenrod}{a_0^{H2}} = Sigmoid(\textcolor{magenta}{z_0^{H2}})$
$\textcolor{goldenrod}{a_1^{H2}} = Sigmoid(\textcolor{magenta}{z_1^{H2}})$
$\textcolor{goldenrod}{a_2^{H2}} = Sigmoid(\textcolor{magenta}{z_2^{H2}})$

$\textcolor{magenta}{z_0^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{00}^{H2} + \textcolor{teal}{a_1^{H1}}w_{01}^{H2} + \textcolor{teal}{a_2^{H1}}w_{02}^{H2} + \textcolor{teal}{a_3^{H1}}w_{03}^{H2}) + b_0^{H2}$
$\textcolor{magenta}{z_1^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{10}^{H2} + \textcolor{teal}{a_1^{H1}}w_{11}^{H2} + \textcolor{teal}{a_2^{H1}}w_{12}^{H2} + \textcolor{teal}{a_3^{H1}}w_{13}^{H2}) + b_1^{H2}$
$\textcolor{magenta}{z_2^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{20}^{H2} + \textcolor{teal}{a_1^{H1}}w_{21}^{H2} + \textcolor{teal}{a_2^{H1}}w_{22}^{H2} + \textcolor{teal}{a_3^{H1}}w_{23}^{H2}) + b_2^{H2}$

Derivation process:

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{00}^{H2}} \\ \frac{∂C}{∂w_{01}^{H2}} \\ \frac{∂C}{∂w_{02}^{H2}} \\ \frac{∂C}{∂w_{03}^{H2}} \\ \frac{∂C}{∂b_0^{H2}} \end{pmatrix} = \underbrace{( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } )}_{\text{there are 2 paths (= neurons on outp layer) \newline to reach derivative}} * \frac{\textcolor{goldenrod}{∂a_0^{H2}} }{\textcolor{magenta}{∂z_0^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{00}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{01}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{02}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{03}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂b_0^{H2}} \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^{H2}} \\ \frac{∂C}{∂w_{11}^{H2}} \\ \frac{∂C}{∂w_{12}^{H2}} \\ \frac{∂C}{∂w_{13}^{H2}} \\ \frac{∂C}{∂b_1^{H2}} \end{pmatrix} = ( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } ) * \frac{\textcolor{goldenrod}{∂a_1^{H2}} }{\textcolor{magenta}{∂z_1^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{10}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{11}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{12}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{13}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂b_1^{H2}} \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{20}^{H2}} \\ \frac{∂C}{∂w_{21}^{H2}} \\ \frac{∂C}{∂w_{22}^{H2}} \\ \frac{∂C}{∂w_{32}^{H2}} \\ \frac{∂C}{∂b_2^{H2}} \end{pmatrix} = ( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } ) * \frac{\textcolor{goldenrod}{∂a_2^{H2}} }{\textcolor{magenta}{∂z_2^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{20}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{21}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{22}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{23}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂b_2^{H2}} \end{pmatrix} $$

Or for our choice of cost/layer/activation:

$$ \def\A0{\textcolor{blue}{a_0^O}} \def\B0{\textcolor{blue}{a_1^O}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\X0{\colorbox{lightblue}{$2(\A0 - y_0) * \A0(1 - \A0)$}} \def\Y0{\colorbox{lightblue}{$2(\B0 - y_1) * \B0(1 - \B0)$}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\arraystretch{1.6} \footnotesize \begin{pmatrix} \frac{∂C}{∂w_{00}^{H2}} \\ \frac{∂C}{∂w_{01}^{H2}} \\ \frac{∂C}{∂w_{02}^{H2}} \\ \frac{∂C}{∂w_{03}^{H2}} \\ \frac{∂C}{∂b_0^{H2}} \end{pmatrix} = (\X0 * w_{00}^O + \Y0 * w_{10}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} \newline \def\C0{\textcolor{goldenrod}{a_1^{H2}} } \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^{H2}} \\ \frac{∂C}{∂w_{11}^{H2}} \\ \frac{∂C}{∂w_{12}^{H2}} \\ \frac{∂C}{∂w_{13}^{H2}} \\ \frac{∂C}{∂b_1^{H2}} \end{pmatrix} = (\X0 * w_{01}^O + \Y0 * w_{11}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} \newline \def\C0{\textcolor{goldenrod}{a_2^{H2}} } \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{20}^{H2}} \\ \frac{∂C}{∂w_{21}^{H2}} \\ \frac{∂C}{∂w_{22}^{H2}} \\ \frac{∂C}{∂w_{23}^{H2}} \\ \frac{∂C}{∂b_2^{H2}} \end{pmatrix} = (\X0 * w_{02}^O + \Y0 * w_{12}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} $$

Above can be summarized into matrix formula:

$$ \def\arraystretch{1.6} \small \begin{vmatrix} \frac{∂C}{∂w_{00}^{H2}} & \frac{∂C}{∂w_{01}^{H2}} & \frac{∂C}{∂w_{02}^{H2}} & \frac{∂C}{∂w_{03}^{H2}} & \frac{∂C}{∂b_0^{H2}} \\ \frac{∂C}{∂w_{10}^{H2}} & \frac{∂C}{∂w_{11}^{H2}} & \frac{∂C}{∂w_{12}^{H2}} & \frac{∂C}{∂w_{13}^{H2}} & \frac{∂C}{∂b_1^{H2}} \\ \frac{∂C}{∂w_{20}^{H2}} & \frac{∂C}{∂w_{21}^{H2}} & \frac{∂C}{∂w_{22}^{H2}} & \frac{∂C}{∂w_{23}^{H2}} & \frac{∂C}{∂b_2^{H2}} \end{vmatrix} = \begin{vmatrix} \frac{\textcolor{goldenrod}{∂a_0^{H2}} }{\textcolor{magenta}{∂z_0^{H2}} } & 0 & 0 \\ 0 & \frac{\textcolor{goldenrod}{∂a_1^{H2}} }{\textcolor{magenta}{∂z_1^{H2}} } & 0 \\ 0 & 0 & \frac{\textcolor{goldenrod}{∂a_2^{H2}} }{\textcolor{magenta}{∂z_2^{H2}} } \end{vmatrix} \begin{vmatrix} \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } \\ \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } \\ \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } \end{vmatrix} \begin{pmatrix} \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} \\ \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} \end{pmatrix} \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T = \newline = \def\AA{\textcolor{goldenrod}{a_0^{H2}} } \def\BB{\textcolor{goldenrod}{a_1^{H2}} } \def\CC{\textcolor{goldenrod}{a_2^{H2}} } \begin{vmatrix} \AA(1 - \AA) & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 \\ 0 & 0 & \CC(1 - \CC) \end{vmatrix} \begin{vmatrix} w_{00}^O & w_{10}^O \\ w_{01}^O & w_{11}^O \\ w_{02}^O & w_{12}^O \end{vmatrix} \def\A0{\textcolor{blue}{a_0^O}} \def\B0{\textcolor{blue}{a_1^O}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\X0{2(\A0 - y_0) * \A0(1 - \A0)} \def\Y0{2(\B0 - y_1) * \B0(1 - \B0)} \begin{pmatrix} \X0 \\ \Y0 \end{pmatrix} \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T $$
1
2
3
4
5
     Matrix sizes:

     x x x x x   x 0 0 * x x * x * x x x x x
     x x x x x = 0 x 0   x x * x
     x x x x x   0 0 x   x x

Hidden-1 Layer

We can derive derivatives by induction.

Derivatives on Output layer: $$ \def\arraystretch{1.6} M_{outp} = \begin{pmatrix} 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) \\ 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = M_{outp} * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} ^ T $$

Derivatives on Hidden-2 Layer: $$ M_{h2} = \def\AA{\textcolor{goldenrod}{a_0^{H2}} } \def\BB{\textcolor{goldenrod}{a_1^{H2}} } \def\CC{\textcolor{goldenrod}{a_2^{H2}} } \def\arraystretch{1.6} \begin{vmatrix} \AA(1 - \AA) & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 \\ 0 & 0 & \CC(1 - \CC) \end{vmatrix} \begin{vmatrix} w_{00}^O & w_{10}^O \\ w_{01}^O & w_{11}^O \\ w_{02}^O & w_{12}^O \end{vmatrix} * M_{outp} $$

$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^{H2}} & \frac{∂C}{∂w_{01}^{H2}} & \frac{∂C}{∂w_{02}^{H2}} & \frac{∂C}{∂w_{03}^{H2}} & \frac{∂C}{∂b_0^{H2}} \\ \frac{∂C}{∂w_{10}^{H2}} & \frac{∂C}{∂w_{11}^{H2}} & \frac{∂C}{∂w_{12}^{H2}} & \frac{∂C}{∂w_{13}^{H2}} & \frac{∂C}{∂b_1^{H2}} \\ \frac{∂C}{∂w_{20}^{H2}} & \frac{∂C}{∂w_{21}^{H2}} & \frac{∂C}{∂w_{22}^{H2}} & \frac{∂C}{∂w_{23}^{H2}} & \frac{∂C}{∂b_2^{H2}} \end{vmatrix} = M_{h2} * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T $$

By induction, derivatives on Hidden-1 layer:

$$ \small \def\arraystretch{1.6} \def\AA{\textcolor{teal}{a_0^{H1}} } \def\BB{\textcolor{teal}{a_1^{H1}} } \def\CC{\textcolor{teal}{a_2^{H1}} } \def\DD{\textcolor{teal}{a_2^{H1}} } M_{h1} = \begin{vmatrix} \AA(1 - \AA) & 0 & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 & 0 \\ 0 & 0 & \CC(1 - \CC) & 0 \\ 0 & 0 & 0 & \DD(1 - \DD) \end{vmatrix} \begin{vmatrix} w_{00}^{H2} & w_{10}^{H2} & w_{20}^{H2} \\ w_{01}^{H2} & w_{11}^{H2} & w_{21}^{H2} \\ w_{02}^{H2} & w_{12}^{H2} & w_{22}^{H2} \\ w_{03}^{H2} & w_{13}^{H2} & w_{23}^{H2} \\ \end{vmatrix} * M_{h2} $$

$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^{H1}} & \frac{∂C}{∂w_{01}^{H1}} & \frac{∂C}{∂b_{0}^{H1}} \\ \frac{∂C}{∂w_{10}^{H1}} & \frac{∂C}{∂w_{11}^{H1}} & \frac{∂C}{∂b_{1}^{H1}} \\ \frac{∂C}{∂w_{20}^{H1}} & \frac{∂C}{∂w_{21}^{H1}} & \frac{∂C}{∂b_{2}^{H1}} \\ \frac{∂C}{∂w_{30}^{H1}} & \frac{∂C}{∂w_{31}^{H1}} & \frac{∂C}{∂b_{3}^{H1}} \\ \end{vmatrix} = M_{h1} * \begin{pmatrix} \textcolor{red}{a_0^{I}} \\ \textcolor{red}{a_1^{I}} \\ 1 \end{pmatrix} ^ T $$

TODO Is it indeed correct induction lol?

Updating weights and biases

As described in math chapter for backward propagation, we now update all our 35 weights and biases according to found gradient and selected learning rate.

Then we repeat optimization cycle as many times as required to reach reasonably small cost function.

Optimizing on batch

Just sum up C and gradients?

01 Fully-connected network

Forward and back propagations