01 Fully-connected network

Forward and back propagations

Table of Content

Model description

Let’s say we have simple forward propagation neural network with 4 layers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10

    Cost function: C = (a₀ — y₀)² + (a₁ — y₁)²

    * *     Output Layer   | 2 neurons, 2*3 weights + 2 biases, Sigmoid activation
   /\/\\                   |
   * * *    Hidden-2 Layer | 3 neurons, 3*4 weights + 3 biases, Sigmoid activation
  /\//\\\                  |
  * * * *   Hidden-1 Layer | 4 neurons, 4*2 weights + 4 biases, Sigmoid activation
   \/\\/                   |
    * *     Input Layer    | 2 neurons, 2 input values (say 0.0 .. 0.1)

Initially we have all our 35 weights and biases set randomly. In our example their typical values are somewhere between $-3.0 … +3.0$.

We also have some training data like:

Our aim is to find such weights and biases that our training data will have cost function output as low as possible. This minimization of cost function is what is called training of neural network.

For simplicity let’s say we only optimize for one exact input-output pair. In reality we optimize for batch of input-output pairs.

Forward propagation

At this stage we have 35 weights and biases set (randomly, if it’s our 1st pass). Our aim is to see what our NN will output on it’s 2 neurons on output layer for a given values of input neurons.

Input Layer: 2 neurons are set, for example:

$\textcolor{red}{a_0^{I}} = 0.3, \textcolor{red}{a_1^{I}} = 0.5$

Hidden-1 Layer: 4 neurons are set

$\textcolor{grey}{z_0^{H1}} = (\textcolor{red}{a_0^{I}}w_{00}^{H1} + \textcolor{red}{a_1^{I}}w_{01}^{H1}) + b_0^{H1}$
$\textcolor{grey}{z_1^{H1}} = (\textcolor{red}{a_0^{I}}w_{10}^{H1} + \textcolor{red}{a_1^{I}}w_{11}^{H1}) + b_1^{H1}$
$\textcolor{grey}{z_2^{H1}} = (\textcolor{red}{a_0^{I}}w_{20}^{H1} + \textcolor{red}{a_1^{I}}w_{21}^{H1}) + b_2^{H1}$
$\textcolor{grey}{z_3^{H1}} = (\textcolor{red}{a_0^{I}}w_{30}^{H1} + \textcolor{red}{a_1^{I}}w_{31}^{H1}) + b_3^{H1}$

And then activaved:

$\textcolor{teal}{a_0^{H1}} = Sigmoid(\textcolor{grey}{z_0^{H1}})$
$\textcolor{teal}{a_1^{H1}} = Sigmoid(\textcolor{grey}{z_1^{H1}})$
$\textcolor{teal}{a_2^{H1}} = Sigmoid(\textcolor{grey}{z_2^{H1}})$
$\textcolor{teal}{a_3^{H1}} = Sigmoid(\textcolor{grey}{z_3^{H1}})$

Hidden-2 Layer: 3 neurons are set

$\textcolor{magenta}{z_0^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{00}^{H2} + \textcolor{teal}{a_1^{H1}}w_{01}^{H2} + \textcolor{teal}{a_2^{H1}}w_{02}^{H2} + \textcolor{teal}{a_3^{H1}}w_{03}^{H2}) + b_0^{H2}$
$\textcolor{magenta}{z_1^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{10}^{H2} + \textcolor{teal}{a_1^{H1}}w_{11}^{H2} + \textcolor{teal}{a_2^{H1}}w_{12}^{H2} + \textcolor{teal}{a_3^{H1}}w_{13}^{H2}) + b_1^{H2}$
$\textcolor{magenta}{z_2^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{20}^{H2} + \textcolor{teal}{a_1^{H1}}w_{21}^{H2} + \textcolor{teal}{a_2^{H1}}w_{22}^{H2} + \textcolor{teal}{a_3^{H1}}w_{23}^{H2}) + b_2^{H2}$

And then activaved:

$\textcolor{goldenrod}{a_0^{H2}} = Sigmoid(\textcolor{magenta}{z_0^{H2}})$
$\textcolor{goldenrod}{a_1^{H2}} = Sigmoid(\textcolor{magenta}{z_1^{H2}})$
$\textcolor{goldenrod}{a_2^{H2}} = Sigmoid(\textcolor{magenta}{z_2^{H2}})$

Output Layer: 2 neurons are set

$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$

And then activaved:

$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$

So, let’s say we got with our current weights and biases: $a_0^{O} = 0.8, a_1^{O} = 0.3$

And we know that for our input ($a_0^{I} = 0.3, a_1^{I} = 0.5$) we were ideally expecting $y_0 = 1.0, y_1 = 0.0$

Cost function measures how far we are from desired result:

Cost function is calculated. In previous example:

$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2 = (0.8 - 1.0)^2 + (0.3 - 0.0)^2 = 0.13$

Backward propagation

Math theory

Our task is now to change weights and biases such that cost function $C$ will be smaller than $0.13$.

As a mathematical function, $C$ is function of 35 variables (which are weights and biases of neural network).

So, now we need to find minimum (global or at least local) of our mathematical function of 35 variables. How do we do it? We find all 35 partial derivatives of $C$ (which are $∂C/∂w_{ij}$ and $∂C/∂b_k$). Those 35 partial derivatives form vector in 35-dimensional space of $C$. Direction along gradient is a good candidate for direction towards some reasonable minimim of $C$.

Once we have this direction, we update all our weights and biases as (example for $w_{21}$):
$w_{21}^{H1_{upd}} = w_{21}^{H1_{org}} + r*∂C/∂w_{21}^{H1_{org}}$

The question is now how to efficiently find those 35 partial derivatives.

Deriving backward propagation formulas

Backpropagation starts at output layer and finishes at input layer.

We will further need sigmoid derivative:

Sigmoid function:

Sigmoid function $y(x) = 1/(1+e^{-x})$ has derivative $dy/dx = y*(1-y)$

Output layer

To find derivatives for output layer we will need following (given above) formulas:

$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2$

Output layer:

$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$

$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$

Derivation process:

Step 1: Let’s trace how one particular derivative ($∂C/∂w_{00}^O$) is found:

There is only one path to reach $∂w_{00}^O$:

Following this path we get:

$$ \frac{∂C}{∂w_{00}^O} = \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} * \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} = \underbrace{2(\textcolor{blue}{a_0^O} - y_0)}_{\text{by cost function}} * \underbrace{\textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O})}_{\text{by activation}} * \underbrace{\textcolor{goldenrod}{a_0^{H2}} }_{\text{by layer}} $$ Notice that final derivatives are only true for our particular choice of cost, activation and layer types.

Step 2: repeat the same logic for all other weights:

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{00}^O} \\ \frac{∂C}{∂w_{01}^O} \\ \frac{∂C}{∂w_{02}^O} \\ \frac{∂C}{∂b_0^O} \end{pmatrix} = \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} * \begin{pmatrix} \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂w_{01}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂w_{02}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂b_0^O} \end{pmatrix} = 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^O} \\ \frac{∂C}{∂w_{11}^O} \\ \frac{∂C}{∂w_{12}^O} \\ \frac{∂C}{∂b_1^O} \end{pmatrix} = \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{z_1^O}} * \begin{pmatrix} \frac{\textcolor{lime}{z_1^O}}{∂w_{10}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{11}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{12}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂b_1^O} \end{pmatrix} = 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} $$

Step 3: above can be condensed in nice matrix formulas.

General formula (it doesn’t depend on exact cost/activation/layers type):

$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = \begin{vmatrix} \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} & 0 \\ 0 & \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{z_1^O}} \end{vmatrix} * \begin{vmatrix} \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} & \frac{\textcolor{lime}{z_0^O}}{∂w_{01}^O} & \frac{\textcolor{lime}{z_0^O}}{∂w_{02}^O} & \frac{\textcolor{lime}{z_0^O}}{∂b_0^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{10}^O} & \frac{\textcolor{lime}{z_1^O}}{∂w_{11}^O} & \frac{\textcolor{lime}{z_1^O}}{∂w_{12}^O} & \frac{\textcolor{lime}{z_1^O}}{∂b_1^O} \end{vmatrix} $

1
2
3
4
     Matrix sizes:

     x x x x _ x 0 * x x x x
     x x x x ‾ 0 x   x x x x

Formula for our case of cost/activation/layers types:

$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = \begin{pmatrix} 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) \\ 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) \end{pmatrix} * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} ^ T $

1
2
3
4
     Matrix sizes:

     x x x x _ x * x x x x
     x x x x ‾ x

Notice that derivatives for output layer weight/biases depend only on:

Hidden-2 Layer

Remember that:

$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2$

Output layer:

$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$

$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$

Hidden layer:

$\textcolor{goldenrod}{a_0^{H2}} = Sigmoid(\textcolor{magenta}{z_0^{H2}})$
$\textcolor{goldenrod}{a_1^{H2}} = Sigmoid(\textcolor{magenta}{z_1^{H2}})$
$\textcolor{goldenrod}{a_2^{H2}} = Sigmoid(\textcolor{magenta}{z_2^{H2}})$

$\textcolor{magenta}{z_0^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{00}^{H2} + \textcolor{teal}{a_1^{H1}}w_{01}^{H2} + \textcolor{teal}{a_2^{H1}}w_{02}^{H2} + \textcolor{teal}{a_3^{H1}}w_{03}^{H2}) + b_0^{H2}$
$\textcolor{magenta}{z_1^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{10}^{H2} + \textcolor{teal}{a_1^{H1}}w_{11}^{H2} + \textcolor{teal}{a_2^{H1}}w_{12}^{H2} + \textcolor{teal}{a_3^{H1}}w_{13}^{H2}) + b_1^{H2}$
$\textcolor{magenta}{z_2^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{20}^{H2} + \textcolor{teal}{a_1^{H1}}w_{21}^{H2} + \textcolor{teal}{a_2^{H1}}w_{22}^{H2} + \textcolor{teal}{a_3^{H1}}w_{23}^{H2}) + b_2^{H2}$

Derivation process:

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{00}^{H2}} \\ \frac{∂C}{∂w_{01}^{H2}} \\ \frac{∂C}{∂w_{02}^{H2}} \\ \frac{∂C}{∂w_{03}^{H2}} \\ \frac{∂C}{∂b_0^{H2}} \end{pmatrix} = \underbrace{( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } )}_{\text{there are 2 paths (= neurons on outp layer) \newline to reach derivative}} * \frac{\textcolor{goldenrod}{∂a_0^{H2}} }{\textcolor{magenta}{∂z_0^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{00}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{01}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{02}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{03}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂b_0^{H2}} \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^{H2}} \\ \frac{∂C}{∂w_{11}^{H2}} \\ \frac{∂C}{∂w_{12}^{H2}} \\ \frac{∂C}{∂w_{13}^{H2}} \\ \frac{∂C}{∂b_1^{H2}} \end{pmatrix} = ( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } ) * \frac{\textcolor{goldenrod}{∂a_1^{H2}} }{\textcolor{magenta}{∂z_1^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{10}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{11}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{12}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{13}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂b_1^{H2}} \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{20}^{H2}} \\ \frac{∂C}{∂w_{21}^{H2}} \\ \frac{∂C}{∂w_{22}^{H2}} \\ \frac{∂C}{∂w_{32}^{H2}} \\ \frac{∂C}{∂b_2^{H2}} \end{pmatrix} = ( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } ) * \frac{\textcolor{goldenrod}{∂a_2^{H2}} }{\textcolor{magenta}{∂z_2^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{20}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{21}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{22}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{23}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂b_2^{H2}} \end{pmatrix} $$

Or for our choice of cost/layer/activation:

$$ \def\A0{\textcolor{blue}{a_0^O}} \def\B0{\textcolor{blue}{a_1^O}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\X0{\colorbox{lightblue}{$2(\A0 - y_0) * \A0(1 - \A0)$}} \def\Y0{\colorbox{lightblue}{$2(\B0 - y_1) * \B0(1 - \B0)$}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\arraystretch{1.6} \footnotesize \begin{pmatrix} \frac{∂C}{∂w_{00}^{H2}} \\ \frac{∂C}{∂w_{01}^{H2}} \\ \frac{∂C}{∂w_{02}^{H2}} \\ \frac{∂C}{∂w_{03}^{H2}} \\ \frac{∂C}{∂b_0^{H2}} \end{pmatrix} = (\X0 * w_{00}^O + \Y0 * w_{10}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} \newline \def\C0{\textcolor{goldenrod}{a_1^{H2}} } \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^{H2}} \\ \frac{∂C}{∂w_{11}^{H2}} \\ \frac{∂C}{∂w_{12}^{H2}} \\ \frac{∂C}{∂w_{13}^{H2}} \\ \frac{∂C}{∂b_1^{H2}} \end{pmatrix} = (\X0 * w_{01}^O + \Y0 * w_{11}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} \newline \def\C0{\textcolor{goldenrod}{a_2^{H2}} } \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{20}^{H2}} \\ \frac{∂C}{∂w_{21}^{H2}} \\ \frac{∂C}{∂w_{22}^{H2}} \\ \frac{∂C}{∂w_{23}^{H2}} \\ \frac{∂C}{∂b_2^{H2}} \end{pmatrix} = (\X0 * w_{02}^O + \Y0 * w_{12}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} $$

Above can be summarized into matrix formula:

$$ \def\arraystretch{1.6} \small \begin{vmatrix} \frac{∂C}{∂w_{00}^{H2}} & \frac{∂C}{∂w_{01}^{H2}} & \frac{∂C}{∂w_{02}^{H2}} & \frac{∂C}{∂w_{03}^{H2}} & \frac{∂C}{∂b_0^{H2}} \\ \frac{∂C}{∂w_{10}^{H2}} & \frac{∂C}{∂w_{11}^{H2}} & \frac{∂C}{∂w_{12}^{H2}} & \frac{∂C}{∂w_{13}^{H2}} & \frac{∂C}{∂b_1^{H2}} \\ \frac{∂C}{∂w_{20}^{H2}} & \frac{∂C}{∂w_{21}^{H2}} & \frac{∂C}{∂w_{22}^{H2}} & \frac{∂C}{∂w_{23}^{H2}} & \frac{∂C}{∂b_2^{H2}} \end{vmatrix} = \begin{vmatrix} \frac{\textcolor{goldenrod}{∂a_0^{H2}} }{\textcolor{magenta}{∂z_0^{H2}} } & 0 & 0 \\ 0 & \frac{\textcolor{goldenrod}{∂a_1^{H2}} }{\textcolor{magenta}{∂z_1^{H2}} } & 0 \\ 0 & 0 & \frac{\textcolor{goldenrod}{∂a_2^{H2}} }{\textcolor{magenta}{∂z_2^{H2}} } \end{vmatrix} \begin{vmatrix} \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } \\ \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } \\ \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } \end{vmatrix} \begin{pmatrix} \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} \\ \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} \end{pmatrix} \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T = \newline = \def\AA{\textcolor{goldenrod}{a_0^{H2}} } \def\BB{\textcolor{goldenrod}{a_1^{H2}} } \def\CC{\textcolor{goldenrod}{a_2^{H2}} } \begin{vmatrix} \AA(1 - \AA) & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 \\ 0 & 0 & \CC(1 - \CC) \end{vmatrix} \begin{vmatrix} w_{00}^O & w_{10}^O \\ w_{01}^O & w_{11}^O \\ w_{02}^O & w_{12}^O \end{vmatrix} \def\A0{\textcolor{blue}{a_0^O}} \def\B0{\textcolor{blue}{a_1^O}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\X0{2(\A0 - y_0) * \A0(1 - \A0)} \def\Y0{2(\B0 - y_1) * \B0(1 - \B0)} \begin{pmatrix} \X0 \\ \Y0 \end{pmatrix} \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T $$

1
2
3
4
5
     Matrix sizes:

     x x x x x   x 0 0 * x x * x * x x x x x
     x x x x x = 0 x 0   x x * x
     x x x x x   0 0 x   x x

Hidden-1 Layer

We can derive derivatives by induction.

Derivatives on Output layer: $$ \def\arraystretch{1.6} M_{outp} = \begin{pmatrix} 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) \\ 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) \end{pmatrix} $$

$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = M_{outp} * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} ^ T $$

Derivatives on Hidden-2 Layer: $$ M_{h2} = \def\AA{\textcolor{goldenrod}{a_0^{H2}} } \def\BB{\textcolor{goldenrod}{a_1^{H2}} } \def\CC{\textcolor{goldenrod}{a_2^{H2}} } \def\arraystretch{1.6} \begin{vmatrix} \AA(1 - \AA) & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 \\ 0 & 0 & \CC(1 - \CC) \end{vmatrix} \begin{vmatrix} w_{00}^O & w_{10}^O \\ w_{01}^O & w_{11}^O \\ w_{02}^O & w_{12}^O \end{vmatrix} * M_{outp} $$

$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^{H2}} & \frac{∂C}{∂w_{01}^{H2}} & \frac{∂C}{∂w_{02}^{H2}} & \frac{∂C}{∂w_{03}^{H2}} & \frac{∂C}{∂b_0^{H2}} \\ \frac{∂C}{∂w_{10}^{H2}} & \frac{∂C}{∂w_{11}^{H2}} & \frac{∂C}{∂w_{12}^{H2}} & \frac{∂C}{∂w_{13}^{H2}} & \frac{∂C}{∂b_1^{H2}} \\ \frac{∂C}{∂w_{20}^{H2}} & \frac{∂C}{∂w_{21}^{H2}} & \frac{∂C}{∂w_{22}^{H2}} & \frac{∂C}{∂w_{23}^{H2}} & \frac{∂C}{∂b_2^{H2}} \end{vmatrix} = M_{h2} * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T $$

By induction, derivatives on Hidden-1 layer:

$$ \small \def\arraystretch{1.6} \def\AA{\textcolor{teal}{a_0^{H1}} } \def\BB{\textcolor{teal}{a_1^{H1}} } \def\CC{\textcolor{teal}{a_2^{H1}} } \def\DD{\textcolor{teal}{a_2^{H1}} } M_{h1} = \begin{vmatrix} \AA(1 - \AA) & 0 & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 & 0 \\ 0 & 0 & \CC(1 - \CC) & 0 \\ 0 & 0 & 0 & \DD(1 - \DD) \end{vmatrix} \begin{vmatrix} w_{00}^{H2} & w_{10}^{H2} & w_{20}^{H2} \\ w_{01}^{H2} & w_{11}^{H2} & w_{21}^{H2} \\ w_{02}^{H2} & w_{12}^{H2} & w_{22}^{H2} \\ w_{03}^{H2} & w_{13}^{H2} & w_{23}^{H2} \\ \end{vmatrix} * M_{h2} $$

$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^{H1}} & \frac{∂C}{∂w_{01}^{H1}} & \frac{∂C}{∂b_{0}^{H1}} \\ \frac{∂C}{∂w_{10}^{H1}} & \frac{∂C}{∂w_{11}^{H1}} & \frac{∂C}{∂b_{1}^{H1}} \\ \frac{∂C}{∂w_{20}^{H1}} & \frac{∂C}{∂w_{21}^{H1}} & \frac{∂C}{∂b_{2}^{H1}} \\ \frac{∂C}{∂w_{30}^{H1}} & \frac{∂C}{∂w_{31}^{H1}} & \frac{∂C}{∂b_{3}^{H1}} \\ \end{vmatrix} = M_{h1} * \begin{pmatrix} \textcolor{red}{a_0^{I}} \\ \textcolor{red}{a_1^{I}} \\ 1 \end{pmatrix} ^ T $$

TODO Is it indeed correct induction lol?

Updating weights and biases

As described in math chapter for backward propagation, we now update all our 35 weights and biases according to found gradient and selected learning rate.

Then we repeat optimization cycle as many times as required to reach reasonably small cost function.

Optimizing on batch

Just sum up C and gradients?