Table of Content
Model description
Let’s say we have simple forward propagation neural network with 4 layers:
|
|
Initially we have all our 35 weights and biases set randomly. In our example their typical values are somewhere between $-3.0 … +3.0$.
We also have some training data like:
- for input $a_0^{I} = 0.3, a_1^{I} = 0.5$ we expect output $y_0 = 1.0, y_1 = 0.0$
- for input $a_0^{I} = 0.2, a_1^{I} = 0.7$ we expect output $y_0 = 0.0, y_1 = 1.0$
- and so on
Our aim is to find such weights and biases that our training data will have cost function output as low as possible. This minimization of cost function is what is called training of neural network.
For simplicity let’s say we only optimize for one exact input-output pair. In reality we optimize for batch of input-output pairs.
Forward propagation
At this stage we have 35 weights and biases set (randomly, if it’s our 1st pass). Our aim is to see what our NN will output on it’s 2 neurons on output layer for a given values of input neurons.
Input Layer: 2 neurons are set, for example:
$\textcolor{red}{a_0^{I}} = 0.3, \textcolor{red}{a_1^{I}} = 0.5$
Hidden-1 Layer: 4 neurons are set
$\textcolor{grey}{z_0^{H1}} = (\textcolor{red}{a_0^{I}}w_{00}^{H1} + \textcolor{red}{a_1^{I}}w_{01}^{H1}) + b_0^{H1}$
$\textcolor{grey}{z_1^{H1}} = (\textcolor{red}{a_0^{I}}w_{10}^{H1} + \textcolor{red}{a_1^{I}}w_{11}^{H1}) + b_1^{H1}$
$\textcolor{grey}{z_2^{H1}} = (\textcolor{red}{a_0^{I}}w_{20}^{H1} + \textcolor{red}{a_1^{I}}w_{21}^{H1}) + b_2^{H1}$
$\textcolor{grey}{z_3^{H1}} = (\textcolor{red}{a_0^{I}}w_{30}^{H1} + \textcolor{red}{a_1^{I}}w_{31}^{H1}) + b_3^{H1}$And then activaved:
$\textcolor{teal}{a_0^{H1}} = Sigmoid(\textcolor{grey}{z_0^{H1}})$
$\textcolor{teal}{a_1^{H1}} = Sigmoid(\textcolor{grey}{z_1^{H1}})$
$\textcolor{teal}{a_2^{H1}} = Sigmoid(\textcolor{grey}{z_2^{H1}})$
$\textcolor{teal}{a_3^{H1}} = Sigmoid(\textcolor{grey}{z_3^{H1}})$
Hidden-2 Layer: 3 neurons are set
$\textcolor{magenta}{z_0^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{00}^{H2} + \textcolor{teal}{a_1^{H1}}w_{01}^{H2} + \textcolor{teal}{a_2^{H1}}w_{02}^{H2} + \textcolor{teal}{a_3^{H1}}w_{03}^{H2}) + b_0^{H2}$
$\textcolor{magenta}{z_1^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{10}^{H2} + \textcolor{teal}{a_1^{H1}}w_{11}^{H2} + \textcolor{teal}{a_2^{H1}}w_{12}^{H2} + \textcolor{teal}{a_3^{H1}}w_{13}^{H2}) + b_1^{H2}$
$\textcolor{magenta}{z_2^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{20}^{H2} + \textcolor{teal}{a_1^{H1}}w_{21}^{H2} + \textcolor{teal}{a_2^{H1}}w_{22}^{H2} + \textcolor{teal}{a_3^{H1}}w_{23}^{H2}) + b_2^{H2}$And then activaved:
$\textcolor{goldenrod}{a_0^{H2}} = Sigmoid(\textcolor{magenta}{z_0^{H2}})$
$\textcolor{goldenrod}{a_1^{H2}} = Sigmoid(\textcolor{magenta}{z_1^{H2}})$
$\textcolor{goldenrod}{a_2^{H2}} = Sigmoid(\textcolor{magenta}{z_2^{H2}})$
Output Layer: 2 neurons are set
$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$And then activaved:
$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$
So, let’s say we got with our current weights and biases: $a_0^{O} = 0.8, a_1^{O} = 0.3$
And we know that for our input ($a_0^{I} = 0.3, a_1^{I} = 0.5$) we were ideally expecting $y_0 = 1.0, y_1 = 0.0$
Cost function measures how far we are from desired result:
Cost function is calculated. In previous example:
$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2 = (0.8 - 1.0)^2 + (0.3 - 0.0)^2 = 0.13$
Backward propagation
Math theory
Our task is now to change weights and biases such that cost function $C$ will be smaller than $0.13$.
As a mathematical function, $C$ is function of 35 variables (which are weights and biases of neural network).
So, now we need to find minimum (global or at least local) of our mathematical function of 35 variables. How do we do it? We find all 35 partial derivatives of $C$ (which are $∂C/∂w_{ij}$ and $∂C/∂b_k$). Those 35 partial derivatives form vector in 35-dimensional space of $C$. Direction along gradient is a good candidate for direction towards some reasonable minimim of $C$.
Once we have this direction, we update all our weights and biases as (example for $w_{21}$):
$w_{21}^{H1_{upd}} = w_{21}^{H1_{org}} + r*∂C/∂w_{21}^{H1_{org}}$
- $r$ is learning rate — mathematically it is scaling kof for gradient vector
- make it too small — and you’ll need more training steps, plus you won’t be able to get out of local minimums
- make it too big — and you will be overshooting your minimum back and forth
- $r$ is taken to be $< 0$ (and usually it is of value around -0.01), because we search for minimum, not maximum
The question is now how to efficiently find those 35 partial derivatives.
Deriving backward propagation formulas
Backpropagation starts at output layer and finishes at input layer.
We will further need sigmoid derivative:
Sigmoid function:
![]()
Sigmoid function $y(x) = 1/(1+e^{-x})$ has derivative $dy/dx = y*(1-y)$
Output layer
To find derivatives for output layer we will need following (given above) formulas:
$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2$
Output layer:
$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$
Derivation process:
Step 1: Let’s trace how one particular derivative ($∂C/∂w_{00}^O$) is found:
There is only one path to reach $∂w_{00}^O$:
- $\textcolor{blue}{a_0^{O}} → \textcolor{lime}{z_0^{O}} → \textcolor{goldenrod}{a_0^{H2}}$
Following this path we get:
$$ \frac{∂C}{∂w_{00}^O} = \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} * \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} = \underbrace{2(\textcolor{blue}{a_0^O} - y_0)}_{\text{by cost function}} * \underbrace{\textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O})}_{\text{by activation}} * \underbrace{\textcolor{goldenrod}{a_0^{H2}} }_{\text{by layer}} $$ Notice that final derivatives are only true for our particular choice of cost, activation and layer types.
Step 2: repeat the same logic for all other weights:
$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{00}^O} \\ \frac{∂C}{∂w_{01}^O} \\ \frac{∂C}{∂w_{02}^O} \\ \frac{∂C}{∂b_0^O} \end{pmatrix} = \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} * \begin{pmatrix} \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂w_{01}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂w_{02}^O} \\ \frac{\textcolor{lime}{z_0^O}}{∂b_0^O} \end{pmatrix} = 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} $$
$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^O} \\ \frac{∂C}{∂w_{11}^O} \\ \frac{∂C}{∂w_{12}^O} \\ \frac{∂C}{∂b_1^O} \end{pmatrix} = \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{z_1^O}} * \begin{pmatrix} \frac{\textcolor{lime}{z_1^O}}{∂w_{10}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{11}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{12}^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂b_1^O} \end{pmatrix} = 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} $$
Step 3: above can be condensed in nice matrix formulas.
General formula (it doesn’t depend on exact cost/activation/layers type):
$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = \begin{vmatrix} \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{z_0^O}} & 0 \\ 0 & \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{z_1^O}} \end{vmatrix} * \begin{vmatrix} \frac{\textcolor{lime}{z_0^O}}{∂w_{00}^O} & \frac{\textcolor{lime}{z_0^O}}{∂w_{01}^O} & \frac{\textcolor{lime}{z_0^O}}{∂w_{02}^O} & \frac{\textcolor{lime}{z_0^O}}{∂b_0^O} \\ \frac{\textcolor{lime}{z_1^O}}{∂w_{10}^O} & \frac{\textcolor{lime}{z_1^O}}{∂w_{11}^O} & \frac{\textcolor{lime}{z_1^O}}{∂w_{12}^O} & \frac{\textcolor{lime}{z_1^O}}{∂b_1^O} \end{vmatrix} $
1 2 3 4Matrix sizes: x x x x _ x 0 * x x x x x x x x ‾ 0 x x x x xFormula for our case of cost/activation/layers types:
$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = \begin{pmatrix} 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) \\ 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) \end{pmatrix} * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} ^ T $
1 2 3 4Matrix sizes: x x x x _ x * x x x x x x x x ‾ x
Notice that derivatives for output layer weight/biases depend only on:
- activated values of O-layer neurons
- activated values of H2-layer neurons
- target values $y_0, y_1$
Hidden-2 Layer
Remember that:
$C = (\textcolor{blue}{a_0^{O}} - y_0)^2 + (\textcolor{blue}{a_1^{O}} - y_1)^2$
Output layer:
$\textcolor{blue}{a_0^{O}} = Sigmoid(\textcolor{lime}{z_0^{O}})$
$\textcolor{blue}{a_1^{O}} = Sigmoid(\textcolor{lime}{z_1^{O}})$$\textcolor{lime}{z_0^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{00}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{01}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{02}^{O}) + b_0^{O}$
$\textcolor{lime}{z_1^{O}} = (\textcolor{goldenrod}{a_0^{H2}}w_{10}^{O} + \textcolor{goldenrod}{a_1^{H2}}w_{11}^{O} + \textcolor{goldenrod}{a_2^{H2}}w_{12}^{O}) + b_1^{O}$Hidden layer:
$\textcolor{goldenrod}{a_0^{H2}} = Sigmoid(\textcolor{magenta}{z_0^{H2}})$
$\textcolor{goldenrod}{a_1^{H2}} = Sigmoid(\textcolor{magenta}{z_1^{H2}})$
$\textcolor{goldenrod}{a_2^{H2}} = Sigmoid(\textcolor{magenta}{z_2^{H2}})$$\textcolor{magenta}{z_0^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{00}^{H2} + \textcolor{teal}{a_1^{H1}}w_{01}^{H2} + \textcolor{teal}{a_2^{H1}}w_{02}^{H2} + \textcolor{teal}{a_3^{H1}}w_{03}^{H2}) + b_0^{H2}$
$\textcolor{magenta}{z_1^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{10}^{H2} + \textcolor{teal}{a_1^{H1}}w_{11}^{H2} + \textcolor{teal}{a_2^{H1}}w_{12}^{H2} + \textcolor{teal}{a_3^{H1}}w_{13}^{H2}) + b_1^{H2}$
$\textcolor{magenta}{z_2^{H2}} = (\textcolor{teal}{a_0^{H1}}w_{20}^{H2} + \textcolor{teal}{a_1^{H1}}w_{21}^{H2} + \textcolor{teal}{a_2^{H1}}w_{22}^{H2} + \textcolor{teal}{a_3^{H1}}w_{23}^{H2}) + b_2^{H2}$
Derivation process:
$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{00}^{H2}} \\ \frac{∂C}{∂w_{01}^{H2}} \\ \frac{∂C}{∂w_{02}^{H2}} \\ \frac{∂C}{∂w_{03}^{H2}} \\ \frac{∂C}{∂b_0^{H2}} \end{pmatrix} = \underbrace{( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } )}_{\text{there are 2 paths (= neurons on outp layer) \newline to reach derivative}} * \frac{\textcolor{goldenrod}{∂a_0^{H2}} }{\textcolor{magenta}{∂z_0^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{00}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{01}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{02}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂w_{03}^{H2}} \\ \frac{\textcolor{magenta}{∂z_0^{H2}} }{∂b_0^{H2}} \end{pmatrix} $$
$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^{H2}} \\ \frac{∂C}{∂w_{11}^{H2}} \\ \frac{∂C}{∂w_{12}^{H2}} \\ \frac{∂C}{∂w_{13}^{H2}} \\ \frac{∂C}{∂b_1^{H2}} \end{pmatrix} = ( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } ) * \frac{\textcolor{goldenrod}{∂a_1^{H2}} }{\textcolor{magenta}{∂z_1^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{10}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{11}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{12}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂w_{13}^{H2}} \\ \frac{\textcolor{magenta}{∂z_1^{H2}} }{∂b_1^{H2}} \end{pmatrix} $$
$$ \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{20}^{H2}} \\ \frac{∂C}{∂w_{21}^{H2}} \\ \frac{∂C}{∂w_{22}^{H2}} \\ \frac{∂C}{∂w_{32}^{H2}} \\ \frac{∂C}{∂b_2^{H2}} \end{pmatrix} = ( \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} * \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } + \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} * \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } ) * \frac{\textcolor{goldenrod}{∂a_2^{H2}} }{\textcolor{magenta}{∂z_2^{H2}} } * \begin{pmatrix} \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{20}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{21}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{22}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂w_{23}^{H2}} \\ \frac{\textcolor{magenta}{∂z_2^{H2}} }{∂b_2^{H2}} \end{pmatrix} $$
Or for our choice of cost/layer/activation:
$$ \def\A0{\textcolor{blue}{a_0^O}} \def\B0{\textcolor{blue}{a_1^O}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\X0{\colorbox{lightblue}{$2(\A0 - y_0) * \A0(1 - \A0)$}} \def\Y0{\colorbox{lightblue}{$2(\B0 - y_1) * \B0(1 - \B0)$}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\arraystretch{1.6} \footnotesize \begin{pmatrix} \frac{∂C}{∂w_{00}^{H2}} \\ \frac{∂C}{∂w_{01}^{H2}} \\ \frac{∂C}{∂w_{02}^{H2}} \\ \frac{∂C}{∂w_{03}^{H2}} \\ \frac{∂C}{∂b_0^{H2}} \end{pmatrix} = (\X0 * w_{00}^O + \Y0 * w_{10}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} \newline \def\C0{\textcolor{goldenrod}{a_1^{H2}} } \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{10}^{H2}} \\ \frac{∂C}{∂w_{11}^{H2}} \\ \frac{∂C}{∂w_{12}^{H2}} \\ \frac{∂C}{∂w_{13}^{H2}} \\ \frac{∂C}{∂b_1^{H2}} \end{pmatrix} = (\X0 * w_{01}^O + \Y0 * w_{11}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} \newline \def\C0{\textcolor{goldenrod}{a_2^{H2}} } \def\arraystretch{1.6} \begin{pmatrix} \frac{∂C}{∂w_{20}^{H2}} \\ \frac{∂C}{∂w_{21}^{H2}} \\ \frac{∂C}{∂w_{22}^{H2}} \\ \frac{∂C}{∂w_{23}^{H2}} \\ \frac{∂C}{∂b_2^{H2}} \end{pmatrix} = (\X0 * w_{02}^O + \Y0 * w_{12}^O) * \C0(1 - \C0) * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} $$
Above can be summarized into matrix formula:
$$ \def\arraystretch{1.6} \small \begin{vmatrix} \frac{∂C}{∂w_{00}^{H2}} & \frac{∂C}{∂w_{01}^{H2}} & \frac{∂C}{∂w_{02}^{H2}} & \frac{∂C}{∂w_{03}^{H2}} & \frac{∂C}{∂b_0^{H2}} \\ \frac{∂C}{∂w_{10}^{H2}} & \frac{∂C}{∂w_{11}^{H2}} & \frac{∂C}{∂w_{12}^{H2}} & \frac{∂C}{∂w_{13}^{H2}} & \frac{∂C}{∂b_1^{H2}} \\ \frac{∂C}{∂w_{20}^{H2}} & \frac{∂C}{∂w_{21}^{H2}} & \frac{∂C}{∂w_{22}^{H2}} & \frac{∂C}{∂w_{23}^{H2}} & \frac{∂C}{∂b_2^{H2}} \end{vmatrix} = \begin{vmatrix} \frac{\textcolor{goldenrod}{∂a_0^{H2}} }{\textcolor{magenta}{∂z_0^{H2}} } & 0 & 0 \\ 0 & \frac{\textcolor{goldenrod}{∂a_1^{H2}} }{\textcolor{magenta}{∂z_1^{H2}} } & 0 \\ 0 & 0 & \frac{\textcolor{goldenrod}{∂a_2^{H2}} }{\textcolor{magenta}{∂z_2^{H2}} } \end{vmatrix} \begin{vmatrix} \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_0^{H2}} } \\ \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_1^{H2}} } \\ \frac{\textcolor{lime}{∂z_0^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } & \frac{\textcolor{lime}{∂z_1^O}}{\textcolor{goldenrod}{∂a_2^{H2}} } \end{vmatrix} \begin{pmatrix} \frac{∂C}{\textcolor{blue}{∂a_0^O}} * \frac{\textcolor{blue}{∂a_0^O}}{\textcolor{lime}{∂z_0^O}} \\ \frac{∂C}{\textcolor{blue}{∂a_1^O}} * \frac{\textcolor{blue}{∂a_1^O}}{\textcolor{lime}{∂z_1^O}} \end{pmatrix} \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T = \newline = \def\AA{\textcolor{goldenrod}{a_0^{H2}} } \def\BB{\textcolor{goldenrod}{a_1^{H2}} } \def\CC{\textcolor{goldenrod}{a_2^{H2}} } \begin{vmatrix} \AA(1 - \AA) & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 \\ 0 & 0 & \CC(1 - \CC) \end{vmatrix} \begin{vmatrix} w_{00}^O & w_{10}^O \\ w_{01}^O & w_{11}^O \\ w_{02}^O & w_{12}^O \end{vmatrix} \def\A0{\textcolor{blue}{a_0^O}} \def\B0{\textcolor{blue}{a_1^O}} \def\C0{\textcolor{goldenrod}{a_0^{H2}} } \def\X0{2(\A0 - y_0) * \A0(1 - \A0)} \def\Y0{2(\B0 - y_1) * \B0(1 - \B0)} \begin{pmatrix} \X0 \\ \Y0 \end{pmatrix} \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T $$
1 2 3 4 5Matrix sizes: x x x x x x 0 0 * x x * x * x x x x x x x x x x = 0 x 0 x x * x x x x x x 0 0 x x x
Hidden-1 Layer
We can derive derivatives by induction.
Derivatives on Output layer: $$ \def\arraystretch{1.6} M_{outp} = \begin{pmatrix} 2(\textcolor{blue}{a_0^O} - y_0) * \textcolor{blue}{a_0^O} (1 - \textcolor{blue}{a_0^O}) \\ 2(\textcolor{blue}{a_1^O} - y_1) * \textcolor{blue}{a_1^O} (1 - \textcolor{blue}{a_1^O}) \end{pmatrix} $$
$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^O} & \frac{∂C}{∂w_{01}^O} & \frac{∂C}{∂w_{02}^O} & \frac{∂C}{∂b_0^O} \\ \frac{∂C}{∂w_{10}^O} & \frac{∂C}{∂w_{11}^O} & \frac{∂C}{∂w_{12}^O} & \frac{∂C}{∂b_1^O} \\ \end{vmatrix} = M_{outp} * \begin{pmatrix} \textcolor{goldenrod}{a_0^{H2}} \\ \textcolor{goldenrod}{a_1^{H2}} \\ \textcolor{goldenrod}{a_2^{H2}} \\ 1 \end{pmatrix} ^ T $$
Derivatives on Hidden-2 Layer: $$ M_{h2} = \def\AA{\textcolor{goldenrod}{a_0^{H2}} } \def\BB{\textcolor{goldenrod}{a_1^{H2}} } \def\CC{\textcolor{goldenrod}{a_2^{H2}} } \def\arraystretch{1.6} \begin{vmatrix} \AA(1 - \AA) & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 \\ 0 & 0 & \CC(1 - \CC) \end{vmatrix} \begin{vmatrix} w_{00}^O & w_{10}^O \\ w_{01}^O & w_{11}^O \\ w_{02}^O & w_{12}^O \end{vmatrix} * M_{outp} $$
$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^{H2}} & \frac{∂C}{∂w_{01}^{H2}} & \frac{∂C}{∂w_{02}^{H2}} & \frac{∂C}{∂w_{03}^{H2}} & \frac{∂C}{∂b_0^{H2}} \\ \frac{∂C}{∂w_{10}^{H2}} & \frac{∂C}{∂w_{11}^{H2}} & \frac{∂C}{∂w_{12}^{H2}} & \frac{∂C}{∂w_{13}^{H2}} & \frac{∂C}{∂b_1^{H2}} \\ \frac{∂C}{∂w_{20}^{H2}} & \frac{∂C}{∂w_{21}^{H2}} & \frac{∂C}{∂w_{22}^{H2}} & \frac{∂C}{∂w_{23}^{H2}} & \frac{∂C}{∂b_2^{H2}} \end{vmatrix} = M_{h2} * \begin{pmatrix} \textcolor{teal}{a_0^{H1}} \\ \textcolor{teal}{a_1^{H1}} \\ \textcolor{teal}{a_2^{H1}} \\ \textcolor{teal}{a_3^{H1}} \\ 1 \end{pmatrix} ^ T $$
By induction, derivatives on Hidden-1 layer:
$$ \small \def\arraystretch{1.6} \def\AA{\textcolor{teal}{a_0^{H1}} } \def\BB{\textcolor{teal}{a_1^{H1}} } \def\CC{\textcolor{teal}{a_2^{H1}} } \def\DD{\textcolor{teal}{a_2^{H1}} } M_{h1} = \begin{vmatrix} \AA(1 - \AA) & 0 & 0 & 0 \\ 0 & \BB(1 - \BB) & 0 & 0 \\ 0 & 0 & \CC(1 - \CC) & 0 \\ 0 & 0 & 0 & \DD(1 - \DD) \end{vmatrix} \begin{vmatrix} w_{00}^{H2} & w_{10}^{H2} & w_{20}^{H2} \\ w_{01}^{H2} & w_{11}^{H2} & w_{21}^{H2} \\ w_{02}^{H2} & w_{12}^{H2} & w_{22}^{H2} \\ w_{03}^{H2} & w_{13}^{H2} & w_{23}^{H2} \\ \end{vmatrix} * M_{h2} $$
$$ \def\arraystretch{1.6} \begin{vmatrix} \frac{∂C}{∂w_{00}^{H1}} & \frac{∂C}{∂w_{01}^{H1}} & \frac{∂C}{∂b_{0}^{H1}} \\ \frac{∂C}{∂w_{10}^{H1}} & \frac{∂C}{∂w_{11}^{H1}} & \frac{∂C}{∂b_{1}^{H1}} \\ \frac{∂C}{∂w_{20}^{H1}} & \frac{∂C}{∂w_{21}^{H1}} & \frac{∂C}{∂b_{2}^{H1}} \\ \frac{∂C}{∂w_{30}^{H1}} & \frac{∂C}{∂w_{31}^{H1}} & \frac{∂C}{∂b_{3}^{H1}} \\ \end{vmatrix} = M_{h1} * \begin{pmatrix} \textcolor{red}{a_0^{I}} \\ \textcolor{red}{a_1^{I}} \\ 1 \end{pmatrix} ^ T $$
TODO Is it indeed correct induction lol?
Updating weights and biases
As described in math chapter for backward propagation, we now update all our 35 weights and biases according to found gradient and selected learning rate.
Then we repeat optimization cycle as many times as required to reach reasonably small cost function.
Optimizing on batch
Just sum up C and gradients?