02 Convolutional NN

Convolution, Forward pass, Backpropagation

Table of Content

Convolution

1D-convolution

Definitions (calculation examples)

Non-full convolution (reffered to as “valid” in numpy):

Output size = Input size — Kernel size + 1

$$ Conv_{valid}\Big( \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0 & 0 \\ 0 & k_1 & k_2 & 0 & 0 \\ 0 & 0 & k_1 & k_2 & 0 \\ 0 & 0 & 0 & k_1 & k_2 \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix} = \begin{pmatrix} a * k_1 + b * k_2 \\ b * k_1 + c * k_2 \\ c * k_1 + d * k_2 \\ d * k_1 + e * k_2 \end{pmatrix} $$

Full convolution (reffered to as “full” in numpy).
In full convolution kernel should be flipped before sliding.

Output size = Input size + Kernel size - 1

$$ Conv_{full}\Big( \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_2 & k_1 & 0 & 0 & 0 & 0 & 0 \\ 0 & k_2 & k_1 & 0 & 0 & 0 & 0 \\ 0 & 0 & k_2 & k_1 & 0 & 0 & 0 \\ 0 & 0 & 0 & k_2 & k_1 & 0 & 0 \\ 0 & 0 & 0 & 0 & k_2 & k_1 & 0 \\ 0 & 0 & 0 & 0 & 0 & k_2 & k_1 \end{vmatrix} * \begin{pmatrix} 0 \\ a \\ b \\ c \\ d \\ e \\ 0\end{pmatrix} $$

Related to convolution: Cross-corellation.
Similar to full convolution, but without kernel flip.

$$ Conv_{full}\Big( \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0 & 0 & 0 & 0 \\ 0 & k_1 & k_2 & 0 & 0 & 0 & 0 \\ 0 & 0 & k_1 & k_2 & 0 & 0 & 0 \\ 0 & 0 & 0 & k_1 & k_2 & 0 & 0 \\ 0 & 0 & 0 & 0 & k_1 & k_2 & 0 \\ 0 & 0 & 0 & 0 & 0 & k_1 & k_2 \end{vmatrix} * \begin{pmatrix} 0 \\ a \\ b \\ c \\ d \\ e \\ 0\end{pmatrix} $$

Bias

Let’s say we calculated convolution for vector a and kernel k:

$$Conv(\overrightarrow{a}, \overrightarrow{k}) = \begin{pmatrix} r_1 \\ r_2 \\ r_3 \end{pmatrix}$$

If bias is a scalar (in NN application, this is usually the case): $$ \overrightarrow{r} + b = \begin{pmatrix} r_1 + b \\ r_2 + b \\ r_3 + b \end{pmatrix}$$

If bias is a vector: $$ \overrightarrow{r} + \overrightarrow{b} = \begin{pmatrix} r_1 + b_1 \\ r_2 + b_2 \\ r_3 + b_3 \end{pmatrix}$$

Stride

Stride is essentially kernel step size (number of elements it is moved ahead on each step).

Formula for stride = 1:

$$ Conv_{valid}^{stride1}\Big( \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0\\ 0 & k_1 & k_2 & 0\\ 0 & 0 & k_1 & k_2\\ \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix} $$

Formula for stride = 2:

$$ Conv_{valid}^{stride2}\Big( \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0\\ 0 & 0 & k_1 & k_2 \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix} $$

Usually stride is selected to fit perfectly in current combination of vector and kernel sizes. When it doesn’t fit usual approach is to simply take last position of kernel that still fits.

For example for stride = 3:

$$ Conv_{valid}^{stride3}\Big( \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0 \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix} $$

Padding

Padding is adding zeroes (or ) to the left and right of input vector:

Then kernel is applied to padded vector. That’s all.

2D-convolution

All concepts are the same as for 1D-conv.

Example picture for:

3D-convolution

Common usage is convolution over color layers.

Idea visualization:

Pooling

Pooling on 2D aims to downsample and aggregate information in convolution layer.

Example of max pooling (max = takes maximum element of submatrix):

Obviously, there are other kinds of pooling such as average, square-root and others.

Simple CNN

How CNN works can be understood on very simple example.

Consider NN with following architecture:

Trainable parameters in this NN:

  1. From conv layer: 2x2 kernel + 1 bias = 4 + 1 params
  2. From FCN layer: 3x4 weights + 3 biases = 12 + 3 params
  3. From FCN layer: 2x3 weights + 2 biases = 6 + 2 params

Forward pass:

Input layer:

  1. 5x5 input matrix
    • represents 5x5 gray-scale image
    • elems values are from 0.0 to 1.0

Convolution layer:

  1. Convolution
    • kernel 2x2, no padding, stride=1
    • outputs 4x4 matrix
  2. Bias (scalar value, same for all elems) is added after convolution
  3. ReLU activation (per element)

Pooling layer:

  1. Pooling
    • Max pooling on 2x2 submatrices
    • Outputs 2x2 matrix
  2. Matrix 2x2 is flattened into vector of len 4

No activation is required since max pooling is already non-linear

Still, in other networks it is common to add to this layer:

Fully-connected layer:

  1. Forward pass: $\overrightarrow{a^{FCN}} = |WeightsMatr^{FCN}|*\overrightarrow{a^{Pooling}} + \overrightarrow{b^{FCN}}$
    • 3x4 matrix for weights
    • biases vector of len 3
  2. Sigmoid activation

Output layer:

  1. Forward pass: $\overrightarrow{a^{outp}} = |WeightsMatr^{outp}|*\overrightarrow{a^{FCN}} + \overrightarrow{b^{outp}}$
    • 2x3 matrix for weights
    • biases vector of len 2
  2. Sigmoid activation

Cost function:

$Loss = (z_0^{outp} - y_0)^2 + (z_1^{outp} - y_1)^2$

Where {y₀, y₁} is target output vector.

Backpropagation:

Achieved by calculating gradients for all params. Then, as usual, just add gradient matrices to all params (multiplied on learning rate like -0.1 or smth).