Table of Content

Convolution

1D-convolution

Definitions (calculation examples)

Bias

Stride

Padding

2D-convolution

3D-convolution

Pooling

Simple CNN

Convolution

1D-convolution

Definitions (calculation examples)

Non-full convolution (reffered to as “valid” in numpy):

Output size = Input size — Kernel size + 1

$$ Conv_{valid}\Big( \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0 & 0 \\ 0 & k_1 & k_2 & 0 & 0 \\ 0 & 0 & k_1 & k_2 & 0 \\ 0 & 0 & 0 & k_1 & k_2 \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix} = \begin{pmatrix} a * k_1 + b * k_2 \\ b * k_1 + c * k_2 \\ c * k_1 + d * k_2 \\ d * k_1 + e * k_2 \end{pmatrix} $$

Full convolution (reffered to as “full” in numpy).
In full convolution kernel should be flipped before sliding.

Output size = Input size + Kernel size - 1

$$ Conv_{full}\Big( \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_2 & k_1 & 0 & 0 & 0 & 0 & 0 \\ 0 & k_2 & k_1 & 0 & 0 & 0 & 0 \\ 0 & 0 & k_2 & k_1 & 0 & 0 & 0 \\ 0 & 0 & 0 & k_2 & k_1 & 0 & 0 \\ 0 & 0 & 0 & 0 & k_2 & k_1 & 0 \\ 0 & 0 & 0 & 0 & 0 & k_2 & k_1 \end{vmatrix} * \begin{pmatrix} 0 \\ a \\ b \\ c \\ d \\ e \\ 0\end{pmatrix} $$

Related to convolution: Cross-corellation.
Similar to full convolution, but without kernel flip.

$$ Conv_{full}\Big( \begin{pmatrix} a \\ b \\ c \\ d \\ e \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0 & 0 & 0 & 0 \\ 0 & k_1 & k_2 & 0 & 0 & 0 & 0 \\ 0 & 0 & k_1 & k_2 & 0 & 0 & 0 \\ 0 & 0 & 0 & k_1 & k_2 & 0 & 0 \\ 0 & 0 & 0 & 0 & k_1 & k_2 & 0 \\ 0 & 0 & 0 & 0 & 0 & k_1 & k_2 \end{vmatrix} * \begin{pmatrix} 0 \\ a \\ b \\ c \\ d \\ e \\ 0\end{pmatrix} $$

Bias

Let’s say we calculated convolution for vector a and kernel k:

$$Conv(\overrightarrow{a}, \overrightarrow{k}) = \begin{pmatrix} r_1 \\ r_2 \\ r_3 \end{pmatrix}$$

If bias is a scalar (in NN application, this is usually the case): $$ \overrightarrow{r} + b = \begin{pmatrix} r_1 + b \\ r_2 + b \\ r_3 + b \end{pmatrix}$$

If bias is a vector: $$ \overrightarrow{r} + \overrightarrow{b} = \begin{pmatrix} r_1 + b_1 \\ r_2 + b_2 \\ r_3 + b_3 \end{pmatrix}$$

Stride

Stride is essentially kernel step size (number of elements it is moved ahead on each step).

Formula for stride = 1:

$$ Conv_{valid}^{stride1}\Big( \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0\\ 0 & k_1 & k_2 & 0\\ 0 & 0 & k_1 & k_2\\ \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix} $$

Formula for stride = 2:

$$ Conv_{valid}^{stride2}\Big( \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0\\ 0 & 0 & k_1 & k_2 \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix} $$

Usually stride is selected to fit perfectly in current combination of vector and kernel sizes. When it doesn’t fit usual approach is to simply take last position of kernel that still fits.

For example for stride = 3:

$$ Conv_{valid}^{stride3}\Big( \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}, \begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) = \begin{vmatrix} k_1 & k_2 & 0 & 0 \end{vmatrix} * \begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix} $$

Padding

Padding is adding zeroes (or ) to the left and right of input vector:

padding = 0 is no zeros: $[a, b, c, d]$
padding = 1 is one zero to the left, plus one zero to the right: $[0, a, b, c, d, 0]$
and so on

Then kernel is applied to padded vector. That’s all.

2D-convolution

All concepts are the same as for 1D-conv.

Example picture for:

valid conv (non-full)

stride: x=2, y=2

no padding

input size: 5x5

kernel size: 3x3

output size: 2x2

3D-convolution

Common usage is convolution over color layers.

Idea visualization:

Pooling

Pooling on 2D aims to downsample and aggregate information in convolution layer.

Example of max pooling (max = takes maximum element of submatrix):

Obviously, there are other kinds of pooling such as average, square-root and others.

Simple CNN

How CNN works can be understood on very simple example.

Consider NN with following architecture:

Trainable parameters in this NN:

From conv layer: 2x2 kernel + 1 bias = 4 + 1 params

From FCN layer: 3x4 weights + 3 biases = 12 + 3 params

From FCN layer: 2x3 weights + 2 biases = 6 + 2 params

Forward pass:

Input layer:

5x5 input matrix

represents 5x5 gray-scale image

elems values are from 0.0 to 1.0

Convolution layer:

Convolution

kernel 2x2, no padding, stride=1

outputs 4x4 matrix

Bias (scalar value, same for all elems) is added after convolution

ReLU activation (per element)

Pooling layer:

Pooling

Max pooling on 2x2 submatrices

Outputs 2x2 matrix

Matrix 2x2 is flattened into vector of len 4

No activation is required since max pooling is already non-linear

Still, in other networks it is common to add to this layer:

coefficient and bias

activation

Fully-connected layer:

Forward pass: $\overrightarrow{a^{FCN}} = |WeightsMatr^{FCN}|*\overrightarrow{a^{Pooling}} + \overrightarrow{b^{FCN}}$

3x4 matrix for weights

biases vector of len 3

Sigmoid activation

Output layer:

Forward pass: $\overrightarrow{a^{outp}} = |WeightsMatr^{outp}|*\overrightarrow{a^{FCN}} + \overrightarrow{b^{outp}}$

2x3 matrix for weights

biases vector of len 2

Sigmoid activation

Cost function:

$Loss = (z_0^{outp} - y_0)^2 + (z_1^{outp} - y_1)^2$

Where {y₀, y₁} is target output vector.

Backpropagation:

Achieved by calculating gradients for all params. Then, as usual, just add gradient matrices to all params (multiplied on learning rate like -0.1 or smth).

02 Convolutional NN

Convolution, Forward pass, Backpropagation