If bias is a scalar (in NN application, this is usually the case):
$$ \overrightarrow{r} + b = \begin{pmatrix} r_1 + b \\ r_2 + b \\ r_3 + b \end{pmatrix}$$
If bias is a vector:
$$ \overrightarrow{r} + \overrightarrow{b} = \begin{pmatrix} r_1 + b_1 \\ r_2 + b_2 \\ r_3 + b_3 \end{pmatrix}$$
Stride
Stride is essentially kernel step size (number of elements it is moved ahead on each step).
Formula for stride = 1:
$$
Conv_{valid}^{stride1}\Big(
\begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix},
\begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) =
\begin{vmatrix}
k_1 & k_2 & 0 & 0\\
0 & k_1 & k_2 & 0\\
0 & 0 & k_1 & k_2\\
\end{vmatrix} *
\begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}
$$
Formula for stride = 2:
$$
Conv_{valid}^{stride2}\Big(
\begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix},
\begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) =
\begin{vmatrix}
k_1 & k_2 & 0 & 0\\
0 & 0 & k_1 & k_2
\end{vmatrix} *
\begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}
$$
Usually stride is selected to fit perfectly in current combination of vector and kernel sizes.
When it doesn’t fit usual approach is to simply take last position of kernel that still fits.
For example for stride = 3:
$$
Conv_{valid}^{stride3}\Big(
\begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix},
\begin{pmatrix} k_1 \\ k_2 \end{pmatrix}\Big) =
\begin{vmatrix}
k_1 & k_2 & 0 & 0
\end{vmatrix} *
\begin{pmatrix} a \\ b \\ c \\ d \end{pmatrix}
$$
Padding
Padding is adding zeroes (or ) to the left and right of input vector:
padding = 0 is no zeros: $[a, b, c, d]$
padding = 1 is one zero to the left, plus one zero to the right: $[0, a, b, c, d, 0]$
and so on
Then kernel is applied to padded vector. That’s all.
2D-convolution
All concepts are the same as for 1D-conv.
Example picture for:
valid conv (non-full)
stride: x=2, y=2
no padding
input size: 5x5
kernel size: 3x3
output size: 2x2
3D-convolution
Common usage is convolution over color layers.
Idea visualization:
Pooling
Pooling on 2D aims to downsample and aggregate information in convolution layer.
Example of max pooling (max = takes maximum element of submatrix):
Obviously, there are other kinds of pooling such as average, square-root and others.
Simple CNN
How CNN works can be understood on very simple example.
Achieved by calculating gradients for all params.
Then, as usual, just add gradient matrices to all params (multiplied on learning rate like -0.1 or smth).