A Single-Layered Network
The Math
First, let's examine the math behind a single-layered neural network.
Suppose we are given inputs (scalars) and neurons. Then, the output of neuron is a linear combination of all -dimensional inputs with -dimensional weights , plus a scalar bias . An activation function is then applied to the resulting sum:
Alternatively, we can compute all outputs simultaneously with a single matrix multiplication:
As we saw in Chapter 1, multiple linear transforms applied back-to-back can be collapsed into a single linear transform. Therefore, non-terminal layers use a non-linear . Depending on the function, this allows us to warp the space (e.g. the logistic function squishes all outputs into the range ), apply gating (e.g. the ReLU function "turns off" an output if the weighted input sum is negative), and more.
The feedforward block of a transformer
The feedforward block consists of two neuron layers, where the first layer has some neurons. It takes as input the residual stream, which is an x matrix. (That is, tokens/sequence positions, each represented by a -dimensional vector.) The first layer consists of neurons, each with a non-linear activation function, followed by a second layer of neurons with an identity activation function.