A Single-Layered Network

The Math

First, let's examine the math behind a single-layered neural network.

Suppose we are given $k$ inputs (scalars) and $N_{l}$ neurons. Then, the output of neuron $i$ is a linear combination of all $k$ -dimensional inputs $x$ with $k$ -dimensional weights $w_{i}$ , plus a scalar bias $b_{i}$ . An activation function $σ : N^{k} \mapsto N$ is then applied to the resulting sum:

$o_{i} = σ (x \cdot w_{i} + b_{i}) .$

Alternatively, we can compute all outputs simultaneously with a single matrix multiplication:

$o = σ (XW + b) .$

As we saw in Chapter 1, multiple linear transforms applied back-to-back can be collapsed into a single linear transform. Therefore, non-terminal layers use a non-linear $σ$ . Depending on the function, this allows us to warp the space (e.g. the logistic function squishes all outputs into the range $[- 1, 1]$ ), apply gating (e.g. the ReLU function "turns off" an output if the weighted input sum is negative), and more.

The feedforward block of a transformer

The feedforward block consists of two neuron layers, where the first layer has some $L$ neurons. It takes as input the residual stream, which is an $N$ x $k$ matrix. (That is, $N$ tokens/sequence positions, each represented by a $k$ -dimensional vector.) The first layer consists of $L$ neurons, each with a non-linear activation function, followed by a second layer of $k$ neurons with an identity activation function.

LLM Foundations

A Single-Layered Network

The Math

The feedforward block of a transformer