A transformer takes as input a list of tokens. In our case, the vocabulary comprise our tokens. That said, a token does not have to be a single character. It can represent chunks of characters or even more abstract concepts such as the start or end of text.

The first transformer operation -- the embedding -- takes each token and replaces it with a vector of numbers. These positions are often called channels since they convey information.

The residual stream

The output of the embedding stage is a matrix with each row representing a token, and each column representing a channel. This gives the resulting matrix a shape of . This matrix begins the transformer's residual stream, which is read from and written to as it passes through each stage of the transformer.

One-hot encoding

How should we represent each of the possible tokens? Recall our objective is to compute the ciphertext letter frequencies and that each letter is a token. So, a reasonable start would be to dedicate a single channel per token. If the channel contains , it indicates token ; if channel contains , it indicates it is not token .

This encoding scheme is called a one-hot encoding.

For example, suppose we one-hot encode three tokens a, b, and c using channels. We'll map a = [1, 0, 0], b = [0, 1, 0], and c = [0, 0, 1]. (Stacked, these comprise the embedding matrix, which here is the identity matrix.)

The embedding of the string abc provides our input to the model : Note this has a convenient property -- we can count the letters by summing across the rows! This gives the vector .

Typically embeddings are not one-hot encoded, since this requires . However, it also neglects an opportunity to store extra information per token. For example, we could dedicate a channel to indicate even vs. odd, or we could ensure that "similar" tokens are nearby in space.

At the end of this chapter, we'll show that nearly any embedding matrix can solve our cryptograms! This is what makes interpretability so difficult. To us, a one-hot encoding makes the algorithm easy to understand. However, a computer is likely to choose an arbitrary embedding which obscures the underlying algorithm.