Letter Frequency Solution

In this section, we'll solve cryptograms by comparing the ciphertext letter frequencies to those expected with each rotation. We'll prove that in this particular problem, comparing the distributions using dot products gives the same result as the sum of squared errors (SSE).

Letter Frequency

Humans solve cryptograms by analyzing letter frequency.

Since e is the most common English letter, we can assume the most common ciphertext letter is a rotated e.

More generally, we can match the letter frequencies of our ciphertext to the letter frequencies we'd expect to see for each rotation. The "best match" will indicate the most likely rotation.

Suppose the most common English letters are etainos. In our ciphertext, the most common letters are fubjopt -- one letter higher. Hence, we can predict a rotation of 1.

From Wikipedia, here are the letter frequencies for "Texts":

# English language frequencies from "Texts" for the letters a-z
base_freq = [0.082, 0.015, 0.028, 0.043, 0.127, 0.022, 0.020, 0.061, 0.070, 
             0.0015, 0.0077, 0.04, 0.024, 0.067, 0.075, 0.019, 0.00095, 0.060, 
             0.063, 0.091, 0.028, 0.0098, 0.024, 0.0015, 0.020, 0.00074]

For a rotation of 2 where A=>C, B=>D, etc. these are cyclically rotated:

# English language frequencies from "Texts" for the letters a-z
rot_freqs = np.array([base_freq[-rot:] + base_freq[rot:] for rot in range(N_LETTERS)])

Formalizing the letter frequency

Let's make this more formal. Suppose we have a vocabulary $V$ containing $∣ V ∣$ tokens. In this problem, $V = {a, b, \dots, z, "space"}$ and $∣ V ∣ = 27$ . We will only rotate a-z, so we'll define $N_{rot} = 26$ .

We define the rotation letter frequency vector $r_{0}$ as the base English letter frequencies (above called base_freqs). We define $r_{i}$ as the letter frequencies when rotated by $i$ (above called rot_freqs).

For the $k$ th letter, we define $r_{ik} := r_{0 (i + k mod N_{rot})}$ . This defines a one-to-one map between $r_{i}$ and $r_{0}$ . Hence, the set of values in each vector are identical, making $∣ r_{i} ∣ = ∣ r_{j} ∣$ for any $0 \leq i, j < N_{rot}$ .

As we'll see below, because these magnitudes are equal they will cancel out when we compare the distance between vectors.

Distribution distances

To solve the cryptogram, we must find which of the $N_{rot}$ rotation letter distributions is "closest" to that of the ciphertext. To do so, we will compute the "distance" between the ciphertext letter frequencies $c$ and each of the $N_{rot}$ expected frequency distributions.

Although there are several methods for comparing frequency distributions, in this chapter we'll start with the simplest -- sum of squared errors (SSE).

Sum of Squared Errors (SSE)

We will compare each letter frequency individually, penalizing larger errors more. We'll do this by squaring these errors and summing the squared errors.

Based on this, here is our definition:

$SSE (c, r) := i \sum (c_{i} - r_{i})^{2} = ∣ c - r ∣^{2} .$

Expanded:

$SSE (c, r) = = i \sum c_{i}^{2} - 2 i \sum c_{i} r_{i} + i \sum r_{i}^{2} ∣ c ∣^{2} - 2 c \cdot r + ∣ r ∣^{2} .$

Squaring the errors gives several benefits: First, it makes all differences positive. Second, it magnifies large differences in our total error. Third, compared to absolute value, it is differentiable (for potential minimization) and easier to prove things about!

Comparing SSEs

Although we could compute all $N_{rot}$ SSEs, there is a faster and easier way. It turns out we can simply take the dot product of $c$ with each $r$ :

Theorem 1. For a given ciphertext letter distribution $c$ and rotation distributions $r_{1}$ and $r_{2}$ , then $SSE (c, r_{1}) < SSE (c, r_{2})$ if and only if $c \cdot r_{1} > c \cdot r_{2}$ .

Proof. We showed earlier that $∣ r_{i} ∣ = ∣ r_{j} ∣$ for any $0 \leq i, j < N_{rot}$ . Then: $⟺ ⟺ SSE (c, r_{1}) < ∣ c ∣^{2} - 2 c \cdot r_{1} + ∣ r_{1} ∣^{2} < c \cdot r_{1} > SSE (c, r_{2}) ∣ c ∣^{2} - 2 c \cdot r_{2} + ∣ r_{2} ∣^{2} c \cdot r_{2} .$

Let's define a score function $S (c, r) = c \cdot r$ . By definition, the rotation with smallest SSE has a smaller SSE than any other rotation. From Theorem 1, the rotation with the largest score has the smallest SSE (and vice versa). For more discussion of score vs. distance vs. similarity, see Appendix B.

Note this does not hold in general -- only when all class vectors have the same magnitude.

Note that we can simultaneously compute the dot products using matrix algebra. Given a matrix $R$ where column $i$ is $r_{i}$ and a row vector of ciphertext frequencies $c$ , then $c R$ gives a vector of all dot products.

Visualizing the dot product

Interestingly, the SSE can now be interpreted as related to the "angle" between vectors. This is because the dot product of unit vectors is the cosine of the angle between them -- see Appendix B for more details!

Imagine the x-axis depicting the letter $e$ frequency and the y-axis the frequency of $b$ . Then, the rotation 0 class vector is $(r_{04}, r_{02}) = (0.127, 0.015)$ and the rotation 5 class vector $(r_{54}, r_{52}) = (r_{09}, r_{07}) = (0.0015, 0.061)$ .

Our ciphertext token frequencies $(e, b)$ will then also be a vector. We compare its angle to that of each of the two rotation classes, and the class vector with largest dot product (typically smallest angle) determines the predicted class.

LLM Foundations

Letter Frequency Solution