Letter Frequency Solution
In this section, we'll solve cryptograms by comparing the ciphertext letter frequencies to those expected with each rotation. We'll prove that in this particular problem, comparing the distributions using dot products gives the same result as the sum of squared errors (SSE).
Letter Frequency
Humans solve cryptograms by analyzing letter frequency.
Since
e
is the most common English letter, we can assume the most common ciphertext letter is a rotatede
.
More generally, we can match the letter frequencies of our ciphertext to the letter frequencies we'd expect to see for each rotation. The "best match" will indicate the most likely rotation.
Suppose the most common English letters are
etainos
. In our ciphertext, the most common letters arefubjopt
-- one letter higher. Hence, we can predict a rotation of 1.
From Wikipedia, here are the letter frequencies for "Texts":
# English language frequencies from "Texts" for the letters a-z
base_freq = [0.082, 0.015, 0.028, 0.043, 0.127, 0.022, 0.020, 0.061, 0.070,
0.0015, 0.0077, 0.04, 0.024, 0.067, 0.075, 0.019, 0.00095, 0.060,
0.063, 0.091, 0.028, 0.0098, 0.024, 0.0015, 0.020, 0.00074]
For a rotation of 2 where A=>C, B=>D, etc. these are cyclically rotated:
# English language frequencies from "Texts" for the letters a-z
rot_freqs = np.array([base_freq[-rot:] + base_freq[rot:] for rot in range(N_LETTERS)])
Formalizing the letter frequency
Let's make this more formal. Suppose we have a vocabulary containing tokens. In this problem, and . We will only rotate a-z, so we'll define .
We define the rotation letter frequency vector as the base English letter frequencies (above called base_freqs). We define as the letter frequencies when rotated by (above called rot_freqs).
For the th letter, we define . This defines a one-to-one map between and . Hence, the set of values in each vector are identical, making for any .
As we'll see below, because these magnitudes are equal they will cancel out when we compare the distance between vectors.
Distribution distances
To solve the cryptogram, we must find which of the rotation letter distributions is "closest" to that of the ciphertext. To do so, we will compute the "distance" between the ciphertext letter frequencies and each of the expected frequency distributions.
Although there are several methods for comparing frequency distributions, in this chapter we'll start with the simplest -- sum of squared errors (SSE).
Sum of Squared Errors (SSE)
We will compare each letter frequency individually, penalizing larger errors more. We'll do this by squaring these errors and summing the squared errors.
Based on this, here is our definition:
Expanded:
Squaring the errors gives several benefits: First, it makes all differences positive. Second, it magnifies large differences in our total error. Third, compared to absolute value, it is differentiable (for potential minimization) and easier to prove things about!
Comparing SSEs
Although we could compute all SSEs, there is a faster and easier way. It turns out we can simply take the dot product of with each :
Theorem 1. For a given ciphertext letter distribution and rotation distributions and , then if and only if .
Proof. We showed earlier that for any . Then:
Let's define a score function . By definition, the rotation with smallest SSE has a smaller SSE than any other rotation. From Theorem 1, the rotation with the largest score has the smallest SSE (and vice versa). For more discussion of score vs. distance vs. similarity, see Appendix B.
Note that we can simultaneously compute the dot products using matrix algebra. Given a matrix where column is and a row vector of ciphertext frequencies , then gives a vector of all dot products.
Visualizing the dot product
Interestingly, the SSE can now be interpreted as related to the "angle" between vectors. This is because the dot product of unit vectors is the cosine of the angle between them -- see Appendix B for more details!
Imagine the x-axis depicting the letter frequency and the y-axis the frequency of . Then, the rotation 0 class vector is and the rotation 5 class vector .
Our ciphertext token frequencies will then also be a vector. We compare its angle to that of each of the two rotation classes, and the class vector with largest dot product (typically smallest angle) determines the predicted class.