Comparing Vectors

A distance measure such as Euclidean distance $d(x,y)$ requires smaller values to indicate "more similar". We assume for any distance measure, $d(x,y)=0$ if and only if $x=y$, $d(x,y)=d(y,x)$, and $d(x,y)≥0$.

A similarity measure such as cosine similarity $c(x,y)$ requires larger values to indicate "more similar", but no upper bound is required. Often, similarity measures can be converted to distance measures, and vice versa. For example, since cosine similarity is bounded by 1 and 1, we can define cosine distance as $1−c(x,y)$.

A score $s(x,y)$ is an arbitrary function where a larger score indicates a "better" match. All similarity measures are scores, and all negated distance metrics are scores. This terminology was invented to avoid misunderstandings that may arise when comparing, say, Euclidean distance with cosine similarity.
The Euclidean distance
In machine learning, we frequently must compare how far one vector is from another. There are many ways of doing this. Among the simplest (and most common) is the Euclidean distance. This is the "straight line" distance we're familiar with in the real world.
Its definition is based on the Pythagorean Theorem. Given a rightangled triangle with side lengths $a,b$ and hypotenuse length $c$, then $c_{2}=a_{2}+b_{2}$. We can use this to compute the distance ("hypotenuse") between two twodimensional points.
Suppose we want to know the "straightline" distance between two points $x=(x_{1},x_{2})$ and $y=(y_{1},y_{2})$. We can connect the points with a right triangle with sides $∣x_{1}−y_{1}∣$ and $∣x_{2}−y_{2}∣$. Then, by the Pythagorean Theorem, the "straightline" distance between them (the hypotenuse) is $d(x,y)=(x_{1}−y_{1})_{2}+(x_{2}−y_{2})_{2} $.
Applying the Pythagorean Theorem a second time, we can derive the formula for threedimensional vectors $d(x,y)=(x_{1}−y_{1})_{2}+(x_{2}−y_{2})_{2}+(x_{3}−y_{3})_{2} $.
Noticing a pattern, we can define the general $k$dimensional Euclidean distance as follows:
$d(x,y):=i=1∑k (x_{i}−y_{i})_{2} .$
Definition. Given a $k$dimensional vector $x$, the magnitude of $x$ is $∣x∣:=i=1∑k x_{i} .$
Definition. Given $k$dimensional vectors $x$ and $y$, the Euclidean distance is $d(x,y):=∣x−y∣.$
We will see that some distance measures (including Euclidean distance) are referred to as metrics:
Definition. A distance metric $d$ satisfies the following properties. For any $k$dimensional vectors $x$, $y$, and $z$:
 $d(x,x)=0$;
 Positivity. $d(x,y)>0$;
 Symmetry. $d(x,y)=d(y,x)$;
 Triangle inequality. $d(x,z)≤d(x,y)+d(y,z)$
The triangle inequality ensures that the shortest distance between two points is a straight line.
Dot product
Definition. Given two $k$dimensional vectors $x$ and $y$, the dot product is their elementwise product: $x⋅y:=∑_{i=1}x_{i}y_{i}$.
In machine learning, the dot product often is used as a similarity measure. However, as we'll see below it is somewhat imperfect. It is affected both by the angle between the vectors and their magnitudes.
Definition. Given a $k$dimensional vector $x$, the unit vector of $x$ is $x^=x/∣x∣$. This vector is in the direction of $x$ but has magnitude $1$, on the unit circle.
In many textbooks, the dot product is defined directly in terms of the angle between two vectors. Where $u,v$ are vectors, $∣u∣,∣v∣$ are their magnitudes, and $θ$ is the angle between the vectors:
$⟺ u⋅v:=u^⋅v^:= ∣u∣∣v∣cosθcosθ. $
Hence, if the vectors are unit vectors, the dot product by itself is the cosine of the angle between them.
Cosine similarity
The cosine similarity is the cosine of the angle between two vectors. It is useful when direction is more important than magnitude. For example, someone who always rates movies $4/5$ is similar to someone who always rates them $5/5$ (perhaps the origin represents all movies rated $2.5/5$)  their ratings convey no information about the movies!
The cosine similarity $s$ is bounded between [1, 1], with a larger score indicating more similar.
 s = 1. Same direction, since: $cos0=1$, e.g. $[1,0]⋅[1,0]=1$.
 s = 1. Opposite direction, since: $cosπ=−1$, e.g. $[1,0]⋅[−1,0]=−1$.
 s = 0. Orthogonal (e.g. a 90degree angle), since: $cosπ/2=0$, e.g. $[1,0]⋅[0,1]=0$.
This measure can be efficiently computed, since it is based on the dot product.
Hence, the dot product by itself is effectively an "unnormalized" cosine similarity. This means it can be affected by magnitude!
This is especially apparent in modern LLMs if the embeddings are unembedded by themselves. Using cosine similarity, each token will match best with itself. However, merely using the dot product will cause many tokens to match with other tokens with larger magnitude.
Exercises
 By applying the Pythagorean Theorem twice, derive the threedimensional Euclidean distance formula. Prove that for any threedimensional vectors $x$ and $y$,
$d(x,y)=(x_{1}−y_{1})_{2}+(x_{2}−y_{2})_{2}+(x_{3}−y_{3})_{2} .$