# Evaluation metrics for ASR

If you're familiar with the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) from NLP, the
metrics for assessing speech recognition systems will be familiar! Don't worry if you're not, we'll go through the
explanations start-to-finish to make sure you know the different metrics and understand what they mean.

When assessing speech recognition systems, we compare the system's predictions to the target text transcriptions,
annotating any errors that are present. We categorise these errors into one of three categories:
1. Substitutions (S): where we transcribe the **wrong word** in our prediction ("sit" instead of "sat")
2. Insertions (I): where we add an **extra word** in our prediction
3. Deletions (D): where we **remove a word** in our prediction

These error categories are the same for all speech recognition metrics. What differs is the level at which we compute
these errors: we can either compute them on the _word level_ or on the _character level_.

We'll use a running example for each of the metric definitions. Here, we have a _ground truth_ or _reference_ text sequence:

```python
reference = "the cat sat on the mat"
```

And a predicted sequence from the speech recognition system that we're trying to assess:

```python
prediction = "the cat sit on the"
```

We can see that the prediction is pretty close, but some words are not quite right. We'll evaluate this prediction
against the reference for the three most popular speech recognition metrics and see what sort of numbers we get for each.

## Word Error Rate
The *word error rate (WER)* metric is the 'de facto' metric for speech recognition. It calculates substitutions,
insertions and deletions on the *word level*. This means errors are annotated on a word-by-word basis. Take our example:

| Reference:  | the | cat | sat     | on  | the | mat |
|-------------|-----|-----|---------|-----|-----|-----|
| Prediction: | the | cat | **sit** | on  | the |     |  |
| Label:      | ✅   | ✅   | S       | ✅   | ✅   | D   |

Here, we have:
* 1 substitution ("sit" instead of "sat")
* 0 insertions
* 1 deletion ("mat" is missing)

This gives 2 errors in total. To get our error rate, we divide the number of errors by the total number of words in our
reference (N), which for this example is 6:

$$
\begin{aligned}
WER &= \frac{S + I + D}{N} \\
&= \frac{1 + 0 + 1}{6} \\
&= 0.333
\end{aligned}
$$

Alright! So we have a WER of 0.333, or 33.3%. Notice how the word "sit" only has one character that is wrong, but the
entire word is marked incorrect. This is a defining feature of the WER: spelling errors are penalised heavily, no matter
how minor they are.

The WER is defined such that *lower is better*: a lower WER means there are fewer errors in our prediction, so a perfect
speech recognition system would have a WER of zero (no errors).

Let's see how we can compute the WER using 🤗 Evaluate. We'll need two packages to compute our WER metric: 🤗 Evaluate
for the API interface, and JIWER to do the heavy lifting of running the calculation:
```
pip install --upgrade evaluate jiwer
```

Great! We can now load up the WER metric and compute the figure for our example:

```python
from evaluate import load

wer_metric = load("wer")

wer = wer_metric.compute(references=[reference], predictions=[prediction])

print(wer)
```
**Print Output:**
```
0.3333333333333333
```

0.33, or 33.3%, as expected! We now know what's going on under-the-hood with this WER calculation.

Now, here's something that's quite confusing... What do you think the upper limit of the WER is? You would expect it to be
1 or 100% right? Nuh uh! Since the WER is the ratio of errors to number of words (N), there is no upper limit on the WER!
Let's take an example were we predict 10 words and the target only has 2 words. If all of our predictions were wrong (10 errors),
we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over
100%. Although if you're seeing this, something has likely gone wrong... 😅

## Inverse Real-Time Factor (RTFx)

While WER measures the accuracy of transcriptions, the *inverse real-time factor (RTFx)* measures the speed of an ASR system.
RTFx is the inverse ratio of processing time to audio duration:

$$
\text{RTFx} = \frac{\text{Audio Duration}}{\text{Processing Time}}
$$

For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0
means the system can transcribe audio faster than real-time, which is essential for live transcription applications like
video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values
below 1.0 indicate slower-than-real-time processing.

Key points about RTFx:
* **Higher is better**: Higher RTFx means faster processing
* **RTFx > 1.0**: Faster than real-time (good for streaming applications)
* **RTFx = 1.0**: Processes at exactly real-time speed
* **RTFx  0
]
all_references_norm = [
    all_references_norm[i]
    for i in range(len(all_references_norm))
    if len(all_references_norm[i]) > 0
]

wer = 100 * wer_metric.compute(
    references=all_references_norm, predictions=all_predictions_norm
)

wer
```
**Output:**
```
125.69809089960707
```

Again we see the drastic reduction in WER we achieve by normalising our references and predictions: the baseline model
achieves an orthographic test WER of 168%, while the normalised WER is 126%.

Right then! These are the numbers that we want to try and beat when we fine-tune the model, in order to improve the Whisper
model for Dhivehi speech recognition. Continue reading to get hands-on with a fine-tuning example 🚀

