| Kaldi-based forced alignment |
| ============================ |
|
|
| This section describes in detail how to use `kaldi-decoder`_ |
| for **FST-based** ``forced alignment`` with models trained by `CTC`_ loss. |
|
|
| .. hint:: |
|
|
| We have a colab notebook walking you through this section step by step. |
|
|
| |kaldi-based forced alignment colab notebook| |
|
|
| .. |kaldi-based forced alignment colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg |
| :target: https://github.com/k2-fsa/colab/blob/master/icefall/ctc_forced_alignment_fst_based_kaldi.ipynb |
|
|
| Prepare the environment |
| ----------------------- |
|
|
| Before you continue, make sure you have setup `icefall`_ by following :ref:`install icefall`. |
|
|
| .. hint:: |
|
|
| You don't need to install `Kaldi`_. We will ``NOT`` use `Kaldi`_ below. |
|
|
| Get the test data |
| ----------------- |
|
|
| We use the test wave |
| from `CTC FORCED ALIGNMENT API TUTORIAL <https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_ |
|
|
| .. code-block:: python3 |
|
|
| import torchaudio |
|
|
| # Download test wave |
| speech_file = torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") |
| print(speech_file) |
| waveform, sr = torchaudio.load(speech_file) |
| transcript = "i had that curiosity beside me at this moment".split() |
| print(waveform.shape, sr) |
|
|
| assert waveform.ndim == 2 |
| assert waveform.shape[0] == 1 |
| assert sr == 16000 |
|
|
| The test wave is downloaded to:: |
| |
| $HOME/.cache/torch/hub/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav |
|
|
| .. raw:: html |
|
|
| <table> |
| <tr> |
| <th>Wave filename</th> |
| <th>Content</th> |
| <th>Text</th> |
| </tr> |
| <tr> |
| <td>Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav</td> |
| <td> |
| <audio title="Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| <td> |
| i had that curiosity beside me at this moment |
| </td> |
| </tr> |
| </table> |
|
|
| We use the test model |
| from `CTC FORCED ALIGNMENT API TUTORIAL <https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_ |
|
|
| .. code-block:: python3 |
|
|
| import torch |
|
|
| bundle = torchaudio.pipelines.MMS_FA |
|
|
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = bundle.get_model(with_star=False).to(device) |
|
|
| The model is downloaded to:: |
| |
| $HOME/.cache/torch/hub/checkpoints/model.pt |
|
|
| Compute log_probs |
| ----------------- |
|
|
| .. code-block:: bash |
|
|
| with torch.inference_mode(): |
| emission, _ = model(waveform.to(device)) |
| print(emission.shape) |
|
|
| It should print:: |
| |
| torch.Size([1, 169, 28]) |
|
|
| Create token2id and id2token |
| ---------------------------- |
|
|
| .. code-block:: python3 |
|
|
| token2id = bundle.get_dict(star=None) |
| id2token = {i:t for t, i in token2id.items()} |
| token2id["<eps>"] = 0 |
| del token2id["-"] |
|
|
| Create word2id and id2word |
| -------------------------- |
|
|
| .. code-block:: python3 |
|
|
| words = list(set(transcript)) |
| word2id = dict() |
| word2id['eps'] = 0 |
| for i, w in enumerate(words): |
| word2id[w] = i + 1 |
|
|
| id2word = {i:w for w, i in word2id.items()} |
|
|
| Note that we only use words from the transcript of the test wave. |
|
|
| Generate lexicon-related files |
| ------------------------------ |
|
|
| We use the code below to generate the following 4 files: |
|
|
| - ``lexicon.txt`` |
| - ``tokens.txt`` |
| - ``words.txt`` |
| - ``lexicon_disambig.txt`` |
|
|
| .. caution:: |
|
|
| ``words.txt`` contains only words from the transcript of the test wave. |
|
|
| .. code-block:: python3 |
|
|
| from prepare_lang import add_disambig_symbols |
|
|
| lexicon = [(w, list(w)) for w in word2id if w != "eps"] |
| lexicon_disambig, max_disambig_id = add_disambig_symbols(lexicon) |
|
|
| with open('lexicon.txt', 'w', encoding='utf-8') as f: |
| for w, tokens in lexicon: |
| f.write(f"{w} {' '.join(tokens)}\n") |
|
|
| with open('lexicon_disambig.txt', 'w', encoding='utf-8') as f: |
| for w, tokens in lexicon_disambig: |
| f.write(f"{w} {' '.join(tokens)}\n") |
|
|
| with open('tokens.txt', 'w', encoding='utf-8') as f: |
| for t, i in token2id.items(): |
| if t == '-': |
| t = "<eps>" |
| f.write(f"{t} {i}\n") |
|
|
| for k in range(max_disambig_id + 2): |
| f.write(f"#{k} {len(token2id) + k}\n") |
|
|
| with open('words.txt', 'w', encoding='utf-8') as f: |
| for w, i in word2id.items(): |
| f.write(f"{w} {i}\n") |
| f.write(f'#0 {len(word2id)}\n') |
|
|
|
|
| To give you an idea about what the generated files look like:: |
| |
| head -n 50 lexicon.txt lexicon_disambig.txt tokens.txt words.txt |
|
|
| prints:: |
| |
| ==> lexicon.txt <== |
| moment m o m e n t |
| beside b e s i d e |
| i i |
| this t h i s |
| curiosity c u r i o s i t y |
| had h a d |
| that t h a t |
| at a t |
| me m e |
|
|
| ==> lexicon_disambig.txt <== |
| moment m o m e n t |
| beside b e s i d e |
| i i |
| this t h i s |
| curiosity c u r i o s i t y |
| had h a d |
| that t h a t |
| at a t |
| me m e |
|
|
| ==> tokens.txt <== |
| a 1 |
| i 2 |
| e 3 |
| n 4 |
| o 5 |
| u 6 |
| t 7 |
| s 8 |
| r 9 |
| m 10 |
| k 11 |
| l 12 |
| d 13 |
| g 14 |
| h 15 |
| y 16 |
| b 17 |
| p 18 |
| w 19 |
| c 20 |
| v 21 |
| j 22 |
| z 23 |
| f 24 |
| ' 25 |
| q 26 |
| x 27 |
| <eps> 0 |
| #0 28 |
| #1 29 |
|
|
| ==> words.txt <== |
| eps 0 |
| moment 1 |
| beside 2 |
| i 3 |
| this 4 |
| curiosity 5 |
| had 6 |
| that 7 |
| at 8 |
| me 9 |
| #0 10 |
|
|
| .. note:: |
|
|
| This test model uses characters as modeling unit. If you use other types of |
| modeling unit, the same code can be used without any change. |
|
|
| Convert transcript to an FST graph |
| ---------------------------------- |
|
|
| .. code-block:: bash |
|
|
| egs/librispeech/ASR/local/prepare_lang_fst.py --lang-dir ./ |
|
|
| The above command should generate two files ``H.fst`` and ``HL.fst``. We will |
| use ``HL.fst`` below:: |
| |
| -rw-r--r-- 1 root root 13K Jun 12 08:28 H.fst |
| -rw-r--r-- 1 root root 3.7K Jun 12 08:28 HL.fst |
|
|
| Force aligner |
| ------------- |
|
|
| Now, everything is ready. We can use the following code to get forced alignments. |
|
|
| .. code-block:: python3 |
|
|
| from kaldi_decoder import DecodableCtc, FasterDecoder, FasterDecoderOptions |
| import kaldifst |
|
|
| def force_align(): |
| HL = kaldifst.StdVectorFst.read("./HL.fst") |
| decodable = DecodableCtc(emission[0].contiguous().cpu().numpy()) |
| decoder_opts = FasterDecoderOptions(max_active=3000) |
| decoder = FasterDecoder(HL, decoder_opts) |
| decoder.decode(decodable) |
| if not decoder.reached_final(): |
| print(f"failed to decode xxx") |
| return None |
| ok, best_path = decoder.get_best_path() |
|
|
| ( |
| ok, |
| isymbols_out, |
| osymbols_out, |
| total_weight, |
| ) = kaldifst.get_linear_symbol_sequence(best_path) |
| if not ok: |
| print(f"failed to get linear symbol sequence for xxx") |
| return None |
|
|
| # We need to use i-1 here since we have incremented tokens during |
| # HL construction |
| alignment = [i-1 for i in isymbols_out] |
| return alignment |
|
|
| alignment = force_align() |
|
|
| for i, a in enumerate(alignment): |
| print(i, id2token[a]) |
|
|
| The output should be identical to |
| `<https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html#frame-level-alignments>`_. |
|
|
| For ease of reference, we list the output below:: |
| |
| 0 - |
| 1 - |
| 2 - |
| 3 - |
| 4 - |
| 5 - |
| 6 - |
| 7 - |
| 8 - |
| 9 - |
| 10 - |
| 11 - |
| 12 - |
| 13 - |
| 14 - |
| 15 - |
| 16 - |
| 17 - |
| 18 - |
| 19 - |
| 20 - |
| 21 - |
| 22 - |
| 23 - |
| 24 - |
| 25 - |
| 26 - |
| 27 - |
| 28 - |
| 29 - |
| 30 - |
| 31 - |
| 32 i |
| 33 - |
| 34 - |
| 35 h |
| 36 h |
| 37 a |
| 38 - |
| 39 - |
| 40 - |
| 41 d |
| 42 - |
| 43 - |
| 44 t |
| 45 h |
| 46 - |
| 47 a |
| 48 - |
| 49 - |
| 50 t |
| 51 - |
| 52 - |
| 53 - |
| 54 c |
| 55 - |
| 56 - |
| 57 - |
| 58 u |
| 59 u |
| 60 - |
| 61 - |
| 62 - |
| 63 r |
| 64 - |
| 65 i |
| 66 - |
| 67 - |
| 68 - |
| 69 - |
| 70 - |
| 71 - |
| 72 o |
| 73 - |
| 74 - |
| 75 - |
| 76 - |
| 77 - |
| 78 - |
| 79 s |
| 80 - |
| 81 - |
| 82 - |
| 83 i |
| 84 - |
| 85 t |
| 86 - |
| 87 - |
| 88 y |
| 89 - |
| 90 - |
| 91 - |
| 92 - |
| 93 b |
| 94 - |
| 95 e |
| 96 - |
| 97 - |
| 98 - |
| 99 - |
| 100 - |
| 101 s |
| 102 - |
| 103 - |
| 104 - |
| 105 - |
| 106 - |
| 107 - |
| 108 - |
| 109 - |
| 110 i |
| 111 - |
| 112 - |
| 113 d |
| 114 e |
| 115 - |
| 116 m |
| 117 - |
| 118 - |
| 119 e |
| 120 - |
| 121 - |
| 122 - |
| 123 - |
| 124 a |
| 125 - |
| 126 - |
| 127 t |
| 128 - |
| 129 t |
| 130 h |
| 131 - |
| 132 i |
| 133 - |
| 134 - |
| 135 - |
| 136 s |
| 137 - |
| 138 - |
| 139 - |
| 140 - |
| 141 m |
| 142 - |
| 143 - |
| 144 o |
| 145 - |
| 146 - |
| 147 - |
| 148 m |
| 149 - |
| 150 - |
| 151 e |
| 152 - |
| 153 n |
| 154 - |
| 155 t |
| 156 - |
| 157 - |
| 158 - |
| 159 - |
| 160 - |
| 161 - |
| 162 - |
| 163 - |
| 164 - |
| 165 - |
| 166 - |
| 167 - |
| 168 - |
|
|
| To merge tokens, we use:: |
| |
| from icefall.ctc import merge_tokens |
| token_spans = merge_tokens(alignment) |
| for span in token_spans: |
| print(id2token[span.token], span.start, span.end) |
|
|
| The output is given below:: |
| |
| i 32 33 |
| h 35 37 |
| a 37 38 |
| d 41 42 |
| t 44 45 |
| h 45 46 |
| a 47 48 |
| t 50 51 |
| c 54 55 |
| u 58 60 |
| r 63 64 |
| i 65 66 |
| o 72 73 |
| s 79 80 |
| i 83 84 |
| t 85 86 |
| y 88 89 |
| b 93 94 |
| e 95 96 |
| s 101 102 |
| i 110 111 |
| d 113 114 |
| e 114 115 |
| m 116 117 |
| e 119 120 |
| a 124 125 |
| t 127 128 |
| t 129 130 |
| h 130 131 |
| i 132 133 |
| s 136 137 |
| m 141 142 |
| o 144 145 |
| m 148 149 |
| e 151 152 |
| n 153 154 |
| t 155 156 |
|
|
| All of the code below is copied and modified |
| from `<https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_. |
|
|
| Segment each word using the computed alignments |
| ----------------------------------------------- |
|
|
| .. code-block:: python3 |
|
|
| def unflatten(list_, lengths): |
| assert len(list_) == sum(lengths) |
| i = 0 |
| ret = [] |
| for l in lengths: |
| ret.append(list_[i : i + l]) |
| i += l |
| return ret |
|
|
|
|
| word_spans = unflatten(token_spans, [len(word) for word in transcript]) |
| print(word_spans) |
|
|
| The output is:: |
| |
| [[TokenSpan(token=2, start=32, end=33)], |
| [TokenSpan(token=15, start=35, end=37), TokenSpan(token=1, start=37, end=38), TokenSpan(token=13, start=41, end=42)], |
| [TokenSpan(token=7, start=44, end=45), TokenSpan(token=15, start=45, end=46), TokenSpan(token=1, start=47, end=48), TokenSpan(token=7, start=50, end=51)], |
| [TokenSpan(token=20, start=54, end=55), TokenSpan(token=6, start=58, end=60), TokenSpan(token=9, start=63, end=64), TokenSpan(token=2, start=65, end=66), TokenSpan(token=5, start=72, end=73), TokenSpan(token=8, start=79, end=80), TokenSpan(token=2, start=83, end=84), TokenSpan(token=7, start=85, end=86), TokenSpan(token=16, start=88, end=89)], |
| [TokenSpan(token=17, start=93, end=94), TokenSpan(token=3, start=95, end=96), TokenSpan(token=8, start=101, end=102), TokenSpan(token=2, start=110, end=111), TokenSpan(token=13, start=113, end=114), TokenSpan(token=3, start=114, end=115)], |
| [TokenSpan(token=10, start=116, end=117), TokenSpan(token=3, start=119, end=120)], |
| [TokenSpan(token=1, start=124, end=125), TokenSpan(token=7, start=127, end=128)], |
| [TokenSpan(token=7, start=129, end=130), TokenSpan(token=15, start=130, end=131), TokenSpan(token=2, start=132, end=133), TokenSpan(token=8, start=136, end=137)], |
| [TokenSpan(token=10, start=141, end=142), TokenSpan(token=5, start=144, end=145), TokenSpan(token=10, start=148, end=149), TokenSpan(token=3, start=151, end=152), TokenSpan(token=4, start=153, end=154), TokenSpan(token=7, start=155, end=156)] |
| ] |
|
|
|
|
| .. code-block:: python3 |
|
|
| def preview_word(waveform, spans, num_frames, transcript, sample_rate=bundle.sample_rate): |
| ratio = waveform.size(1) / num_frames |
| x0 = int(ratio * spans[0].start) |
| x1 = int(ratio * spans[-1].end) |
| print(f"{transcript} {x0 / sample_rate:.3f} - {x1 / sample_rate:.3f} sec") |
| segment = waveform[:, x0:x1] |
| return IPython.display.Audio(segment.numpy(), rate=sample_rate) |
| num_frames = emission.size(1) |
|
|
| .. code-block:: python3 |
|
|
| preview_word(waveform, word_spans[0], num_frames, transcript[0]) |
| preview_word(waveform, word_spans[1], num_frames, transcript[1]) |
| preview_word(waveform, word_spans[2], num_frames, transcript[2]) |
| preview_word(waveform, word_spans[3], num_frames, transcript[3]) |
| preview_word(waveform, word_spans[4], num_frames, transcript[4]) |
| preview_word(waveform, word_spans[5], num_frames, transcript[5]) |
| preview_word(waveform, word_spans[6], num_frames, transcript[6]) |
| preview_word(waveform, word_spans[7], num_frames, transcript[7]) |
| preview_word(waveform, word_spans[8], num_frames, transcript[8]) |
|
|
| The segmented wave of each word along with its time stamp is given below: |
|
|
| .. raw:: html |
|
|
| <table> |
| <tr> |
| <th>Word</th> |
| <th>Time</th> |
| <th>Wave</th> |
| </tr> |
| <tr> |
| <td>i</td> |
| <td>0.644 - 0.664 sec</td> |
| <td> |
| <audio title="i.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/i.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>had</td> |
| <td>0.704 - 0.845 sec</td> |
| <td> |
| <audio title="had.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/had.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>that</td> |
| <td>0.885 - 1.026 sec</td> |
| <td> |
| <audio title="that.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/that.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>curiosity</td> |
| <td>1.086 - 1.790 sec</td> |
| <td> |
| <audio title="curiosity.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/curiosity.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>beside</td> |
| <td>1.871 - 2.314 sec</td> |
| <td> |
| <audio title="beside.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/beside.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>me</td> |
| <td>2.334 - 2.414 sec</td> |
| <td> |
| <audio title="me.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/me.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>at</td> |
| <td>2.495 - 2.575 sec</td> |
| <td> |
| <audio title="at.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/at.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>this</td> |
| <td>2.595 - 2.756 sec</td> |
| <td> |
| <audio title="this.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/this.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| <tr> |
| <td>moment</td> |
| <td>2.837 - 3.138 sec</td> |
| <td> |
| <audio title="moment.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/moment.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| </tr> |
| </table> |
|
|
| We repost the whole wave below for ease of reference: |
|
|
| .. raw:: html |
|
|
| <table> |
| <tr> |
| <th>Wave filename</th> |
| <th>Content</th> |
| <th>Text</th> |
| </tr> |
| <tr> |
| <td>Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav</td> |
| <td> |
| <audio title="Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" controls="controls"> |
| <source src="/icefall/_static/kaldi-align/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" type="audio/wav"> |
| Your browser does not support the <code>audio</code> element. |
| </audio> |
| </td> |
| <td> |
| i had that curiosity beside me at this moment |
| </td> |
| </tr> |
| </table> |
|
|
| Summary |
| ------- |
|
|
| Congratulations! You have succeeded in using the FST-based approach to |
| compute alignment of a test wave. |
|
|