# Supplemental reading and resources

This Unit pieced together many components from previous units, introducing the tasks of speech-to-speech translation, 
voice assistants and speaker diarization. The supplemental reading material is thus split into these three new tasks 
for your convenience:

Speech-to-speech translation:
* [STST with discrete units](https://ai.facebook.com/blog/advancing-direct-speech-to-speech-modeling-with-discrete-units/) by Meta AI: a direct approach to STST through encoder-decoder models
* [Hokkien direct speech-to-speech translation](https://ai.facebook.com/blog/ai-translation-hokkien/) by Meta AI: a direct approach to STST using encoder-decoder models with a two-stage decoder
* [Leveraging unsupervised and weakly-supervised data to improve direct STST](https://arxiv.org/abs/2203.13339) by Google: proposes new approaches for leveraging unsupervised and weakly supervised data for training direct STST models and a small change to the Transformer architecture
* [Translatotron-2](https://google-research.github.io/lingvo-lab/translatotron2/) by Google: a system that is able to retain speaker characteristics in translated speech

Voice Assistant:
* [Accurate wakeword detection](https://www.amazon.science/publications/accurate-detection-of-wake-word-start-and-end-using-a-cnn) by Amazon: a low latency approach for wakeword detection for on-device applications
* [RNN-Transducer Architecture](https://arxiv.org/pdf/1811.06621.pdf) by Google: a modification to the CTC architecture for streaming on-device ASR

Meeting Transcriptions:
* [pyannote.audio Technical Report](https://huggingface.co/pyannote/speaker-diarization/blob/main/technical_report_2.1.pdf) by Hervé Bredin: this report describes the main principles behind the `pyannote.audio` speaker diarization pipeline
* [Whisper X](https://arxiv.org/pdf/2303.00747.pdf) by Max Bain et al.: a superior approach to computing word-level timestamps using the Whisper model

