Seq2Seq Model
A Seq2Seq model consists of the following three types of systems:
- ASR (Automatic Speech Recognition): This system converts the recorded voice to text in the same audio language. For our example, it will take the audio file as input and try to produce the sentence, ‘ I like watching cricket.’
- MT (machine translation): This will take the converted sentence from step 1 and translate it into the target language. In our case, it will give the output as ‘मुझे क्रिकेट देखना पसंद है’ in Hindi.
- TTS (text-to-speech synthesis): This will take the converted output text from step 2 and convert it back to audio.
The main drawback of such a system was
- High latency: As it involved the passage of data among three subsystems
- Cascading of error: The error introduced in AST caused compounding effects in MT and TTS.
Translatotron 2 Speech-to-Speech Translation Architecture
The speech-to-speech translation system translates the input audio from one language to another. These are abbreviated as S2ST (Speech to Speech Translation) systems or S2S(Speech to Speech) systems in general. The primary objective of this system is to enable communication among people who speak different languages.