Seq2Seq Model

A Seq2Seq model consists of the following three types of systems:

  • ASR (Automatic Speech Recognition): This system converts the recorded voice to text in the same audio language. For our example, it will take the audio file as input and try to produce the sentence, ‘ I like watching cricket.’
  • MT (machine translation): This will take the converted sentence from step 1 and translate it into the target language. In our case, it will give the output as ‘मुझे क्रिकेट देखना पसंद है’ in Hindi.
  • TTS (text-to-speech synthesis): This will take the converted output text from step 2 and convert it back to audio.

The main drawback of such a system was

  • High latency: As it involved the passage of data among three subsystems
  • Cascading of error: The error introduced in AST caused compounding effects in MT and TTS.

Translatotron 2 Speech-to-Speech Translation Architecture

The speech-to-speech translation system translates the input audio from one language to another. These are abbreviated as S2ST (Speech to Speech Translation) systems or S2S(Speech to Speech) systems in general. The primary objective of this system is to enable communication among people who speak different languages.

Similar Reads

Seq2Seq Model

A Seq2Seq model consists of the following three types of systems:...

Translatotron 2

In 2019, researchers at Google came up with direct speech-to-speech translation with a sequence-to-sequence model, which was the first end-to-end sequence-to-sequence model for S2ST....

Conclusion

When Google introduced Translatotorn 1 for end-to-end S2ST, though it performed well it was not able to match the performance of cascade S2ST. With Translatotron 2 it was able to match the performance of cascade S2ST. As per Google, the primary improvement comes from the high-level architecture i.e. the way the attention module connects the Encoder, Decoder, and Speech Synthesizer. The architectural choice of components did help in improving components but one can always experiment with those....