ABSTRACT
In a continental-sized country like Brazil, collecting feedback on
governmental services such as education, healthcare, and security
is challenging and impractical to perform manually, except
through sampling techniques. With advancements in machine learning,
particularly models based on transformers, it is now possible
to automate this process on a large scale, enabling, for instance,
the dissemination of health campaign information or the collection
of citizen opinions on recently used services. This paper focuses
on speech-to-text transcription, a crucial step for enabling largescale
voice-based responses.We explored scalability challenges and
evaluated combinations of transcription models and audio formats
(WAV, FLAC, and MP3), aiming to balance the computational cost
and transcription quality. Our results showed that MP3 files sampled
at 14 kHz provide transcription quality comparable to WAV
files sampled at 16 kHz while requiring only 11% of the storage
size. Furthermore, we demonstrated that smaller models, such as
Wav2Vec2-XLSR-53 with 3.17 × 108 parameters, can achieve results
similar to larger models, such as Seamless M4T, which has
approximately an order of magnitude more parameters.
O Computer on the Beach é um evento técnico-científico que visa reunir profissionais, pesquisadores e acadêmicos da área de Computação, a fim de discutir as tendências de pesquisa e mercado da computação em suas mais diversas áreas.