ABSTRACT
Punctuation plays a fundamental role in conveying the correct meaning
in written texts. As a result, punctuation errors can occur, significantly
impairing the way a message is interpreted, whether in
formal or informal contexts. In this sense, the use of machine learning,
combined with recent techniques in natural language processing, has
been widely used in the task of punctuation restoration, in languages
such as English. However, despite the wide application of this task
in other languages, its use in Portuguese is still quite limited. In this
work, we propose to adapt a punctuation restoration model for its
application in formal texts in the Portuguese language, in addition
to evaluating the model’s behavior in informal texts. The Portuguese
Legal Sentences v3 dataset was used to train the model, which was
also used for the in-domain evaluation. Regarding the cross-domain
evaluation, the IWSLT (International Workshop on Spoken Language
Translation) database was used, consisting of transcripts of lectures
known as TED Talks. The results indicate that the model with the
largest amount of training data and that mapped all question marks
to full stops performed satisfactorily in the formal context, suggesting
that the methodology adopted was adequate for the proposed
task. Furthermore, it was found that the scarcity of question marks
negatively impacts the model’s performance and that, in the informal
context, the results were unsatisfactory in the evaluation metrics, suggesting
that formal and informal sentences have their own structures,
which the model was unable to generalize adequately in the informal
context.
O Computer on the Beach é um evento técnico-científico que visa reunir profissionais, pesquisadores e acadêmicos da área de Computação, a fim de discutir as tendências de pesquisa e mercado da computação em suas mais diversas áreas.