Avaliação In-Domain e Cross-Domain em Restauração de Pontuação utilizando Processamento de Linguagem Natural

Brenda C. D. Moura; Angel G. de S. Sales; José E. B. de S. Linhares; Fabiann M. D. Barbosa; Amadeu A. Neto

doi:10.14210/cotb.v16.p045-052

Pesquisa

Resumo

Avaliação In-Domain e Cross-Domain em Restauração de Pontuação utilizando Processamento de Linguagem Natural

10.14210/cotb.v16.p045-052

Data de publicação: 27/05/2025

ABSTRACT
Punctuation plays a fundamental role in conveying the correct meaning
in written texts. As a result, punctuation errors can occur, significantly
impairing the way a message is interpreted, whether in
formal or informal contexts. In this sense, the use of machine learning,
combined with recent techniques in natural language processing, has
been widely used in the task of punctuation restoration, in languages
such as English. However, despite the wide application of this task
in other languages, its use in Portuguese is still quite limited. In this
work, we propose to adapt a punctuation restoration model for its
application in formal texts in the Portuguese language, in addition
to evaluating the model’s behavior in informal texts. The Portuguese
Legal Sentences v3 dataset was used to train the model, which was
also used for the in-domain evaluation. Regarding the cross-domain
evaluation, the IWSLT (International Workshop on Spoken Language
Translation) database was used, consisting of transcripts of lectures
known as TED Talks. The results indicate that the model with the
largest amount of training data and that mapped all question marks
to full stops performed satisfactorily in the formal context, suggesting
that the methodology adopted was adequate for the proposed
task. Furthermore, it was found that the scarcity of question marks
negatively impacts the model’s performance and that, in the informal
context, the results were unsatisfactory in the evaluation metrics, suggesting
that formal and informal sentences have their own structures,
which the model was unable to generalize adequately in the informal
context.

Anais do Computer on the Beach

O Computer on the Beach é um evento técnico-científico que visa reunir profissionais, pesquisadores e acadêmicos da área de Computação, a fim de discutir as tendências de pesquisa e mercado da computação em suas mais diversas áreas.

Access journal

Autor(es)

Brenda C. D. Moura

Instituto Federal de Educação, Ciência e Tecnologia do Amazonas - Campus Manaus Zona Leste Manaus, Amazonas, Brasil
Angel G. de S. Sales

Instituto Federal de Educação, Ciência e Tecnologia do Amazonas - Campus Manaus Zona Leste Manaus, Amazonas, Brasil
José E. B. de S. Linhares

Instituto Federal de Educação, Ciência e Tecnologia do Amazonas - Campus Manaus Zona Leste Manaus, Amazonas, Brasil
Fabiann M. D. Barbosa

Instituto Federal de Educação, Ciência e Tecnologia do Amazonas - Campus Manaus Zona Leste Manaus, Amazonas, Brasil
Amadeu A. Neto

Instituto Federal de Educação, Ciência e Tecnologia do Amazonas - Campus Manaus Zona Leste Manaus, Amazonas, Brasil

PDF

Edição
v. 16 (2025)

Seção
Artigos Completos

Plugins

Downloads

Não há dados estatísticos.

Portal de Periódicos

Anais do Computer on the Beach

Portal de Periódicos

Pesquisa

Avaliação In-Domain e Cross-Domain em Restauração de Pontuação utilizando Processamento de Linguagem Natural

Anais do Computer on the Beach