• Resumo

    Detecção de URLs de Phishing com PU Learning e Métricas de Divergência de Distribuições

    Data de publicação: 09/06/2026

    In this article, we propose a machine learning-based solution to detect phishing URLs in the context of Positive-Unlabeled (PU) Learning. Our solution combines lexical features and statistical attributes extracted from characters and bigram distributions. To that end, we applied distance and divergence metrics to the average distribution of 25 million unlabeled URLs in order to identify suspicious patterns. In the experiments, we evaluated classical (e.g., Random Forest) and deep learning (e.g., CNN, MLP, CNNLSTM, GRU) models, validating the robustness of statistical features. The tests performed presented accuracy rates of up to 94,32%, highlighting the influence of the language in the distributions: models adapted to Portuguese performed 20% better than those trained with English data. Our approach overcomes traditional phishing detection methods based on blocklists, thus reducing the vulnerability window and fostering the effective protection of computer systems users in general.

Anais do Computer on the Beach

O Computer on the Beach é um evento técnico-científico que visa reunir profissionais, pesquisadores e acadêmicos da área de Computação, a fim de discutir as tendências de pesquisa e mercado da computação em suas mais diversas áreas.

Access journal