Abstract
In problems with a large volume of unlabeled data, semi-supervised
learning techniques, such as self-training, are attractive because
they make full use of the data and do not require extensive labeling
of the data, since it is an expensive process. However, using pseudolabels
to train a model indiscriminately can lead to undue changes in
the model’s decision boundary, which can happen unintentionally
or intentionally, such as in malware classification, where attackers
want to classify malicious software as benign. In this paper, we
propose a dataset for poisoning models based on self-training that
simulates a data stream, intending to evaluate the robustness of
these models against intentional or unintentional poisoning by
unlabeled instances. Our experiments use models from the MOA-SS
framework, and show that models that use incremental training
and prediction confidence as a criterion for using the unlabeled
instance in training are more susceptible to poisoning.
O Computer on the Beach é um evento técnico-científico que visa reunir profissionais, pesquisadores e acadêmicos da área de Computação, a fim de discutir as tendências de pesquisa e mercado da computação em suas mais diversas áreas.