Um pipeline multiplataforma unificado para coleta e padronização de dados de mídias sociais

Anderson Frasão; Tiago Heinrich; Vinicius Fulber-Garcia

doi:10.14210/cotb.v17.p488-495

Social networks have become a central source of data for large-scale empirical research. However, access to this data is constrained by official API limitations, opaque commercial solutions, and fragmented scraping tools. This work presents a tool for collecting and storing multi-platform data on social networks, designed as an integrated pipeline that includes Instagram, TikTok, Twitter/X, and YouTube. The tool combines automated collection with structured extraction of text, images, videos, audio, metadata, URLs, and comments, organizing these elements into a standardized relational schema with tables for users, media, and comments on each platform. This standardization reduces the engineering effort required to consolidate heterogeneous data, facilitating comparative analyses between networks and enabling the application of multimodal analysis and Natural Language Processing (NLP) methods in different research scenarios, including studies of fraud, engagement, and abusive behavior. In addition, the pipeline incorporates time-based controls, logging, and multi-account and cookie management, making the collection process more robust in the face of blocks, access limits, and platform changes. The tool thus aims to serve as reusable infrastructure for sociotechnical research on social media, pro

Anais do Computer on the Beach

O Computer on the Beach é um evento técnico-científico que visa reunir profissionais, pesquisadores e acadêmicos da área de Computação, a fim de discutir as tendências de pesquisa e mercado da computação em suas mais diversas áreas.

Access journal

Portal de Periódicos

Anais do Computer on the Beach

Portal de Periódicos

Pesquisa

Um pipeline multiplataforma unificado para coleta e padronização de dados de mídias sociais

Anais do Computer on the Beach