ПРАВОВІ ТА ЕТИЧНІ ЗАСАДИ ПОБУДОВИ РЕПРЕЗЕНТАТИВНИХ ДАТАСЕТІВ ДЛЯ ВИЯВЛЕННЯ ПРОЯВІВ КІБЕРБУЛІНГУ У ТЕКСТОВОМУ КОНТЕНТІ

Olena SOBKO; Archil CHOCHIA

doi:10.31891/csit-2025-3-14

Authors

Olena SOBKO Khmelnytskyi National University https://orcid.org/0000-0001-5371-5788
Archil CHOCHIA Tallinn University of Technology https://orcid.org/0000-0003-4821-297X

DOI:

https://doi.org/10.31891/csit-2025-3-14

Keywords:

cyberbullying, ethical aspects, legal basis, data representativeness, text content, dataset, discrimination, artificial intelligence, multi-criteria optimization, machine learning

Abstract

The article is devoted to developing the method for creating of representative text data datasets for detecting manifestations of cyberbullying in text content, considering ethical and legal principles. The primary focus is ensuring fair and equal representation of different demographic groups in text samples, which is critical for creating non-discriminatory and socially responsible artificial intelligence models. Emphasis is placed on compliance with key ethical principles – preventing harm, avoiding bias, and ensuring representativeness – and provisions of international law, particularly the General Data Protection Regulation. Proposed method for creating of representative text data datasets for detecting manifestations of cyberbullying in text content, taking into account ethical principles, which includes the following stages: preliminary processing of text data, analysis of distributions according to ethical aspects (age, gender, religion etc.), and representative adjustment through multi-criteria optimization. Machine learning models are trained on prepared balanced samples using appropriate reference datasets to classify text samples according to ethical criteria. The comparison is based on official demographic data for Ukraine, which ensures the reliability of the assessment of deviations.

As a result of applying the developed method, a representative sample was created with a deviation of the proportions of ethical groups from the target values within 0.00-0.04%. The statistical metrics obtained confirmed the effectiveness of the selected models and demonstrated a high degree of compliance with the ethical responsibility requirements of the results. The analysis showed that the initial datasets contained imbalances, which were successfully eliminated through multi-criteria optimization and data augmentation. The developed approach can be integrated into preparing training samples for ethically oriented artificial intelligence systems that perform automated detection of cyberbullying manifestations in text content, reducing the risks of reproducing social biases and increasing trust in algorithmic decisions.

LEGAL AND ETHICAL BASES FOR CREATING REPRESENTATIVE DATASETS TO DETECTING MANIFESTATIONS OF CYBERBULLYING IN TEXT CONTENT

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

Language

Indexing