HANDLING THE BREAST CANCER RECURRENCE DATA FOR A MORE RELIABLE FORECAST

Authors

DOI:

https://doi.org/10.31891/csit-2023-4-2

Keywords:

machine learning, breast cancer dataset, recurrence events, noise cleaning, performance improving

Abstract

Breast cancer in women is a global problem that affects the gene pool. This sickness has become a prevalent cancer threat for Ukrainian women, while early detection and prophylactics notably raise survival chances, dropping the cost of treatment. Recurrence event control and forecasting are vital field areas of this problem.

This article deals with data that permits via machine-learning breast cancer recurrences in patients undergoing the therapy. The renewed data set presented in this paper contains 252 cases, of which 206 did not have recurrent events, but 46 did. This data set is an improved version of the well-known Ljubljana breast cancer data set from 1988.

The aim is a lift in the reliability of clinical prognoses of breast cancer recurrence using the updated and improved LBCD. The list of tasks accompanying this goal is as follows: Estimating relevance ranks for LBCD attributes; Evaluations of noise levels for attributes, mainly for the class attribute; Reduction of the dataset by removing irrelevant and noisy data; Imputing (restoring) the missed values for the class attribute; The simile of the performance for the initial and upgraded dataset.

Our updated dataset has fewer instances (252 instead of 286) and fewer attributes (six instead of ten), aside from the class attribute being noise-cleaned and its missed values being restored. As a result,  the performance of the upgraded data set is much better than the original one, especially concerning cases of recurrence cancer. It allows clinicians a more reliable machine-learning diagnosis of breast cancer recurrence using the most known classifiers.

The used dataset is helpful in machine learning models' devising, which shall classify, detect, and forecast probabilities of recurrence events of breast cancer in clinics. The elaborated dataset ensures a much higher performance for machine learning algorithms than the initial prototype. Compared to the prototype, the dataset is more compact, comprising 252 instances instead of 286 and 6 attributes instead of 10. This dataset's class (category) attribute is entirely free of noise.

Downloads

Published

2023-12-28

How to Cite

CHUIKO, G., & YAREMCHUK , O. (2023). HANDLING THE BREAST CANCER RECURRENCE DATA FOR A MORE RELIABLE FORECAST. Computer Systems and Information Technologies, (4), 10–15. https://doi.org/10.31891/csit-2023-4-2