DATA UPDATE ALGORITHMS IN THE MACHINE LEARNING SYSTEM

Authors

DOI:

https://doi.org/10.31891/csit-2023-1-1

Keywords:

Data Drift, Data QC, Anomaly Detection, MLOps, Data Validation, Machine Learning, Time-Series

Abstract

This paper analyzes methods for operationalizing anomaly detection, data drift detection, as a data validation step in a machine learning system. A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next. MLOps is a set of practices aimed at reliable and efficient deployment and support of machine learning models in the real world. We proposed a solution with technologies mentioned in the theoretical paper [1] for operationalizing the Data QC pipeline. Also, we propose to build a Data QC pipeline based on MLFlow, a machine learning cycle manager. We chose MLFlow as a skeleton for building our pipelines. The choice springs from the specifics of the task, problems and the need for ready-made solutions to meet our requirements. Specific explanations are mentioned in the paper [1] both for Data Drift and Data QC pipelines. To construct either Data QC or Data Drift pipeline, we need to wrap the defined solution, divided into steps to the MLFlow. The latter will register all artifacts, metrics and parameters. An artifact in a machine learning system is a result of a process in a pipeline. For example, it could be a trained model, an Excel file, or a feature importance image.  The paper considers the following stages of the Data QC pipeline: filtering, anomaly detection, reporting, validation, and comparison of new data with historical. The Data Drift detection pipeline. The Data QC and Data Drift detection pipelines are necessary for data validation and processing in the current machine learning life cycle. The task of the Data QC pipeline is to automate the evaluation and validation of new data. The task is especially important for Time-Series systems in real-time. In this paper, we researched the formation of interactive quality reports, and the anomaly and data drift detection approaches for the Time-Series system. We analyzed approaches to implementing such MLOps architecture with data validation step described with Data QC and Data Drift pipelines.

Downloads

Published

2023-03-30

How to Cite

Boyko, N., & Kovalchuk, R. (2023). DATA UPDATE ALGORITHMS IN THE MACHINE LEARNING SYSTEM . Computer Systems and Information Technologies, (1), 6–13. https://doi.org/10.31891/csit-2023-1-1