METHOD OF CREATING CUSTOM DATASET TO TRAIN CONVOLUTIONAL NEURAL NETWORK

Authors

DOI:

https://doi.org/10.31891/csit-2024-4-5

Keywords:

CNN, dataset, neural network, Roboflow, data preprocessing, data augmentation, labeling

Abstract

The task of creating and developing custom datasets for training convolutional neural networks (CNNs) is essential due to the increasing adoption of deep learning across industries. CNNs have become fundamental tools for various applications, including computer vision, natural language processing, medical imaging, and autonomous systems. However, the success of a CNN depends heavily on the quality and relevance of the data it is trained on. The datasets used to train these models must be diverse, representative of the task at hand, and of sufficient quality to capture the underlying patterns that the CNN needs to learn. Thus, building custom datasets that align with the specific objectives of a neural network plays a critical role in enhancing the performance and generalization capability of the trained model.

This paper focuses on developing a method and subsystem for generating high-quality custom datasets tailored to CNNs. The aim is to provide a framework that automates and streamlines the processes involved in data collection, preprocessing, augmentation, annotation, and validation. Moreover, the method integrates tools that allow the dataset to evolve over time, incorporating new data to adapt to changing requirements or environments, making the system flexible and scalable.

The process of creating a dataset begins with the acquisition of raw data. The data can come from various sources such as images from cameras, videos, sensor feeds, open data repositories, or proprietary datasets. A key consideration during data collection is ensuring that the samples cover the full range of conditions or classes the CNN will encounter in production. For example, in an object recognition task, it is essential to collect images from diverse environments, lighting conditions, and angles to train the model effectively. Ensuring variability in the dataset increases the model's ability to generalize, reducing the risk of poor performance on unseen data.

Data augmentation is a critical step in building a robust dataset, particularly when the size of the dataset is limited. Augmentation techniques introduce variability into the dataset by artificially modifying the existing samples, thereby simulating a wider range of conditions. This helps the CNN generalize better and prevents overfitting. In essence, it allows the model to experience different perspectives and distortions of the same data, strengthening its adaptability to real-world scenarios.

Annotation involves labeling the data samples with the correct class or category information. Depending on the task, annotations may include bounding boxes for object detection, segmentation masks for semantic segmentation, or class labels for classification tasks. The importance of well-annotated data cannot be overstated, as CNNs rely on this labeled information to understand the relationships between input data and the desired output predictions.

A balanced dataset is crucial for achieving good performance in CNN models. If one class or condition is overrepresented, the model may become biased toward that class, resulting in poor performance when encountering other classes.

Downloads

Published

2024-12-26

How to Cite

ISAIEV, T., & KYSIL, T. (2024). METHOD OF CREATING CUSTOM DATASET TO TRAIN CONVOLUTIONAL NEURAL NETWORK. Computer Systems and Information Technologies, (4), 37–44. https://doi.org/10.31891/csit-2024-4-5