CONVOLUTIONAL NEURAL NETWORK-BASED SOUND SOURCE SEPARATION IN THE TIME-FREQUENCY DOMAIN

Authors

DOI:

https://doi.org/10.31891/csit-2026-1-15

Keywords:

computer science, artificial intelligence (AI), convolutional neural networks (CNN), audio data analysis, analog signal processing, sound source separation

Abstract

This paper addresses the problem of sound source separation in mixed audio signals in the time-frequency domain. The study considers the application of convolutional neural networks for isolating individual acoustic components from complex audio mixtures where multiple sources overlap in both time and frequency. The presence of such overlap significantly complicates the separation process and increases the requirements for stability and structural consistency of the applied models. The proposed approach is based on transforming audio signals using the Short-Time Fourier Transform and representing audio mixtures as spectrograms that preserve both temporal and spectral characteristics of sound components. A binary masking strategy is applied to the resulting representations to structurally simplify the separation task. A convolutional neural network is employed to predict masks corresponding to individual sound sources such as vocals, bass, drums, and other components. This masking formulation enables selective extraction of spectral regions associated with specific sources and supports the implementation of a hybrid processing scheme that combines elements of classification and regression within a unified neural architecture. The research methodology includes the design of the network architecture, preparation of spectrogram-based input data, model training on multi-source audio mixtures, and validation of separation quality using reconstruction consistency criteria. Particular attention is paid to ensuring stable convergence of the model and preserving meaningful acoustic patterns within the predicted masks. The findings demonstrate stable isolation of sound components and consistent performance across training and validation datasets. Quantitative evaluation shows separation accuracy of 0.772 for vocals, 0.766 for drums, 0.944 for bass, and 0.764 for other sources, with corresponding mean squared error values ranging from 0.044 to 0.203 across evaluated categories. The highest performance was achieved for bass isolation due to the distinct low-frequency spectral structure of this source. Signal-level evaluation using SI-SDR, SDR, and SNR metrics produced values ranging from -1.24 to 4.10 dB (SI-SDR), -0.26 to 4.59 dB (SDR), and 1.16 to 5.09 dB (SNR), with the highest values observed for bass and vocal sources, consistent with the accuracy-based results. The results confirm the effectiveness of integrating binary masking with convolutional processing of spectrograms for computationally efficient sound source separation. The proposed approach, implemented using a compact neural architecture with 323,233 trainable parameters, can be applied in music production systems, speech enhancement solutions, intelligent audio analysis platforms, and other audio processing environments requiring reliable and lightweight separation mechanisms.

Downloads

Published

2026-03-26

How to Cite

TOMASHEVSKYY, O., & TKACHUK, O. (2026). CONVOLUTIONAL NEURAL NETWORK-BASED SOUND SOURCE SEPARATION IN THE TIME-FREQUENCY DOMAIN. Computer Systems and Information Technologies, (1), 156–171. https://doi.org/10.31891/csit-2026-1-15