INFORMATION TECHNOLOGY FOR STATISTICAL CLUSTER ANALYSIS OF INFORMATION IN COMPLEX NETWORKS

Authors

DOI:

https://doi.org/10.31891/csit-2022-4-7

Keywords:

optimal number of clusters, cluster centers, k-core decomposition algorithm, eigenvalues, stochastic matrix, clustering process, statistical characteristics, process of Markov

Abstract

Information technology has been developed, which is used to collect, process and save large volumes of data from the web space. With the help of technology, the statistical characteristics of various segments of the web space and their cluster structure are studied. Two methods are used to find the optimal number of clusters and cluster centers: the well-known k-core decomposition algorithm and a new method developed by the authors. The new algorithm is based on the distribution of eigenvalues of the stochastic matrix, which describes the process of Markov transitions in the system. The clustering process is carried out using the Power iteration clustering algorithm.

With the help of written software (crawler), information is collected on a given segment of the web space. For the studied area, there are statistical characteristics, namely: node degree, clustering coefficient, node probability distributions by input and output connections. Oriented and unoriented graphs of web pages of the studied zones are constructed. By combining the calculated dependencies for the input and output subnets, we can obtain the statistical characteristics of the undirected graphs of the web pages of the web space zones that we are investigating.

For cluster analysis, the optimal number of clusters and cluster centers can be found in 2 ways: by the well-known k-core decomposition algorithm and by using a new method developed by the author. The new algorithm is based on the distribution of eigenvalues of the stochastic matrix, which describes the process of Markov transitions in the system. Using the Rower iteration clustering algorithm, the cluster structure of various segments of the web space is studied.

The advantage of the developed information technology is that with its help one can work with large sets of data collected on the Internet, study their structure and statistical characteristics, and perform the clustering process. To implement the clustering process and find the optimal number of clusters and centroids a new algorithm is suggested. The results of the algorithm indicate high accuracy in determining the optimal number of clusters.

Downloads

Published

2022-12-29

How to Cite

KYRYCHENKO, O. (2022). INFORMATION TECHNOLOGY FOR STATISTICAL CLUSTER ANALYSIS OF INFORMATION IN COMPLEX NETWORKS. Computer Systems and Information Technologies, (4), 47–51. https://doi.org/10.31891/csit-2022-4-7