Data clustering in distributed monitoring systems
Abstract
Introduction: The traditional methods of analyzing distributed data sources typically use centralized data warehouses and have
several disadvantages associated with confidentiality, a high cost of centralized data storage, a limited bandwidth and a high load
on telecommunications networks. Techniques for a decentralized analysis do not take into account the data distribution type and
the features of the selected algorithm. This reduces the performance and accuracy, or can be impracticable under certain conditions.
Purpose: Studying and analyzing the features of distributed monitoring systems and data mining algorithms. Results: For clustering
based on distributed data sources, requirements were set for an algorithm in distributed monitoring systems: single pass, support
of different types of input data, online operation and adaptation to data when the environment changes, large data scaling, analysis
without assumptions about the input data distribution, analysis of data on information sources without involving a third party. Two
main ways of data distribution on sources in heterogeneous systems are defined: vertical and horizontal ones. The methods are classified
according to their basic principle of cluster delimitation. The classification includes the main clustering algorithms, their operation
principles, advantages and disadvantages. A review of the existing clustering methods has shown that in distributed monitoring
systems, Kohonen’s neural networks are most efficient. The algorithm of Kohonen self-organizing maps was decomposed, and blocks
for data processing were determined: the calculation of a winner neuron and the adjustment of neuron weights. Two strategies have been
proposed for clustering distributed data. Practical relevance: The proposed strategies allow you to perform clustering in systems with
distributed sources taking into account the characteristics of the environment, without transferring all the data.