Clustering method: description, basic concepts, application features

2025 Author: Angel Austin | [email protected]. Last modified: 2025-01-23 12:19

The clustering method is the task of grouping a set of objects in such a way that they in the same group are more similar to each other than to objects in other industries. It is the primary task of data mining and a general statistical analysis technique used in many fields, including machine learning, pattern recognition, image recognition, information retrieval, data compression, and computer graphics.

Optimization problem

The clustering method itself is not one specific algorithm, but a general task that needs to be solved. This can be achieved with various algorithms that differ significantly in understanding what constitutes a group and how to find it efficiently. The use of the clustering method for the formation of metasubjects includes the use of a group withsmall distances between members, dense regions of space, intervals, or certain statistical distributions. Therefore, clustering can be formulated as a multi-objective optimization problem.

The appropriate method and parameter settings (including items such as the distance function to use, the density threshold, or the number of expected clusters) depend on the individual data set and the intended use of the results. Analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization. This clustering method includes trial and error attempts. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

In addition to the term "clustering", there are a number of words with similar meanings, including automatic classification, numerical taxonomy, bothryology and typological analysis. Subtle differences often lie in the use of the clustering method to form metasubject relationships. While in data extraction the resulting groups are of interest, in automatic classification it is already the discriminatory power that performs these functions.

Cluster analysis was based on numerous works by Kroeber in 1932. It was introduced into psychology by Zubin in 1938 and by Robert Tryon in 1939. And these works have been used by Cattell since 1943 to indicate the classification of clustering methods in theory.

Term

The concept of "cluster" cannot be precisely defined. This is one of the reasons why there are so many clustering methods. There is a common denominator: a group of data objects. However, different researchers use different models. And each of these uses of clustering methods involves different data. The concept found by various algorithms differs significantly in its properties.

Using the clustering method is the key to understanding the differences between the instructions. Typical cluster patterns include:

Centroid s. This is, for example, when k-means clustering represents each cluster with one mean vector.
Connectivity model s. This is, for example, hierarchical clustering, which builds models based on distance connectivity.
Distribution model s. In this case, clusters are modeled using the clustering method to form metasubject statistical distributions. Such as multivariate normal separation, which is applicable to the expectation maximization algorithm.
Density model s. These are, for example, DBSCAN (Spatial Clustering Algorithm with Noise) and OPTICS (Order Points for Structure Detection), which define clusters as connected dense regions in data space.
Subspace model c. In biclustering (also known as co-clustering or two modes), groups are modeled with both elements and with the appropriate attributes.
Model s. Some algorithms do notrefined relationship for their clustering method to generate meta-subject results and simply provide grouping information.
Model based on graph s. A clique, that is, a subset of nodes, such that every two connections in the edge part can be considered as a prototype of the cluster shape. The weakening of the total demand is known as quasi-cliques. Exactly the same name is presented in the HCS clustering algorithm.
Neural models s. The best-known unsupervised network is the self-organizing map. And it is these models that can usually be characterized as analogous to one or more of the above clustering methods for the formation of meta-subject results. It includes subspace systems when neural networks implement the necessary form of principal or independent component analysis.

This term is, in fact, a set of such groups, which usually contain all the objects in the set of data clustering methods. In addition, it can indicate the relationship of clusters to each other, such as a hierarchy of systems built into each other. The grouping can be divided into the following aspects:

Hard centroid clustering method. Here, each object belongs to a group or is outside of it.
Soft or fuzzy system. At this point, each object already belongs to a certain extent to any cluster. It is also called the c-means fuzzy clustering method.

And more subtle differences are also possible. For example:

Strict partitioning clustering. Hereeach object belongs to exactly one group.
Strict partitioning clustering with outliers. In this case, objects may also not belong to any cluster and be considered unnecessary.
Overlapping clustering (also alternative, with multiple views). Here, objects can belong to more than one branch. Typically involving solid clusters.
Hierarchical clustering methods. Objects belonging to a child group also belong to the parent subsystem.
Formation of subspace. Although similar to overlapping clusters, within a uniquely defined system, mutual groups should not overlap.

Instructions

As stated above, clustering algorithms can be classified based on their cluster model. The following review will list only the most prominent examples of these instructions. Since there may be over 100 published algorithms, not all provide models for their clusters and therefore cannot be easily classified.

There is no objectively correct clustering algorithm. But, as noted above, the instruction is always in the field of view of the observer. The most suitable clustering algorithm for a particular problem often has to be chosen experimentally, unless there is a mathematical reason for preferring one model over another. It should be noted that an algorithm designed for a single type usually does not work witha dataset that contains a radically different subject. For example, k-means cannot find non-convex groups.

Connection-based clustering

This union is also known by its name, the hierarchical model. It is based on the typical idea that objects are more connected to neighboring parts than to those that are much further away. These algorithms connect objects, forming different clusters, depending on their distance. A group can be described mainly by the maximum distance that is needed to connect the different parts of the cluster. At all possible distances, other groups will form, which can be represented using a dendrogram. This explains where the common name "hierarchical clustering" comes from. That is, these algorithms do not provide a single partition of the dataset, but instead provide an extensive order of authority. It is thanks to him that there is a drain with each other at certain distances. In a dendrogram, the y-axis denotes the distance at which the clusters come together. And the objects are arranged along the X line so that the groups do not mix.

Connection-based clustering is a whole family of methods that differ in the way they calculate distances. In addition to the usual choice of distance functions, the user also needs to decide on the connection criterion. Since a cluster consists of several objects, there are many options for computing it. A popular choice is known as single-lever grouping, this is the methodfull link, which contains UPGMA or WPGMA (unweighted or weighted ensemble of pairs with arithmetic mean, also known as mean link clustering). In addition, the hierarchical system can be agglomerative (starting with individual elements and combining them into groups) or divisive (starting with a complete data set and breaking it into sections).

Distributed clustering

These models are most closely related to statistics that are based on splits. Clusters can be easily defined as objects that most likely belong to the same distribution. A handy feature of this approach is that it is very similar to the way artificial datasets are created. By sampling random objects from a distribution.

While the theoretical basis of these methods is excellent, they suffer from one key problem, known as overfitting, unless limits are imposed on the complexity of the model. A larger association will usually explain the data better, making it difficult to choose the right method.

Gaussian mixture model

This method uses all sorts of expectation maximization algorithms. Here, the dataset is usually modeled with a fixed (to avoid overriding) number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to better fit the dataset. This system will converge to a local optimum. That is why several runs can givedifferent results. To get the tightest clustering, features are often assigned to the Gaussian distribution they are most likely to belong to. And for softer groups, this is not necessary.

Distribution-based clustering creates complex models that can ultimately capture the correlation and dependency between attributes. However, these algorithms impose an additional burden on the user. For many real world datasets, there may not be a concisely defined mathematical model (for example, assuming a Gaussian distribution is a fairly strong assumption).

Density based clustering

In this example, the groups are basically defined as areas with higher impermeability than the rest of the data set. Objects in these rare parts, which are necessary to separate all components, are usually considered noise and edge points.

The most popular density-based clustering method is DBSCAN (Spatial Noise Clustering Algorithm). Unlike many newer methods, it has a well-defined cluster component called "density reachability". Similar to link-based clustering, it is based on connection points within certain distance thresholds. However, this method collects only those items that satisfy the density criterion. In the original version, defined as the minimum number of other objects in this radius, the cluster consists of alldensity-related items (which can form a free-form group, unlike many other methods), and all objects that are within the allowed range.

Another interesting property of DBSCAN is that its complexity is quite low - it requires a linear number of range queries against the database. And also unusual is that it will find essentially the same results (this is deterministic for core and noise points, but not for boundary elements) in every run. Therefore, there is no need to run it multiple times.

The main disadvantage of DBSCAN and OPTICS is that they expect some drop in density to detect cluster boundaries. For example, in datasets with overlapping Gaussian distributions-a common use case for artificial objects-the cluster boundaries generated by these algorithms often appear arbitrary. This happens because the density of groups is continuously decreasing. And in a Gaussian mixture dataset, these algorithms almost always outperform methods such as EM clustering, which are able to accurately model these types of systems.

Mean displacement is a clustering approach in which each object moves to the densest region in the neighborhood based on an estimate of the entire kernel. In the end, the objects converge to local impenetrability maxima. Similar to k-means clustering, these "density attractors" can serve as representatives for a data set. But the mean shiftcan detect arbitrarily shaped clusters similar to DBSCAN. Due to the expensive iterative procedure and density estimation, the average displacement is usually slower than DBSCAN or k-Means. In addition, the applicability of the typical shift algorithm to high-dimensional data is difficult due to the non-uniform behavior of the kernel density estimate, which leads to excessive fragmentation of the cluster tails.

Rating

clustering method for the formation of metasubject

Verifying clustering results is as difficult as clustering itself. Popular approaches include "internal" scoring (where the system is reduced to a single measure of quality) and, of course, "external" scoring (where the clustering is compared to an existing "ground truth" classification). And the human expert's manual score and indirect score are found by examining the usefulness of clustering in the intended application.

Internal flag measures suffer from the problem that they represent features that can themselves be considered clustering targets. For example, it is possible to group data given by the Silhouette coefficient, except that there is no known efficient algorithm for doing so. Using such an internal measure for evaluation, it is better to compare the similarity of optimization problems.

The outside mark has similar problems. If there are such labels of "ground truth", then there is no need to cluster. And in practical applications, there are usually no such concepts. On the other hand, the labels reflect only one possible partition of the data set, which does not meanthat there is no other (maybe even better) clustering.

So none of these approaches can ultimately judge the actual quality. But this requires human evaluation, which is highly subjective. Nevertheless, such statistics can be informative in identifying bad clusters. But one should not discount the subjective assessment of a person.

Inner mark

When the result of a clustering is evaluated based on data that has itself been clustered, this is referred to as this term. These methods generally assign the best result to an algorithm that creates groups with high similarity within and low between groups. One of the disadvantages of using internal criteria in cluster evaluation is that high scores do not necessarily lead to effective information retrieval applications. Also, this score is biased towards algorithms that use the same model. For example, k-means clustering naturally optimizes feature distances, and an internal criterion based on it is likely to overestimate the resulting clustering.

Therefore, these evaluation measures are best suited to get an idea of situations where one algorithm performs better than another. But this does not mean that each information gives more reliable results than others. The validity period measured by such an index depends on the assertion that the structure exists in the dataset. An algorithm developed for some types has no chance if the set contains radicallydifferent composition or if the assessment measures different criteria. For example, k-means clustering can only find convex clusters, and many score indices assume the same format. In a dataset with non-convex models, it is inappropriate to use k-means and typical evaluation criteria.

External evaluation

With this kind of balling, clustering results are evaluated based on data that was not used for grouping. That is, such as known class labels and external tests. Such questions consist of a set of pre-classified items and are often created by experts (humans). As such, reference kits can be seen as the gold standard for evaluation. These types of scoring methods measure how close the clustering is to given reference classes. However, it has recently been discussed whether this is adequate for real data or only for synthetic sets with actual ground truth. Since classes may contain internal structure, and the existing attributes may not allow separation of clusters. Also, from a knowledge discovery point of view, reproducing known facts may not necessarily produce the expected result. In a special constrained clustering scenario where meta-information (such as class labels) is already used in the grouping process, it is not trivial to retain all the information for evaluation purposes.

Now it is clear what does not apply to clustering methods, and what models are used for these purposes.