Multidimensional scaling: definition, goals, objectives and example

Table of contents:

Multidimensional scaling: definition, goals, objectives and example
Multidimensional scaling: definition, goals, objectives and example
Anonim

Multivariate scaling (MDS) is a tool for visualizing the level of similarity of individual cases in a data set. It refers to a set of related ordination methods used in the visualization of information, in particular to display the information contained in a distance matrix. This is a form of non-linear dimensionality reduction. The MDS algorithm aims to place each object in an N-dimensional space in such a way that the distances between objects are preserved as best as possible. Each object is then assigned coordinates in each of the N dimensions.

The number of dimensions of the MDS graph can exceed 2 and is specified a priori. Selecting N=2 optimizes object placement for the 2D scatterplot. You can see examples of multidimensional scaling in the pictures in the article. Examples with symbols in Russian are especially illustrative.

Multidimensional scaling
Multidimensional scaling

Essence

Method of multidimensional scaling (MMS,MDS) is an extended set of classical tools that generalizes the optimization procedure for a set of loss functions and input matrices of known distances with weights and so on. In this context, a useful loss function is called stress, which is often minimized by a procedure called stress majorization.

Manual

There are several options for multidimensional scaling. MDS programs automatically minimize the load to get a solution. The core of the nonmetric MDS algorithm is a twofold optimization process. First, the optimal monotonic proximity transformation must be found. Second, configuration points must be optimally positioned so that their distances match the scaled proximity values as closely as possible.

Multidimensional scaling example
Multidimensional scaling example

Expansion

An extension of metric multidimensional scaling in statistics where the target space is an arbitrary smooth non-Euclidean space. Where the differences are distances on a surface and the target space is a different surface. Thematic programs allow you to find an attachment with minimal distortion of one surface into another.

Steps

There are several steps in conducting a study using multivariate scaling:

  1. Formulation of the problem. What variables do you want to compare? How many variables do you want to compare? For what purpose will the study be used?
  2. Getting input data. Respondents are asked a series of questions. For each pair of products, they are asked to rate the similarity (usually on a 7-point Likert scale from very similar to very dissimilar). The first question could be for Coca-Cola/Pepsi, for example, the next for beer, the next for Dr. Pepper, etc. The number of questions depends on the number of brands.
Distance scaling
Distance scaling

Alternative approaches

There are two other approaches. There is a technique called "Perceptual Data: Derived Approach" in which products are decomposed into attributes and the evaluation is done on a semantic differential scale. Another method is the “preference data approach”, in which respondents are asked about preferences rather than similarities.

It consists of the following steps:

  1. Launching the MDS statistical program. Software for performing the procedure is available in many statistical software packages. There is often a choice between metric MDS (which deals with interval or ratio level data) and non-metric MDS (which deals with ordinal data).
  2. Determining the number of measurements. The researcher must determine the number of measurements he wants to create on the computer. The more measurements, the better the statistical fit, but the more difficult it is to interpret the results.
  3. Display results and define measurements - the statistical program (or related module) will display the results. The map will display each product (usually in 2D).space). The proximity of products to each other indicates either their similarity or preference, depending on which approach was used. However, how measurements actually correspond to measurements of system behavior is not always clear. A subjective judgment of conformity can be made here.
  4. Check results for reliability and validity - compute R-squared to determine the proportion of scaled data variance that can be accounted for by the MDS procedure. Square R 0.6 is considered the minimum acceptable level. R squared 0.8 is considered good for metric scaling, while 0.9 is considered good for non-metric scaling.
Multivariate scaling results
Multivariate scaling results

Various tests

Other possible tests are Kruskal-type stress tests, split data tests, data stability tests, and retest reliability tests. Write in detail about the results in the test. Along with the mapping, at least a measure of distance (eg Sorenson index, Jaccard index) and reliability (eg stress value) should be specified.

It is also highly desirable to give an algorithm (e.g. Kruskal, Mather) which is often determined by the program used (sometimes replacing the algorithm report), if you have given a starting configuration or had a random choice, number of dimension runs, Monte Carlo results, number of iterations, stability score, and proportional variance of each axis (r-square).

Visual information and data analysis methodmultidimensional scaling

Information visualization is the study of interactive (visual) representations of abstract data to enhance human cognition. Abstract data includes both numeric and non-numeric data such as textual and geographic information. However, information visualization differs from scientific visualization: “it is informational (information visualization) when a spatial representation is chosen, and scivis (scientific visualization) when a spatial representation is given.”

The field of information visualization emerged from research in human-computer interaction, computer science applications, graphics, visual design, psychology, and business methods. It is increasingly being used as an essential component in scientific research, digital libraries, data mining, financial data, market research, production control, and so on.

Methods and principles

Information visualization suggests that visualization and interaction methods take advantage of the richness of human perception, allowing users to simultaneously see, explore and understand large amounts of information. Information visualization aims to create approaches for communicating abstract data, information in an intuitive way.

Color multidimensional scaling
Color multidimensional scaling

Data analysis is an integral part of all applied research and problem solving in industry. MostThe fundamental approaches to data analysis are visualization (histograms, scatter plots, surface plots, tree maps, parallel coordinate plots, etc.), statistics (hypothesis testing, regression, PCA, etc.), data analysis (matching, etc.)..d.) and machine learning methods (clustering, classification, decision trees, etc.).

Among these approaches, information visualization or visual data analysis is the most dependent on the cognitive skills of the analytical staff and allows the discovery of unstructured actionable insights that are only limited by human imagination and creativity. An analyst does not need to learn any complex techniques to be able to interpret data visualizations. Information visualization is also a hypothesis generation scheme that can and is usually accompanied by more analytical or formal analysis such as statistical hypothesis testing.

Study

The modern study of visualization began with computer graphics, which "from the very beginning was used to study scientific problems. However, in the early years, the lack of graphics power often limited its usefulness. The priority on visualization began to develop in 1987, with the release of special software for Computer Graphics and Visualization in Scientific Computing Since then, there have been several conferences and workshops jointly organized by the IEEE Computer Society and ACM SIGGRAPH".

They covered the general topics of data visualization, information visualization and scientific visualization,as well as more specific areas such as volume rendering.

Multidimensional brand scaling
Multidimensional brand scaling

Summary

Generalized Multidimensional Scaling (GMDS) is an extension of metric multidimensional scaling in which the target space is non-Euclidean. When the differences are distances on a surface, and the target space is another surface, GMDS allows you to find the nesting of one surface into another with minimal distortion.

GMDS is a new line of research. Currently, the main applications are deformable object recognition (for example, for 3D face recognition) and texture mapping.

The purpose of multidimensional scaling is to represent multidimensional data. Multidimensional data, that is, data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lies on an embedded non-linear manifold in a high-dimensional space. If the collector has a low enough dimension, the data can be visualized in low-dimensional space.

Many of the non-linear dimensionality reduction methods are related to linear methods. Nonlinear methods can be broadly classified into two groups: those that provide mapping (either from high-dimensional space to low-dimensional embedding, or vice versa), and those that simply provide visualization. In the context of machine learning, mapping methods can be viewed asa preliminary stage of feature extraction, after which pattern recognition algorithms are applied. Usually those that just give visualizations are based on proximity data - i.e. distance measurements. Multidimensional scaling is also quite common in psychology and other humanities.

Diagonal multidimensional scaling
Diagonal multidimensional scaling

If the number of attributes is large, then the space of unique possible strings is also exponentially large. Thus, the larger the dimension, the more difficult it becomes to depict the space. This causes a lot of problems. Algorithms that operate on high-dimensional data tend to have very high time complexity. Reducing data to fewer dimensions often makes analysis algorithms more efficient and can help machine learning algorithms make more accurate predictions. This is why multidimensional data scaling is so popular.

Recommended: