Data Mining is Concept, algorithm analysis, purpose and application

2025 Author: Angel Austin | [email protected]. Last modified: 2025-01-23 12:19

The development of information technology brings practical results. But such tasks as finding, analyzing and using information have not yet received an effective high-quality tool. There are analytics and quantitative tools, they really work. But a qualitative revolution in the use of information has not happened yet.

Long before the advent of computer technology, a person needed to process large amounts of information and coped with this to the best of his experience and available technical capabilities.

The development of knowledge and skills has always met real needs and corresponded to current tasks. Data mining is a collective name used to refer to a set of methods for discovering previously unknown, non-trivial, practically useful and accessible knowledge in data, necessary for making decisions in various areas of human activity.

Human, intelligence, programming

A person always knows how to act in any situation. Ignorance or an unfamiliar situation does not prevent him from making a decision. The objectivity and reasonableness of any human decision can be questioned, but it will be accepted.

Intelligence is based on: hereditary "mechanism", acquired, active knowledge. Knowledge is applied to solve problems that arise before a person.

Intelligence is a unique set of knowledge and skills: opportunities and foundation for human life and work.
Intelligence is constantly evolving, and human actions have an impact on other people.

Programming is the first attempt to formalize the representation of data and the process of creating algorithms.

Artificial intelligence (AI) is a waste of time and resources, but the results of unsuccessful attempts of the last century in the field of AI remained in memory, were used in various expert (intelligent) systems and were transformed, in particular, into algorithms (rules) and mathematical (logical) data analysis and Data Mining.

Information and the usual search for a solution

An ordinary library is a repository of knowledge, and the printed word and graphics have not yet yielded the palm to computer technology. Books on physics, chemistry, theoretical mechanics, design, natural history, philosophy, natural science, botany, textbooks, monographs, works of scientists, conference materials, reports on development work, etc. are always relevant and reliable.

Library is a lot of different sources that differform of presentation, origin, structure, content, presentation style, etc.

Library: books, magazines and other printed matter

Outwardly everything is visible (readable, accessible) for understanding and use. You can solve any problem, correctly set the task, justify the solution, write an essay or term paper, select material for a diploma, analyze sources on the topic of a dissertation or a scientific and analytical report.

Any information problem can be solved. With due perseverance and skill, an accurate and reliable result will be obtained. In this context, Data Mining is a completely different approach.

In addition to the result, a person receives "active links" to everything that was viewed in the process of achieving the goal. The sources that he used in solving the problem can be referred to and no one will dispute the fact of the existence of the source. This is not a guarantee of authenticity, but it is a sure testimony to whom the responsibility for authenticity is "unsubscribed". From this point of view, Data Mining means big doubts about the reliability and no "active" links.

By solving several problems, a person gets results and expands his intellectual potential to many "active links". If a new task “activates” an already existing link, the person will know how to solve it: there is no need to search for anything again.

"Active link" is a fixed association: how and what to do in a particular case. The human brain automatically remembers everything that seems to it potentially interesting, useful.or likely to be needed in the future. In many ways, this happens on a subconscious level, but as soon as a task arises that can be associated with an “active link”, it instantly pops up in the mind and a solution will be obtained without additional search for information. Data Mining is always a repetition of the search algorithm and this algorithm does not change.

Regular search: "artistic" problems

A math library and finding information in it is a relatively weak task. Finding one or another way to solve an integral, build a matrix, or perform the operation of adding two imaginary numbers is laborious, but simple. You need to sort through a number of books, many of which are written in a specific language, find the right text, study it and get the required solution.

Over time, enumeration will become familiar, and the accumulated experience will allow you to navigate the library information and other mathematical problems. This is a limited information space of questions and answers. A characteristic feature: such a search for information accumulates knowledge for solving similar problems. A person's search for information leaves traces ("active links") in his memory on possible solutions to other problems.

In fiction, find the answer to the question: "How did people live in January 1248?" very hard. It is even more difficult to answer the question of what was on store shelves and how the food trade was organized. Even if some writer clearly and directly wrote about this in his novel, if the name of this writer could be found, then doubts aboutthe reliability of the received data will remain. Reliability is a critical characteristic of any amount of information. The source, the author and evidence that excludes the falsity of the result is important.

Objective circumstances of a particular situation

Man sees, hears, feels. Some specialists are fluent in a unique feeling - intuition. The statement of the problem requires information, the process of solving the problem is most often accompanied by a refinement of the statement of the problem. This is the lesser trouble that comes with moving information into the bowels of a computer system.

The library and work colleagues are indirect participants in the decision process. The design of the book (source), the graphics in the text, the features of splitting information into headings, footnotes by phrases, the subject index, the list of primary sources - everything evokes associations in a person that indirectly affect the process of solving the problem.

The time and place of solving the problem is essential. A person is so arranged that he involuntarily pays attention to everything that surrounds him in the process of solving a problem. It can be distracting, or it can be stimulating. Data Mining will never "understand".

Information in virtual space

A person has always been interested only in reliable information about an event, phenomenon, object, algorithm for solving a problem. Man has always imagined exactly how he can achieve the desired goal.

The appearance of computers and information systems should have made life easier for a person, but everything has only become more complicated. Information migrated to the bowels of computer systems and disappeared from sight. To select the required data, you need to create a correct algorithm or formulate a query to the database.

The question must be correct. Only then can you get an answer. But doubts about the authenticity remain. In this sense, Data Mining is really "excavations", it is "information extraction". This is how it is fashionable to translate this phrase. The Russian version is data mining or data mining technology.

In the works of authoritative specialists, the tasks of Data Mining are indicated as follows:

classification;
clustering;
association;
sequence;
forecasting.

From the point of view of the practice that guides a person in the manual processing of information, all these positions are debatable. In any case, a person processes information automatically and does not think about classifying data, compiling thematic groups of objects (clustering), searching for temporal patterns (sequence) or predicting the result.

All these positions in the human mind are represented by active knowledge, which cover more positions and dynamically use the logic of processing the initial data. A person's subconscious plays an important role, especially when he is a specialist in a particular field of knowledge.

Example: Wholesale of computer equipment

The task is simple. There are severaldozens of suppliers of computer equipment and peripherals. Each has a price list in xls format (Excel file), which can be downloaded from the official website of the supplier. It is required to create a web resource that reads Excel files, converts them into database tables and allows customers to select the desired products at the lowest prices.

Problems arise immediately. Each supplier offers its own version of the structure and content of the xls file. You can get the file by downloading it from the supplier's website, ordering it by e-mail, or getting a download link through your personal account, that is, by officially registering with the supplier.

The solution of the problem (at the very beginning) is technologically simple. Loading files (initial data), a file recognition algorithm is written for each supplier and the data is placed in one large table of initial data. After all the data has been received, after the mechanism of continuous swapping (daily, weekly or upon change) of fresh data has been established:

change assortment;
price changes;
clarification of the quantity in stock;
adjustment of warranty terms, specifications, etc.

This is where the real problems begin. The thing is that the supplier can write:

notebook Acer;
notebook Asus;
Dell laptop.

We are talking about the same product, but from different manufacturers. How to match notebook=laptop or how to remove Acer, Asus and Dell from a product line?

Forhuman is not a problem, but how will the algorithm "understand" that Acer, Asus, Dell, Samsung, LG, HP, Sony are trademarks or suppliers? How to match "printer" and printer, "scanner" and "MFP", "copier" and "MFP", "headphones" with "headset", "accessories" with "accessories"?

Building a category tree based on source data (source files) is already a problem when you need to set everything to automatic.

Data sampling: excavations of the "freshly poured"

The task of creating a database of computer equipment suppliers has been solved. A tree of categories has been built, a common table with offers from all suppliers is functioning.

Typical Data Mining tasks in the context of this example:

find a product at the lowest price;
select the item with the lowest shipping cost and price;
product analysis: characteristics and prices by criteria.

In the real work of a manager using data from several dozen suppliers, there will be many variations of these tasks, and even more real situations.

For example, there is a supplier "A" who sells ASUS VivoBook S15: prepayment, delivery 5 days after the actual receipt of money. There is a supplier "B" of the same product of the same model: payment upon receipt, delivery after the conclusion of the contract within a day, the price is one and a half times higher.

Data Mining begins - "excavations". Figurative expressions: "excavations" or "data mining" are synonyms. It's about how to get a reason to make a decision.

Suppliers "A" and "B" have a history of deliveries. Gradeprepayment in the first case against payment on receipt in the second case, taking into account that the delivery failure in the second case is 65% higher. The risk of pen alties from the client is higher/lower. How and what to determine and what decision to make?

On the other hand: the database was created by a programmer and a manager. If the programmer and manager have changed, how to determine the current state of the database and learn how to use it correctly? You will also have to do data mining. Data Mining offers a variety of mathematical and logical methods that do not care what kind of data is being researched. This gives the correct solution in some cases, but not in all.

Moving into virtuality and finding meaning

Data Mining methods become meaningful as soon as the information is written into the database and disappeared from the “field of view”. Trading in computer equipment is an interesting task, but it's just a business. How well he is organized in the company depends on its success.

Climate changes on the planet and the weather in a particular city are of interest to everyone, not just professional climate experts. Thousands of sensors take readings of wind, humidity, pressure, data from artificial Earth satellites and there is a history of data for years and centuries.

Weather data is not only about deciding whether or not to bring an umbrella to work. Data mining technologies are the safe flight of an airliner, the stable operation of a highway and the reliable supply of petroleum products by sea.

"Raw" data is sent to the informationsystem. The tasks of Data Mining are to turn them into a systematized system of tables, establish links, highlight groups of homogeneous data, and detect patterns.

Mathematical and logical methods since the days of quantitative analytics OLAP (On-line Analytical Processing) have shown their practicality. Here, technology allows you to find meaning, and not lose it, as in the example of selling computer equipment.

Moreover, in global tasks:

transnational business;
air transportation management;
study of the bowels of the earth or social problems (at the state level);
study of the effect of drugs on a living organism;
predicting the consequences of the construction of an industrial enterprise, etc.

Data Mine technologies and turning “meaningless” data into real data that allows you to make objective decisions is the only option.

Human possibilities end where there is a large amount of raw information. Data mining systems lose their usefulness where it is required to see, understand and feel information.

Reasonable distribution of functions and objectivity

Man and computer should complement each other - this is an axiom. Writing a dissertation is a priority for a person, and an information system is a help. Here, the data that Data Mining technology has is heuristics, rules, algorithms.

Preparing a weekly weather forecast is the priority of the information system. Man manages the data, but bases his decisions on the results of the system's calculations. It combines Data Mining methods, specialist data classification, manual control of the application of algorithms, automatic comparison of past data, mathematical forecasting and a lot of knowledge and skills of real people involved in the application of the information system.

Probability theory and mathematical statistics are not the most "favorite" and understandable areas of knowledge. Many specialists are very far from them, but the methods developed in these areas give almost 100% correct results. By applying systems based on the ideas, methods and algorithms of Data Mining, solutions can be obtained objectively and reliably. Otherwise, it is simply impossible to get a solution.

Pharaohs and mysteries of past centuries

History was periodically rewritten:

states - for the sake of their strategic interests;
authoritative scientists - for the sake of their subjective beliefs.

It's hard to tell what's true and what's false. The use of Data Mining allows us to solve this problem. For example, the technology of building pyramids was described by chroniclers and studied by scientists in different centuries. Not all materials got on the Internet, not everything is unique here, and many data may not have:

described point in time;
time of writing the description;
dates on which the description is based;
author(s), opinions (links) taken into account;
confirmation of objectivity.

Blibraries, temples and "unexpected places" you can find manuscripts from different centuries and material evidence of the past.

Interesting goal: to put everything together and unearth the "truth". Feature of the problem: information can be obtained from the first description by a chronicler, during the lifetime of the pharaohs, to the current century, in which this problem is solved by modern methods by many scientists.

Rationale for using Data Mining: manual labor is not possible. Too many quantities:

sources of information;
representation languages;
researchers describing the same thing in different ways;
dates, events and terms;
term correlation problems;
analysis of statistics by data groups over time may differ, etc.

At the end of the last century, when another fiasco of the idea of artificial intelligence became obvious not only to the layman, but also to a sophisticated specialist, the idea appeared: “to recreate the personality.”

For example, according to the works of Pushkin, Gogol, Chekhov, a certain system of rules, logics of behavior is formed and an information system is created that can answer certain questions as a person would: Pushkin, Gogol or Chekhov. Theoretically, such a task is interesting, but in practice it is extremely difficult to implement.

However, the idea of such a task suggests a very practical idea: "how to create an intelligent information search." The Internet is a lot of developing resources, a huge database and this is a great opportunity to apply Data Mining in combination with humanlogic in the format of joint development.

A machine and a man paired is an excellent task and an undoubted success in the field of "information archeology", high-quality excavations in data and results that will put something in doubt, but without a doubt will allow you to gain new knowledge and will be in demand in society.