Frequency text analysis: features and examples

Table of contents:

Frequency text analysis: features and examples
Frequency text analysis: features and examples
Anonim

You have met this concept more than once in your life if you had to work with texts. In particular, you could turn to online calculators that carry out exactly the frequency analysis of the text. These handy tools show how many times a particular character or letter occurs in any passage of text. Often a percentage is also shown. Why is this needed? How does frequency analysis of text contribute to the "cracking" of simple ciphers? What is its essence, who invented it? We will answer these and other important questions on the topic in the course of the article.

Definition

Frequency analysis is one of the varieties of cryptanalysis. It is based on the assumption of scientists about the existence of a statistical non-trivial distribution of individual characters and their regular sequences in both open and encrypted text.

It is believed that such a distribution, up to the replacement of individual characters, will also be preserved in the encryption/decryption processes.

frequency analysis of systems
frequency analysis of systems

Process characteristic

Now let's take a look at frequency analysis in simple terms. This implies that the number of occurrences of the same alphabetic character in texts of sufficient length is the same in different texts written in the same language.

And now what about monoalphabetic encryption? It is assumed that if there is a character with such a similar probability of occurrence in the section with ciphertext, then it is realistic to assume that it is that ciphered letter.

Followers of frequency text analysis apply the same reasoning to digrams (sequences of two letters). Trigrams - this is for the case of already polyalphabetic ciphers.

History of the method

Frequency analysis of words is not a find of modernity. It has been known to the scientific world since the 9th century. Its creation is associated with the name Al-Kindi.

But the known cases of application of the method of frequency analysis belong to a much later period. The most striking example here is the decipherment of Egyptian hieroglyphs, produced in 1822 by J.-F. Champollion.

If we turn to fiction, we can find many interesting references to this decryption method:

  • Conan Doyle - "The Dancing Men".
  • Jules Verne - "Children of Captain Grant".
  • Edgar Poe - "Gold Bug".

However, since the middle of the last century, most of the algorithms used in encryption have been developed taking into account their resistance to such frequency cryptanalysis. Therefore ittoday they are most often used only for training future cryptographers.

text frequency analysis
text frequency analysis

Basic method

Let's now present the frequency response analysis in detail. This kind of analysis is directly based on the fact that the test consists of words, and those, in turn, of letters. The number of letters that fill the national alphabets is limited. Letters can simply be listed here.

The most important characteristics of such a text will be both the repetition of letters, various bigrams, trigrams and n-grams, as well as the compatibility of various letters with each other, the alternation of consonants / vowels and other varieties of these symbols.

The main idea of the methods is to count occurrences of possible n-grams (denoted by nm) in plaintexts long enough for analysis (denoted by T=t1t2…tl) composed of letters of the national alphabet (denoted by {a1, a2, …, an}). All of the above causes some consecutive m-grams of the text:

t1t2…tm, t2t3… tm+1, …, ti-m+1tl-m+2…tl.

If this is the number of occurrences of the m-gram ai1ai2…aim in a certain text T, and L is the total number of m-grams analyzed by the researcher, then it is possible to establish empirically that for sufficiently large L, the frequencies for such an m-gram will be little different from each other.

frequency analysis
frequency analysis

Frequently occurring letters of the Russian alphabet

But time-frequency analysis, despite the similar name, has nothing to do with the topic of our conversation. This kind of analysis is carried out forsignals from low-observable radar stations using a special wavelet transform.

Now let's get back to the main topic. When conducting a frequency analysis, you can find out which letters of the Russian alphabet are most often found in fairly voluminous texts (percentage from 0.062 to 0.018):

  • A.
  • V.
  • D.
  • F.
  • I.
  • K.
  • M.
  • O.
  • R.
  • T.
  • F.
  • T.
  • Sh.
  • b.
  • E.
  • I.

Even a special mnemonic rule has been introduced, which helps to learn the most common letters of the Russian alphabet. To do this, it is enough to remember just one word - "hayloft".

In general cases, the frequency of use of letters in percentage terms is set simply: the specialist counts how many times the letter occurs in the text, then divides the resulting value by the total number of characters in the text. And to express this value as a percentage, it is enough to multiply it by 100.

It is important to consider that the frequency will depend not only on the volume of the text, but also on its nature. For example, in technical sources the letter "F" appears much more often than in fiction. Therefore, for objective results, a specialist must type texts of various nature and style for research.

text frequency analysis programs
text frequency analysis programs

Bi-, tri-, four-grams

In meaningful texts, you can also find the most common (respectively, the mostrepeated) combinations of two or more letters. Specialists have also compiled several tables, which indicate the frequencies of such digrams of various alphabets.

As for Russian, the frequency analysis of systems of voluminous meaningful texts made it possible to establish the most common bigrams and trigrams:

  • EN.
  • ST.
  • BUT.
  • NOT.
  • ON.
  • RA.
  • OV.
  • KO.
  • VO.
  • STO.
  • NEW
  • ENO.
  • TOV.
  • OVA.
  • OVO.

Preferred relationships of letters to each other

And this is not all the possibilities that frequency analysis can provide to text researchers. By systematizing information from similar tables of bigrams and trigrams, it is possible to extract data on the most common combinations of letters. Or, in other words, their preferred relationships with each other.

Such an extensive study has already been carried out by experts. Its result was a table where, along with each letter of the alphabet, its neighbors were indicated. Moreover, those characters that are often found both immediately before it and after it. The letters in the table are not spelled out by accident. Closer to the symbol, the most frequent neighbors are indicated, further - more rare ones.

Consider examples:

  • Letter "A". The following preferred connections are distinguished here: l-d-k-t-v-r-n-A-l-n-s-t-r-v-to-m. From here we see that most often before "A" in the texts there is "H" ("NA"). And after "A" most often in texts in Russian we can meet "L"("AL").
  • Letter "M". Experts have identified such preferred connections: "I-s-a-i-e-o-M-i-e-o-u-a-n-p-s".
  • Letter "b". Preferred connections are as follows: "n-s-t-l-b-n-k-v-p-s-e-o-i".
  • Letter "Sh". Preferred connections: "e-b-a-i-u-Sch-e-i-a".
  • Letter "P". Preferred connections with this symbol of the Russian alphabet: "v-s-u-a-i-e-o-P-o-r-e-a-u-i-l".
time-frequency analysis
time-frequency analysis

What defines analysis?

Modern frequency text analysis programs help to study large volumes of a wide variety of articles, essays, passages, and so on. The following information is provided to the researcher as standard:

  • Total number of characters in the text.
  • Number of spaces used by the author.
  • Number of digits.
  • Information about used punctuation marks - periods, commas, etc.
  • The number of letters in each of the available alphabets - Cyrillic, Latin, etc.
  • Information about the frequency of use of each letter and symbol in the text - the number of mentions and percentage compared to the entire text.

Struggle against overoptimization and oversaturation

Why is text frequency analysis performed? Is it just for the purpose of curiosity - to establish which characters in the written text turned out to be frequently encountered? No, the main application of analysis is practical, and it lies elsewhere.

N-grams include not only stable bigrams and trigrams. To the samecategories include keywords (tags), collocations. That is, stable combinations consisting of two or more words. They are distinguished by the fact that such compositions occur together in the text and at the same time carry a certain semantic load.

This plays into the hands of unscrupulous SEO specialists. In their work, they sometimes abuse the repetition of tags and keywords in the text in order to artificially increase the relevance of a particular web page. They are trying to deceive the system with such a "trick": turning a natural combination with the usual combination of words, traditional for the Russian language ("buy a mink coat") into an inconsistent one. That is, obtained by rearranging words in such a natural N-gram ("buy a mink coat").

But today, search algorithms have learned to detect overoptimization as effectively as overspam - oversaturation of text with keywords, tags that affect the ranking of results on the search page. Over-optimized pages are now, on the contrary, ranked lower by the user's query. And people themselves do not tend to read meaningless, oversaturated with tags text, preferring useful information on another resource.

frequency analysis method
frequency analysis method

Helping private analysis for SEO specialists

Thus, modern search engine text filters today give preference to those Internet pages, the information on which is not only easy to read, but also useful to visitors. To optimize their work for new standards, SEO specialistsand turn to the frequency analysis of the text. Many popular services provide it today.

Frequency analysis helps to review the text being prepared for publication for informativeness. Eliminate unnecessary redundancy of tags and key phrases. It also allows you to draw the author's attention to unnatural combinations of words that arouse suspicion in the text filters of search engines.

frequency response analysis
frequency response analysis

Frequency analysis of the text thus helps to determine the frequency of mention of a particular character in the source. The method is used today to assess text overload with tags, unnatural permutations of words.

Recommended: