What is corpus linguistics?

Table of contents:

What is corpus linguistics?
What is corpus linguistics?
Anonim

Several decades ago, scientists could only dream of automating linguistic research. The work was done by hand, a large number of students were involved in it, there was a significant probability of an "inattention" error, and most importantly, it all took a lot, a lot of time.

With the development of computer technology, it became possible to conduct research much faster, and today one of the promising areas in the study of language is corpus linguistics. Its main feature is the use of large amounts of textual information, consolidated into a single database, marked up in a special way and called a corpus.

Today, there are many corpora created for different purposes, based on different language material, covering from millions to tens of billions of lexical units. This direction is recognized as promising and demonstrates significant progress in achieving applied and research goals. Professionals, one way or another dealing withnatural language, it is recommended that you familiarize yourself with text corpora at least at a basic level.

History of corpus linguistics

The formation of this direction is connected with the creation of the Brown Corps in the USA in the early 60s of the last century. The collection of texts consisted of only 1 million word forms, and today a corpus of such a volume would be completely uncompetitive. This is largely due to the pace of development of computer technology, as well as the growing demand for new research resources.

In the 90s, corpus linguistics was formed into a full-fledged and independent discipline, collections of texts were compiled and marked up for several dozen languages. During this period, for example, the British National Corpus was created for 100 million word usages.

corpus linguistics
corpus linguistics

As this direction of linguistics develops, the volume of texts becomes larger (and reaches billions of vocabulary units), and the markup becomes more and more diverse. Today, in the Internet space, you can find corpora of written and oral speech, multilingual and educational, focused on fiction or academic literature, as well as many other varieties.

What cases are there

Corpus types in corpus linguistics can be represented in several ways. It is intuitively clear that the basis for classification can be the language of texts (Russian, German), access mode (open source, closed source, commercial), genre of source material (artliterature, documentary, academic, journalism).

methods of corpus linguistics
methods of corpus linguistics

In an interesting way, the generation of materials representing oral speech is carried out. Since the deliberate recording of such speech would create artificial conditions for the respondents, and the resulting material could not be called "spontaneous", modern corpus linguistics went the other way. The volunteer is equipped with a microphone, and during the day all conversations in which he participates are recorded. The surrounding people, of course, cannot know that in the course of an everyday conversation they are contributing to the development of science.

Later, the received audio recordings are stored in the data bank and are accompanied by printed text like a transcript. In this way, the markup needed to create a corpus of spoken everyday speech becomes possible.

Application

Where it is possible to use language, it is also possible to use text corpora. The purpose of using corpus methods in linguistics can be:

  • Creating sentiment programs that are widely used in politics and business to track positive and negative feedback from voters and customers, respectively.
  • Connecting the information system to dictionaries and translators to improve their performance.
  • Various research tasks that contribute to the understanding of the structure of the language, the history of its development and predictions of its change in the near future.
  • Development of information extraction systems based on morphological,syntactic, semantic and other features.
  • Optimization of the work of various linguistic systems, etc.

Using shells

The resource interface is similar to a typical search engine and prompts the user to enter some word or combination of words to search the infobase. In addition to the exact request form, you can use the extended version, which allows you to find textual information by almost any linguistic criteria.

computer and corpus linguistics
computer and corpus linguistics

The basis for the search can be:

  • belonging to a certain group of parts of speech;
  • grammatical features;
  • semantics;
  • stylistic and emotional coloring.

It is also possible to combine search criteria for a sequence of words: for example, find all occurrences of a verb in the present tense, first person, singular followed by the preposition "in" and a noun in the accusative case. Solving such a simple task takes the user a few seconds and requires only a few mouse clicks in the given fields.

Creation process

The search itself can be carried out both in all subcorpuses, and in one, specifically selected, depending on the needs when achieving a specific goal:

  1. First of all, it is determined which texts will form the basis of the corpus. For practical purposes, journalistic, newspaper materials, Internet comments are often used. In research projects, the mostvarious types of corpora, but the texts must be selected on some common basis.
  2. The resulting set of texts is preprocessed, errors are corrected, if any, a bibliographic and extralinguistic description of the text is prepared.
  3. All non-textual information is filtered out: graphics, pictures, tables are deleted.
  4. Tokens, usually words, are allocated for further processing.
  5. Finally, morphological, syntactic and other markup of the resulting set of elements is carried out.

The result of all performed operations is a syntactic structure with a set of elements distributed over it, for each of which a part of speech, grammatical and, in some cases, semantic features are defined.

Difficulties in creating cases

It is important to understand that to get a corpus, it is not enough to put together a lot of words or sentences. On the one hand, a collection of texts must be balanced, that is, present different types of texts in certain proportions. On the other hand, the contents of the case must be marked in a special way.

Zakharov corpus linguistics
Zakharov corpus linguistics

The first issue is resolved by agreement: for example, the collection includes 60% of fiction texts, 20% of documentaries, a certain proportion is given to the written presentation of oral speech, legislative acts, scientific papers, etc. The ideal recipe for a balanced corpus today does not exist.

The second question regarding content markup is more difficult to solve. There are special programs and algorithms used for automatic markup of texts, but they do not give a 100% result, can cause failures and require manual refinement. Opportunities and problems in solving this problem are described in detail in the work of V. P. Zakharov on corpus linguistics.

Text markup is carried out at several levels, which we will list below.

Morphological markup

From the school bench, we remember that in the Russian language there are different parts of speech, and each of them has its own characteristics. For example, a verb has categories of mood and tense that a noun does not have. A native speaker inflects nouns and conjugates verbs without hesitation, but manual labor is not suitable for marking a corpus of 100 million word usages. All the necessary operations can be performed by a computer, however, for this it needs to be taught.

Morphological markup is necessary for the computer to "understand" each word as some part of speech that has certain grammatical features. Since a number of regular rules function in Russian (as in any other) language, it is possible to build an automatic procedure for morphological analysis by putting a number of algorithms into the machine. However, there are exceptions to the rule, as well as various complicating factors. As a result, pure computer analysis today is far from ideal, and even 4% errors give a value of 4 million words in a corpus of 100 million units, requiring manual refinement.

This problem is described in detail by V. P. Zakharov's book "Corpus Linguistics".

Syntactic markup

Syntactic analysis or parsing is a procedure that determines the relationship of words in a sentence. With the help of a set of algorithms, it becomes possible to determine the subject, predicate, additions, and various turns of speech in the text. By figuring out which words in the sequence are main and which are dependent, we can efficiently extract information from the text and train the machine to return only the information we are interested in in response to a search request.

laboratories of corpus linguistics in Russian universities
laboratories of corpus linguistics in Russian universities

By the way, modern search engines use this to give specific numbers instead of lengthy texts in response to relevant queries like: “how many calories are in an apple” or “distance from Moscow to St. Petersburg”. However, to understand even the very basics of the described process, you will need to familiarize yourself with the "Introduction to Corpus Linguistics" or another basic textbook.

Semantic markup

The semantics of a word is, in simple terms, its meaning. A widely applicable approach in semantic analysis is the attribution of tags to a word, reflecting its belonging to a set of semantic categories and subcategories. Such information is valuable for optimizing text sentiment analysis algorithms, automatic referencing, and performing other tasks using corpus linguistics methods.

There are a number of "roots" of the tree, which are abstract words that havevery broad semantics. As this tree branches, nodes are formed containing more and more specific lexical elements. For example, the word "creature" can be associated with such concepts as "human" and "animal". The first word will continue to branch into various professions, terms of kinship, nationality, and the second - into classes and types of animals.

Use of information retrieval systems

Spheres of use of corpus linguistics cover a wide variety of areas of activity. Corpora are used for compiling and correcting dictionaries, creating automatic translation systems, summarizing, extracting facts, determining sentiment and other text processing.

corpus linguistics corpus types
corpus linguistics corpus types

In addition, such resources are actively used in the study of the languages of the world and the mechanisms of the functioning of the language as a whole. Access to large volumes of pre-prepared information contributes to the rapid and comprehensive study of trends in the development of languages, the formation of neologisms and stable speech turns, changes in the meanings of lexical units, etc.

Because working with such large volumes of data requires automation, today there is a close interaction between computer and corpus linguistics.

National Corpus of the Russian Language

This corpus (abbreviated as NKRC) includes a number of subcorpuses that allow using the resource to solve a wide variety of tasks.

Materials in the NCRA database are divided into:

  • on publications in the media of the 90s and 2000syears, both domestic and foreign;
  • recordings of oral speech;
  • accentologically marked texts (i.e. with accent marks);
  • dialect speech;
  • poetic works;
  • materials with syntactic markup, etc.

The information system also includes subcorpuses with parallel translations of works from Russian into English, German, French and many other languages (and vice versa).

Also in the database there is a section of historical texts representing written speech in Russian in various periods of its development. There is also a teaching corpus that can be useful for foreign citizens in mastering the Russian language.

The national corpus of the Russian language includes 400 million lexical units and in many ways is ahead of a significant part of the corpora of European languages.

Prospects

Fact in favor of recognizing this area as promising is the presence of corpus linguistics laboratories in Russian universities, as well as in foreign ones. With the use and research within the framework of the considered information retrieval resources, the development of some areas in the field of high technologies, question-answer systems is associated, but this was discussed above.

history of corpus linguistics
history of corpus linguistics

Further development of corpus linguistics is predicted at all levels, from technical, in terms of the introduction of new algorithms that optimize the processes of searching and processing information, expanding the capabilities of computers, increasing the operationalmemory, and ending with household ones, as users find more and more ways to use this type of resource in everyday life and at work.

In closing

In the middle of the last century, 2017 seemed like a distant future, in which spacecraft surf the expanses of the Universe and robots do all the work for people. In reality, however, science is replete with "blank spots" and is making desperate attempts to answer questions that have troubled mankind for centuries. Questions of the functioning of the language take pride of place here, and corpus and computational linguistics can help us answer them.

Processing large amounts of data allows you to detect patterns that were previously inaccessible, predict the development of certain language features, track the formation of words almost in real time.

At a practical global level, corpora can be considered, for example, as a potential tool for assessing public sentiment - the Internet is a continuously updated database of various texts created by real users: these are comments, reviews, articles, and many other forms of speech.

In addition, working with corpora contributes to the development of the same technical means that are involved in information retrieval, familiar to us from Google or Yandex services, machine translation, electronic dictionaries.

It is safe to say that corpus linguistics is only making its first steps and will develop rapidly in the near future.

Recommended: