Logistic regression: model and methods

Table of contents:

Logistic regression: model and methods
Logistic regression: model and methods
Anonim

Methods of logistic regression and discriminant analysis are used when it is necessary to clearly differentiate respondents by target categories. In this case, the groups themselves are represented by levels of one single-variant parameter. Let's take a closer look at the logistic regression model and find out why it is needed.

logistic regression
logistic regression

General information

An example of a problem in which logistic regression is used is the classification of respondents into groups who buy and do not buy mustard. Differentiation is carried out in accordance with socio-demographic characteristics. These include, in particular, age, gender, number of relatives, income, etc. In operations, there are differentiation criteria and a variable. The latter encodes the target categories into which, in fact, the respondents should be divided.

Nuances

It should be said that the range of cases in which logistic regression is applied is much narrower than for discriminant analysis. In this regard, the use of the latter as a universal method of differentiation is consideredmore preferred. Moreover, experts recommend starting classification studies with discriminant analysis. And only in case of uncertainty about the results, you can use logistic regression. This need is due to several factors. Logistic regression is used when there is a clear understanding of the type of independent and dependent variables. Accordingly, one of the 3 possible procedures is selected. In discriminant analysis, the researcher always deals with one static operation. It involves one dependent and several independent categorical variables with any type of scale.

Views

The task of a statistical study that uses logistic regression is to determine the probability that a certain respondent will be assigned to a particular group. Differentiation is carried out according to certain parameters. In practice, in accordance with the values of one or more independent factors, it is possible to classify respondents into two groups. In this case, binary logistic regression takes place. Also, the specified parameters can be used when dividing into groups of more than two. In such a situation, multinomial logistic regression takes place. The resulting groups are expressed in levels of a single variable.

logistic regression
logistic regression

Example

Let's say there are respondents' answers to the question of whether they are interested in the offer to purchase a land plot in the suburbs of Moscow. The options are "no"and yes. It is necessary to find out which factors have a predominant influence on the decision of potential buyers. To do this, the respondents are asked questions about the infrastructure of the territory, the distance to the capital, the area of the site, the presence / absence of a residential building, etc. Using binary regression, it is possible to distribute the respondents into two groups. The first will include those who are interested in the acquisition - potential buyers, and the second, respectively, those who are not interested in such an offer. For each respondent, in addition, the probability of being assigned to one or another category will be calculated.

Comparative characteristics

The difference from the two options above is the different number of groups and the type of dependent and independent variables. In binary regression, for example, the dependence of a dichotomous factor on one or more independent conditions is studied. Moreover, the latter can have any type of scale. Multinomial regression is considered a variation of this classification option. In it, more than 2 groups belong to the dependent variable. The independent factors must have either an ordinal or a nominal scale.

Logistic regression in spss

In the statistical package 11-12, a new version of analysis was introduced - ordinal. This method is used when the dependent factor belongs to the same name (ordinal) scale. In this case, independent variables are selected of one specific type. They must be either ordinal or nominal. The classification into several categories is considered the mostuniversal. This method can be used in all studies that use logistic regression. However, the only way to improve the quality of a model is to use all three techniques.

adequacy quality check and logistic regression
adequacy quality check and logistic regression

Ordinal classification

It is worth saying that earlier in the statistical package there was no typical possibility of performing specialized analysis for dependent factors with an ordinal scale. For all variables with more than 2 groups, the multinominal variant was used. The relatively recently introduced ordinal analysis has a number of features. They take into account the specifics of the scale. Meanwhile, in teaching aids, ordinal logistic regression is often not considered as a separate technique. This is due to the following: ordinal analysis does not have any significant advantages over multinomial. The researcher may well use the latter in the presence of both an ordinal and a nominal dependent variable. At the same time, the classification processes themselves almost do not differ from each other. This means that performing ordinal analysis will not cause any difficulties.

Analysis option

Let's consider a simple case - binary regression. Suppose, in the process of marketing research, the demand for graduates of a certain metropolitan university is assessed. In the questionnaire, respondents were asked questions, including:

  1. Are you employed? (ql).
  2. Enter year of graduation (q 21).
  3. What is the averagegraduation score (aver).
  4. Gender (q22).

Logistic regression will evaluate the impact of independent factors aver, q 21 and q 22 on the variable ql. Simply put, the purpose of the analysis will be to determine the likely employment of graduates based on information about the field, year of graduation and GPA.

logistic sigmoid regression indicator
logistic sigmoid regression indicator

Logistic Regression

To set the parameters using binary regression, use the Analyze►Regression►Binary Logistic menu. In the Logistic Regression window, select the dependent factor from the list of available variables on the left. It is ql. This variable must be placed in the Dependent field. After that, it is necessary to introduce independent factors into the Covariates plot - q 21, q 22, aver. Then you need to choose how to include them in your analysis. If the number of independent factors is more than 2, then the method of simultaneous introduction of all variables, which is set by default, is used, but step by step. The most popular way is Backward:LR. Using the Select button, you can include in the study not all respondents, but only a specific target category.

Define Categorical Variables

The Categorical button should be used when one of the independent variables is nominal with more than 2 categories. In this situation, in the Define Categorical Variables window, just such a parameter is placed on the Categorical Covariates section. In this example, there is no such variable. After that, in the drop-down list Contrast followsselect the Deviation item and press the Change button. As a result, several dependent variables will be formed from each nominal factor. Their number corresponds to the number of categories of the initial condition.

Save New Variables

Using the Save button in the main dialog box of the study, the creation of new parameters is set. They will contain the indicators calculated in the regression process. In particular, you can create variables that define:

  1. Belonging to a specific classification category (Groupmembership).
  2. Probability of assigning a respondent to each study group (Probabilities).

When using the Options button, the researcher does not get any significant options. Accordingly, it can be ignored. After clicking the "OK" button, the results of the analysis will be displayed in the main window.

logistic regression coefficient
logistic regression coefficient

Quality check for adequacy and logistic regression

Consider the Omnibus Testsof Model Coefficients table. It displays the results of the analysis of the quality of the approximation of the model. Due to the fact that a step-by-step option was set, you need to look at the results of the last stage (Step2). A positive result will be considered if an increase in the Chi-square indicator is found when moving to the next stage at a high degree of significance (Sig. < 0.05). The quality of the model is evaluated in the Model line. If a negative value is obtained, but it is not considered significant with the overall high materiality of the model, the lastcan be considered practically suitable.

Tables

Model Summary makes it possible to estimate the total variance index, which is described by the constructed model (R Square indicator). It is recommended to use the Nagelker value. The Nagelkerke R Square parameter can be considered a positive indicator if it is above 0.50. After that, the results of the classification are evaluated, in which the actual indicators of belonging to one or another category under study are compared with those predicted based on the regression model. For this, the Classification Table is used. It also allows us to draw conclusions about the correctness of differentiation for each group under consideration.

logistic regression model
logistic regression model

The following table provides an opportunity to find out the statistical significance of the independent factors entered into the analysis, as well as each non-standardized logistic regression coefficient. Based on these indicators, it is possible to predict the belonging of each respondent in the sample to a particular group. Using the Save button, you can enter new variables. They will contain information about belonging to a particular classification category (Predictedcategory) and the probability of being included in these groups (Predicted probabilities membership). After clicking "OK", the calculation results will appear in the main window of Multinomial Logistic Regression.

The first table, which contains indicators important for the researcher, is Model Fitting Information. A high level of statistical significance would indicate high quality andsuitability of using the model in solving practical problems. Another significant table is Pseudo R-Square. It allows you to estimate the proportion of total variance in the dependent factor, which is determined by the independent variables selected for analysis. According to the Likelihood Ratio Tests table, we can draw conclusions about the statistical significance of the latter. Parameter Estimates reflect non-standardized coefficients. They are used in the construction of the equation. In addition, for each combination of variables, the statistical significance of their impact on the dependent factor was determined. Meanwhile, in marketing research, it often becomes necessary to differentiate respondents by category not individually, but as part of the target group. For this, the Observedand Predicted Frequencies table is used.

Practical application

The considered method of analysis is widely used in the work of traders. In 1991, the logistic sigmoid regression indicator was developed. It is an easy-to-use and effective tool for predicting likely prices before they "overheat". The indicator is shown on the chart as a channel formed by two parallel lines. They are equally spaced from the trend. The width of the corridor will depend solely on the timeframe. The indicator is used when working with almost all assets - from currency pairs to precious metals.

logistic regression in spss
logistic regression in spss

In practice, 2 key strategies for using the instrument have been developed: for breakout andfor a turn. In the latter case, the trader will focus on the dynamics of price changes within the channel. As the value approaches the support or resistance line, a bet is placed on the probability that the movement will start in the opposite direction. If the price comes close to the upper border, then you can get rid of the asset. If it is at the lower limit, then you should think about purchasing. The breakout strategy involves the use of orders. They are installed outside the limits at a relatively small distance. Taking into account that the price in some cases violates them for a short time, you should play it safe and set stop losses. At the same time, of course, regardless of the chosen strategy, the trader needs to perceive and evaluate the situation that has arisen on the market as calmly as possible.

Conclusion

Thus, the use of logistic regression allows you to quickly and easily classify respondents into categories according to the given parameters. When analyzing, you can use any particular method. In particular, multinomial regression is universal. However, experts recommend using all the methods described above in combination. This is due to the fact that in this case the quality of the model will be significantly higher. This, in turn, will expand the range of its application.

Recommended: