The assumptions embodied in statistical modeling describe a set of probability distributions, some of which are assumed to adequately approximate the distribution. A specific set of data is selected from the definition. The probability distributions inherent in statistical modeling are what distinguish statistical models from other, non-statistical, mathematical models.
Connection with mathematics
This scientific method is rooted primarily in mathematics. Statistical modeling of systems is usually given by mathematical equations that relate one or more random variables and possibly other non-random variables. Thus, a statistical model is a "formal representation of a theory" (Hermann Ader, quoting Kenneth Bollen).
All statistical hypothesis tests and all statistical estimates are derived from statistical models. More generally, statistical models are part of the basis of statistical inference.
Methods of statisticalmodeling
Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: this assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the bone.
The first statistical assumption constitutes the statistical model, because with only one assumption we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model, because with only one assumption we cannot calculate the probability of each event.
In the above example with the first assumption, it is easy to calculate the probability of an event. However, in some other examples, the calculation may be complex or even impractical (for example, it may require millions of years of computation). For the assumption that constitutes a statistical model, this difficulty is acceptable: performing the calculation does not have to be practically feasible, just theoretically possible.
Examples of models
Suppose we have a population of schoolchildren with evenly distributed children. The height of a child will be stochastically related to age: for example, when we know that a child is 7 years old, this affects the probability that the child will be 5 feet tall (about 152 cm). We could formalize this relationship in a linear regression model, for example: growth=b0 + b1agei+ εi, where b0 is the intersection, b1 is the parameter by which the age is multiplied when obtaining the growth forecast, εi is the error term. This implies that height is predicted by age with some error.
A valid model must match all data points. So a straight line (heighti=b0 + b1agei) cannot be an equation for a data model - unless it fits all data points exactly, i.e. all data points lie perfectly on the line. The error term εi must be included in the equation for the model to fit all data points.
To make a statistical inference, we first need to assume some probability distributions for εi. For example, we can assume that the distributions of εi are Gaussian, with zero mean. In this case, the model will have 3 parameters: b0, b1 and the variance of the Gaussian distribution.
General Description
A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that it is non-deterministic. It is used to model statistical data. Thus, in a statistical model defined with mathematical equations, some variables do not have specific values, but instead have probability distributions; that is, some variables are stochastic. In the example above, ε is a stochastic variable; without this variable, the model waswould be deterministic.
Statistical models are often used in statistical analysis and modeling, even if the physical process being modeled is deterministic. For example, tossing coins is in principle a deterministic process; yet it is usually modeled as stochastic (via a Bernoulli process).
Parametric models
Parametric models are the most commonly used statistical models. Regarding semi-parametric and non-parametric models, Sir David Cox said: "They generally include fewer assumptions about the structure and shape of the distribution, but usually contain strong independence assumptions." Like all other mentioned models, they are also often used in the statistical method of mathematical modeling.
Multilevel models
Multilevel models (also known as hierarchical linear models, nested data models, mixed models, random coefficients, random effects models, random parameter models, or patch models) are statistical parameter models that vary at more than one level. An example is a student achievement model that contains metrics for individual students as well as metrics for classrooms in which students are grouped. These models can be thought of as generalizations of linear models (in particular, linear regression), although they can also be extended to non-linear models. These models have becomemuch more popular once sufficient computing power and software became available.
Multilevel models are particularly suited to research projects where data for participants is organized at more than one level (ie, nested data). Units of analysis are usually individuals (at a lower level) that are nested within context/aggregate units (at a higher level). While the lowest level of data in multilevel models is typically individual, repeated measurements of individuals can also be considered. Thus, multilevel models provide an alternative type of analysis for univariate or multivariate repeated measures analysis. Individual differences in growth curves can be considered. In addition, multilevel models can be used as an alternative to ANCOVA, where dependent variable scores are adjusted for covariates (eg, individual differences) before testing for treatment differences. Multilevel models are able to analyze these experiments without the assumption of uniform regression slopes required by ANCOVA.
Multilevel models can be used for data with many levels, although two-level models are the most common and the rest of this article focuses on these. The dependent variable should be examined at the lowest level of analysis.
Model selection
Model selectionis the task of selecting from a set of candidate models given the data, carried out within the framework of statistical modeling. In the simplest cases, an already existing data set is considered. However, the task may also involve designing experiments so that the data collected is well suited to the model selection task. Given candidate models with similar predictive or explanatory power, the simplest model is likely to be the best choice (Occam's razor).
Konishi & Kitagawa says, "Most statistical inference problems can be considered problems related to statistical modeling." Similarly, Cox said, “How the translation of the subject matter into the statistical model is done is often the most important part of the analysis.”
Model selection can also refer to the problem of selecting a few representative models from a large set of computational models for decision or optimization purposes under uncertainty.
Graphic patterns
Graphic model, or probabilistic graphic model, (PGM) or structured probabilistic model, is a probabilistic model for which the graph expresses the structure of a conditional relationship between random variables. They are commonly used in probability theory, statistics (especially Bayesian statistics), and machine learning.
Econometric models
Econometric models are statistical models used ineconometrics. An econometric model defines the statistical relationships that are believed to exist between various economic quantities related to a particular economic phenomenon. An econometric model can be derived from a deterministic economic model that takes into account uncertainty, or from an economic model that is itself stochastic. However, it is also possible to use econometric models that are not tied to any particular economic theory.