Pages

Tuesday, July 5, 2011

What is Modeling?

UPDATE OF THIS ARTICLE POSTED HERE  MAY 11 2013




What is modeling, anyhow? Good question. Generally, in the physical sciences and elsewhere, "Modeling" is a term having a specific meaning. Here is a simple definition.

Modeling is a procedure for numerical fitting and interpolation of existing sets of observational data by means of continuous functions having a collection of adjustable parameters. 

1.0 Models cannot predict anything in a causal sense.

The central aim of modeling is to provide a simplified analytic function or set of functions that match discrete data points and interpolate between them.  Models, therefore, do not predict anything in a causal sense.  Models simply generate sets of numbers that may be compared to sets of observations.

 In this discussion, we view data as a collection of discrete points embedded in an abstract continuum parameter space. Independent variables might include time, physical location, incident solar radiation flux, etc. Dependent variables are variables that can be identified with data. Examples of dependent variables are local temperatures, or the non-thermodynamic quantity "global average temperature" we hear about.  


A model is simply a function that maps independent variables to sets of numbers that may be compared to sets of observations, i.e. data sets.  


The model generates output. Model output consists of sets of values of the dependent variables. These numbers are the stuff the model generates. We say the observational data is "modeled" by sets of numbers generated by the model.  


1.1 Models of physical systems need not contain any physics.
Instead they contain hidden variables and adjustable parameters.

Besides independent variables, models contain a set of hidden variables. Hidden variables are usually of two kinds, fixed parameters and adjustable parameters. They are used to  formulate the functions that generate the output variables of the model.  


Fixed parameters  come from underlying laws of physics or other solidly trusted sources. Their values are taken as given. 


Adjustable parameters are hidden variables whose values can be specified  arbitrarily.   For a given set of specified values of the adjustable parameters, a specific model is obtained. Different models are easily obtained by changing the values of the adjustable parameter set.   


Notice there is no requirement that models obey the laws of physics. Rather, models are sets of functions that generate numbers that may be compared to observational data sets.


Modelers try to optimize their models by judicious choice of the set of adjustable parameters, by removing unnecessary adjustable parameters, etc. 
How do we know when the model is optimized?  One way is to validate it by comparison to a data set.


1.2 What is a validated model?
To validate a model the modeler first needs a data set of observations to model. This data set is necessarily a pre-existing set of observational data. This data set is sometimes called the base data set.  


Here's how the validation process goes....


To validate the model, the modeler goes through a tweeking process where various values of the adjustable parameters are tested, and model outputs are compared to the base data set. The comparison is usually made quantitative by some "goodness of fit" measure. Goodness of fit is a number or set of numbers that measure how well the model output emulates the actual observed data.  For example, the sum of mean square differences between model variables and the base data set could be a goodness of fit parameter.  The smaller the better. This fitting procedure is usually done numerically, but can be done by eye in simple cases. 

So far, we have the model output that is restricted be "close to" existing data, because that data is what we are trying to fit. Such models are very useful for data analysis. It is nice to have continuous curves  that fit discrete data points. If nothing else, it helps us visually examine data sets, spot trends, gain intuition about the data.  All great stuff.


Notice there is a range of independent variables that is comparable to the range of the independent variables of the base data set. The goodness of fit is done in this restricted range. That's where the existing data is.  

If you are given values of the population of California for each census year,  you will have a time dependent data set.  However, you will have no data for the year 2040. So it is not possible to fit the model to the year 2040.  Hence it is not possible to validate the model for this year.  To make progress, we would fit the population model, say a straight line, to existing data. The range of the time variable would be restricted to the existing data. 

Once a satisfactory set of values for the adjustable parameters has been found, the model may be considered validated within the range of the data set. Models are not considered valid outside their range of validation. 


When models are used for extrapolation, the extrapolation must be re-validated as new data becomes available. In this way, past extrapolations can be invalidated and identified as such.


1.3 Model differential equations and pseudo-causality.
Modelers often spice up the mix by invoking sets of model differential equations that may be solved numerically to propagate the model into the future. Thus models may contain time dependent differential equations having derivatives emulating causal behavior.  

Such model equations may have some physics in them, but inevitably they leave out important physical processes.  Hence, they are not truly causal because they do not obey the causality of the underlying laws of physics. Such time dependent  models may be termed pseudo-causal to distinguish them from the fully causal laws of physics.  More on causality later.


Numerical models that solve truncated sets of fluid equations such as General Circulation Models (GCMs) are examples of pseudo-causal models. Extrapolations of GCM's are not guaranteed to agree with future observations.  Rather the opposite, all extrapolations must diverge from future observations. These models are only approximately causal.


GCMs and other models require the same disclaimer as stock brokers:
   "Past performance is not a guarantee of future accuracy."

1.4 Can models provide "too good" a fit to the base data?


If a model has enough adjustable parameters it can fit any data set with great accurcy, e.g. John von Neumann's Elephant.  Excessively large sets of adjustable parameters provide deceptively pretty looking data plots. Actually it is considered bad practice to fit the data with too many parameters.  Over parameterized models have many problems, they tend to have more unreliable extrapolations, have derivatives that fluctuate between data points, exhibit rapidly growing instabilities. 


Paradoxically, models that produce impressive agreement with base data sets, tend to fail badly in extrapolation. 


If the fit to the basis data set is too good, it probably means the modeler has used too many adjustable parameters. A good modeler will find a minimal set of basis functions and a minimal set of adjustable parameters that skillfully fit the base data set to a reasonable accuracy and so minimize the amount of arbitrariness in the model. This will also tend to slow the rate of divergence upon extrapolation.  


1.5 What are the basis functions of models?


Models make use of a set of basis functions. For example, the functions X, X^2, X^3, X^4, ... are all divergent functions that are used in polynomial fits, also called non-linear regressions. The problem is, such functions tend to +/- infinity in the limit of large values of the independent variable X, and do so more rapidly for higher powers of X. The basis  functions are unbounded, and extrapolations always diverge.  


One approach is to choose bounded functions for the basis set. Periodic functions {C, sin(X), cos(X), sin(2X), cos(2X), ...} where C is the constant function, would be an example of a set of bounded basis functions. At least extrapolations of bounded functions will not diverge to infinity. Comforting. 


1.6 Periodic phenomena make modelers look good.


 Many natural phenomena are periodic or approximately periodic. If a time series data set repeats itself on a regular basis then it can be modeled accurately with a small collection of periodic functions, sines and cosines. We do not have to solve the orbital dynamics equations in real time to predict with great accuracy that the sun will come up tomorrow.  


Complex systems may also display quasi-periodic behavior. So-called non-linear phenomena may repeat with a slowly changing frequency and amplitude.  Simple periodic models tend to do very well in extrapolation over multiple periods into the future. Moreover, periodic models do not diverge upon extrapolation. They simply assert that the future is going to be a repeat of the past. 


When models extrapolate non-periodically, it's a red flag. Extrapolations of aperiodic (i.e. non-periodic) models are much more likely to be invalid, as discussed here.

1.7 The Climate can be Cooling and Warming at the Same Time. 
Climate, Weather, and Multiple Timescales.

When discussing climate and weather, it is very important to be specific as to the timescale of change. Earthly phenomena described as "Climate" and "Weather" take place over a astonishingly wide range of possible timescales.  In general, we can be talking about minutes, hours, days, months, years, decades, centuries, millennia, tens of thousands years, hundreds of thousands of years, millions of years, and longer.  

For example, the Vostok ice core data discussed in a previous post provides evidence for periodic climate cycles on time scales of thousands of years up to hundreds of thousands of years, but little information on the hundred year and shorter timescales, and little information about millions of years and longer. From the Vostok data it is clear that the earth is undergoing a many thousands of years long warming cycle, and in roughly 5000 years will begin a cooling cycle leading to another ice age. 

Such cyclic phenomena on these long timescales are likely to repeat because they have done so in the past over many cycles for hundreds of thousands and millions of years.  One can reliably predict that the earth will begin a cooling cycle and a repeat of the ice age cycle in a few thousand years.

What about the timescales ranging from one year to one thousand years?  On these  timescales hourly variations of the weather and seasonal changes are averaged out, and one can look for trends and cycles having periods of a few years to a thousand years.   These timescales are the shortest timescales that can be treated as climate change timescales.  On these shorter timescales, the distinction between climate and weather becomes less obvious and more arbitrary.  

Because of this multiple timescale property of climate and weather, it is possible for the climate and weather to be warming on a shorter timescale and be cooling on a longer timescale. 

Paradoxically it is entirely reasonable for the climate to be warming and cooling at the same time.  More correctly, it is entirely possible for the climate to be cooling on the decade timescale, and simultaneously warming on the thousand year timescale, because decade long cooling trends may average-out over the thousand year timescale. 

There is much more that may be said about multiple timescale analysis of  weather-climate phenomena.  

For now, remember this: Climate-Weather changes on a hierarchy of timescales. 

It is meaningless to claim the climate is warming without clearly understanding the timescale of the phenomenon and where it fits into the larger hierarchy of climate timescales.


2.0 Extrapolation of models is inherently unreliable.
What about extrapolation? Often, modelers are asked to extrapolate their models beyond the validated range of independent variables. Into the unknown future, or elsewhere. These extrapolations are notoriously unreliable for several reasons, among them are (1) the fact that models do not obey causality, (2) they may not properly conserve invariants of the underlying physical system, and (3) are often mathematically unstable and exhibit divergent behavior in the limit of large dependent variable, (4) non-linear regression fits used in climate modeling are especially prone to instability. Such models would inevitably “predict” catastrophic values of the dependent variables as an artifact of their instability. 


Of course, no actual predicting is going on in such models, merely extrapolation of the model beyond its validated domain.


2.1 What's the difference between models and simulations?
Often the distinctions between models and simulations may not be very important. Both might give us cool looking numerical output, including 3D movies. Cool, but is it real? That is, are we seeing just pretty pictures or does the display rigorously reproduce the full physics of a real system? 

Sometimes the distinctions between models and simulations are important.  In the scientific community two broad types of numerical computations are distinguished. They are Models and Simulations. So what's the difference? Both use computers right? Yes, but....

The main difference is simulations solve the fundamental equations of the physical system in a (more or less) fundamentally rigorous fashion.  Models by distinction, do not have to obey this standard of rigor, they can be greatly simplified versions of the problem, or might not even contain any real physics at all. 

For example, one of the most widely used types of models involve fitting of experimental data to sets of continuous functions. Variously curve fitting, linear regression, non-linear regression, are techniques that generate models of the data by simply fitting existing data with adjustable functions.  No physics needed at all. Just fitting. But often very useful.

So, models are open ended and can be more or less anything that accomplishes the purpose.  

Models can be seductive. "They look so real" but models cannot be as real as real reality(!)  

This brings us to this issue of causality. It can be said models as a class do not obey the causality implicit in the complete fundamental physics equations of the system. This limitation is important to recognize.

2.2 Simulations obey causality, models do not.
If a model were to include the real physics of the complete system, it would be a simulation, not a model.  Simulations obey causality.  Simulations usually consist of sets of time dependent coupled partial differential equations, PDEs, subject to realistic boundary/ initial conditions. Simulations are numerically solvable rigorous formulations of underlying physical laws. 


Here's an example of a simulation.
Simulations are often used to examine the evolution of temperature in fluid systems.  If the temperature is non-uniform, then the system is far from true thermodynamic equilibrium.  However, fluids very often satisfy the requirements for local thermodynamic equilibrium. This simply means that a  local temperature can be defined in the medium. This temperature is represented a scalar field that varies continuously with location and time. 

Such systems will exhibit thermal transport, a characteristic of atmospheres and oceans. Often problems of thermal transport can be well described by relatively simple sets of fully causal partial differential equations. 

If robust numerical solvers exist then the complete equations can be solved very accurately by a simulation code. The output of the simulation code would then reliably predict the time evolution of a real system. That is, a good simulation will predict the future of the system. 

Of course, care must be taken that the numerical tools give us the right answer. As long as the solver is accurate, the simulation is guaranteed to follow the same physics of causality as the real system.  The output of a good  simulation code is like a numerical experiment. It mirrors reality including the future (if done right.)  


2.3 Subtle aspects of causality in physics lie beyond the scope of this discussion. But it's very interesting so... a few highlights.


In practice, most simulation codes solve formulations of the fluid equations and related field equations of classical physics.  In these cases the simple classical definition of causality is obeyed. 


Quantum mechanics experts know that quantum mechanical systems have a probabilistic nature. When quantum effects are important, some aspects of causality are lost.  However, even in quantum systems, the fundamental probability amplitudes, or wave functions of quantum theory, themselves obey differential equations that "propagate" these functions forward in time in a causal manner.  Roughly speaking, the wave functions evolve continuously and causally in time such that the statistical properties of quantum systems, expectation values of observable single and multi-particle operators, revert to classical causality in the limit of "large quantum numbers."  


Even classical systems can exhibit stochastic or chaotic behavior in some situations. For example, the so-called butterfly effect. The task of simulating many-particle systems subject to stochastic or chaotic behavior is challenging. However, for the important case of many-particle systems having sufficiently many degrees of freedom, chaotic effects often tend to be "washed-out" by other effects.  Perhaps this is an over simplification.  


A related and absolutely fascinating phenomenon of continuous fluid systems is the possibility of self-organization.  The microscopic behavior of self-organizing systems can conspire to generate large scale organized flows. The jet stream in the earth's atmosphere is an example of such an organized flow, sometimes called a zonal flow. The jet stream is a vast high speed wind current in the upper atmosphere that can persist and move around as an organized entity. The color bands in Jupiter's atmosphere and the great red spot appear to be such zonal flows. Simulating the formation and evolution of such large scale organized flows is a challenging problem addressed in various atmospheric and oceanic simulation codes.  Amazing stuff.


Now we are getting into specialized stuff that is way beyond the scope of this brief discussion. For more on this, consult the extensive popular literature.  


Now let's summarize our conclusions about models,  modeling, and the inherent unreliability of extrapolation. 


2.4 Summary and Conclusions about Models.


In most fields of physics, models are considered useful tools for data analysis, but their known limitations and range of validity are widely appreciated. There are just too many ways for extrapolations of models to go wrong. 


Models do not obey causality nor can they properly "predict" anything in the causal sense. Models provide sets of numbers that can be compared to sets of observational data. 


Models are not simulations. Models may contain: 1) none of the physics, 2) some of the physics, but not 3) all of the physics of the system.  


Extrapolation of a model inevitably takes the model outside it validated domain.  When extrapolation is necessary, it must be done conservatively and cautiously. Further, extrapolations must be validated against new data as it becomes available. Conservative extrapolations are more likely to be validated by future observations.


3.0 Is the methodology of climate modeling inherently unreliable?

Now that we are familiar with the inherent limitations of models in general, an important question can be asked about the methodology of climate modeling.  Are climate models being extrapolated beyond their domain of validity? It certainly seems to be the case, climate model extrapolations are often found to be in disagreement with new data that does not fit the extrapolated model.  There is extensive literature available on this subject. 

We are concerned with a more fundamental issue. It seems non-causality is a property of the methodology of climate modeling. Climate models don't contain all of the relevant physics. In a fundamental sense, such models cannot reliably predict the future of the real climate.  


We can also observe that it is incorrect that weight is given to inherently unreliable extrapolations of climate models. Especially troubling are extrapolations of such models beyond the known range of their mathematical validity. 

Of course, most everyone in the hard sciences knows all of this. So my question might be reformulated as: 


Why are extrapolations of climate models given weight, when the methodology is known to be inherently unreliable in extrapolation? 


Models are not infallible and climate models are not infallible.  Models are  known to be unreliable when extrapolated beyond their validated range. 

Maybe that's enough for the moment. Responses welcome. A little dialog is a good, but let's keep it on the top two levels of the Graham hierarchy.