Validation of Computer Simulations from a Kuhnian Perspective
|2 Kuhn's philosophy of science|
|3 A revolution, but not a Kuhnian revolution: Computer simulations in science|
|4 Validation of Simulations from a Kuhnian perspective|
|4.1 Do computer simulations require a new paradigm of validation?|
|4.2 Validation of simulations and the Duhem-Quine-thesis|
|4.3 Validation of social simulations|
|4.3.1 Where social simulations differ|
|4.3.2 Are social simulations still in a pre-scientific stage?|
|5 Summary and Conclusions|
One of the most surprising features to the outside observer of the field of social simulations in general is the widespread absence of empirical validation, sometimes combined with a certain unwillingness to see this as a problem.
In a meta-study on agent-based-modeling (ABM), which is one very important sub-discipline of social simulations, Heath et al. (2009) find that the models in 65% of surveyed articles have not properly been validated, which they consider “a practice that is not acceptable in other sciences and should no longer be acceptable in ABM practice and in publications associated with ABM” (4.11). While some of these not-validated simulations can serve a purpose as thought experiments that capture some relevant connection in an idealized and simplified form (Reutlinger et al. 2017), many of them are merely follow-ups to existing simulations and bear little relevance of their own. The practice of publishing simulations without empirical validation and seemingly little (additional) theoretical relevance is so widespread that it has been termed the YAAWN-Syndrome where YAAWN stands for "Yet Another Agent-Based Model ... Whatever ... Nevermind" (Osullivan et al. 2016). The fact that such a term has been coined is an indication that the ABM-community is growing weary of unvalidated or otherwise uninteresting simulations. Thus, the situation may change in the future. For the time being, lack of validation is still a problem.
To be sure, agent-based-modeling is a broad field. On the one hand side there are very theoretical simulations that set out from abstract concepts but without any particular application case in mind. And on the other hand side there exist simulations that are right from the start related to a particular empirical setting. The latter kind of simulations is typically found in corporate or political consulting. I am going to look at the theoretical simulations first and then consider the more applied kinds of simulations later.
Naturally, unvalidated simulations are much more prevalent among the theoretical simulations, where the lack of empirical validation is sometimes not even perceived as a problem. This may be illustrated by a quotation from an interview with a philosopher who has produced models of opinion dynamics (Hegselmann/Krause 2002) that have frequently been cited in other modelling-studies but that have not been empirically validated:
None of the models has so far been confirmed in psychological experiments. Should one really be completely indifferent about that? Rainer Hegselmann becomes almost a bit embarrassed by the question. “You know: In the back of my head is the idea that a certain sort of laboratory experiments does not help us along at all.” (Groetker 2005, p. 2)
But if laboratory experiments do not help us along, how can models that have never been confirmed empirically either by laboratory experiments or by field research help us along? This lack of interest in empirical research is all the more surprising as opinion dynamics concern a field with an abundance of empirical research. Naively, one should assume that scientists have a natural interest in finding out whether the hypotheses, models and theories they produce reflect empirical reality. That this is obviously not always the case, confirms Kuhn's view that the criteria by which scientific research is judged are also set by the paradigm that guides the thinking of the researchers and that there is no such thing as a “natural” scientific method independent of paradigms. However, even Kuhn's mild relativism would rule out science without any form of empirical validation as unrewarding.
The lack of empirical concern within the field of social simulations can furthermore be attributed to another working mechanism of paradigms that Kuhn identified, namely, the role of exemplars. As mentioned earlier, according to Kuhn scientific practice is not guided by the abstract rules of a logic of scientific discovery. Instead, scientists follow role models or exemplars of good scientific practice.
Some very influential role models in the field of social simulations concern simulations that have never successfully been validated. The just mentioned opinion-dynamics simulation by Hegselmann and Krause is one example for this kind of role model. But the arguably most famous unvalidated model that serves as an exemplar in Kuhn's sense is Robert Axelrod's “Evolution of Cooperation” (Axelrod 1984). Despite the fact that the reiterated Prisoner's Dilemma simulations that Axelrod used as a model for the evolution of cooperation had turned out to be a complete empirical failure by the mid 1990s (Dugatkin 1997) and despite the devastating criticism Axelrod's approach had received from theoretical game theory (Binmore 1994, Binmore 1998), it continues to be passed down as a role model of social simulations until this day. In a journal article from 2010 in the prestigious Science-journal, where a similar research design as Axelrod's was employed, it is mentioned as a role model that has been “widely credited with invigorating the field” (Rendell et al. 2010a, 2008f.). And one can easily find recent studies (Phelps 2016) that naively pick up Axelrod's study as if no discussions concerning its robustness, its empirical validity or its theoretical scope had ever taken place in the meantime. If simulation-research-designs without proper validation such as Axelrod's continue to be treated as exemplars, it is no surprise that many social simulations lack proper validation.
Now, there are two caveats: Firstly, in some cases unvalidated simulations can serve a useful scientific function, among other things as thought-experiments. Of a thought experiment one usually does not require empirical validation. Thus, if Axelrod's evolution of cooperation or Hegselmann's and Krause's opinion dynamics could be considered thought experiments their status as role models in connection with their lack of empirical validation could not be taken as an indication that social simulations still remain in a pre-scientific stage. However, the way that both these simulations functioned as role models was not by their (potential) use as thought experiments, but as a research programme. Indeed, it would be hard to justify the literally dozens if not hundreds of follow-up simulations to Hegselmann-Krause or Axelrod as thought experiments without invalidating the category of a thought experiment as a useful scientific procedure. But it has to be kept in mind that not any kind of unvalidated simulation is an indication of pre-scientific fiddling about.
Secondly, and more importantly, not all simulation traditions have, of course, remained as disconnected from empirical research as Axelrod's Evolution of Cooperation and Hegselmann's and Krause's opinion dynamics simulations. One example is the Garbage-Can-Model (GCM) by Cohen et al. (1972) which describes decision making inside organizations with a four component model, taking “problems”, “solutions”, “participants” and “opportunities” into account. This model is highly stylized and, because of this, would be difficult to validate directly. Nevertheless, it is frequently referred to in studies on organizational decision making, including empirical studies.
But why, one may ask, could the connection to empirical research, or more generally, other kinds of research on organizational decision making be established in this case while it failed in the aforementioned cases? There are several possible reasons:
can help to establish the basic plausibility of the model, if the simulation itself and its results are plausible in view of the prior knowledge about the simulated process. In the case of the GCM the model establishes the connection between a certain structure of the decision making process and certain characteristics of the outcome, like how efficiently problems will be solved. In a verbal description this connection can be maintained, but not be demonstrated. A simulation can show that such a connection exists, even if only within the model.
In view of the possible functions of communication and hypotheses-generation, one can argue that models like the Garbage Can Model can be useful in the context of empirical research even without being empirically validated themselves. Still, the question remains what characteristics a model of this kind must have to be considered useful or suitable, or how one can tell a good model from a bad model. There seems to exist an intuitive understanding within the scientific communities habitually using these models, but it is hard to find any explicit criteria. This strengthens the impression that a paradigm of validation is not yet in place, at least not for the more theoretical simulations.
What about applied simulations, though? Agend-based-models are, among other things, used to give advice about particular policy measures, like introducing a new pension plan (Harding et al. 2010) or determining the best procedures for research funding (Ahrweiler/Gilbert 2015). Obviously, validation is of considerable importance if simulations are used for political consulting. So, how do scientists who apply social simulations get around the restriction that the simulation results often cannot directly be compared with measurable empirical data? In particular, how can simulations be validated that are meant to evaluate the possible consequences of policy measures that might never be implemented?
In their discussion of the validation of the SKIN-model, which simulates knowledge dynamics in innovation networks, Ahrweiler/Gilbert (2015, section 1.1.2) do not even assume that there exist objective observations independent of a concrete research goal or question. At least for the sake of the argument they even accept the view that the observation of a social process is a construct of this process or “what you observe as the real world” (Ahrweiler/Gilbert 2015, section 1.2), just like the simulation of the same process is another construct of this process. However, since the authority over what is observed as the real world lies with the “user community” (Ahrweiler/Gilbert 2015, section 1.3), the output of a simulation can meaningfully be compared with the observations.
Since the construction of the simulation as described by Ahrweiler/Gilbert (2015, section 2.4) is a process in which the user community is deeply involved, it is tempting to raise the question how unbiased this kind of validation really is. After all, an administration assigning the task of examining the potential for enhancement of their administrative procedures to a team of simulation scientists might be more interested in the vindication of certain administrative procedures than in their unbiased assessment. However, the “user community view” as described by Ahrweiler/Gilbert (2015) depicts only the outline of the construction and validation process of applied agent-based-models. A more detailed analysis of the validation of applied agent-based-models as provided by Harding et al. (2010) reveals that there exists a whole array of validation procedures which, if executed properly, limits the risk of producing biased or arbitrary results. For the Australian Population and Policy Simulation Model Harding et al. (2010) report, among other measures: i) the calibration and benchmarking of the simulation with available cross-sectional and longitudinal data, ii) the comparison of the simulation model's projection with that of other models, iii) the modular structure and separate evaluation of each module, iv) the examination, if both the individual agent's simulated life histories and the summary statistics yield reasonable results. The impact of proposed policy measures as revealed by the simulation can by its very nature not beforehand be compared with empirical data. However, one can contend that in the context of policy advise a simulation is sufficiently validated, if it leads to policy decisions that are better grounded than they would be without running a simulation model.
Where does this leave us? Are social simulations still in a pre-scientific stage with respect to their validation? On the one hand there is a widespread lack of proper validation and the impression that the increasing number of published agent-based models does not necessarily pay off in terms of further deepening our understanding of the simulated processes. While other quality issues of agent-based models, such as their reproducibility and mutual comparability, have been addressed in recent years, there is still no common understanding concerning how agent-based models should be validated. So far, the textbooks on agent-based simulations have little to say about validation. With the central issue of validation still being unresolved, the field of social simulations does yet seem to have matured into a normal science in the sense of Kuhn. The situation can positively be a described as a phase of humble beginnings in the sense of the interpretation of Feyerabend's anarchic epistemology that was given earlier.
On the other hand, scientists that apply agent-based-models to particular empirical processes typically invest considerable time and effort into the validation of their simulations and employ a diverse set of validation procedures to ensure the credibility of their simulations. So, we might indeed be witnessing a paradigm of validation of applied agent-based-models in the making. It is, so far, only in the making, because the various validation procedures and criteria used by the practitioners do not yet seem to have been consolidated to a degree where they become textbook knowledge.
 I am indebted to Julian Newman for pointing out to me the excellent paper by Northcott/Alexandrova (2015) on the Prisoner's Dilemma. It contains the so far best analysis why Alexrod's reinterpretation in terms of the Prisoner's Dilemma of truces in WWI ultimately fails. And because the author's have obviously not been aware of my own research on the topic, I consider it as an independent confirmation of my own critical conclusions regarding Axelrod's chapter on WWI (Arnold 2008, ch. 5.2.2).
 This seems to be the standard case for applying the GCM in organizational science. See Fardal/Sornes (2008) and Delgoshaei (2013) for example. It will be interesting to see whether the more refined simulation models of the GCM that have been published more recently (Fioretti 2008) will bring about an increased use of simulation models in applied studies referring to the GCM or not.
 This is precisely where Axelrod's simulations was lacking, because a) his tournament of reiterated Prisoner's Dilemmas is too far removed from the phenomenology of either animal or human interaction to be prima facie plausible, and b) his results were - unbeknownst to him - highly volatile with respect to the simulation setup and thus also lack plausibility.
 They discuss this unter the heading of “theory-ladenness of observations”, though their examples suggest that the issue at stake is rather different interpretations of observations or a focus on different observations depending on the research questions than different observations due to a different theoretical background.
 A most notable initiative in this respect has been the introduction of the ODD Protocol for the standardized description of agent-based-models (Railsback/Grimm 2012).