census data

This forum contains all archives from the SD Mailing list (go to http://www.systemdynamics.org/forum/ for more information). This is here as a read-only resource, please post any SD related questions to the SD Discussion forum.
Locked
Bob Eberlein
Member
Posts: 49
Joined: Fri Mar 29, 2002 3:39 am

census data

Post by Bob Eberlein »

The project Geoff has suggested is actually something that is done
frequently as a matter of course for demographers. It is, in fact, the
technique by which both current population estimates and population
projections are arrived at. It is also something that is done
frequently by people, such as myself, doing dynamic models of other
things.

Clearly when you are building such a model you want to compare the state
of the model with data at an appropriately matched time. If you are
looking at 5 sets of census data over 50 years you need to be using the
model results at the time the census was taken.

Geoff is absolutely right that taking a census is highly error prone
activity. Model building is also highly error prone. Fortunately,
those errors are not simply additive. The model and data can be each
used to improve the quality of the other. There is, in fact, a
wonderful story about census counts and model as applied, I think, to
Kenya. There had been two censuses taken some number of years apart
(more than 10 if I recall correctly), and a model built to bridge
between them. The model simply could not do that and a little digging
by the demographers demonstrated how poor the early census results had
been. Actually it demonstrated it to my satisfaction, but many people
simply refuse to believe anything but data measured by generally
accepted sampling practices and so there was little consensus on the
census.

As to degree of accuracy required in models, demographics is one of the
few areas for which it is easy to see which is more accurate. People do
indeed age year by year and it is very easy to see echos in population
structures. All of this is lost with an aggregate representation and
even with a yearly aging chain (as opposed to discrete cohort shifting)
you get significant spreading (that is there are more 22 year olds in
1920 because more people were born in 1900).

Given the speed of modern computers and the ability to put things into a
relatively compact notation having 100 population cohorts by <b>***</b> is
often sensible. I would say this with the caveats that first, you do
not want to become so quagmired in detail that you are unable to
actually spend time on the problem you are trying to work and second
that advanced statistical and analytical techniques such as kalman
filtering will not work with these beasts.

Bob Eberlein
bob@vensim.com
"Jaideep Mukherjee"
Junior Member
Posts: 15
Joined: Fri Mar 29, 2002 3:39 am

census data

Post by "Jaideep Mukherjee" »

Prof Coyle points out that there are a number of problems with using census
data. Agreed, but is there a better alternative - surely it cannot be
inventing numbers out of thin air. My belief is that, even if census data is
lousy, knowing where it is so can make it much more valuable than using any
other numbers in its place. If there were a better alternative, Id use it.
The only alternatives available right now are at least as bad as census
data. Prof. Coyle lists the many limitations - what we can do and should do
is to get around those limitations by clever ideas (human ingenuity, better
math, trends, expert opinions applied to "modify" the census dtata, rather
than to throw all census data out).

There is another interesting point raised regarding how "loading dynamic
models with numerical data is often pointless,
gives false impressions of accuracy and is technically difficult in models
involving delays". I have a story to relate here - while working on a
wastewater model of Houston wastewater system, we needed some idea of
Houston daily rainfall. The following is an example of a typical exchange
(it has repeated in other consultancy situations too, so it is not an
isolated event):

As usual, my first-cut approach was to ask "What is the rainfall in
Houston?"

City engineers (CE): "It changes all the time! cant give you that"

Me: "Give me a rough figure - give me an average"

CE: "There is no average"

Me: "Cmon, there has to be an average; any set of real numbers has an
average - give me something to start with"

CE: "There is no average; it is all too spread out".

Me: "OK - then give me the min-max values and standard deviation...we may
try some simulations with these different numbers"

etc. etc.

Finally I ended up including a 10-year daily rainfall data (an Excel file,
built from NOAA data, attached to a Powersim model - also available on my
website) that would be an exact replica of the past rainfall
stressors/inputs to the system, and the engineers had no problem with this
approach. It was closer to reality, it reflected the causes that, partly,
are responsible for the sanitary sewer overflow problems, and it is not
hand-waving. I dont have any problem with this approach either - and I
think it is a good example of how a dynamic model may be linked with
numerical data. (Delays are not relevant here). There would be a problem
with this approach if we somehow did not expect rainfall data in the future
to be similar to the past data. For example, this approach would be
problematic if global warming were so severe and so rapidly approaching that
past data cannot be used as proxy for future rainfall inputs. Again, in this
case too, instead of inventing numbers out of thin air or using some
half-baked global warming theory, it may be better to use past data and
modify it using some clever ideas (I will leave it to the clever phd
students to figure out these clever ideas:-))

Bottomline: in some cases, numerical data may be used with dynamic models,
in some cases it may not be. The point is to be aware of where things may go
wrong with either approach (Prof. Coyle lists some of these things). In very
fast-moving or uncertain environments, it is not a good approach - for
example, to use past prices to predict future prices, in a technologically
very volatile area - BAD IDEA (can anyone tell me what the average consumer
prices of web hosting, of long distance telephone, of cable TV, will be one
year from now?!!). The LTG modelers misjudged the huge impact of technology
and thus indirectly prices in the World3 model, and even the economists are
daily surprised at the impact of high-technology today (on the directions
and impacts of prices). If the system is not changing too fast or
unpredictably (Id include a population model in this, unless there are
natural disasters or massacres), then numerical data or statistics derived
from such data are the only means to go, unless there is a better
alternative. If there is a better alternative, then I wonder why anyone
bothers to even collect any census data?

Enough rambling - and thanks to Prof. Coyle for his thought-provoking
comments,

Best regards

Jaideep
jaideep@optimlator.com
http://www.optimlator.com/
Jim Thompson
Junior Member
Posts: 2
Joined: Fri Mar 29, 2002 3:39 am

census data

Post by Jim Thompson »

Bob Eberlein wrote:
"There is, in fact, a wonderful story about census counts and model as
applied, I think, to Kenya. There had been two censuses taken some number
of years apart (more than 10 if I recall correctly), and a model built to
bridge between them. The model simply could not do that and a little
digging by the demographers demonstrated how poor the early census results
had been. Actually it demonstrated it to my satisfaction, but many people
simply refuse to believe anything but data measured by generally accepted
sampling practices and so there was little consensus on the census."

Bob probably was referring to events in Nigeria.

According to the Center for International Research at the U.S. Census
Bureau, the 1991 census figure for Nigeria was 88.5 million people versus
projections prepared by the UN, World Bank and U.S. Census Bureau
projections for that year of about 120 million for Nigeria. These
organizations had each relied on data from the 1963 Nigerian census as the
starting point for those projections.

According to Timothy Fowler of the U.S. Census Bureau Population Studies
Branch, Nigeria had taken counts in 1953, 1963, and 1973 before the count
made in 1991. Frank Hobbs at the U.S. Census Bureau Center for
International Research wrote in 1992 that the 1973 census was ignored for
the most part by the census-taking community because there was "considerable
documentation that the [1973 census] levels were inflated due to political
factors." This documentation included official recognition by the Nigerian
census-takers that 1973 data were inflated to match the output of population
models used by international aid organizations.

The U.S. Census Bureau explored the question of "missing Nigerians" by
running models based in 1953 census data and 1963 data and comparing output
from those runs with 1973, 1991, and a provisional count made in 1987.
Their work explored birth, death and emigration rates as possible
explanations for the differences between model expectations and the actual
1991 count. They concluded that the 1963 census overstated the population.
Separately, the World Bank and the U.N. Population Branch concluded the
same. In essence, all three used large discrepancies in expected age and
gender cohorts by year of birth to support their conclusions. Further, each
organization indicated that the 1953 census was significantly more reliable
than the 1963 census, principally because model outputs for 1991 were only
about 2.5% different than the actual 1991 count when 1953 cohort data were
used as the starting point.

For a first reference on how the U.S. Census Bureau prepares estimates and
projections, look at Appendix B of "World Population Profile: 1996" which
can be downloaded from <
http://www.census.gov/ipc/www/wp96.html>.

Last, Prof. Coyles comments remind me of the old joke: "Whore you gonna
believe: me or your lying eyes?"

Jim Thompson

--
Global Prospectus LLC
http://www.GlobalProspectus.com
jthompson@globalprospectus.com
Phone: +1 860 676 8152
"Jim Hines"
Senior Member
Posts: 88
Joined: Fri Mar 29, 2002 3:39 am

census data

Post by "Jim Hines" »

Dana Meadows commented on the consultants plea that the extreme detail
pleases the clients. She raises the good question whether its really in
the clients interest to put in (and pay for) the detail. Id like to raise
a slightly different question:

The consultants time might be better spent explaining to the client why
added detail is unnecessary and undesirable. The time required to add
additional detail is often under-estimated. The required time is not just
adding the structure, but debugging it, explaining it, and dealing with it
every time its necessary to figure out why the model is doing whatever its
doing. Further, the required time is not just on **this** model. By
putting in unnecessary detail, consultants educate their clients to expect
and request detail.

Regards,
Jim Hines
jhines@mit.edu
"geoff coyle"
Senior Member
Posts: 94
Joined: Fri Mar 29, 2002 3:39 am

census data

Post by "geoff coyle" »

Keith Linards note about the use of census data shows the value of this
discussion forum as it has made me think a bit more deeply about the use of
numerical data in SD models. One aspect is the cleverness of being able to
make Excel talk to Powersim, or whatever, but the other is the underlying
assumptions and possibilities of error. This might well make a very nice PhD
project so Ill illustrate some of the issues by examining a few aspects of
the census data problem.

Im working from memory and some of the following details may be a bit out
but that doesnt affect the principles. To illustrate the points Ill assume
that no-one lives for more than 100 years, which immediately introduces a
small error because the Queen regularly sends a telegram of congratulation
to anyone reaching their hundredth birthday (she probably uses e-mail now).

The 1990 UK census took 3 years to publish and was admitted to be out by
about 5-10% in its count of the total population. Even used immediately
after publication there is an error that anyone over 97 at census time has
now died. Moving the data up by three years doesnt solve the error problem
because one would have to assume a mortality profile for all the rest of the
population as they age by three years and, of course, one would have to
assume birth rates for anyone under 3 years old, which involves three more
guesses: increased lateness of child bearing as more women work, changes in
the proportions of couples choosing to remain childless and changes in the
numbers of children born out of wedlock. Apart from that there have been
immigrants, emigrants and refugees.

Using the 1990 data now could well mean that the data are out by as much as
25% in total population and considerably out (+ or -) in the age cohort
populations. In short linking an SD model to any data set, not just census
data, is fraught with all sorts of possibilities of error and even the most
glittering GUI will only conceal a completely illusory pretence at accuracy.

The errors arise from fitting data to a dynamic model in which time is
passing and delays are taking effect. Another good example is shipbuilding.
There are data on the tonnage of oil tankers laid down (which is not the
same as built) from, say, 1970 to 1990. A model of world tanker capacity,
designed to address issues such as the dynamics of the shipping freight
markets, would need to have shipping capacity and those data seem to offer
it. The problem is that ships laid down in 1950 or later would still be in
service in 1970. OK, so we go back 10 years further but still no good as
ships laid down in 1940 were still around for years after the war. The
war-built ships were completely different from the supertankers of the 1960s
so there is a problem of not comparing like with like when calculating
capacity. You would also need a record of sinkings, breaking up of ships,
transfer to different types of trade and so forth.

In short, loading dynamic models with numerical data is often pointless,
gives false impressions of accuracy and is technically difficult in models
involving delays. How can it be done? When should it be attempted? There are
a load of challenges and I just hope that there is a very bright PhD
candidate out there who might want to tackle this.

Regards,

Geoff

geoff.coyle@btinternet.com
Professor Geoff Coyle
Consultant in System Dynamics and Strategic Analysis
Tel: (44) 01793 782817 Fax: 01793 783188
Donella.H.Meadows@Dartmouth.EDU
Junior Member
Posts: 7
Joined: Fri Mar 29, 2002 3:39 am

census data

Post by Donella.H.Meadows@Dartmouth.EDU »

I have enjoyed the discussion about modeling aging chains, since that was my
problem in the very first model I made (the population sector of World3). What
I learned from that exercise reinforces several of the points made in this
recent discussion.


First, for my dynamic question it was important to capture the delay effect of
the aging chain, but not the details. So, after experimenting with alternatives
ranging from one population stock to 100 of them (single-year cohorts), I
settled for 4. That was the right solution for MY modeling problem, maybe not
for yours.


A few years later I did a review of many population models (summarized in The
Electronic Oracle) and discovered that virtually all of them included full
demographic models (200 stocks, differentiated by age and <b>***</b>) and that in every
case it was a waste of modeling time and an overcomplication of the model. Data
from other parts of the system interacting with the population just did not
justify such precision. Time was put into modeling aging-chain details and
diverted from important feedback structure. Maybe worst of all, the life tables
-- the age-specific fertilities and mortalities -- had been frozen and made
non-dynamic, because no one could figure out how system changes would affect
200+ parameters. Therefore the population models, while correctly stepping
everyone forward in aging, were increasingly inaccurate in removing the dead and
calculating the births.


Running the life tables "backward," making each age-specific mortality a table
function of other factors in the system, was what I did to make dynamic
mortality and fertility in World3, but I only had to deal with 4 table
functions, not 200!


With regard to the consultants plea that the extreme detail pleases the
clients, I would just ask whether you serve your client by spending and charging
for enormous amounts of modeling time that are in fact unnecessary to the
problem at hand. (And I do realize that there are cases where it IS necessary.)


Dana Meadows
From: Donella.H.Meadows@Dartmouth.EDU (Donella H. Meadows)
Keith Linard
Junior Member
Posts: 11
Joined: Fri Mar 29, 2002 3:39 am

census data

Post by Keith Linard »

Let me summarise my interpretation of Geoff Coyles concerns on cenus data.

1. Any census data, whether a sample or an attempted 100% coverage,
involves inaccuracy.
2. A census of any entity which changes over time gets progressively
innacurate.
3. Therefore using numerical census data is often pointless and gives a
false impression of accuracy and is technically difficult where there are
delays.

The first two points are valid but I challenge the implications Geoff draws
from them.

Let me summarise my contentions:

The attributes of a successful SD modelling process include:
1. That the client thinks there is a problem
2. That the problem has dynamic characteristics
3. That the modelling process leads to a shared understanding among the
key stakeholders of the characteristics (dynamic, relational or otherwise)
of the problem
4. That the modelling process points towards actions which these
stakeholders may take to resolve or mitigate the (perceived or actual)
problem
5. The client has sufficient confidence in the model as a result of the
modelling process that they are motivated and empowered to act in a way
which leads to system improvement (defining improvement of course being a
PhD project in its own right)

I contend that, even were we successful in the first 4 attributes, if we do
not achieve the 5th then the modelling process is a failure. My opening
gambit in every undergraduate and graduate class since moving from the
real world of management into academia has been: "Coming up with a
right (or even a good) technical answer is 5% of the task. Ensuring
that this right answer is accepted and acted upon by the decision makers
is the 95%. If your answer is ignored, you might as well have a totally
invalid solution." (Obviously my mental model is still imbued with an
altruistic public service ethic: the academic modeller can always
publish papers on why a brilliant modelling exercise was not implemented
and the consultant modeller can pocket their fee ... both successes under
different mental models.)

Sometimes, particularly in circumstances where solutions are non
controvertial or where there are not multiple stakeholders pushing
conflicting agendas, the successful model may simply consist of causal
loop or influence diagrams , or primitive SD models. Much of the public
sector reform program in Australia during the 1980s was driven by such
high level modelling.

However, when there are contentious and conflicting agendas, every aspect
of supporting models (be they SD, econometric or whatever) will be
challenged by those who see their positions threatened. And the validity
of supporting data is the simplest and easiest place to start that attack,
certainly before challenges to model structure. Given this, my pragmatic
approach has been to use "official Bureau of Statistics" census data
(updated by the Bureaus estimates). Also on pragmatic grounds I
incorporate estimates of statistical uncertainty & use these in my SD
models by combining the models with Monte Carlo simulation (now very simple
with Powersim Solver). In this way I have no difficulty in focussing
attention where it really belongs ... on the model structure and on the
critical assumptions regarding change from the present state as a result of
policy options.

Returning to Geoffs contentions, lets consider them in the context of
motor car pollution emissions concerns in Australia (simplifying the fleet
by ignoring trucke, buses etc). Problem characteristics include:
* A very mixed vehicle fleet of cars with ages up to 95 years (although we
can probably ignore as statistically insignificant all cars over 40 years)
* Emission characteristics being very age related (leaded c.f.
unleaded petrol users; catalytic converters of varying technologies;
fuel efficiency etc)
* Age of vehicle is related to age of user, to city vs country
* Vehicle kilometres of travel being strongly related to age of user
* Possession of drivers licence being very strongly related to age of
user (so that the 70-80 year old cohort in 2010 will include many more
drivers than the current 70-80 year old cohort ... and its pretty easy to
model that transition!)

Government has a variety of possible policy levers to influence change:
imposing pollution taxes on leaded fuel; differential annual car
registration fees depending on technology; banning leaded fuel; banning
cars without catalytic converters etc etc. Each policy has different
winners and losers, with dramatically different environmental and equity
implications. The lobby groups are powerful and have access to expert
technical advice. A critical issue for all will be how different policies
impact on their particular clientele compared with other winners and
losers.

The data is available on age of vehicle fleet and age specific death
rates (based on continuation of current policy etc), so it is easy to
generate estimates, acceptable to all stakeholders, of future vehicle
fleets (albeit subject to uncertainty ... which can be estimated).
Similarly for population characteristics. Frankly, if we cannot build an
acceptable module for this we have no right to start on the really
difficult stuff, namely what are the causal relationships between possible
policy levers and changes in people behaviour and environmental outcomes
... how are the different policy levers likely to impact on propensity to
write off old cars, propensity to switch from private to public transport,
how are these propensities likely to vary over time etc etc.

Yes, there is a place for high level conceptual modelling based on
hypothesised causal relationships, minimal data and minimal validation. In
contentious policy areas however such modelling has little impact on
decision making ... whilst the purveyors of econometric magic (subject to
errors of =/- 100% or more) rule the day. I believe that there is an
important role for SD models which are integrated with and draw on the
(numerical data in) corporate data bases. This is why I am excited by the
potential for the integration of Powersim etc in the Strategic Enterprise
Modelling modules of SAP, PeopleSoft etc. But this integration will
require more than simplistic rookie-consultant-principal models with
relationships plucked out of the air.

Keith Linard
From: Keith Linard <
k-linard@adfa.edu.au>
Senior Lecturer
University of New South Wales
Locked