_________________________________________________________________
Measurement
Copyright © 1996 by H. Goldstein, J.L. Gross, R.E. Pollack and R.B.
Blumberg
(This is the third chapter of the first volume of The Scientific
Experience, by Herbert Goldstein, Jonathan L. Gross, Robert E. Pollack
and Roger B. Blumberg. The Scientific Experience is a textbook
originally written for the Columbia course "Theory and Practice of
Science". The primary author of "Measurement" is Jonathan L. Gross,
and it has been edited and prepared for the Web by Blumberg. It
appears at MendelWeb, for non-commercial educational use only.
Although you are welcome to download this text, please do not
reproduce it without the permission of the authors.)
3.0: Introduction
3.1: Nominal, ordinal, and interval scales
3.2: What observations can you trust?
3.3: From observation to prediction: the role of models
3.4: Obstructions to measurement
3.5: Exercises
3.6: Notes
_________________________________________________________________
MEASUREMENT
Deciding by observation the amount of a given property a subject
possesses is called measurement. In a more liberal sense, the term
"measurement" often refers to any instance of systematic observation
in a scientific context. Measurement is always an empirical procedure,
such as reckoning the mass of an object by weighing it, or evaluating
the amount a student has learned by giving an examination.
By way of contrast, quantification is a kind of theorizing, such as
refining the concept of mass, to explain observed resistance to force.
Therefore, it might seem that quantification precedes measurement, or
perhaps that after some preliminary observations to develop a method
of quantification, one thereafter makes measurements according to that
quantification. Such a picture omits much of the hard work typically
involved in the creative process.
In original scientific investigations, the relationship between
quantification and measurement is a "feedback loop". That is, the
first set of measurements might suggest that the initial
quantification was overly simplistic or even partially wrong, in which
case the quantification is appropriately modified. Then some more
measurements are made. Perhaps they indicate that not all the problems
have been resolved, and that the quantification should be further
refined. Then there are more measurements. And so on.
The pot of gold at the end of the quantification rainbow is called a
"mathematical model". In general, a model is a construction that
represents an observed subject or imitates a system. Museums of
science often display physical models of atoms and molecules, for
example, or of the solar system. Here is the sense in which a
mathematical abstraction can be a model: the relationships among the
values of the mathematical variables in the abstraction can imitate
the relationships among their respective counterparts in the system
being modeled.
For instance, there are models of the solar system that enable us to
predict solar eclipses, far before their occurrence. Such models might
be considered valid only to the extent that they predict accurately,
independent of the attention given to detail. Analogously, a`valid
model of some aspect of the economy would be one that accurately
predicts future economic behavior.
This chapter describes the steps along the way from empirical
observations to the formulation of models. Developing a model is what
wins a Nobel Prize. Perfecting the methods of observation is a step
along the way, but "scientific truth" is an empirical demonstration
that a model is valid.
3.1 Nominal, ordinal, and interval scales
We have defined measurement in a sufficiently broad sense that it
applies to any procedure that assigns a classification or a value to
observed phenomena. Informally, the data that result from a
measurement procedure are also called measurements.
Collecting data is not often a matter of writing down whatever you
see, an infinite task whose outcome would be a few critical
observations, completely hidden in an undifferentiated mess of
irrelevant details. The collection of data requires structure,
including an experimental design and a method of observation.
Part of an experimental design is the creation of a "scale" for the
measurements, based on the quantification of the phenomena to be
observed. There are three major classes of scales, called "nominal"
scales, "ordinal" scales, and "interval" scales.
A nominal scale is a qualitative categorization according to unordered
distinctions. Consider, for example, an attempt to assign a value
"male" and "female" while measuring gender. We might say that female
is "category 0" and male "category 1", but we might equally well
reverse those numeric labels, because they have nothing to do with
femaleness or maleness.
Another example of a nominal scale is the department an undergraduate
chooses for "major" emphasis. Some academic institutions, such as the
Massachusetts Institute of Technology, have assigned a number to every
department. When asked to identify his or her major, an M.I.T. student
is likely to respond something like "Course 8" (which means physics)
or"Course 21" (which means humanities).[1] However, the numeric
designations are entirely arbitrary. Thus, even at M.I.T.,
classification of students according to major department is nominal!
An ordinal scale makes ranked distinctions. For instance,
lexicographic ("dictionary") order is an example of an ordinal scale.
It depends on only one property, the sequence of letters in the word.
The lexicographic order of words, such as
"huge" < "infinite" < "little" < "medium-sized" < "tiny"
need not be consistent with any notion of order that is derived from
the meanings of the words. Military rank is another example of an
ordinal scale.
An interval scale is based on the real numbers, so that each unit on
the scale expresses the same degree of difference, no matter where on
the scale it is located. For instance, weight, length, and duration of
time are interval scales.
The atomic number of the elements of chemistry is regarded as an
interval scale, even though the realizable values are all integers.
The difference it expresses between elements one apart is always one
proton in the nucleus. The point is that the type of a scale --
nominal, ordinal, or interval -- depends on the underlying nature of
the quantification, rather than on the observed existence of
particular values, or even on the physical possibility of existence.
If the value of zero on an interval scale represents a total absence
of the property being measured, then the scale is sometimes called a
ratio scale. For instance, whereas Celsius temperature and Kelvin
temperature are both interval scales, Kelvin temperature is a ratio
scale but Celsius temperature is not. The difference, in classical
physics, is that zero on the Kelvin scale means absolute zero, the
case in which all motion stops. Here are some multiple-choice
questions designed to illustrate what is missing when an interval
scale fails to be a ratio scale and that the concepts involved are
somewhat subtle.
Question 1: Suppose it is 10° Celsius on Sunday and 20° on Monday.
Does that make it twice as warm on Monday? Choose only one of the
following answers.
( ) yes ) ( ) no
The sensible answer to Question 1 is no, of course. It is a mistake in
physics to identify the concept of "warmth" with measurements on the
Celsius scale. Suppose the temperature drops to 1° C on Tuesday. Was
it really twenty times as warm on Sunday? What if it drops to 0.1° C
on Wednesday. Was it ten times as warm on Tuesday, and 200 times as
warm on Monday? Perhaps by Thursday the temperature drops to -5°. Was
it -4 times as warm on Monday as on Thursday?
The Fahrenheit scale is another interval scale for temperature that is
not a ratio scale. The comparable Fahrenheit readings for Sunday,
Monday, Wednesday, and Thursday would be 50° F, 68° F, 32° F, and 22°
F. Their ratios make no more sense than ratios on the Celsius scale.
The point is that, if a scale is not a ratio scale, then inferences
based on the calculation of ratios of its data points may be utterly
misleading.
Question 2: Using the Celsius temperature measurements given in
Question 1 and its answer, is it correct to say that the temperature
difference between Monday (20° C) and Sunday (10° C) was about twice
as great as the temperature difference between Monday and Wednesday
(0.1° C)?
( ) yes ( ) no
This time the answer is yes. Even though the Celsius scale is not a
ratio scale, when one uses it to measure temperature change, it
becomes a ratio scale. When you are measuring relative difference, the
value of zero represents no difference; zero is then no longer an
arbitrary reference point in the scale, but one with physical and
psychological meaning. In particular, the amount of energy required to
raise the temperature, say, of one cubic centimeter of water, ten
degrees is twice as great as the amount needed to raise it five
degrees. Moreover, although finger-dipping estimates are not as
precise or reliable as thermometer readings, people can scale gross
differences of temperature within a suitable range of human
perception.
The concepts of nominal, ordinal, and interval scale apply to what is
possibly only a single dimension of a multivariate observation.
Multidimensional measurements and classifications are common. For
instance, we might classify individuals according to age and sex,
thereby combining an interval scale on one dimension with a nominal
scale on the other. In Chapter 7 we shall discuss some instances when
two or more ordinal scales are combined to form a multidimensional
scale, the ordinality might partially break down, because subject A
might rate higher than subject B on scale 1, while subject B rates
higher than subject A on scale 2.
3.2 What observations can you trust?
Certain attributes are thought desirable in just about any method of
observation. First of all, a method is said to be reliable if its
repetition under comparable circumstances yields the same result. In
general, unreliable observation methods are not to be trusted in the
practice of science, and theories are not accepted on the basis of
irreproducible experiments. Reliability is consistency, and it is
perhaps the measurement quality most responsible for yielding
scientific knowledge that is "public".
In the natural sciences, the criterion of reproducibility is
frequently easy to meet. For instance, two samples of the same
chemical substance are likely to be essentially identical, if they are
reasonably free of impurities. By way of contrast, this standard often
introduces some problems of experimental design into the medical
sciences. You cannot expect to apply the same treatment many times to
the same patient, for a host of reasons. Among them, the patient's
welfare is at stake, and each treatment might change the patient's
health. Nonetheless, in medicine, even when you cannot hope to reuse
the exact same subject, there is generally some hope of finding
comparable subjects.[2]
In some of the social sciences, the problem of achieving
reproducibility often seems extreme. The first visit of an
anthropologist to a community may well change the community.
Similarly, widespread publication of the results of a political poll
might induce a change of voter sentiment. Nonetheless, for the
findings to be regarded as scientific, whoever publishes them must
accept the burden of presenting them in a manner that permits others
to test them.
The intent to be accurate or impartial is not what makes a discipline
a science. Conscientious reporting or disinterested sampling is not
the basis for scientific acceptance. Reproducibility is what
ultimately counts. Outstanding results are often obtained by persons
with a stake in a particular outcome. For instance, a manufacturer of
pharmaceuticals might have a large financial stake in the outcome of
research into the usefulness and safety of a particular drug. Academic
scientists usually have a greater professional benefit to be gained
from "positive" results than from "negative" results; it would usually
be considered more noteworthy to obtain evidence for the existence of
some particle with specified properties than to gather evidence that
it does not exist.
An accurate measurement method is one that gives the measurer the
correct value or very close to it. An unreliable method cannot
possibly be accurate, but consistency does not guarantee accuracy. An
incorrectly calibrated thermometer might yield highly consistent
readings on an interval scale, all inaccurate. By way of analogy, a
piece of artillery that consistently overshoots by the same amount is
reliable, but inaccurate. Accuracy means being on target.
As a practical matter, reliably inaccurate measurements are often
preferable to unreliably inaccurate ones, because you might be able to
accurately recalibrate the measuring tools. An extreme analogy is that
a person who always says "No" for "Yes", and vice versa, gives you far
more information, once you understand the pathology, than a person who
lies to deceive, or than one who gives random answers.
Measurements on an ordinal scale are regarded as reliable if they are
consistent with regard to rank. For instance, if we decide that when
two soldiers meet, the one who salutes first is the one of higher
rank, then our method is highly reliable, because there is a carefully
observed rule about who salutes first. Our decision method happens to
be completely inaccurate, because it is the person of lower rank who
must salute first.
If the scale is nominal, we would appraise the reliability of a
classificatory method by the likelihood that the subjects of the
experiment would be reclassified into the same categories if the
method were reapplied.
Consider now, the reliability of a classification of living species
into either the plant kingdom or the animal kingdom according to
whether they were observed to be stationery or mobile. This is not an
accurate method, since coral animals are apparently stationery, and
sagebrush is seemingly mobile. Moreover, not all living species belong
to either the plant kingdom or the animal kingdom, so the problem of
discernment is not even well-posed. A further complication is that
some species, such as butterflies, have a stationary phase and a
mobile phase. However, the issue is only whether whatever method of
observation we apply to distinguish between stationariness and
mobility produces the same answer repeatedly for members of the same
species.
We can rate the reliability of a measurement method according to a
worst case or to an average case criterion. For interval scale
measurements, the reliability is most commonly appraised according to
relative absence of discrepancy, and it is given the special name
precision. Discrepancy of a millimeter is highly precise if the
target is a light year away. It would be overwhelmingly imprecise if
we were measuring the size of a molecule. A common measure of
precision is the number of significant digits in a decimal
representation of the measurement.
Associated with accuracy is the concept of validity. We say that a
measurement is valid if it measures what it purports to measure. A
method that is direct and accurate, such as measuring distance by a
correctly calibrated ruler, is always valid. However, when the
measurement is indirect, it might be invalid, no matter how consistent
or precise it is.
For example, suppose we attempted to measure how much of a long
passage of text an individual had memorized by the number of complete
sentences of that passage that the person could write in ten minutes.
Although this measure might be consistent, it is invalid in design,
since it gives too much weight to handwriting speed. It is also an
invalid measure of handwriting speed, because it gives too much weight
to memorization.
The phlogiston theory is another example of precise invalidity. Before
it was understood that combustion is rapid oxidation, it was thought
that when a material ignites, it loses its "phlogiston". The amount of
phlogiston lost was reckoned to be the difference in weight between
the initial substance and its ash residue. Do you know what accounts
for the lost weight?
The question of validity is often enshrouded in semantics. For
instance, there is a purported method of measuring "intelligence"
according to the frequency with which a person uses esoteric words.
The burden of proof that what is measured by such a test, or by any
other test, is correlated to other kinds of performance commonly
associated with the word "intelligence" lies on the designer of that
measure.
In the absence of proof, using the word "intelligence" for that
property is purely semantic inflation of a single trait of relatively
minor importance into a widely admired general kind of superiority.
Surely the importance of the concept of intelligence depends on its
definition as something more than a property measured by such a
simplistic test. Indeed it might be preferable not to try to reduce a
complex property such as intelligence to a single number. Suppose that
someone was far above average in verbal skills and far below average
in mathematical skills: would it be reasonable to blur the distinction
between that person and someone else who was average in both kinds of
skills?
Criticism of proposed models -- sometimes based on empirical results,
sometimes based on intuition or "thought experiments" --is an
important part of scientific activity. Of course, every practicing
scientist knows that it is far easier to find flaws or unsupported
parts in someone else's theory or experiment, than to design a useful
theory or experiment of one's own.
3.3 From observation to prediction: the role of models
A contribution to scientific knowledge is commonly judged by the same
standard as a contribution in many other practical endeavors: how well
does it allow us to predict future events or performance? By way of
analogy, an investment counselor's financial advice is worthwhile to
the extent that it yields future income, not by how well it would have
worked in past markets. Other criteria might be applied in astronomy
or geology, for instance, but prediction is the standard
criterion. [3]
Investment counselors have an incentive to conceal part of the basis
for their advice. After all, if you knew everything about how the
advice was generated, you could avoid paying the counselor's fee,
because you could get answers to your specific questions yourself. In
contrast, one of the standard requirements for a contribution to
scientific knowledge is that it must include details that permit
others to predict future behavior for themselves.[4]
The form of a scientific prediction is a mathematical model. Here are
some examples.
1. A stone is dropped from a window of every eighth floor of a tall
building, that is from the 8th floor, from the 16th, from the 24th,
and so on. As a stone is dropped from each floor, a stopwatch is used
to measure the amount of time that elapses until it hits the ground.
The following table tells the height of the window sill at each floor
and the recorded times in the experiment:
Floor: 8 16 24 32 40
Distance to Ground (ft.) 96 192 288 384 480
Drop time (sec.) 2.45 3.46 4.24 4.90 5.48
A good model for the relationship between dropping time t and the
distance d is the mathematical function:
d = 16t²
This rule certainly explains the recorded observations. You can easily
verify that 16 x 2.45² = 96.04, and so on.
We attribute the discrepancy of .04 between the distance of 96.04 that
our model associates with the dropping time of 2.45 seconds and the
observed distance of 96 as measurement error. All of the other
discrepancies are similarly small. For the time being, we will not be
concerned with the process by which such incredibly precise
measurements were made.
2. By supposing that the dropping time is related to the distance, we
are avoiding what is often the hardest part of modeling: separating
what is relevant from what is irrelevant. Suppose we had carefully
recorded the length of the names of the person who dropped the stone
from each floor. Here is another table, showing this previously
overlooked relationship.
Name of Dropper: YY Sal John David Kathy
Name Length: 2 3 4 5 5
Drop time (sec.): 2.45 3.46 4.24 4.90 5.48
If you round the dropping time to the nearest integer, the result is
the same as the name length of the person who dropped the stone. In
this case the function:
namelength = Round-to-Nearest-Integer (dropping time)
is a mathematical model for the relationship.
How do you decide which model is better? All other things being equal,
you would probably prefer the first model for its advantage in
precision. However, there is a better way to make the choice, which is
to extend the experiment in such a way that the predictions of the two
models disagree.
For instance, we might send a volunteer named Mike to the 60th floor,
whose window is 720 feet from the ground. The first model predicts
that the stone will drop in
the square-root of (720/16) which is about 6.7 seconds
The second model predicts a dropping time of 3.5 to 4.5 seconds.
Sometimes, two different models make the same prediction for all of
the possible cases of immediate interest. In that instance, the
philosophical principle called "Ockham's razor" is applied, and the
simpler model is regarded as preferable.[5]
The namelength model might seem a trifle too frivolous to have been
seriously proposed; more frivolous, say, than the discredited
phlogiston theory. However, we can turn it around to make a point.
Often the best explanation of a phenomenon is initially in something
the experimenter ignores because it is not the right kind of answer.
A chemist who is pouring solutions from test tube to test tube might
not think that the effect being observed depends on whether the
experiment is performed in natural light or artificial light. A
physician whose clinical research is primarily on the same day every
week might be completely unaware that the patients at the clinic on
that day of the week differ from those who are there on other days.
When research findings are published, there is an opportunity for
other persons to repeat the experiment. If different results are
obtained on different occasions, an explanation is required. It is
likely that other researchers will think of various ways to extend the
experiment. Often they wish not only to test the results of another
scientist, but also to have an opportunity to augment the model.
Not every mathematical model is given by a closed formula. Here is
another example.
3. A pair of cuddly animals is kept together from birth. After one
month, there is still only the original pair. However, after two
months there is a second pair, after three a third. After four months
there are five pairs. The following table shows the number of pairs
for the first eight months:
Time (months): 0 1 2 3 4 5 6 7 8
Number of Pairs: 1 1 2 3 5 8 13 21 34
Notice that after two months, the number of pairs in any given month
is the sum of the numbers of pairs one month ago and two months ago.
For instance, the number of pairs at six months is 13, which is the
sum of 8 and 5, the numbers at five months and at four months.
One possible model for this process is recursive. Let p(n) represent
the number of pairs at n months. Then the value of p(n) for any
positive integer n is given by this recursively defined function:
p(0) = 1
p(1) = 1
p(n) = p(n-1) + p(n-2) if n > 1
The following program calculates p(60), or rather it calculates an
approximation to p(60) (the number p(60) is so large that it is
represented in the computer in floating point form, with lower order
digits truncated).
Program Fibonacci[6]
100 DIM P[60]
110 LET P[0] = 1
120 LET P[1] = 1
130 FOR N = 2 TO 60
140 LET P[N] = P[N-1] + P[N-2]
150 NEXT N
200 PRINT P[60]
Despite the recursive definition of the Fibonacci function, using a
recursive program to calculate Fibonacci numbers is a bad mistake.
Here is such a recursive program.
Function FIBONACCI (N)
FIBONACCI (0) = 1
FIBONACCI (1) = 1
FIBONACCI (N) = FIBONACCI (N-1) + FIBONACCI (N-2)
Imagine that you want to calculate the 60th Fibonacci number. The
first recursive step:
FIBONACCI (60) = FIBONACCI (59) + FIBONACCI (58)
"calls itself" twice. That is, its right side requires that the
function Fibonacci be calculated for the arguments 59 and 58.
Calculating FIBONACCI (59) and FIBONACCI (58) each require two more
self calls, thus, an additional four calls. For each of those four
calls, there would be two more self-calls. For each of those eight
self-calls, there would be two more -- that is, 16 additional calls --
and so on, until the bottom level was reached. Thus, there would be
exponentially many calls. Such a phenomenon is known as "exponential
recursive descent". You would have to wait a very long time to see the
answer.
3.4 Obstructions to measurement
Different kinds of obstruction to measurement are encountered in
different categories of scientific research. In the physical sciences,
perhaps the toughest problem is indirectness. For instance, how do you
measure the duration of existence of a particle theoretically supposed
to survive for a millionth of a second, assuming it really exists at
all?
In the biological sciences, limitations on precision are also a
formidable obstacle. If you have ever observed a human birth, you know
how difficult it would be to assign a precise moment of birth. Of
course, most human births are not the subject of scientific research;
still, it is interesting to know how the time of birth is decided upon
in many cases. Shortly after the infant receives the necessary
immediate care, someone says, "Did anyone get the time of birth?", and
someone else says, "Yes, it was xx:xx.", based on a rough estimate.
While it might be possible in a research study to get somewhat more
accurate times than those typically recorded, it would make no sense
to try to record them to, say, the tenth of a second.
In this section, we will concern ourselves mainly with the frontiers
of measurement in the sociological sciences. In addition to severe
problems of semantics, competing special interests, and deliberate
political obfuscation, there are questions of intrinsic measurability
-- that is, whether a putative property can be measured at all.
Tax reform is a good example of a truly tangled issue. How do you
measure the economic outcome on various classes of individuals that
has accrued from an adopted reform? An obvious, oversimplified
approach is to compute the tax burden for prototype individuals under
both the new system and the old. The omitted complication is that when
tax laws change, many individuals and businesses adapt their economic
behavior. These primary adaptations have consequences for other
taxpayers, who also adapt their behavior, in turn affecting other
persons.
Moreover, the act of changing tax laws can cause alterations in social
values, such as risk-taking behavior, charitable giving, and
perception of personal needs. These changes in values can, in turn,
have a major effect upon taxation under the reformed system. Even the
possibility that major changes in tax laws could occur is likely to
make some persons and businesses seek near-term results, instead of
looking to the long run.
Even after tax laws are changed, it is difficult to distinguish
whether subsequent changes in the economy are the result of the tax
changes or of something else. There are times when we just don't have
a trustworthy way to make a measurement, often because so many factors
affect the observable outcome that it is hard to sort out what is due
to a particular change.
For another preliminary example, consider two persons who are largely
in control of what they do at their jobs. Perhaps they are both
primarily responsible for administrative work. They are arguing one
day about who is "busier" than the other. One says her calendar is
packed from morning to night with appointments to keep and things that
must be done, and that since there are no breaks in it whatsoever, no
one else could possibly be busier without working an even longer day.
The other counters that her day is so much busier, she is constantly
having extemporaneous meetings and taking phone calls, and she doesn't
even have the time for "non-productive" activities like making
schedules to protect herself from interruptions. Underlying this
slightly frivolous dispute are two competing theories of measurement
of busyness. The first administrator is suggesting that if someones
time is completely allocated in advance, then that person is 100%
busy. The second is suggesting that busyness is related to the number
(and also, perhaps, to the intensity) of interruptions. As in the case
of tax reform, we have no resolution to propose.
The Alabama paradox A less complicated measurement quandary, in the
sense that there are fewer factors involved, is caused by the problem
of apportionment in the House of Representatives of the United States.
To give a simplified illustration of a phenomenon known as the
"Alabama paradox", we will reduce it to a model with four states,
known as A, B, C, and D, a population of 100 million persons, and 30
representatives to be apportioned. The following table gives the
population of each state, and its exact proportion of the population,
expressed as a decimal fraction to three places:
State Population Proportion
A 4,500,000 .045
B 10,700,000 .107
C 31,100,000 .311
D 53,700,000 .537
Totals: 100,000,000 1.000
Since there are 30 representatives, the first step in calculating the
fair share for each state is to multiply its proportion of the total
population by 30. The next table shows the result of this calculation:
State Proportion Exact Fair Share
A .045 1.35
B .107 3.215
C .311 9.33
D .537 16.11
Totals: 1.000 30.00
It is obvious that State A should get at least 1 representative, State
B at least 3, State C at least 9, and State D at least 16, but that
makes only 29 representatives, so there is one leftover. What once
seemed obvious to all was that State A should get the remaining
representative, because its remaining fraction of 0.35 is the largest
fraction among the four states. If there were two remaining
representatives after each state got its whole-number share, then the
state with the second largest fractional part -- in this case, State
C, with 0.33 -- would get the second remaining representative, and so
on.
Now imagine that the number of representatives is increased from 30 to
31. You might guess that the additional representative will go to
State C, but what happens is far more surprising. Here is a new table
of exact fair shares, based on 31 representatives:
State Proportion Exact Fair Share
A .045 1.39
B .107 3.32
C .311 9.64
D .537 16.65
Totals: 1.000 31.00
The integer parts of the fair share calculations remain the same as
they were in the 30 representatives case, and they add to a total of
29. However, the two remaining representatives go to State D (with the
highest fractional part, at 0.65) and to State C (with the second
highest fractional part, at 0.64). Thus, increasing the number of
representatives has had the "paradoxical" effect of costing State A
one of its representatives.
When this first occurred in United States political history, the
"victim state" was Alabama, hence the name "Alabama paradox". Later,
Maine was a losing state.
In the 1970's, two mathematicians, M. Balinski and H. P. Young,
devised a different way to achieve "proportional representation", in
which each state would still get at least its whole-number share, and
such that no state would ever lose a representative to the Alabama
paradox. The principal disadvantage of this innovative plan is that
scarcely any non-mathematicians are able to understand how to use it.
Certain countries, notably France, have resolved the problem by
assigning fractional votes to certain representatives. It has also
been argued that the Alabama paradox is not unfair, since it merely
removes a temporary partially unearned seat from one state and
transfers it to another. In practice, however, it is clear that some
states have ganged up on others and contrived to increase the number
of representatives in such a manner as to deprive particular persons
of their seats.
The Condorcet paradox . Another issue that has arisen in the political
realm presents a deeper problem. Suppose that there are several
candidates for the same office, and that each voter ranks them in the
order of his or her preference. It seems to stand to reason that if
one candidate is a winner, then no other candidate would be preferred
by a majority of the voters. Now suppose there were three candidates,
known as A, B, and C, a population of 100 voters, and that the
following distribution of rank orders was obtained.
Ranking Order Number of Voters
A > B > C 32
B > C > A 33
C > A > B 35
Total: 100
From this table, it seems to follow that:
1. Candidate C cannot be the winner, since 65% of the voters prefer
Candidate B.
2. Candidate B cannot be the winner, since 67% of the voters prefer
Candidate A.
3. Candidate A cannot be the winner, since 68% of the voters prefer
Candidate C.
This paradox is not an indictment of the concept of rank-order voting.
Under most common vote-for-one-only systems, a winner would emerge,
but not one with a "clear mandate". For instance, if the winner were
determined by plurality on the first ballot, then Candidate C would
win, despite the fact that 65% really would have preferred Candidate
B. Just because the vote-for-one-only system conceals this preference
doesn't change the actuality that it is indeed a preference of the
voters. If the rules called for elimination of a least favored on each
round until one candidate obtained a plurality, then Candidate A would
be eliminated on the first round, and Candidate B would get the 32
votes no longer assignable to Candidate A. Thus, Candidate B would
win, even though 67% of the voters would have preferred Candidate A.
The underlying measurement problem here is as follows: given a rank
order preference list supplied by each voter, how do you determine the
"collective choice"? The area of applied mathematics in which this
problem is studied is called "social choice theory", and one of its
practitioners, Kenneth Arrow, has won a Nobel Prize in Economics for
showing that no method of choosing a winner could simultaneously
satisfy all the reasonable constraints one would wish to impose and
still work in every instance.
Perhaps the strongest aspect of vote-for-one-only systems is that they
are easily understood and difficult to manipulate by voter-bloc
discipline. Under political conditions in which voters commonly feel
that the problem is choosing the "least undesirable candidate", since
no desirable candidate exists, it is sometimes suggested that voters
should have the option of voting "no" against their least-liked
candidate, where a yes counts +1 and a no counts -1. There is both a
"yes-or-no" system, in which one may vote either "yes" for a
most-favored candidate or "no" against a least-liked, but not both,
and a "yes-and-no" system in which one may make both a "yes" vote and
a "no" vote. A prominent advantage of such a system is that
"protest-voters" would not end up "wasting" their votes on obscure
candidates with no chance of winning. Nonetheless, "no"-vote systems
have problems of their own, and this is explored in some of the
exercises below.
3.5 Exercises
1. There are many different reasons why a prize might be awarded to a
baseball team for performance in league competition. Identify the type
of scale to which each of the following possible criteria belongs, and
explain your answer.
a) number of games won against other teams in the league
b) zodiac sign of the shortstop
c) distance between the catcher's eyes
d) hair color of the centerfielder
2. Jones has an expensive foreign-made analog watch with a jeweled
movement that gains at most 12 seconds a month. Every month, Jones
resets his watch according to the time announced by the telephone
company time service. Smith has an inexpensive digital watch that
gains at most 2 seconds a year. However, Smith sets her watch five
minutes fast, in order to avoid missing commuter trains. Whose watch
is more precise? Whose watch is more accurate?
3. Consider a two-dimensional square of definite size. Among all
possible transformations that can be made on this square, there are
some that will leave it in a position indistinguishable from its
original position (e.g. rotation in the plane by 90 degrees; a
rotation of 180 degrees around the diagonal of the square, or a
reflection in a two-sided mirror positioned positioned perpendicular
to the square and passing through its middle). We call such
transformations the symmetry transformations of the square, and say
that the more symmetry transformations a system has, the higher its
degree of symmetry. What sort of scale is being used in this sort of
measurement? Under what conditions would it be sensible to compare the
degrees of symmetry of two systems?
4. A conversation is taking place between two members of your college
class, and they are talking about their academic standing when they
were in high school. One student tells the other that she was 2nd in
her graduating class, and the other student remarks "Oh, well, you're
about three times as smart as I am; I was only 6th in my class."
Explain what's wrong with the second student's reasoning. Would the
second student be warranted in asserting instead that the first
student had done better than he had in high school?
5. Write a recursive program for Fibonacci numbers. Use your program
to calculate the 5th, 10th, 15th, 20th, 25th, and 30th Fibonacci
numbers. Use a wrist watch to record the time it takes for each
calculation.
6. The text of this chapter would seem to imply that in deciding
whether or not a particular discipline is or is not scientific,
reproducibility of measurements is what ultimately counts. Do you
think this is a reasonable way to demarcate science from
"non-science", and/or can you think of other measurement criteria
which you think are more important than "reliability" in establishing
knowledge claims? Do you think that the criterion of reliability rules
out the possibility of scientific status for some disciplines in
principle (e.g. economics, psychoanalysis, literary theory)? Please
explain your answers carefully.
7. For the given four-state example, show that an increase from 30
representatives to 32 would not have caused an instance of the Alabama
paradox, but that an increase from 30 to 34 would.
8. Either construct a two-state example of the Alabama paradox, or
prove that it is impossible to do so.
9. Construct a three-state example of the Alabama paradox.
10. A "yes-and-no" voting system can be viewed as a reduction of a
preference ranking system, in which all that matters is who each voter
likes best and least. If voters favoring a second-most popular
candidate cast their "no"-votes against the most popular candidate,
instead of a candidate they actually like least, this can sometimes
tilt the net tally so that the second-most popular candidate wins. In
other words, a "yes-and-no" system can deteriorate into deliberate
manipulation. Construct a 4-candidate example in which this phenomenon
could occur.
3.6 Notes
[1] At M.I.T., literature, art, music, foreign languages, and
history are collectively a single department, while there is finer
differentiation among the sciences and the engineering disciplines).
[2]Deciding which subjects or groups of subjects are comparable
raises a number of difficult questions, and this issue is often left
to experts. This obviously presents problems if these same experts
have a stake in the outcome of a particular research program.
[3]One of these other criteria, of course, would be explanation.
As you read more about models, you might ask yourself how or whether
successful models actually explain, or merely describe the phenomena
being modeled. Do you think that explanation, as opposed to
description, is very important in science, or is description +
prediction the only thing we should be concerned about? Similarly, you
might consider whether you think the predictive capability of certain
physical theories is the defining characteristic of natural science;
that is, consider whether or not you would classify a discipline as
"pseudo-science" if the theories of that discipline were not
predictive.
[4] This is not to say that scientists always comply with this
requirement. Sometimes scientists temporarily conceal a few crucial
details for a number of months, in the interest of keeping a lead over
other researchers, and thereby advancing their own careers. Like
investment counselors, some scientists have been known to protect what
they perceive as their own self-interest.
[5] Named for the 14th century English philosopher, William of
Ockham, this is the maxim that the number of assumptions introduced to
explain something, or the number of entities postulated by a theory,
must not be multiplied beyond necessity. Of course, what constitutes
"necessity" is often far from clear, and the criterion of "simplicity"
in the evaluation of models is notoriously difficult to specify.
[6] You may recognize the name as that which describes the
sequence of numbers, each of which is the sum of the two previous
numbers. Named for the 13th century Italian mathematician Leonardo
Fibonacci ("Leonardo, son of Bonaccio"), these numbers have fascinated
mathematicians and scientists alike. In this case, the program
Fibonacci instructs the computer to simulate the growth of the
variable p(n).
_________________________________________________________________
(This is the third chapter of the first volume of The Scientific
Experience, by Herbert Goldstein, Jonathan L. Gross, Robert E. Pollack
and Roger B. Blumberg. The Scientific Experience is a textbook
originally written for the Columbia course "Theory and Practice of
Science". The primary author of "Measurement" is Jonathan L. Gross,
and it has been edited and prepared for the Web by Blumberg. It
appears at MendelWeb, for non-commercial educational use only.
Although you are welcome to download this text, please do not
reproduce it without the permission of the authors.)
_________________________________________________________________
This document can be found at MendelWeb (http://www.mendelweb.org)