Levels of Measurement
Statisticians commonly distinguish four types or levels of measurement; the same terms may also be used to refer to data measured at each level. The levels of measurement differ both in terms of the meaning of the numbers and in the types of statistics that are appropriate for their analysis.
With nominal data, as the name implies, the numbers function as a name or label and do not have numeric meaning. For instance, you might create a variable for gender, which takes the value 1 if the person is male and 0 if the person is female. The 0 and 1 have no numeric meaning but function simply as labels in the same way that you might record the values as “M” or “F.” There are two main reasons to choose numeric rather than text values to code nominal data: data is more easily processed by some computer systems as numbers, and using numbers bypasses some issues in data entry such as the conflict between upper- and lowercase letters (to a computer, “M” is a different value than “m,” but a person doing data entry may treat the two characters as equivalent). Nominal data is not limited to two categories: for instance, if you were studying the relationship between years of experience and salary in baseball players, you might classify the players according to their primary position by using the traditional system whereby 1 is assigned to pitchers, 2 to catchers, 3 to first basemen, and so on.
If you can’t decide whether data is nominal or some other level of measurement, ask yourself this question: do the numbers assigned to this data represent some quality such that a higher value indicates that the object has more of that quality than a lower value? For instance, is there some quality “gender” which men have more of than women? Clearly not, and the coding scheme would work as well if women were coded as 1 and men as 0. The same principle applies in the baseball example: there is no quality of “baseballness” of which outfielders have more than pitchers. The numbers are merely a convenient way to label subjects in the study, and the most important point is that every position is assigned a distinct value. Another name for nominal data is categorical data, referring to the fact that the measurements place objects into categories (male or female; catcher or first baseman) rather than measuring some intrinsic quality in them. When data can take on only two values, as in the male/female example, it may also be called binary data. This type of data is so common that special techniques have been developed to study it, including logistic regression, which has applications in many fields. Many medical statistics such as the odds ratio and the risk ratio were developed to describe the relationship between two binary variables, because binary variables occur so frequently in medical research.
Ordinal data refers to data that has some meaningful order, so that higher values represent more of some characteristic than lower values. For instance, in medical practice burns are commonly described by their degree, which describes the amount of tissue damage caused by the burn. A first-degree burn is characterized by redness of the skin, minor pain, and damage to the epidermis only, while a second-degree burn includes blistering and involves the dermis, and a third-degree burn is characterized by charring of the skin and possibly destroyed nerve endings. These categories may be ranked in a logical order: first-degree burns are the least serious in terms of tissue damage, third-degree burns the most serious. However, there is no metric analogous to a ruler or scale to quantify how great the distance between categories is, nor is it possible to determine if the difference between first- and second-degree burns is the same as the difference between second- and third-degree burns.
Many ordinal scales involve ranks: for instance, candidates applying for a job may be ranked by the personnel department in order of desirability as a new hire. We could also rank the U.S. states in order of their population, geographic area, or federal tax revenue. The numbers used for measurement with ordinal data carry more meaning than those used in nominal data, and many statistical techniques have been developed to make full use of the information carried in the ordering, while not assuming any further properties of the scales. For instance, it is appropriate to calculate the median (central value) of ordinal data, but not the mean (which assumes interval data).
Interval data has a meaningful order and also has the quality that equal intervals between measurements represent equal changes in the quantity of whatever is being measured. The most common example of interval data is the Fahrenheit temperature scale. If we describe temperature using the Fahrenheit scale, the difference between 10 degrees and 25 degrees (a difference of 15 degrees) represents the same amount of temperature change as the difference between 60 and 75 degrees. Addition and subtraction are appropriate with interval scales: a difference of 10 degrees represents the same amount over the entire scale of temperature. However, the Fahrenheit scale, like all interval scales, has no natural zero point, because 0 on the Fahrenheit scale does not represent an absence of temperature but simply a location relative to other temperatures. Multiplication and division are not appropriate with interval data: there is no mathematical sense in the statement that 80 degrees is twice as hot as 40 degrees. Interval scales are a rarity: in fact it’s difficult to think of another common example. For this reason, the term “interval data” is sometimes used to describe both interval and ratio data (discussed in the next section).
Ratio data has all the qualities of interval data (natural order, equal intervals) plus a natural zero point. Many physical measurements are ratio data: for instance, height, weight, and age all qualify. So does income: you can certainly earn 0 dollars in a year, or have 0 dollars in your bank account. With ratio-level data, it is appropriate to multiply and divide as well as add and subtract: it makes sense to say that someone with $100 has twice as much money as someone with $50, or that a person who is 30 years old is 3 times as old as someone who is 10 years old.
It should be noted that very few psychological measurements (IQ, aptitude, etc.) are truly interval, and many are in fact ordinal (e.g., value placed on education, as indicated by a Likert scale). Nonetheless, you will sometimes see interval or ratio techniques applied to such data (for instance, the calculation of means, which involves division). While incorrect from a statistical point of view, sometimes you have to go with the conventions of your field, or at least be aware of them. To put it another way, part of learning statistics is learning what is commonly accepted in your chosen field of endeavor, which may be a separate issue from what is acceptable from a purely mathematical standpoint.
Continuous and Discrete Data
Another distinction often made is that between continuous and discrete data. Continuous data can take any value, or any value within a range. Most data measured by interval and ratio scales, other than that based on counting, is continuous: for instance, weight, height, distance, and income are all continuous.
In the course of data analysis and model building, researchers sometimes recode continuous data in categories or larger units. For instance, weight may be recorded in pounds but analyzed in 10-pound increments, or age recorded in years but analyzed in terms of the categories 0–17, 18–65, and over 65. From a statistical point of view, there is no absolute point when data become continuous or discrete for the purposes of using particular analytic techniques: if we record age in years, we are still imposing discrete categories on a continuous variable. Various rules of thumb have been proposed: for instance, some researchers say that when a variable has 10 or more categories (or alternately, 16 or more categories), it can safely be analyzed as continuous. This is another decision to be made on a case-by-case basis, informed by the usual standards and practices of your particular discipline and the type of analysis proposed.
Discrete data can only take on particular values, and has clear boundaries. As the old joke goes, you can have 2 children or 3 children, but not 2.37 children, so “number of children” is a discrete variable. In fact, any variable based on counting is discrete, whether you are counting the number of books purchased in a year or the number of prenatal care visits made during a pregnancy. Nominal data is also discrete, as are binary and rank-ordered data.
The term proxy measurement refers to the process of substituting one measurement for another. Although deciding on proxy measurements can be considered as a subclass of operationalization, we will consider it as a separate topic. The most common use of proxy measurement is that of substituting a measurement that is inexpensive and easily obtainable for a different measurement that would be more difficult or costly, if not impossible, to collect.
For a simple example of proxy measurement, consider some of the methods used by police officers to evaluate the sobriety of individuals while in the field. Lacking a portable medical lab, an officer can’t directly measure blood alcohol content to determine if a subject is legally drunk or not. So the officer relies on observation of signs associated with drunkenness, as well as some simple field tests that are believed to correlate well with blood alcohol content. Signs of alcohol intoxication include breath smelling of alcohol, slurred speech, and flushed skin. Field tests used to quickly evaluate alcohol intoxication generally require the subjects to perform tasks such as standing on one leg or tracking a moving object with their eyes. Neither the observed signs nor the performance measures are direct measures of inebriation, but they are quick and easy to administer in the field. Individuals suspected of drunkenness as evaluated by these proxy measures may then be subjected to more accurate testing of their blood alcohol content.
Another common (and sometimes controversial) use of proxy measurement are the various methods commonly used to evaluate the quality of health care provided by hospitals or physicians. Theoretically, it would be possible to get a direct measure of quality of care, for instance by directly observing the care provided and evaluating it in relationship to accepted standards (although that process would still be an operationalization of the abstract concept “quality of care”). However, implementing such a process would be prohibitively expensive as well as an invasion of the patients’ privacy. A solution commonly adopted is to measure processes that are assumed to reflect higher quality of care: for instance whether anti-tobacco counseling was offered in an office visit or whether appropriate medications were administered promptly after a patient was admitted to the hospital.
Proxy measurements are most useful if, in addition to being relatively easy to obtain, they are good indicators of the true focus of interest. For instance, if correct execution of prescribed processes of medical care for a particular treatment is closely related to good patient outcomes for that condition, and if poor or nonexistent execution of those processes is closely related to poor patient outcomes, then execution of these processes is a useful proxy for quality. If that close relationship does not exist, then the usefulness of measurements of those processes as a proxy for quality of care is less certain. There is no mathematical test that will tell you whether one measure is a good proxy for another, although computing statistics like correlations or chi-squares between the measures may help evaluate this issue. Like many measurement issues, choosing good proxy measurements is a matter of judgment informed by knowledge of the subject area, usual practices in the field, and common sense.
True and Error Scores
We can safely assume that no measurement is completely accurate. Because the process of measurement involves assigning discrete numbers to a continuous world, even measurements conducted by the best-trained staff using the finest available scientific instruments are not completely without error. One concern of measurement theory is conceptualizing and quantifying the degree of error present in a particular set of measurements, and evaluating the sources and consequences of that error.
Classical measurement theory conceives of any measurement or observed score as consisting of two parts: true score, and error. This is expressed in the following formula:
|X = T + E|
where X is the observed measurement, T is the true score, and E is the error. For instance, the bathroom scale might measure someone’s weight as 120 pounds, when that person’s true weight was 118 pounds and the error of 2 pounds was due to the inaccuracy of the scale. This would be expressed mathematically as:
|120 = 118 + 2|
which is simply a mathematical equality expressing the relationship between the three components. However, both T and E are hypothetical constructs: in the real world, we never know the precise value of the true score and therefore cannot know the value of the error score, either. Much of the process of measurement involves estimating both quantities and maximizing the true component while minimizing error. For instance, if we took a number of measurements of body weight in a short period of time (so that true weight could be assumed to have remained constant), using the most accurate scales available, we might accept the average of all the measurements as a good estimate of true weight. We would then consider the variance between this average and each individual measurement as the error due to the measurement process, such as slight inaccuracies in each scale.
Random and Systematic Error
Because we live in the real world rather than a Platonic universe, we assume that all measurements contain some error. But not all error is created equal. Random error is due to chance: it takes no particular pattern and is assumed to cancel itself out over repeated measurements. For instance, the error scores over a number of measurements of the same object are assumed to have a mean of zero. So if someone is weighed 10 times in succession on the same scale, we may observe slight differences in the number returned to us: some will be higher than the true value, and some will be lower. Assuming the true weight is 120 pounds, perhaps the first measurement will return an observed weight of 119 pounds (including an error of −1 pound), the second an observed weight of 122 pounds (for an error of +2 pounds), the third an observed weight of 118.5 pounds (an error of −1.5 pounds) and so on. If the scale is accurate and the only error is random, the average error over many trials will be zero, and the average observed weight will be 120 pounds. We can strive to reduce the amount of random error by using more accurate instruments, training our technicians to use them correctly, and so on, but we cannot expect to eliminate random error entirely.
Two other conditions are assumed to apply to random error: it must be unrelated to the true score, and the correlation between errors is assumed to be zero. The first condition means that the value of the error component is not related to the value of the true score. If we measured the weights of a number of different individuals whose true weights differed, we would not expect the error component to have any relationship to their true weights. For instance, the error component should not systematically be larger when the true weight is larger. The second condition means that the error for each score is independent and unrelated to the error for any other score: for instance, there should not be a pattern of the size of error increasing over time (which might indicate that the scale was drifting out of calibration).
In contrast, systematic error has an observable pattern, is not due to chance, and often has a cause or causes that can be identified and remedied. For instance, the scale might be incorrectly calibrated to show a result that is five pounds over the true weight, so the average of the above measurements would be 125 pounds, not 120. Systematic error can also be due to human factors: perhaps we are reading the scale’s display at an angle so that we see the needle as registering five pounds higher than it is truly indicating. A scale drifting higher (so the error components are random at the beginning of the experiment, but later on are consistently high) is another example of systematic error. A great deal of effort has been expended to identify sources of systematic error and devise methods to identify and eliminate them: this is discussed further in the upcoming section on measurement bias.
Reliability and Validity
There are many ways to assign numbers or categories to data, and not all are equally useful. Two standards we use to evaluate measurements are reliabilityand validity. Ideally, every measure we use should be both reliable and valid. In reality, these qualities are not absolutes but are matters of degree and often specific to circumstance: a measure that is highly reliable when used with one group of people may be unreliable when used with a different group, for instance. For this reason it is more useful to evaluate how valid and reliable a measure is for a particular purpose and whether the levels of reliability and validity are acceptable in the context at hand.
Reliability refers to how consistent or repeatable measurements are. For instance, if we give the same person the same test on two different occasions, will the scores be similar on both occasions? If we train three people to use a rating scale designed to measure the quality of social interaction among individuals, then showed each of them the same film of a group of people interacting and asked them to evaluate the social interaction exhibited in the film, will their ratings be similar? If we have a technician measure the same part 10 times, using the same instrument, will the measurements be similar each time? In each case, if the answer is yes, we can say the test, scale, or instrument is reliable.
Much of the theory and practice of reliability was developed in the field of educational psychology, and for this reason, measures of reliability are often described in terms of evaluating the reliability of tests. But considerations of reliability are not limited to educational testing: the same concepts apply to many other types of measurements including opinion polling, satisfaction surveys, and behavioral ratings.
There are three primary approaches to measuring reliability, each useful in particular contexts and each having particular advantages and disadvantages:
- Multiple-occasions reliability
- Multiple-forms reliability
- Internal consistency reliability
Multiple-occasions reliability, sometimes called test-retest reliability, refers to how similarly a test or scale performs over repeated testings. For this reason it is sometimes referred to as an index of temporal stability, meaning stability over time. For instance, we might have the same person do a psychological assessment of a patient based on a videotaped interview, with the assessments performed two weeks apart based on the same taped interview. For this type of reliability to make sense, you must assume that the quantity being measured has not changed: hence the use of the same videotaped interview, rather than separate live interviews with a patient whose state may have changed over the two-week period. Multiple-occasions reliability is not a suitable measure for volatile qualities, such as mood state. It is also unsuitable if the focus of measurement may have changed over the time period between tests (for instance, if the student learned more about a subject between the testing periods) or may be changed as a result of the first testing (for instance, if a student remembers what questions were asked on the first test administration). A common technique for assessing multiple-occasions reliability is to compute the correlation coefficient between the scores from each occasion of testing: this is called the coefficient of stability.
Multiple-forms reliability (also called parallel-forms reliability) refers to how similarly different versions of a test or questionnaire perform in measuring the same entity. A common type of multiple forms reliability is split-half reliability, in which a pool of items believed to be homogeneous is created and half the items are allocated to form A and half to form B. If the two (or more) forms of the test are administered to the same people on the same occasion, the correlation between the scores received on each form is an estimate of multiple-forms reliability. This correlation is sometimes called the coefficient of equivalence. Multiple-forms reliability is important for standardized tests that exist in multiple versions: for instance, different forms of the SAT (Scholastic Aptitude Test, used to measure academic ability among students applying to American colleges and universities) are calibrated so the scores achieved are equivalent no matter which form is used.
Internal consistency reliability refers to how well the items that make up a test reflect the same construct. To put it another way, internal consistency reliability measures how much the items on a test are measuring the same thing. This type of reliability may be assessed by administering a single test on a single occasion. Internal consistency reliability is a more complex quantity to measure than multiple-occasions or parallel-forms reliability, and several different methods have been developed to evaluate it. However, all depend primarily on the inter-item correlation, i.e., the correlation of each item on the scale with each other item. If such correlations are high, that is interpreted as evidence that the items are measuring the same thing and the various statistics used to measure internal consistency reliability will all be high. If the inter-item correlations are low or inconsistent, the internal consistency reliability statistics will be low and this is interpreted as evidence that the items are not measuring the same thing.
Two simple measures of internal consistency that are most useful for tests made up of multiple items covering the same topic, of similar difficulty, and that will be scored as a composite, are the average inter-item correlation andaverage item-total correlation. To calculate the average inter-item correlation, we find the correlation between each pair of items and take the average of all the correlations. To calculate the average item-total correlation, we create a total score by adding up scores on each individual item on the scale, then compute the correlation of each item with the total. The average item-total correlation is the average of those individual item-total correlations.