Reliability refers to how dependably or consistently a test measures a characteristic. If a person takes the test again, will he or she get a similar test score, or a much different score? A test that yields similar scores for a person who repeats the test is said to measure a characteristic reliably.
How do we account for an individual who does not get exactly the same test score every time he or she takes the test? Some possible reasons are the following:
- Test taker’s temporary psychological or physical state. Test performance can be influenced by a person’s psychological or physical state at the time of testing. For example, differing levels of anxiety, fatigue, or motivation may affect the applicant’s test results.
- Environmental factors. Differences in the testing environment, such as room temperature, lighting, noise, or even the test administrator, can influence an individual’s test performance.
- Test form. Many tests have more than one version or form. Items differ on each form, but each form is supposed to measure the same thing. Different forms of a test are known as parallel forms or alternate forms. These forms are designed to have similar measurement characteristics, but they contain different items. Because the forms are not exactly the same, a test taker might do better on one form than on another.
- Multiple raters. In certain tests, scoring is determined by a rater’s judgments of the test taker’s performance or responses. Differences in training, experience, and frame of reference among raters can produce different test scores for the test taker.
Types of reliability estimates
There are several types of reliability estimates, each influenced by different sources of measurement error. Test developers have the responsibility of reporting the reliability estimates that are relevant for a particular test. Before deciding to use a test, read the test manual and any independent reviews to determine if its reliability is acceptable. The acceptable level of reliability will differ depending on the type of test and the reliability estimate used.
The discussion in Table 2 should help you develop some familiarity with the different kinds of reliability estimates reported in test manuals and reviews.
Test-retest reliability indicates the repeatability of test scores with the passage of time. This estimate also reflects the stability of the characteristic or construct being measured by the test.
Some constructs are more stable than others. For example, an individual’s reading ability is more stable over a particular period of time than that individual’s anxiety level. Therefore, you would expect a higher test-retest reliability coefficient on a reading test than you would on a test that measures anxiety.
Alternate or parallel form reliability indicates how consistent test scores are likely to be if a person takes two or more forms of a test.
- A high parallel form reliability coefficient indicates that the different forms of the test are very similar which means that it makes virtually no difference which version of the test a person takes. On the other hand, a low parallel form reliability coefficient suggests that the different forms are probably not comparable; they may be measuring different things and therefore cannot be used interchangeably.
- Inter-rater reliability indicates how consistent test scores are likely to be if the test is scored by two or more raters.
On some tests, raters evaluate responses to questions and determine the score. Differences in judgments among raters are likely to produce variations in test scores. A high inter-rater reliability coefficient indicates that the judgment process is stable and the resulting scores are reliable.
Inter-rater reliability coefficients are typically lower than other types of reliability estimates. However, it is possible to obtain higher levels of inter-rater reliabilities if raters are appropriately trained.
- Internal consistency reliability indicates the extent to which items on a test measure the same thing.
A high internal consistency reliability coefficient for a test indicates that the items on the test are very similar to each other in content (homogeneous). It is important to note that the length of a test can affect internal consistency reliability. For example, a very lengthy test can spuriously inflate the reliability coefficient.
Tests that measure multiple characteristics are usually divided into distinct components. Manuals for such tests typically report a separate internal consistency reliability coefficient for each component in addition to one for the whole test.
Test manuals and reviews report several kinds of internal consistency reliability estimates. Each type of estimate is appropriate under certain circumstances. The test manual should explain why a particular estimate is reported.