Reliability and Validity

We often think of reliability and validity as separate ideas but, in fact, they are related to each other. The relationship between reliability and validity is that of the target here. Think of the center of the target as the concept we are trying to measure. Imagine that for each person we are measuring, we are taking a shot at the target. If we measure the concept perfectly for a person, we are hitting the center of the target. If we do not, we are missing the center. The more we are off for that person, the further we are from the center.

The above figure shows four possible situations. In the first one, we are hitting the target consistently, but we are missing the center of the target. That is, we are consistently and systematically measuring the wrong value for all respondents. This measure is reliable, but not valid. (It is consistent but wrong.) The second shows hits that are randomly spread across the target. We seldom hit the center of the target but, on average, we are getting the right answer for the group (but not very well for individuals). In this case, we get a valid group estimate, but we are inconsistent.

Here, we can clearly see that reliability is directly related to the variability of our measure. The third scenario shows a case where our hits are spread across the target and we are consistently missing the center. We measure in this case is neither reliable nor valid. Finally, the figure shows the Robin Hood or Mahabharat's Arjun bow scenario; we consistently hit the center of the target. Our measure is both reliable and valid.

Reliability

Reliability can be defined as the degree of consistency between two measures of the same thing (Black, 2002: 81). Reliability means many things to many people, but in most contests, the notion of consistency emerges. A measure is reliable to the degree that it supplies consistent results. Reliability is a necessary contributor to validity but is not a sufficient condition for validity.

If we were to take another sort of measuring device, a ruler, how sure can be that it is always a reliable measure? If it is made of metal, does it expand in extreme heat and therefore give different readings on hot and cold days? Alternatively, we might use it on two different days with similar temperatures, but do we mark off the measurement of a line on a piece of paper with the same degree of care and accuracy? For a research tool to be reliable, we would expect it to give us the same results when something was measured yesterday and today.

Reliability is concerned with estimates of the degree to which a measurement is free of random or unstable error. Reliable instruments can be used with confidence that transient and situational factors are not interfering. Reliable instruments are robust; they work well at different times under different conditions. This distinction of time and condition is the basis for frequently used perspectives on reliability stability, equivalence, and internal consistency.

1. Stability
A measure is said to possess stability if you can secure consistent results with repeated measurements of the same person with the same instrument. An observational procedure is stable if it gives the same reading on a particular person when repeated one or more times. It is often possible to repeat observations on a subject and to compare them for consistency. When there is much time between measurements, there is a chance for situational factors to change, thereby affecting the observations. The change would appear incorrectly as a drop in the reliability of the measurement process.

They measure the scores achieved on the same test on two different occasions. Any difference is called subject error. For example, a survey of employee attitudes towards their workplace may yield different results if taken on a Monday than on a Friday. To avoid this, the survey should be taken at a more neutral time of the week.

2. Equivalence
A second perspective on reliability considers how much error may be introduced by different investigators (in observation) or different samples of items being studied (in questioning or scales). Thus, while stability is concerned with personal and situational fluctuations from one time to another, equivalence is concerned with variations at one point in time among observers and samples of items. A good way to test for the equivalence of measurement by different observers is to compare their scoring for the same event. An example of this is the scoring of Olympic figure skaters by a panel of judges.

In studies where a consensus among experts or observers is required, the similarity of the judges' perceptions is sometimes questioned. How does a panel of supervision render a judgment on merit raises, a new product's packaging, or future economic trends? Inter-rater reliability may be used in these cases to correlate the observations or scores of the judges and render an index of how consistent their ratings are. In Olympic figure skating, a judge's relative positioning of skaters (by establishing a rank order for each judge and comparing each judge's ordering for all skaters) is a means of measuring equivalence.

3. Internal consistency
The third approach to reliability uses only one administration of an instrument or test to assess the internal consistency or homogeneity among the items. The split-half technique can be used when the measuring tool has many similar questions or statements to which the subject can respond (Sekaran, 1992: 174). The instrument is administered and the results are separated by item into even and odd numbers or into randomly selected halves. When the two halves are correlated, if the results of the correlation are high, the instrument is said to have high reliability in an internal consistency sense. The high correlation tells us there is similarity (or homogeneity) among the items. The potential for incorrect inferences about high internal consistency exists when the test contains many items-which inflate the correlation index.

Validity

Basically, to ensure validity, any instrument must measure what was intended. This means that the instrument, as the operational definition, must be logically consistent and cover comprehensively all aspects of the abstract concept to be studied. Ideally, it should be possible to confirm this through alternative, independent observation. The measurement literature traditionally has defined several different types of validity, some of which overlap. The discussion of validity in the literature is littered with controversy.

Validity in this context is the extent to which differences found with a measuring tool reflect true differences among respondents being tested.

We want the measurement tool to be sensitive to all the nuances of meaning in the variable and to changes in nuances of meaning over time. The difficulty in meeting the test of validity is that usually, one does not know what the true differences are. Without direct knowledge of the dimension being studied, we must face the question, "How can one discover validity without directly confirming knowledge?" A quick answer is to seek other relevant evidence that confirms the answers found with the measurement device, but this leads to a second question. "What constitutes relevant evidence?" There is no quick answer this time. What is relevant depends on the nature of the research problem and the researcher's judgment.

The issue of validity, however, is much more complex than this. At a basic level, it can be defined as seven types (Gray, 2004: 91).

1. Internal validity
Internal validity refers to correlation questions (cause and effect) and to the extent to which causal conclusions can be drawn. If we take, for example, an evaluation of the impact of a health education campaign, one group receives the educational material (the experimental group) while one does not (the control group). Possible confounding variables are controlled for, by trying to make sure that participants in each group are of similar ages and educational attainment. Internal validity (the impact of the campaign) may be helped by testing only those who are willing to participate in the experiment.

2. External validity
This is the extent to which it is possible to generalize from the data to a larger population or setting. Clearly, this is important in experimental and quasi-experimental studies where sampling is required and where the potential for generalizing findings is often an issue. As Robson (1993) points out, the argument for generalization can be made by either direct demonstration or by making a case. The problem of generalizing from a study is that cynics can argue that its results are of relevance only to its particular setting.

Direct demonstration, then, involves carrying out further studies involving different participants and in different settings. If the findings can be replicated (often through a series of demonstrations), then the argument for generalizing becomes stronger. Making a case simply involves the construction of a reasoned argument that the findings can be generalized. So, this would set out to show that the group(s) being studied, or the setting or period, share certain essential characteristics with other groups, settings, or periods.

3. Content validity
This applies to validating the content of an achievement test or qualifying examination. This might be carried out by comparing the topic coverage and cognitive emphasis of an examination with the original specifications in a syllabus. Examination boards and organizations that produce standardized tests tend to be very meticulous about such processes, while classroom teachers lack the resources and usually collect questions for tests less systematically.

4. Criterion validity
Criterion-related validity reflects the success of measures used for prediction or estimation. We may want to predict an outcome of validity, respectively. They differ only in a time perspective. An opinion questionnaire that correctly forecasts the outcome of a union election has predictive validity. An observational method that correctly categorizes families by current income class has concurrent validity.

While these examples appear to have simple and unambiguous validity criteria, there are difficulties in estimating validity. Consider the problem of estimating family income. There clearly is a knowable true income for every family. However, we may find it difficult to secure this figure. Thus, while the criterion is conceptually clear, it may be unavailable.

5. Construct validity
This is considered to be the most important for research design, since it is concerned with the measurement of abstract concepts and traits, such as intelligence, anxiety, logical reasoning, attitude towards dogs, social class, or perceived efficiency. To a certain degree, the validity of each of these is dependent upon a definition or description of the terminology. How is "anxiety" defined? What constitutes different levels of "perceived efficiency?" In the latter case, it may be that the operational definition is a score on a questionnaire.

6. Predictive validity
This shows how well a test can forecast a future trait such as job performance or attainment. It is no use if a test has both construct and content validity if it fails to identify, say, those who are likely to be "high performers" in a key worker role.

7. Statistical validity
This is the extent to which a study has made use of the appropriate design and statistical methods that will allow it to detect the effects that are present.