Reliability Analysis

Overview

Internal consistency: Estimation based on the correlation among the variables comprising the set (typically, Cronbach's alpha)
Split-half reliability: Estimation based on the correlation of two equivalent forms of the scale (typically, the Spearman-Brown coefficient)
Test-retest reliability: Estimation based on the correlation between two (or more) administrations of the same item, scale, or instrument for different times, locations, or populations, when the two administrations do not differ on other relevant variables (typically, the Spearman Brown coefficient)
Inter-rater reliability: Estimation based on the correlation of scores between/among two or more raters who rate the same item, scale, or instrument (typically, intraclass correlation, of which there are six types discussed below).

Contents

Key concepts and terms

Internal consistency reliability

Split-half reliability

Test-retest reliability

Inter-rater reliability

Assumptions

Frequently asked questions

Bibliography

Key Concepts and Terms

General
- Scores are the subject's responses to items on an instrument (ex., a mail questionnaire). Observed scores may be broken down into two components: the true score (commonly labeled tau) plus the error score. The error score, in turn, can be broken down into systematic error (non-random error reflecting some systematic bias, as due, for instance, to the methodology used -- hence also called method error) and random error (due to random traits of the subjects -- hence also called trait error). The smaller the error component in relation to the true score component, the higher the reliability of an item, which is the ratio of the true score to the total (true + error) score.
  - Number of scale items. Note that the larger the number of items added together in a scale, the less random error matters as it will be self-cancelling (think of weighing a subject on 100 different scales and averaging rather than on using just one scale), and therefore some reliability coefficients (such as Cronbach's alpha) also compute higher reliability when the number of scale items is higher.
- Models supported by SPSS under the Models button of the Reliability dialog are:
  1. Alpha (Cronbach). This models internal consistency based on average correlation among items.
  2. Split-half. This model is based on the correlation between the parts of a scale which is split into two forms.
  3. Guttman. This is an alternative split-half model which computes Guttman's lower bounds for true reliability, discussed below.
  4. Parallel. This method uses maximum likelihood to test if all items have equal variances and error variances. Cronbach's alpha is the maximum likelihood estimate of the reliability coefficient when the parallel model is assumed to be true (SPSS, 1988: 873). If the chi-square goodness of fit significance for the parallel model is <=.05, the researcher rejects the null hypothesis that the items have equal variances and error variances in the population.
  5. Strict parallel. This method also uses maximum likelihood to test for equal variances, equal error variances, and equal population means across items.
- Triangulation is the attempt to increase reliability by reducing systematic (method) error, through a strategy in which the researcher employs multiple methods of measurement (ex., survey, observation, archival data). If the alternative methods do not share the same source of systematic error, examination of data from the alternative methods gives insight into how individual scores may be adjusted to come closer to reflecting true scores, thereby increasing reliability.
- Calibration is the attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used. Raters meet in calibration meetings to discuss items on which they have disagreed, typically during pretesting of the instrument. The raters seek to reach consensus on rules for rating items (ex., defining the meaning of a "3" for an item dealing with job satisfaction). Calibration meetings should not involve discussion of expected outcomes of the study, as this would introduce bias and undermine validity.

Internal consistency reliability
http://www.youtube.com/watch?v=DS8Hw0Ort4w

Cronbach's alpha is the most common form of internal consistency reliability coefficient. Alpha equals zero when the true score is not measured at all and there is only an error component. Alpha equals 1.0 when all items measure only the true score and there is no error component.

Interpretation: Cronbach's alpha can be interpreted as the percent of variance the observed scale would explain in the hypothetical true scale composed of all possible items in the universe. Alternatively, it can be interpreted as the correlation of the observed scale with all possible other scales measuring the same thing and using the same number of items.

Cut-off criteria. By convention, a lenient cut-off of .60 is common in exploratory research; alpha should be at least .70 or higher to retain an item in an "adequate" scale; and many researchers require a cut-off of .80 for a "good scale." Cronbach's alpha is discussed further in the section on standard measures and scales, along with other coefficients such as Cohen's kappa.

Cronbach's alpha is arguably the most commonly used metric used to evaluate the internal consistency reliability associated with scores derived from a scale. If you ask most any researcher, he or she will likely tell you that Cronbach's alpha must be at least .70. Unfortunately, as pointed out by (Lance, Butts, & Michels, 2006), this often cited criterion, claimed to have been articulated by Nunnally, is actually misleading. I encourage you to read the Lance et al. (2006) paper. Essentially, Nunnaly and Bernstein (1994) state that .70 may be an acceptable minimum for a scale that is newly developed. By contrast, basic research should rely upon scales that yields scores with a minimum reliability of .80. In cases where important decisions are being made based on scores from a scale, a reliability in excess of .90 should be expected.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-
Hill.

Other characteristics:

Number of items. Note that Cronbach's alpha increases as the number of items in the scale increases, even controlling for the same level of average intercorrelation of items. This assumes, of course, that the added items are not bad items compared to the existing set. Increasing the number of items can be a way to push alpha to an acceptable level. This reflects the assumption that scales and instruments with a greater number of items are more reliable. It also means that comparison of alpha levels between scales with differing numbers of items is not appropriate.

Alpha if deleted. SPSS will compute "Cronbach's Alpha if Item Deleted," which is the estimated value of alpha if the given item were removed from the model. The researcher may wish to drop items where the alpha if deleted is higher than the overall alpha as another way to improve the alpha level. Note, however, that when an item has high random error it is possible that it would be removed on this basis when, in fact, it does measure the same construct.

The item-total correlation, also part of SPSS output in the Total Correlation column when Item is checked under the Statistics button. This is the Pearsonian correlation of the item with the total of scores on all other items. A low item-total correlation means the item is little correlated with the overall scale (ex., < .3 for large samples or not significant for small samples) and the researcher should consider dropping it. A negative correlation indicates the need to recode the item in the opposite direction. The reliability analysis should be re-run if an item is dropped or recoded. Note a scale with an acceptable Cronbach's alpha may still have one or more items with low item-total correlations.

The squared multiple correlation, R² is the R² for an item when it is predicted from all other items in the scale. The larger the R², the more the item is contributing to internal consistency. The lower the R², the more the researcher should consider dropping it. Note the R² of some items may be low even on a scale which has an acceptable Cronbach's alpha overall.

Negative alphas. Note also that a negative Cronbach's alpha indicates inconsistent coding (see assumptions) or a mixture of items measuring different dimensions, leading to negative inter-item correlations.

The Kuder-Richardson (KR20) coefficient is the same as Cronbach's alpha when items are dichotomous. SPSS에서 척도-신뢰도분석에서 가능.

In SPSS, Cronbach's alpha is found under Analyze, Scale, Reliability Analysis. Then in the Statistics button, check Scale to get alpha. You can also check Scale if deleted, in which case alpha will be computed both for all variables entered, and also for all remaining variables if any one is dropped (the alpha if deleted is listed in a table, one for each variable). That is, the 'scale if deleted' option lets the researcher assess the reliability of each item.

Standardized item alpha is the average inter-item correlation when item variances are equal. It is also called the Spearman-Brown stepped-up reliability coefficient or simply the "Spearman-Brown Coefficient," but these terms should not be confused with the Spearman-Brown split-half reliability coefficient discussed below. The difference between Cronbach's alpha and standardized item alpha is a measure of the dissimilarity of variances among items in the set. In a second use, standardized item alpha can be used to estimate the change in reliability as the number of items in an instrument or scale varies. In SPSS, the Spearman-Brown stepped-up reliability coefficient is labeled "Cronbach's alpha based on standardized items" and is part of the default output in the "Reliability Statistics" table, next to Cronbach's alpha..
Ordinal reliability alpha. Zumbo, Gadermann, & Zeisser (2007) use a polychoric correlation matrix input to calculate alpha parallel to Cronbach. Their simulation studies lead them to conclude that ordinal reliability alpha provides "consistently suitable estimates of the theoretical reliability, regardless of the magnitude of the theoretical reliability, the number of scale points, and the skewness of the scale point distributions. In contrast, coefficient alpha is in general a negatively biased estimate of reliability" for ordinal data (p. 21). Ordinal reliability alpha will normally be higher than the corresponding Cronbach's alpha.
Raykov's reliability rho (ρ), also called reliability rho or composite reliability, tests if it may be assumed that a single common factor underlies a set of variables. Raykov (1998) has demonstrated that Cronbach's alpha may over- or under-estimate scale reliability. Underestimation is common. For this reason, rho is now preferred and may lead to higher estimates of true reliability. Raykov's reliability rho is not to be confused with Spearman's median rho, an ordinal alternative to Cronbach's alpha, discussed below. The acceptable cutoff for rho would be the same as the researcher sets for Cronbach's alpha since both attempt to measure true reliability. Raykov's reliability rho is ouput by EQS. See Raykov (1997), which lists EQS and LISREL code for computing composite reliability.Graham (2006) discusses Amos computation of reliability rho.
Armor's reliability theta is a similar measure developed by Armor (1974). Theta =θ = [p/(p-1)]*[1-(1/λ₁)], where p = the number of items in the scale and where λ₁ denotes the first and therefore largest eigenvalue from the principal component analysis of the correlation of items comprising the scale. See Zumbo, Gadermann, & Zeisser, 2007: 22. Reliability theta is interpreted similar to other reliability coefficients. While not directly computed by SPSS or SAS, it is easily calculated from principal components factor results using the formula above.
- Ordinal reliability theta. Zumbo, Gadermann, & Zeisser (2007) use a polychoric correlation matrix input to principal components analysis to calculate an ordinal version of reliability theta, using simulation studies to demonstrate ordinal reliability theta "consistently suitable estimates of the theoretical reliability, regardless of the magnitude of the theoretical reliability, the number of scale points, and the skewness of the scale point distributions." (p. 21). Ordinal reliability theta will normally be higher than the corresponding Cronbach's alpha.
Spearman's reliability rho.. Spearman's rho is a form of rank order calculation. It is calculated with the same formula as for Pearson's r correlation, but using rank rather than interval data. The median rho between all pairs of items in a scale is a classic measure of reliability, in the sense of internal consistency, and as such is an rdinal alternative to Cronbach's alpha. Rho > .60 is considered the minimum for adequate scale reliability. This is not to be confused with Raykov's reliability rho.

Split-half reliability
- Split-half reliability, which measures equivalence, is also called parallel forms reliability or internal consistency reliability. It is administering two equivalent batteries of items measuring the same thing in the same instrument to the same people. If split halves is requested in SPSS, four coefficients will be generated: Cronbach's alpha for each form, the Spearman-Brown coefficient, the Guttman split-half coeffcient, and the Pearsonian correlation between the two forms (aka, "half-test reliability"). (Note: Some authors label split-half reliability as a subtype of internal consistency reliability.)
  In SPSS, select Analyze, Scale, Reliability Analysis; list your variables; click Statistics; select Item Scale, Scale if Item Deleted; select Split-Half from the Model drop-down list. OK. SPSS will take the first half of the items as the first split form, and the second half as listed in the dialog box as the second split form. If there are an odd number of items, the first form will be one item longer than the second form. You can also use the Paste button to call up the Syntax window and alter the /MODEL=SPLIT parameter to be /MODEL=SPLIT n, where n is the number of items in the second form.
  - Spearman-Brown split-half reliability coefficient, also called the Spearman-Brown prophecy coefficient and not to be confused with the Spearman-Brown stepped-up reliability coefficient (standardized item alpha) above, is a form of split-halves reliability measure. The Spearman-Brown prophecy coefficient is used to estimate full test reliability based on split-half reliability measures. A common rule of thumb is .80 or high for adequate reliability and .90 or higher for good reliability. However, for exploratory research, a cutoff as low as .60 is not uncommon.
    The Pearson correlation of split forms estimates the half-test reliability of an instrument or scale. The Spearman-Brown "prophecy formula" predicts what the full-test reliability would be, based on half-test correlations. This coefficient will be higher than the half-test reliability coefficient. This coefficient is usually equal to and easily hand-calculated as twice the half-test correlation divided by the quantity 1 plus the half-test reliability. In SPSS, two Spearman-Brown split-half reliability coefficients will appear in the "Reliability Statistics" portion of the output when split-half is selected under the Model button: (1) "Equal length" gives the estimate of the reliability if both halves had equal numbers of items, and (2) "Unequal length" gives the reliability estimate assuming unequal numbers.
    As with other split-halves measures, the Spearman-Brown reliability coefficient is highly influenced by alternative methods of sorting items into the two forms, which is preferably done randomly. Random assignment of items to the two forms should assure equality of variances between the forms, but this is not guaranteed and should be checked by the researcher.
  - Guttman split-half reliability coefficient is an adaptation of the Spearman-Brown coefficient, but one which does not require equal variances between the two split forms.
    - Guttman's lower bounds (lambda 1-6) are a set of six coefficients, L1 to L6, generated when in SPSS one selects "Guttman" under the Model button.:
      1. L1: An intermediate coefficient used in computing the other lambdas.
      2. L2: More complex than Cronbach's alpha and preferred by some researchers, though less common.
      3. L3: Equivalent to Cronbach's alpha.
      4. L4: Guttman split-half reliability.
      5. L5: Recommended when a single item highly covaries with other items, which themselves lack high covariances with each other.
      6. L6: Recommended when inter-item correlations are low in relation to squared multiple correlations
    Guttman recommends experimenting to find the split of items which maximizes Guttman split-half reliability (L4), then using the highest of the lower bound lambdas as the reliability estimate for the set of items. The best split will be that in which each half contains highly inter-correlated items.
Test-retest reliability
- Test-retest reliability, which measures stability over time, is administering the same test to the same subjects at two points in time. The appropriate length of the interval depends on the stability of the variables which causally determine that which is measured. A year might be too long for an opinion item but appropriate for a physiological measure. A typical interval is several weeks. Statistically, test-retest reliability is treated as a variant of split-half reliability and also uses the Spearman-Brown coefficient.
  Test-retest methods are disparaged by many researchers as a way of gauging reliability. Among the problems are that short intervals between administrations of the instrument will tend to yield estimates of reliability which are too high. There may be invalidity due to a learning/practice effect (subjects learn from the first administration and adjust their answers on the second). There may be invalidity due to a maturation effect when the interval between administrations is long (the subjects change over time). The bother of having to take a second administration may cause some subjects to drop out of the pool, leading to nonresponse biases. Note, however, that test-retest designs are still widely used and published and there is support for this. McKelvie (1992), for instance, reports that reliability estimates under test-retest designs are not inflated due to memory effects. Researchers using test-retest reliability must address the special validity concerns, but may decide to go ahead if warranted.
Inter-rater reliability
- Inter-rater reliability, which measures homogeneity, is administering the same form to the same people by two or more raters/interviewers so as to establish the extent of consensus on use of the instrument by those who administer it. In the data setup, judges are the columns and judgees are the rows. For categorical data, consensus is measured as number of agreements divided by total number of observations. For continuous data, consensus is measured by intraclass correlation, discussed below. Note raters should be as blind as possible to expected outcomes of the study and should be randomly assigned.
- Cohen's Kappa for inter-rater reliability can be used to assess inter-rater reliability if there are just two raters. Cohen developed a multi-rater version of Kappa, but it is not implemented in SPSS.
- Intraclass correlation (ICC) is used to measure inter-rater reliability for two or more raters when data may be considered interval level. It may also be used to assess test-retest reliability. ICC may be conceptualized as the ratio of between-groups variance to total variance, as elaborated below. A classic citation for intraclass correlation is Shrout and Fleiss (1979), though ICC is based on work going back before WWI.

Assumptions

Additivity. Each item should be linearly related to the total score. Tukey's test of non-additivity, a choice under the Statistics button of the Reliability dialog in SPSS, tests the null hypothesis that there is no multiplicative interaction between the cases and the items. If this test is significant (<= .05) then there is multiplicative interaction. The Tukey significance is found in the "Nonadditivity" row of the "ANOVA with Tukey's Test for Nonadditivity" table in SPSS output.
If Tukey's test shows multiplicative interaction, any model computing scores for cases based on the scale must include the case main effect, the item main effect, and the case-by-item interaction effect. In a footnote to the Tukey test output, SPSS prints an estimates of the power to which items in a set would need to be raised in order to be additive. (Warning: while transforms may eliminate non-additivity, raising item scores to too high a power will generate large values for all subjects, obscuring differences among subjects).
In SPSS, select Analyze, Scale, Reliability Analysis; click Statistics; check Tukey's test of additivity
Independence. Observations for one subject/case should be independent of observations for any other subject/case in any administration of the instrument. However, the fact that test-retest designs involve correlated data between administrations does not pose a statistical problem in assessing reliability and does not in itself violate assumptions of reliability analysis.
Uncorrelated error. Errors should be uncorrelated .
Consistent coding. High values must have the same meaning across items.
Random assignment of subjects. In split-half tests, random assignment of items to forms is assumed. Typically, odd-numbered items become one form and even-numbered items become the second form. Sequential assignment may involve a subject fatigue factor with regard to the second form.
Equivalency of forms. In split-half tests, the two forms should be equivalent. A test of this is to see if the mean response is the same in the two groups. In split half models, Hotelling’s T² is a multivariate test for equality of means between groups. A significant T² means that the null hypothesis that means were equal can be rejected by the researcher. This test assumes multivariate normality of items. Hotelling’s T² is a choice under the Statistics button of the Reliability dialog in SPSS.
Equal variances. In split-half tests, the Spearman-Brown split-half reliability coefficient assumes the split halves have equal variances. The chi-square test of parallel models tests the null hypothesis that the variances are equal. If the chi-square significance is <= .05, then the researcher concludes the models are not parallel and that the variances differ significantly. In SPSS, select "Parallel" under the model button.
Similar difficulty of items. In internal consistency analysis using Cronbach's alpha, it assumes the scale items all measure the same dimension equally (ex., and assortment of math problems of equal difficulty). However, if the scale is of the Guttman scale type where higher items (ex., solving division problems) imply responses to lower items (ex., solving addition problems), but not vice-versa, internal consistency in the Cronbach's alpha sense is not expected and Cronbach's alpha gives an inappropriate estimate of reliability.
Same assumptions as for correlation.

Frequently Asked Questions

How is reliability related to validity?
How is reliability related to attenuation in correlation?
What is Cochran's Q test of equality of proportions for dichotomous items?
What is the derivation of intraclass correlation coefficients?
What are Method 1 and Method 2 in the SPSS RELIABILITY module?

Bibliography

Armor, D. J. (1974). Theta reliability and factor scaling. Pp. 17-50 in H. Costner, ed., Sociological methodology. San Francisco: Jossey-Bass.
Bonett, Douglas G. (2002). Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine 21: 1331-1335.
Ebel, Robert L. (1951). Estimation of the reliability of ratings. Psychometrika 16: 407-424.
Fleiss, J. L., Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33: 613-619.
Graham, James M. (2006). Congeneric and (essentially) tau-equivalent.estimates of score reliability: What they are and how to use them: Educational and Psychological Measurement 66; 930-944.
Haggard, E. A. (1958). Intraclass correlation and the analysis of variance. NY: Dryden.
Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33:159-174. This article sets cut-offs for Cohen's Kappa.
Litwin, Mark S.(2002). How to assess and interpret survey psychometrics. The Survey Kit series, Vol. 8. Thousand Oaks, CA: Sage Publications). Covers test-retest, alternate-form, internal consistency, interobserver, and intraobserver reliability.
McGraw, K.O. and S.P. Wong (1996). "Forming inferences about some intraclass correlation coefficients," Psychological Methods 1(1): 30-46.
McKelvie, S. J. (1992). Does memory contaminate test-retest reliability? Journal of Gen Psychology 119(1):59-72. This article reports that reliability estimates under test-retest designs are not inflated due to memory effects.
McNemar, Q. (1969). Psychological Statistics. Fourth edition. New York: Wiley Covers F tests for intraclass correlation (p. 322).
Muchinsky P.M. (1996) The correction for attenuation. Educational & Psychological Measurement 56(1), 63-75.
Nunnelly, J. C. (1970) Psychometric Theory. Second ed., 1978. New York: McGraw Hill. Cited as a reference in support of the .70 cut-off for Cronbach's alpha. Classic on reliability in psychological and educational testing.
Raykov, Tenko (1997). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21, 173-184
Raykov, Tenko (1998). Coefficient alpha and composite reliability with interrelated nonhomogeneous items Applied Psychological Measurement, 22(4), 375-385.
Shrout, P.E., and J. L. Fleiss (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin (86): 420-428. Classic article on intraclass correlation.
SPSS (1988). SPSS-X User's Guide, Third ed.. Chicago, IL: SPSS Inc.
Walter, S. D.; Eliasziw, M. ; and Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in Medicine 17: 101-110.
Winne, Philip H. & Belfry, M. Joan (1982). Interpretive problems when correcting for attenuation. Journal of Educational Measurement, 19(2), 125-134.
Zumbo, B. D.; Gadermann, A. M.; & Zeisser, C.. (2007). Ordinal versions of coefficients alpha and theta for likert rating scales. Journal of Modern Applied Statistical Methods, 6, 21-29.