ÇöÀçÀ§Ä¡   Home > Information
 Title    [Á¶»ç¹æ¹ý]½Å·ÚµµºÐ¼® [ 2012-11-28 18:31:14 ]
 I D    robot Á¶È¸¼ö : 1723
 Content    

Reliability Analysis




        Overview

          Researchers must demonstrate instruments are reliable since without reliability, research results using the instrument are not replicable, and replicability is fundamental to the scientific method. Reliability is the correlation of an item, scale, or instrument with a hypothetical one which truly measures what it is supposed to. Since the true instrument is not available, reliability is estimated in one of four ways:

          1. Internal consistency: Estimation based on the correlation among the variables comprising the set (typically, Cronbach's alpha)
          2. Split-half reliability: Estimation based on the correlation of two equivalent forms of the scale (typically, the Spearman-Brown coefficient)
          3. Test-retest reliability: Estimation based on the correlation between two (or more) administrations of the same item, scale, or instrument for different times, locations, or populations, when the two administrations do not differ on other relevant variables (typically, the Spearman Brown coefficient)
          4. Inter-rater reliability: Estimation based on the correlation of scores between/among two or more raters who rate the same item, scale, or instrument (typically, intraclass correlation, of which there are six types discussed below).
          These four reliability estimation methods are not necessarily mutually exclusive, nor need they lead to the same results. All reliability coefficients are forms of correlation coefficients, but there are multiple types discussed below, representing different meanings of reliability and more than one might be used in single research setting.



      Contents


      Key Concepts and Terms

      • General

        • Scores are the subject's responses to items on an instrument (ex., a mail questionnaire). Observed scores may be broken down into two components: the true score (commonly labeled tau) plus the error score. The error score, in turn, can be broken down into systematic error (non-random error reflecting some systematic bias, as due, for instance, to the methodology used -- hence also called method error) and random error (due to random traits of the subjects -- hence also called trait error). The smaller the error component in relation to the true score component, the higher the reliability of an item, which is the ratio of the true score to the total (true + error) score.

          • Number of scale items. Note that the larger the number of items added together in a scale, the less random error matters as it will be self-cancelling (think of weighing a subject on 100 different scales and averaging rather than on using just one scale), and therefore some reliability coefficients (such as Cronbach's alpha) also compute higher reliability when the number of scale items is higher.
        • Models supported by SPSS under the Models button of the Reliability dialog are:

          1. Alpha (Cronbach). This models internal consistency based on average correlation among items.
          2. Split-half. This model is based on the correlation between the parts of a scale which is split into two forms.
          3. Guttman. This is an alternative split-half model which computes Guttman's lower bounds for true reliability, discussed below.
          4. Parallel. This method uses maximum likelihood to test if all items have equal variances and error variances. Cronbach's alpha is the maximum likelihood estimate of the reliability coefficient when the parallel model is assumed to be true (SPSS, 1988: 873). If the chi-square goodness of fit significance for the parallel model is <=.05, the researcher rejects the null hypothesis that the items have equal variances and error variances in the population.
          5. Strict parallel. This method also uses maximum likelihood to test for equal variances, equal error variances, and equal population means across items.
        • Triangulation is the attempt to increase reliability by reducing systematic (method) error, through a strategy in which the researcher employs multiple methods of measurement (ex., survey, observation, archival data). If the alternative methods do not share the same source of systematic error, examination of data from the alternative methods gives insight into how individual scores may be adjusted to come closer to reflecting true scores, thereby increasing reliability.
        • Calibration is the attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used. Raters meet in calibration meetings to discuss items on which they have disagreed, typically during pretesting of the instrument. The raters seek to reach consensus on rules for rating items (ex., defining the meaning of a "3" for an item dealing with job satisfaction). Calibration meetings should not involve discussion of expected outcomes of the study, as this would introduce bias and undermine validity.

    • Internal consistency reliability

    • http://www.youtube.com/watch?v=DS8Hw0Ort4w

 

  • Cronbach's alpha is the most common form of internal consistency reliability coefficient. Alpha equals zero when the true score is not measured at all and there is only an error component. Alpha equals 1.0 when all items measure only the true score and there is no error component.

 

    • Interpretation: Cronbach's alpha can be interpreted as the percent of variance the observed scale would explain in the hypothetical true scale composed of all possible items in the universe. Alternatively, it can be interpreted as the correlation of the observed scale with all possible other scales measuring the same thing and using the same number of items.

 

  • Cut-off criteria. By convention, a lenient cut-off of .60 is common in exploratory research; alpha should be at least .70 or higher to retain an item in an "adequate" scale; and many researchers require a cut-off of .80 for a "good scale." Cronbach's alpha is discussed further in the section on standard measures and scales, along with other coefficients such as Cohen's kappa.

Cronbach's alpha is arguably the most commonly used metric used to evaluate the internal consistency reliability associated with scores derived from a scale. If you ask most any researcher, he or she will likely tell you that Cronbach's alpha must be at least .70. Unfortunately, as pointed out by (Lance, Butts, & Michels, 2006), this often cited criterion, claimed to have been articulated by Nunnally, is actually misleading. I encourage you to read the Lance et al. (2006) paper. Essentially, Nunnaly and Bernstein (1994) state that .70 may be an acceptable minimum for a scale that is newly developed. By contrast, basic research should rely upon scales that yields scores with a minimum reliability of .80. In cases where important decisions are being made based on scores from a scale, a reliability in excess of .90 should be expected.

 

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-
Hill.


Other characteristics:

  1. Number of items. Note that Cronbach's alpha increases as the number of items in the scale increases, even controlling for the same level of average intercorrelation of items. This assumes, of course, that the added items are not bad items compared to the existing set. Increasing the number of items can be a way to push alpha to an acceptable level. This reflects the assumption that scales and instruments with a greater number of items are more reliable. It also means that comparison of alpha levels between scales with differing numbers of items is not appropriate.

 

  1. Alpha if deleted. SPSS will compute "Cronbach's Alpha if Item Deleted," which is the estimated value of alpha if the given item were removed from the model. The researcher may wish to drop items where the alpha if deleted is higher than the overall alpha as another way to improve the alpha level. Note, however, that when an item has high random error it is possible that it would be removed on this basis when, in fact, it does measure the same construct.

 

  1. The item-total correlation, also part of SPSS output in the Total Correlation column when Item is checked under the Statistics button. This is the Pearsonian correlation of the item with the total of scores on all other items. A low item-total correlation means the item is little correlated with the overall scale (ex., < .3 for large samples or not significant for small samples) and the researcher should consider dropping it. A negative correlation indicates the need to recode the item in the opposite direction. The reliability analysis should be re-run if an item is dropped or recoded. Note a scale with an acceptable Cronbach's alpha may still have one or more items with low item-total correlations.

 

  1. The squared multiple correlation, R2 is the R2 for an item when it is predicted from all other items in the scale. The larger the R2, the more the item is contributing to internal consistency. The lower the R2, the more the researcher should consider dropping it. Note the R2 of some items may be low even on a scale which has an acceptable Cronbach's alpha overall.

 

  1. Negative alphas. Note also that a negative Cronbach's alpha indicates inconsistent coding (see assumptions) or a mixture of items measuring different dimensions, leading to negative inter-item correlations.

 

  1. The Kuder-Richardson (KR20) coefficient is the same as Cronbach's alpha when items are dichotomous. SPSS¿¡¼­ ôµµ-½Å·ÚµµºÐ¼®¿¡¼­ °¡´É.

 

  • In SPSS, Cronbach's alpha is found under Analyze, Scale, Reliability Analysis. Then in the Statistics button, check Scale to get alpha. You can also check Scale if deleted, in which case alpha will be computed both for all variables entered, and also for all remaining variables if any one is dropped (the alpha if deleted is listed in a table, one for each variable). That is, the 'scale if deleted' option lets the researcher assess the reliability of each item.

 

  • Standardized item alpha is the average inter-item correlation when item variances are equal. It is also called the Spearman-Brown stepped-up reliability coefficient or simply the "Spearman-Brown Coefficient," but these terms should not be confused with the Spearman-Brown split-half reliability coefficient discussed below. The difference between Cronbach's alpha and standardized item alpha is a measure of the dissimilarity of variances among items in the set. In a second use, standardized item alpha can be used to estimate the change in reliability as the number of items in an instrument or scale varies. In SPSS, the Spearman-Brown stepped-up reliability coefficient is labeled "Cronbach's alpha based on standardized items" and is part of the default output in the "Reliability Statistics" table, next to Cronbach's alpha..

      rSB2 = (N* rave)/[1 + (N-1)* rave)]
      where
        rSB2 = the Spearman-Brown stepped-up reliability = standardized item alpha
        rave = the average of inter-item correlations
        N = total number of items
  • Ordinal reliability alpha. Zumbo, Gadermann, & Zeisser (2007) use a polychoric correlation matrix input to calculate alpha parallel to Cronbach. Their simulation studies lead them to conclude that ordinal reliability alpha provides "consistently suitable estimates of the theoretical reliability, regardless of the magnitude of the theoretical reliability, the number of scale points, and the skewness of the scale point distributions. In contrast, coefficient alpha is in general a negatively biased estimate of reliability" for ordinal data (p. 21). Ordinal reliability alpha will normally be higher than the corresponding Cronbach's alpha.
  • Raykov's reliability rho (¥ñ), also called reliability rho or composite reliability, tests if it may be assumed that a single common factor underlies a set of variables. Raykov (1998) has demonstrated that Cronbach's alpha may over- or under-estimate scale reliability. Underestimation is common. For this reason, rho is now preferred and may lead to higher estimates of true reliability. Raykov's reliability rho is not to be confused with Spearman's median rho, an ordinal alternative to Cronbach's alpha, discussed below. The acceptable cutoff for rho would be the same as the researcher sets for Cronbach's alpha since both attempt to measure true reliability. Raykov's reliability rho is ouput by EQS. See Raykov (1997), which lists EQS and LISREL code for computing composite reliability.Graham (2006) discusses Amos computation of reliability rho.
  • Armor's reliability theta is a similar measure developed by Armor (1974). Theta =¥è = [p/(p-1)]*[1-(1/¥ë1)], where p = the number of items in the scale and where ¥ë1 denotes the first and therefore largest eigenvalue from the principal component analysis of the correlation of items comprising the scale. See Zumbo, Gadermann, & Zeisser, 2007: 22. Reliability theta is interpreted similar to other reliability coefficients. While not directly computed by SPSS or SAS, it is easily calculated from principal components factor results using the formula above.

    • Ordinal reliability theta. Zumbo, Gadermann, & Zeisser (2007) use a polychoric correlation matrix input to principal components analysis to calculate an ordinal version of reliability theta, using simulation studies to demonstrate ordinal reliability theta "consistently suitable estimates of the theoretical reliability, regardless of the magnitude of the theoretical reliability, the number of scale points, and the skewness of the scale point distributions." (p. 21). Ordinal reliability theta will normally be higher than the corresponding Cronbach's alpha.
  • Spearman's reliability rho.. Spearman's rho is a form of rank order calculation. It is calculated with the same formula as for Pearson's r correlation, but using rank rather than interval data. The median rho between all pairs of items in a scale is a classic measure of reliability, in the sense of internal consistency, and as such is an rdinal alternative to Cronbach's alpha. Rho > .60 is considered the minimum for adequate scale reliability. This is not to be confused with Raykov's reliability rho.


  • Split-half reliability

    • Split-half reliability, which measures equivalence, is also called parallel forms reliability or internal consistency reliability. It is administering two equivalent batteries of items measuring the same thing in the same instrument to the same people. If split halves is requested in SPSS, four coefficients will be generated: Cronbach's alpha for each form, the Spearman-Brown coefficient, the Guttman split-half coeffcient, and the Pearsonian correlation between the two forms (aka, "half-test reliability"). (Note: Some authors label split-half reliability as a subtype of internal consistency reliability.)

      In SPSS, select Analyze, Scale, Reliability Analysis; list your variables; click Statistics; select Item Scale, Scale if Item Deleted; select Split-Half from the Model drop-down list. OK. SPSS will take the first half of the items as the first split form, and the second half as listed in the dialog box as the second split form. If there are an odd number of items, the first form will be one item longer than the second form. You can also use the Paste button to call up the Syntax window and alter the /MODEL=SPLIT parameter to be /MODEL=SPLIT n, where n is the number of items in the second form.

      • Spearman-Brown split-half reliability coefficient, also called the Spearman-Brown prophecy coefficient and not to be confused with the Spearman-Brown stepped-up reliability coefficient (standardized item alpha) above, is a form of split-halves reliability measure. The Spearman-Brown prophecy coefficient is used to estimate full test reliability based on split-half reliability measures. A common rule of thumb is .80 or high for adequate reliability and .90 or higher for good reliability. However, for exploratory research, a cutoff as low as .60 is not uncommon.

        The Pearson correlation of split forms estimates the half-test reliability of an instrument or scale. The Spearman-Brown "prophecy formula" predicts what the full-test reliability would be, based on half-test correlations. This coefficient will be higher than the half-test reliability coefficient. This coefficient is usually equal to and easily hand-calculated as twice the half-test correlation divided by the quantity 1 plus the half-test reliability. In SPSS, two Spearman-Brown split-half reliability coefficients will appear in the "Reliability Statistics" portion of the output when split-half is selected under the Model button: (1) "Equal length" gives the estimate of the reliability if both halves had equal numbers of items, and (2) "Unequal length" gives the reliability estimate assuming unequal numbers.

          rSB1 = (k* rij)/[1 + (k-1)* rij)]
          where
            rSB1 = the Spearman-Brown split-half reliability
            rij = the Pearson correlation between forms i and j
            k = total sample size divided by sample size per form (k is usually 2)

        As with other split-halves measures, the Spearman-Brown reliability coefficient is highly influenced by alternative methods of sorting items into the two forms, which is preferably done randomly. Random assignment of items to the two forms should assure equality of variances between the forms, but this is not guaranteed and should be checked by the researcher.
      • Guttman split-half reliability coefficient is an adaptation of the Spearman-Brown coefficient, but one which does not require equal variances between the two split forms.

        • Guttman's lower bounds (lambda 1-6) are a set of six coefficients, L1 to L6, generated when in SPSS one selects "Guttman" under the Model button.:

          1. L1: An intermediate coefficient used in computing the other lambdas.
          2. L2: More complex than Cronbach's alpha and preferred by some researchers, though less common.
          3. L3: Equivalent to Cronbach's alpha.
          4. L4: Guttman split-half reliability.
          5. L5: Recommended when a single item highly covaries with other items, which themselves lack high covariances with each other.
          6. L6: Recommended when inter-item correlations are low in relation to squared multiple correlations
        Guttman recommends experimenting to find the split of items which maximizes Guttman split-half reliability (L4), then using the highest of the lower bound lambdas as the reliability estimate for the set of items. The best split will be that in which each half contains highly inter-correlated items.

  • Test-retest reliability

    • Test-retest reliability, which measures stability over time, is administering the same test to the same subjects at two points in time. The appropriate length of the interval depends on the stability of the variables which causally determine that which is measured. A year might be too long for an opinion item but appropriate for a physiological measure. A typical interval is several weeks. Statistically, test-retest reliability is treated as a variant of split-half reliability and also uses the Spearman-Brown coefficient.

      Test-retest methods are disparaged by many researchers as a way of gauging reliability. Among the problems are that short intervals between administrations of the instrument will tend to yield estimates of reliability which are too high. There may be invalidity due to a learning/practice effect (subjects learn from the first administration and adjust their answers on the second). There may be invalidity due to a maturation effect when the interval between administrations is long (the subjects change over time). The bother of having to take a second administration may cause some subjects to drop out of the pool, leading to nonresponse biases. Note, however, that test-retest designs are still widely used and published and there is support for this. McKelvie (1992), for instance, reports that reliability estimates under test-retest designs are not inflated due to memory effects. Researchers using test-retest reliability must address the special validity concerns, but may decide to go ahead if warranted.


  • Inter-rater reliability

    • Inter-rater reliability, which measures homogeneity, is administering the same form to the same people by two or more raters/interviewers so as to establish the extent of consensus on use of the instrument by those who administer it. In the data setup, judges are the columns and judgees are the rows. For categorical data, consensus is measured as number of agreements divided by total number of observations. For continuous data, consensus is measured by intraclass correlation, discussed below. Note raters should be as blind as possible to expected outcomes of the study and should be randomly assigned.
    • Cohen's Kappa for inter-rater reliability can be used to assess inter-rater reliability if there are just two raters. Cohen developed a multi-rater version of Kappa, but it is not implemented in SPSS.

        Let there be two raters who each independently rate the same n objects, each of which might fall into one of k categories. For instance, two raters could rate 100 letters to the editor as "liberal," "conservative," or "bipartisan." The raters' choices would be organized into a square table, with the k available choices for Rater #1 being the columns and the k choices for Rater #2 being the rows. A cell count of 10 in the Rater #1-liberal column and the Rater #2-bipartisan row would mean that 10 of the 100 cases were rated liberal by Rater #1 and bipartisan by Rater #2, and so on.

        Counts in diagonal cells will reflect inter-rater agreement and cells off the diagonal will represent disagreements. Kappa is a function of the ratio of agreements to disagreements in relation to expected frequencies. In SPSS it is not available in the Reliability module. Rather one must obtain it from the Crosstabs procedure (Kappa is a choice under the Statistics button in Crosstabs; it is not a default option). In SAS, weighted and unweighted kappa is computed by the FREQ procedure.

        Interpretation. By convention, a Kappa > .70 is considered acceptable inter-rater reliability, but this depends highly on the researcher's purpose. Another rule of thumb is that K = 0.40 to 0.59 is moderate inter-rater reliability, 0.60 to 0.79 substantial, and 0.80 outstanding (Landis & Koch, 1977). For inter-rater reliability of a set of items, such as a scale, one would report mean Kappa.

        Manual computation: let a = the sum of counts on the diagonal, reflecting agreements. Let e = the sum of expected counts on the diagonal, where expected is calculated as [(row total * column total)/n], summed for each cell on the diagonal. Let n = the total number of ratings (observations). Kappa then equals the ratio of the surplus of agreements over expected agreements, divided by the number of expected disagreements. This is equivalent to K = (a - e)/(n - e). Fleiss and Cohen (1973) have shown ICC, discussed below, is mathematically equivalent to weighted Kappa.

        Weighted Kappa: For ordinal rankings or better, one can weight each cell in the agreement/disagreement table by a weight between 0 and 1, where 1 corresponds to the row and column categories being the same and 0 corresponds to the categories being maximally dissimilar.

    • Intraclass correlation (ICC) is used to measure inter-rater reliability for two or more raters when data may be considered interval level. It may also be used to assess test-retest reliability. ICC may be conceptualized as the ratio of between-groups variance to total variance, as elaborated below. A classic citation for intraclass correlation is Shrout and Fleiss (1979), though ICC is based on work going back before WWI.

        Sample size: ICC vs. Pearson r: When there are just two ratings, ICC is preferred over Pearson's r only when sample size is small (<15). Since Pearson's r makes no assumptions about rater means, a t-test of the significance of r reveals if inter-rater means differ. For small samples (<15), Pearson's r overestimates test-retest correlation and in this situation, intraclass correlation is used instead of Pearson's r. Walter, Eliasziw, & Donner (1998) set optimal sample size for ICC based on desired power level, magnitude of the predicted ICC, and the lower confidence limit, concluding that if the researcher used the customary .95 confidence level and the .20 power level, and had two ratings per subject, then the needed sample size (needed to prove the estimated ICC was different from 0) would range from 5 when the estimated ICC was .9 to 616 when it was only .1; for three ratings, the corresponding range was 3 to 225; for four ratings, 3 to 123; for five ratings, 3 to 81; for 10 ratings, 2 to 26; for 20 ratings, 2 to 11 (pp. 106-107). Bonnett (2002: 1334) investigated the sample size issue for ICC, concluding that optimum sample size is a function of the size of the intraclass correlation coefficient and the number of ratings per subject, as well as the desired significance level (alpha) and desired width (w) of the confidence interval. For alpha = .95 and w=.2, Bonnett concluded that the optimal sample size for two ratings varied from 15 for ICC=.9 to 378 for ICC = .1; for three ratings, it varied from 13 to 159; five ratings, 10 to 64; and 10 ratings, 8 to 29. That is, the fewer ratings and the smaller the ICC level, the larger the needed sample size.

        Data setup: In using intraclass correlation for inter-rater reliability, one constructs a table in which column 1 is the target id (1, 2, ..., n) and subsequent columns are the raters (A, B, C, ...). The row variable is some grouping variable which is the target of the ratings, such as persons (Subject1, Subject2, etc.) or neighborhood (E, W, N, S). The cell entries after the first id column are the raters' ratings of the target on some interval variable or interval-like variable, such as some Likert scale. The purpose of ICC is to assess the inter-rater (column) effect in relation to the grouping (row) effect, using two-way ANOVA.

        Interpretation: ICC is interpreted similar to Kappa, discussed above. ICC will approach 1.0 when there is no variance within targets (ex., subjects, neighborhoods -- for any target, all raters give the same ratings), indicating total variation in measurements on the Likert scale is due solely to the target (ex., subject, neighborhood) variable. That is, ICC will be high when any given row tends to have the same score across the columns (which are the raters). For instance, one may find all raters rate an item the same way for a given target, indicating total variation in the measure of a variable depends solely on the values of the variable being measured -- that is, there is perfect inter-rater reliability. Put another way, ICC may be thought of as the ratio of variance explained by the independent variable divided by total variance, where total variance is the explained variance plus variance due to the raters plus residual variance. ICC is 1.0 only when there is no variance due to the raters and no residual variance to explain.

        In SPSS, select Analyze, Scale/Reliability Analysis; select your variables; click Statistics; in the Descriptives group, select Item and select Intraclass correlation coefficient.; select a model from the Model drop-down list (ex., two-way mixed); select a type from the Type drop-down list (ex., consistency). Continue. OK. Models and Types are discussed below.

        Models: ICC varies depending on whether the judges are all judges of interest or are conceived as a random sample of possible judges, and whether all targets are rated or only a random sample, and whether reliability is to be measured based on individual ratings or mean ratings of all judges. These considerations give rise to six forms of intraclass correlation, described in the classic article by Shrout and Fleiss (1979). In SPSS, these types are selected under the Model button of the Reliability dialog and under the Type drop-down list (3 models times 2 types = the six forms of ICC). .

        1. One-way random effects model. Judges/raters are conceived as being a random selection of possible raters/judges, who rate all targets of interest. That is, in this model judges are treated as a random sample and the focus of interest is a one-way anova testing if there is a subject/target effect. This model applies even when the researcher cannot associate a particular subject with a particular rater because information is lacking about which judge assigned which score to a subject. This would happen if the columns were first rating of a subject, second rating, third rating, etc., but a given rating (ex., the first rating) for one subject might be by a different judge than the first rating for another subject, etc. This in turn means there is no way to separate out a judge/rater effect. There would also be no way to separate out a judge/rater effect if each judge rates only one subject, even if it is known which judge assigned which score. In either of these situations the researcher uses a one-way random effects model. This model conceptualizes that there is a target/subject factor, with each observed actual subject representing a level of that target/subject factor. The rater/judge factor cannot be measured and is absorbed into error variance. The ICC is interpreted as the proportion of target/subject variance associated with differences among the scores of the subjects.
        2. Two-way random effects model. Judges are conceived as being a random selection from among all possible judges, and targets/subjects are conceived as being a random factor too. Raters rate all n subjects/targets chosen at random from a pool of targets/subjects and it is known how each judge rated each subject. The ICC is interpreted as the proportion of Subject plus Rater variance that is associated with differences among the scores of the subjects. The ICC is interpreted as being generalizable to all possible judges.
        3. Two-way mixed model. All judges of interest rate all targets, which are a random sample. This is a mixed model because the judges are seen as a fixed effect (not as a random sample of all possible raters/judges) and the targets are a random effect. The ICC coefficients will be identical to the two-way random effects model, but the ICC is interpretated as not being generalizable beyond the given judges.

        Types: Under the Model button of the SPSS Reliability dialog, the Type drop-down list allows the researcher to specify one of two types of ICC computation:

        1. Absolute agreement: Measures if raters assign the same absolute score. Absolute agreement is often used when systematic variability due to raters is relevant.
        2. Consistency: Measures if raters' scores are highly correlated even if they are not identical in absolute terms That is, raters are consistent as long as their relative ratings are similar. Consistency agreement is often used when systematic variability due to raters is irrelevant.

        Single versus average measures: Each model has two versions of the intraclass correlation coefficient:

        1. Single measure reliability: individual ratings constitute the unit of analysis. That is, single measure reliability gives The the reliability for a single judge's rating. Use this if further research will use the ratings of a single rater.
        2. Average measure reliability: the mean of all ratings is the unit of analysis. That is, average measure reliability gives the reliability of the mean of the ratings of all raters. Use this if the research design involves averaging multiple ratings for each item, perhaps because the researcher judges that using an individual rating would involve too much uncertainty. Note average measure reliability for either two-way random effects or two-way mixed models will be the same as Cronbach's alpha.

          Average measure reliability requires a reasonable number of judges to form a stable average. The number of judges required is estimated beforehand as nj = ICC*(1 - rl)/rl( 1 - ICC*), where nj is the number of judges needed, rl is the lower bound from the (1-a)*100% confidence interval around the ICC, discovered in a pilot study; and ICC* is the minimum level of ICC acceptable to the researcher (ex., .80).

        Use in other contexts. ICC is sometimes used outside the context of inter-rater reliability. In general, ICC is a coefficient which approaches 1.0 as the between-groups effect (the row effect) is very large relative to the within-groups effect (the column effect), whatever the rows and columns represent. In this way ICC is a measure of homogeneity: it approaches 1.0 when any given row tends to have the same values for all columns. For instance, let columns be survey respondents and let rows be Census block numbers, and let the attribute measured be white=0/nonwhite=1. If blocks are homogenous by race, any given row will tend to have mostly 0's or mostly 1's, and ICC will be high and positive. As a rule of thumb, when the row variable is some grouping or clustering variable, such as Census areas, ICC will more and more approach 1.0 as the size of the clusters decreases and becomes more compact (ex., as one goes from metropolitan statistical areas to Census tracts to Census blocks). ICC is 0 when within-groups variance equals between-groups variance, indicative of the grouping variable having no effect. Though less common, note that ICC can become negative when the within-groups variance exceeds the between-groups variance.


Assumptions

  • Additivity. Each item should be linearly related to the total score. Tukey's test of non-additivity, a choice under the Statistics button of the Reliability dialog in SPSS, tests the null hypothesis that there is no multiplicative interaction between the cases and the items. If this test is significant (<= .05) then there is multiplicative interaction. The Tukey significance is found in the "Nonadditivity" row of the "ANOVA with Tukey's Test for Nonadditivity" table in SPSS output.

    If Tukey's test shows multiplicative interaction, any model computing scores for cases based on the scale must include the case main effect, the item main effect, and the case-by-item interaction effect. In a footnote to the Tukey test output, SPSS prints an estimates of the power to which items in a set would need to be raised in order to be additive. (Warning: while transforms may eliminate non-additivity, raising item scores to too high a power will generate large values for all subjects, obscuring differences among subjects).

    In SPSS, select Analyze, Scale, Reliability Analysis; click Statistics; check Tukey's test of additivity

  • Independence. Observations for one subject/case should be independent of observations for any other subject/case in any administration of the instrument. However, the fact that test-retest designs involve correlated data between administrations does not pose a statistical problem in assessing reliability and does not in itself violate assumptions of reliability analysis.
  • Uncorrelated error. Errors should be uncorrelated .
  • Consistent coding. High values must have the same meaning across items.
  • Random assignment of subjects. In split-half tests, random assignment of items to forms is assumed. Typically, odd-numbered items become one form and even-numbered items become the second form. Sequential assignment may involve a subject fatigue factor with regard to the second form.
  • Equivalency of forms. In split-half tests, the two forms should be equivalent. A test of this is to see if the mean response is the same in the two groups. In split half models, Hotelling¡¯s T2 is a multivariate test for equality of means between groups. A significant T2 means that the null hypothesis that means were equal can be rejected by the researcher. This test assumes multivariate normality of items. Hotelling¡¯s T2 is a choice under the Statistics button of the Reliability dialog in SPSS.
  • Equal variances. In split-half tests, the Spearman-Brown split-half reliability coefficient assumes the split halves have equal variances. The chi-square test of parallel models tests the null hypothesis that the variances are equal. If the chi-square significance is <= .05, then the researcher concludes the models are not parallel and that the variances differ significantly. In SPSS, select "Parallel" under the model button.
  • Similar difficulty of items. In internal consistency analysis using Cronbach's alpha, it assumes the scale items all measure the same dimension equally (ex., and assortment of math problems of equal difficulty). However, if the scale is of the Guttman scale type where higher items (ex., solving division problems) imply responses to lower items (ex., solving addition problems), but not vice-versa, internal consistency in the Cronbach's alpha sense is not expected and Cronbach's alpha gives an inappropriate estimate of reliability.
  • Same assumptions as for correlation.


Frequently Asked Questions

  • How is reliability related to validity?
      A measure may be reliable but not valid, but it cannot be valid without being reliable. That is, reliability is a necessary but not sufficient condition for validity.
  • How is reliability related to attenuation in correlation?
      Reliability is a form of correlation. Correlation coefficients can be attenuated (misleadingly low) for a variety of reasons, including truncation of the range of variables (as by dichotomization of continuous data; reducing a 7-point scale to a 3-point scale). Measurement error also attenuates correlation. Reliability may be thought of as the correlation of a variable with itself. Attenuation-corrected correlation ("disattenuated correlation") is higher than the raw correlation on the assumption that the lower the reliability, the greater the measurement error, and the higher the "true" correlation is in relation to the measured correlation.

      The Spearman correction for attenuation of a correlation: let rxy* be corrected r for the correlation of x and y; let rxy be the uncorrected correlation; then rxy* is a function of the reliabilities of the two variables, rxx and ryy:

      rxy* = rxy / [SQRT{rxxryy}]

      This formula will result in an estimated true correlation ( rxy*) which is higher than the observed correlation (rxy), and all the more so the lower the reliabilities. Corrected r may be greater than 1.0, in which case it is customarily rounded down to 1.0.

      Note that use of attenuation-corrected correlation is the subject of controversy (see, for ex., Winne & Belfry, 1982). Moreover, because corrected r will no longer have the same sampling distribution as r, a conservative approach is to take the upper and lower confidence limits of r and compute corrected r for both, giving a range of attenuation-corrected values for r. However, Muchinsky (1996) has noted that attenuation-corrected reliabilities, being not directly comparable with uncorrected correlation, are therefore not appropriate for use with inferential statistics in hypothesis testing and this would include taking confidence limits. Still, Muchinsky and others acknowledge that the difference between a correlation and attenuation-corrected correlation may be useful, at least for exploratory purposes, in assessing whether a low correlation is low because of unreliability of the measures or because the measures are actually uncorrelated.


  • What is Cochran's Q test of equality of proportions for dichotomous items?
      Cochran's Q is used to test whether a set of dichotomous items split similarly. This is the same as testing whether the items have the same mean. If they test the same, then items within the set might be substituted for one another. In the ANOVA output table for a set of dichotomous items, the "Between Items" row, "Sig" for Cochran's Q column, if Sig (Q) <= .05, then the researcher rejects the null hypothesis that all items display an equal split or have the same mean.

      In SPSS, select Analyze, Scale/Reliability; select your items; click Statistics; in the Descriptives area, select Item, Scale, Scale if Deleted; in Summarize, select summary statistics (Means, Variances, Covariances, Correlations); and in the ANOVA table group, select Cochran chi-square. Continue. OK.

      Cochran's Q is discussed further in the section on significance tests for more than two dependent samples.

  • What is the derivation of intraclass correlation coefficients?

      Derivation of the ICC formula, following Ebel (1951: 409-411): Let A be the true variance in subjects' ratings due to the normal expectation that different subjects will have true different scores on the rating variable. Let B be the error variance in subjects' ratings attributable to inter-rater unreliability. The intent of ICC is to form the ratio, ICC = A/(A + B). That is, intraclass correlation is to be true inter-subject variance as a percent of total variance, where total variance is true variance plus variance attributable to inter-rater error in classification. B is simply the mean-square estimate of within-subjects variance (variance in the ratings for a given subject by a group of raters), computed in ANOVA. The mean-square estimate of between-subjects variance equals k times A (the true component) plus B (the inter-rater error component), since each mean contains a true component and an error component.

      Given B = mswithin, and given msbetween = kA + B, substituting these equalities into the intended equation (ICC = A/[A+B]), the equation for ICC reduces to the formula for the most-used version of intraclass correlation (Haggard, 1958: 60):

      ICC = rI = (msbetween - mswithin)/(msbetween + [k - 1]mswithin)

      where

      • msbetween is the mean-square estimate of between-subjects variance, reflecting the normal expectation that different subjects will have true different scores on the rating variable
      • mswithin is the mean-square estimate of within-subjects variance, or error attributed to inter-rater unreliability in rating the same person or target (row).
      • k is the number of raters/ratings per target (person, neighborhood, etc.) = number of columns. If the number of raters differs per target, an average k is used based on the harmonic mean: k' = (1/[n-1])(SUMk-[SUMk2]/SUMk).
  • What are Method 1 and Method 2 in the SPSS RELIABILITY module?
      These are two methods of computing the reliability coefficient (Cronbach's alpha) for a set of items thought to comprise a scale. Method 1 allows constant terms to remain in the scale, while Method 2 deletes constant terms. Method 2 also produces a standardized item alpha, as if data had been input in standardized form. Method 2 can be forced when using the syntax window by adding the clause METHOD=COV. Method 2 is the default in SPSS.


Bibliography

  • Armor, D. J. (1974). Theta reliability and factor scaling. Pp. 17-50 in H. Costner, ed., Sociological methodology. San Francisco: Jossey-Bass.
  • Bonett, Douglas G. (2002). Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine 21: 1331-1335.
  • Ebel, Robert L. (1951). Estimation of the reliability of ratings. Psychometrika 16: 407-424.
  • Fleiss, J. L., Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33: 613-619.
  • Graham, James M. (2006). Congeneric and (essentially) tau-equivalent.estimates of score reliability: What they are and how to use them: Educational and Psychological Measurement 66; 930-944.
  • Haggard, E. A. (1958). Intraclass correlation and the analysis of variance. NY: Dryden.
  • Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33:159-174. This article sets cut-offs for Cohen's Kappa.
  • Litwin, Mark S.(2002). How to assess and interpret survey psychometrics. The Survey Kit series, Vol. 8. Thousand Oaks, CA: Sage Publications). Covers test-retest, alternate-form, internal consistency, interobserver, and intraobserver reliability.
  • McGraw, K.O. and S.P. Wong (1996). "Forming inferences about some intraclass correlation coefficients," Psychological Methods 1(1): 30-46.
  • McKelvie, S. J. (1992). Does memory contaminate test-retest reliability? Journal of Gen Psychology 119(1):59-72. This article reports that reliability estimates under test-retest designs are not inflated due to memory effects.
  • McNemar, Q. (1969). Psychological Statistics. Fourth edition. New York: Wiley Covers F tests for intraclass correlation (p. 322).
  • Muchinsky P.M. (1996) The correction for attenuation. Educational & Psychological Measurement 56(1), 63-75.
  • Nunnelly, J. C. (1970) Psychometric Theory. Second ed., 1978. New York: McGraw Hill. Cited as a reference in support of the .70 cut-off for Cronbach's alpha. Classic on reliability in psychological and educational testing.
  • Raykov, Tenko (1997). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21, 173-184
  • Raykov, Tenko (1998). Coefficient alpha and composite reliability with interrelated nonhomogeneous items Applied Psychological Measurement, 22(4), 375-385.
  • Shrout, P.E., and J. L. Fleiss (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin (86): 420-428. Classic article on intraclass correlation.
  • SPSS (1988). SPSS-X User's Guide, Third ed.. Chicago, IL: SPSS Inc.
  • Walter, S. D.; Eliasziw, M. ; and Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in Medicine 17: 101-110.
  • Winne, Philip H. & Belfry, M. Joan (1982). Interpretive problems when correcting for attenuation. Journal of Educational Measurement, 19(2), 125-134.
  • Zumbo, B. D.; Gadermann, A. M.; & Zeisser, C.. (2007). Ordinal versions of coefficients alpha and theta for likert rating scales. Journal of Modern Applied Statistical Methods, 6, 21-29.
    Ãßõ¼ö : 0   ¹Ý´ë¼ö : 0