## Friday, March 30, 2012

### Concordance, Correlation, Agreement -- Statistics

 Name Description/Function Stata Command Lin’s Concordance Correlation According to Lin (1989), this index “evaluates the agreement between two readings (from the same sample) by measuring variation from the 45 degree line through the origin (the concordance line).”  Neither Lin nor the Stata Technical Bulletin (STB-43) insert suggest that this index can be used for categorical data, although neither explicitly forbids as much. -concord- Cohen’s Kappa Coefficient Jacob Cohen’s (1960) measure for inter-rater agreement.  Values range between [0,1] with zero denoting the amount of agreement expected by chance alone and one denoting perfect agreement.  Although the statistic is grounded upon assessing agreement between “raters”, could it be adapted for use with establishing agreement between items on a questionnaire/survey? -kap-, -kappa- Kendall’s Coefficient of Concordance / Kendall’s W / Friedman’s Test Calculates Friedman’s non-parametric two-way analysis of variance and Kendall’s coefficient of concordance.  One p-value is provided since the tests are equivalent although the Kendall’s statistic may be easier to interpret since it is bounded by [0,1] and is a measure of the agreement between rankings.  It is unclear whether this test is suitable for ordinal variables. -friedman- Kendall’s Rank Correlation / Kendall’s Tau Kendall’s Tau-a and Tau-b are calculated where the only difference between the two is in their denominators.  Tau-a uses total number of pairs whereas Tau-b incorporates number of tied values (Tau-b will be larger if ties exist).  These statistics are closely related to Spearman’s Rho and don’t necessarily assess agreement, but independence.  According to Conover (1999, p.323), Spearmans and Kendall’s will produce nearly identical results in most cases although Spearman’s will tend to be larger in an absolute sense.  Since I'm more concerned with assessing agreement rather than independence -- rejection of an independence null is expected -- I question this test's applicability. -ktau- McNemar’s Test (2x2); Bowker’s Test (KxK) For a 2x2 table, the test reduces to a McNemar’s test whereas for a KxK table, the Bowker’s test for table symmetry and the Stuart-Maxwell test for marginal homogeneity are calculated.  The test assumes a 1-to-1 matching of cases and controls and is used to analyze matched-pair case-control data with multiple discrete levels of the outcome/exposure variable.  I’m not 100% sure whether this test is suitable for what I need although if I can frame it such that the instrument items are case and control, respectively, and the symmetry and marginal homogeneity tests are non-significant then it would suggest that a subject’s responses to two items aren’t different.  Need to investigate this possibility. -symmetry-

The research into a suitable method for assessing agreement between two items on a survey/questionnaire hasn't been as straightforward and unambiguous as I'd hoped.  (Although if it were then perhaps the Ph.D. wouldn't be nearly as masochistic?)  Per a search of the literature, the Stata help files, and the Stata listserve I've identified some test statistics that are pretty good candidates for what I need.  I figure placing them in a table along with brief descriptions will aid in identifying which, if any, is most appropriate (this is, of course, assuming that the agreement/equivalence aspect of my research remains in place).  There are also graphical methods of assessing categorical, ordinal agreement -- of which I'll present those in a forthcoming post.

## Thursday, March 29, 2012

### Non-Clinical Equivalence?

Determining whether two measures are equivalent is a tricky thing in statistics.  With a standard hypothesis test, the null hypothesis (Ho) is usually one of no effect or no association.  The alternative hypothesis (Ha) is the converse:  existence of an effect or the presence of an association.  In a two-sample case involving continuous data, for example, the null hypothesis is generally framed as testing whether the difference between the two samples is zero.  The alternative hypothesis -- if it is two-sided -- is that the difference is not zero.  Rejection of the null indicates that the difference is not zero and is large enough to not be attributable to chance, whereas failure to reject suggests that the parameters being compared may be equal (or aren't different).  What failure to reject doesn't provide, however, is proof-positive that the parameters are equal.  What happens, then, if we want to establish equality, rather than difference, between two measures or parameters?  Well, technically you can't.  Friedman, Furberg, and DeMets put it best in their very readable Fundamentals of Clinical Trials (3rd ed. pp. 118):
The problem in designing positive control studies is that there can be no statistical method to demonstrate complete equivalence.  That is, it is not possible to show [delta]=0.  Failure to reject the null hypothesis is not sufficient reason to claim two interventions to be equal but merely that the evidence is inadequate to say they are different.
They go on to state that even though you can't demonstrate complete equivalence, one approach is to designate a value for delta such that intervention(s) with differences less than the specified value might be indicative of equivalence.  I've never been involved with a clinical trials equivalence study so I doubt that I'm qualified to write much more in that regard but in my dissertation research, I'm facing a similar problem.  At least I think it's a similar problem.  Or maybe it really isn't a problem but I'm creating one.  Either way, I'm stumped.

But how to establish "equivalence" in a non-clinical setting between an ordinal variable (confidence question) and either a series of ordinal questions (the nine reasons for missing medications) or the summary score derived from the reasons questions?  One approach -- and this is perhaps the most frequently used approach -- is to correlate the two measures via either Pearson's or Spearman's correlation coefficients.  The problem with assessing equivalence by way of a correlation coefficient is that what it really reveals is degree of linear association ("how well are the measures related?") rather than agreement ("how well do the methods/measures agree?").  A few academics (e.g. Lin, Bland, Altman, etc.) have published and implemented methods for assessing agreement/concordance but I have yet to find anything that is perfectly suited for my task.  All of the methods I've looked into seem appropriate in one way yet inappropriate in another, including the Bland-Altman plot, Lin's concordance correlation for agreement, Cohen's kappa coefficient, Kendall's coefficient of concordance, Kendall's tau, McNemar's test, and Bowker's test.  I've mulled over each of these and I'm still unsure which, if any, is best suited for establishing "equivalence" between two nominal variables.  In order to flush out my thinking and, hopefully, arrive at a decision for which is best for my analysis, I'm going to present and briefly discuss each of the above in a future blog post since the length of this post is getting longer than any random reader should be subjected to.

## Wednesday, March 7, 2012

### Mechanics of Reading a Scientific Paper

When trying to determine what a paper is about, Greenhalgh emphasizes that a paper should be 'trashed' because of its methods, not its results.  Given the emphasis on the methods, then, three preliminary questions should initiate the appraisal:
1. What was the research question -- and why was the study needed?  This should be clearly stated somewhere in the first few paragraphs of the paper.
2. What was the research design?  The type of design has implications for the statistical analyses used (if any), conclusions, and rigor of the paper.
3. Was the research design appropriate to the question?  Not all research questions require a randomized controlled trial (RCT).
In this chapter, Greenhalgh also briefly discusses each of the research designs common to scientific papers then assigns them a place in the "hierarchy of evidence" with those at the top commanding the most weight and influence re: clinical interventions.  Aside from placing systematic reviews/meta-analyses at the top (particularly helpful in EBM), I think most other disciplines would report a similar hierarchy:
1. Systematic reviews and meta-analyses.
2. RCTs with definitive (i.e. statistically significant) results.
3. RCTs with non-definitive (i.e. suggestive but not statistically significant) results.
4. Cohort studies.
5. Case-control studies.
6. Cross-sectional surveys.
7. Case reports.
In the methodological quality chapter, assessment relies on five key questions:
1. Was the study original?  Does it duplicate previous research or add something new to the literature?
2. Whom is the study about?  How were subjects recruited and what were the inclusion/exclusion criteria?
3. Was the design of the study sensible?  What and how were the outcomes measured?
4. Was systematic bias avoided or minimized?  Study adequately controlled?  Was assessment 'blind'?
5. Were preliminary statistical questions addressed -- how many subjects enrolled, duration of follow-up, and completeness of follow-up?
The chapter on statistics is intended for non-statisticians but I still found it helpful and amusing (especially the 'advice' on how to cheat on statistical tests when writing up results).  I've written about this in a previous post and won't repeat it here since what Ben Goldacre wrote relied largely on what Greenhalgh wrote.  Suffice to say, Greenhalgh breaks down the most common statistical analyses used and their interpretation such that even the most timid statistically-averse researcher can make sense of the results.

Even after I've long since finished slogging through all (most?) of the articles collected for my lit review, I suspect this book will still sit prominently on my bookshelf.