Comparing two datasets
Like a home inspector in the world of real estate, a data scientist employed in the area of care standardization has no friends among physicians. In order to avoid being hit from behind by a baseball bat on their way to the car after a long day at the office, he or she must ensure that potentially damning conclusions they reached during the course of the said long day are statistically sound. Here is how:
- identify potential outliers;
- attempt to explain and clean out spurious outliers, e.g., convert dates formed using two-digit years to proper YYYY dates;
- remove remaining outliers if you must or incorporate them into your dataset;
- backfill missing data;
- identify and prune datasets that are not statistically significantly different from each other, e.g., physicians’ pharmacy charges where the hypothesis about the two providers charging substantially identical amounts cannot be rejected at the 5% level.
There are several tests for determining whether the difference between two or more data sets can be viewed as statistically significant . A summary (, parts reproduced by permission) is presented in Table A.1.
Table A.1: Statistical tests used to confirm the statistical significance of difference between data sets