“. . . the solutions to our problems lie outside the box.”

Aviation Week & Space Technology, July 1975


This section consists of a number of classroom, lab, and workplace applications developed between the mid-1990s and 2004.  Some of this was applied to a series of lectures and lab sessions developed and provided in the college setting.  In some cases the lab experience and the work experience were merged, with lab processes used to test research methodologies later used to study regional and state public health information. 

During a series of GIS Lab presentations for population health classes developed in Portland, Oregon and continued in Denver, Colorado, the methods for researching the health of a theoretical population were developed for a population with N = 100,000,000.  All population metrics were standardized, small n’s slightly modified and changed to random whole number distributions with correct neighbor-neighbor relationships when graphed, and all results were presented using this standardized N series.   The lessons learned by using old NIH and state databases datasets helped define the processes developed and repeatedly tested over the years.  By 2005, this methodology for studying “surfaces” or ‘topography’ and topology was improved significantly and given the new applications for its use in researching and displaying human population features and health.  This method has universal applications and can be used to display events, cost, numbers of activities (i.e. pruchases of specific items), etc. over age, gender, ethnicity, space and time, involving any or all of the typical independent metrics (i.e. individual census stats) made available to data users by the data provider.

The development of this research tool began in winter 1999/2000 as part of a Pacific Northwest demographic study, but wasn’t tested until population data could be acquired for the GIS cancer research projects and the writing of grant proposals for this work, which commenced later in 2000, and was renewed in 2001 and extended through other corporate, regional and institutional grants in 2001 and 2002.

Populations with N between 100,000 and 250,000 can be evaluated using this technique, but if the average differences in moving window age groups values is <0.65, the two populations are too different for reliable evaluations to be performed.  Generally speaking, the data for such groups can be normalized however, or the age groups adjusted using the standard techniques for age-gender adjustments applied to epidemiological studies.   Groups can also be reconfigured in order to eliminate the random distribution propblem that can be seen (highs and lows for specific age bands).  For example a 2 year age band used for HEDIS evaluations for a total N for the program(s) >45,000, can be rewritten to evaluate 5 year age bands.  The formulas are the same, but the moving windows ranges demonstrated in the lab setting have to modified (reassign the theoretical N used in calculations).

For any study involving large masses of people, it is important to note that “regression to the means” rules apply.  The larger the population, the more likely your population represents the national standards.  Given a national population of 350 million, a sampling of 17% of that population or more is likely to represent the entire population’s features.  Small narrow age band differences are of course going to differ, after all, the national absolute or truest average is only one value.  But it is unlikely that any sampling of an entire group is ever going to produce identical values for all parts of that group.  However, it is highly unlike the actual average reported is going to be significantly different once the final results are produced, and more importantly different to such an extent that these differecnes are staitistically significant.  It is very hard, if not impossible, to mistakenly select a large chunk of outliers (millions of outliers) when tens of millions of lives are being assessed. 

Another way to view this relationship is to imagine the shape of a bell curve representing the entire popualtion, and placing a second bell curve 1/6th the size or larger within the total population curve.  It is hard to place that curve in as many positions as a curve generated by a sampling of just 5% or less.  Moreover, it becomes increasingly more difficult to select a sample that does not result in a true bell curve for results as your sample size gets larger.  This means that whenever a 16.7% (1/6 of the total population is assessed, the bell curve of results spatially make it impossible for another bell curve to be generated that is significantly off.  Such a curve would have figures outside the perimeter of the original curve.  The means the 17% bell curve has only a few fits that are appropriate within the 100% population bell curve in terms of shape, size and form.  This is the application of the regression to the means standard used to evaluate exceptionally datasets statistics.

Regression to means standard implies that large samples research population data become very stable as N is increased.  It also means that for statistically significant differences to be generated by engaging in a new study, the general rule to follow states that a doubling of the original population size is normally required for noticeable statistical changes in numbers to be noted.  This is the reason HEDIS/NCQA studies are performed using numbers that seem “unrealistic”, with 540 the maximum data pull usually (up to 40 of these are selected in case unanticipated ineligibility arises weeks later when the final analysis is performed).   Based on reviews of how HEDIS-like data behaves for public health statistics, it was found that for randomly generated datasets based on people behaviors (the human error component) and IT systems programming behaviors (internal or systems error inducers), an average score of 9.735 for example for a study involving 30 M people, made change only by a tenth if that 30M is converted to 60M people tested.  Such an effort is typically not worth the time for the rerun.

This study was supported and ‘passively’ supervised (for security purposes more than methodology QA related purposes) by former Perot Systems (now Dell Perot), and internal IPA and IT monitoring groups from 2004 to 2005.  All reports generated since then follow previously agreed upon institutional and federal program IP and PHI rights and regulations.  Datasets have been slightly modified (the N = 100,000,000 standard) without modifying statistical outcomes; but all corrections of values must produce new numbers that are integers in order to assure realistic outcomes.  Data sources have been renamed and/or provided with a unique thoretical identifier of data content.  Only age and gender identifiers are presented in the original format.

Related Pages

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.