Note: this page and neighboring pages are from older teaching materials used for a lab on GIS and the corresponding lecture/discussion series I developed on ‘GIS, population health surveillance, epidemiology and public health’.


Making Sense of Numbers

The number of patients (N), the age-gender adjusted frequencies (prevalence), the numbers of amount of times they file claims for particular services or care related needs (claims) and the unit costs for receiving this case (cost), are all values that can be used to determine the demands and needs of care for a particular place or system.

For a single area on a grid cell map, when N or the number of patients is high, cost is high.  For that same map, if we review an area with a typical or even average population distribution, when prevalence is high, cost is high.  Again, reviewing that same map, for areas with similar N’s or Prevalence rates between areas, when Claims are high, cost is also high.  Cost is the major concern driving the current big data work out there, cost and its equivalent, the potential sources for revenue or income.  In medicine, when cost is high, we tend to frown upon the causes, especially if these causes are due to too much passivity in terms of designing, implementing and paying for effective prevention plans.  Costs for care are also frowned upon when active business related reasons are the reason for the high dollar–the notion that the company is allowing for such engagements as the means to making more profit with each of its products.

This means that N, prevalence and claims, for which specific costs are attached, are all effective ways to monitor a program’s activities, failures and successes in health.  If we review these factors for all age ranges and all relevant demographic variables such as ethnicity and race, we end up with a way to interpret health care quality based on its cost to consumers, the cost of the industry and the cost to the insurance agent.  There are different age-time relationships that exist for each of these factors, which in turn allow us to develop a fairly simple tool for suing these four factors to quantify the value, meaning and effectiveness of a health care program.  Whereas evaluating cost alone, from one region to the next can sometimes appear to produce important results, it is more often the case that cost alone cannot be used to judge the health care for a given area, or the quality of care being provided for a given region, or the effects that particular features related to demography have on the cost for health care.  Likewise, we cannot use N or prevalence alone to quantify the entire system of health maintenance, prevalence and health care for a given region, yet we often depend upon incidence and prevalence to base our next health care practice decisions upon, such as where to cut spending for fertility care, where to put more money into immunization programs, where to move the bulk of our money to for preventing certain sexually transmitted diseases from impacting particular communities and forms of living.

It is possible to map each of these metrics separately, and analyze them to define overall meaning for a given setting.  In the case of disease mapping, we tend to map each of these independently, but rely mostly upon prevalence or total cost to base most of our decisions on.  When making these decisions, administrators try to juggle back and forth the concepts being expressed about high cost on one set of results, versus high prevalence on another, using this logic to define where and how to intervene in the best possible way.  This particular behavior adds opportunities for human error to prevent the success of such programs.  Subjectivity, especially related to just one decision  making algorithm, has little value at the systems level.  It fails to take other reasons for cost related problems into account.

Claims is a very useful tool when analyzing cost because it is the truest variable in terms of having a direct relationship to cost.  For a given claim type, a given amount charged to the patient and to the insurance agency exists, and a certain percentile of coverage is known in terms of how much of that overall billed cost will be covered by a health insurance program.  Whereas prevalence for certain health issues, conditions or diseases can be very low at certain points in life, claims can in turn be very high, due simply to the exclusion and diagnostic processes that are involved with preventive health care.  Exceptionally high numbers of claims for certain diagnostic rule-out processes for example can be very costly, and therefore indicate that age and risk related to lifestyle and presentation are the reasons for high cost in such measurements, not high incidence or prevalence.  This makes the value of assessing prevalence much less important in terms of cost analysis for many injury related problems such as age-related peaks in fractures, of high cost, or age-related needs for diagnostic steps needed to rule out certain pulmonary diseases, such as infection or asthma.

An exceptionally high cost related to claims can sometimes be required costs due to the link to long term prevention behaviors, such as opting for a 45 dollar immunization instead of waiting for the several thousand dollars worth of high cost care needed for that disease should have avoided getting that immunization.  Claims help us better understand cost because claims inherently are related to cost, and the more they exist in the patient’s history, the higher the cost for care will be, even if the patient doesn’t ever get any sicker.  But increasing claims usually has an exponential behavior that kicks in later in life, meaning that as survival decreases for a certain disease, those who remain alive begin to cost the system more and more due to long term effects of the disease.  Claims have a direct relationship to overall cost for care for a patient as he/she gets older because he/she has to default to engaging in more medication use, more visits, more diagnostic services, more palliative services for care as he/she gets older.  Claims may be very large in number at a very young age, at which time is has one relationship to overall cost for care, whereas once the peak age for a given diagnosis is reached, cost tends to increase as numbers of cases decrease, due directly to the increased need for care and the concern for complications and comorbidities.  Claims go up as one gets older, and so does cost, exponentially over that fairly linear or smoothly curved change in numbers of claims filed that can be noted for people as they get older.

By independently mapping each of these, we can develop and equation that can be used to first, relate the total formula of this data to cost, as well as each individual measure independently.

For this reason, the following formula can be used to map the relationships N, prevalence and claims have for cost.  Prevalence corrects for age at every level of age independently, and automatically puts age into this equation as well.  Factors related to race or ethnicity can also be added to this equation, as their own unique multipliers.  This means that the following formula is useful in analyzing overall cost for care related to the above features, for a given mapped object depicting these results:

N x Prev. x Claims

And the following formula can be used to score areas in terms of cost per areal unit:

N x Prev. x Claims x LogCost

To merge these into effective tool, we can do either of the two following formulas (using both is preferred):

Log(N x Prev. x Claims) x LogCost

Log (N x Prev. x Claims x LogCost)

By correlating these grid maps of a large region to the demographic data for that region, a more concise way of comparing costs to demographic is produced.

To engage in this using GIS or a Grid mapping approach, one simply overlays the maps to produce the end result.

Other multipliers that can be added to this methodology are:

[Log(N x Prev. x Claims) x LogCost]^2

[Log (N x Prev. x Claims x LogCost)]^2


[Log(N x Prev. x Claims) x LogCost]^3

[Log (N x Prev. x Claims x LogCost)]^3

For certain analyses, Log base can be changed.  Log10 for example tends to be less precise with spatial relationships between exponentially changing costs values.  Log 5 or Log6 has been tried and seems to produce smoother results in terms of spatial representation and “truth.”

The reason LogCost is always use, rather than Cost or adjusted Raw Cost data, is that Cost variance is too broad in the health care system, due to the nature of the paying methods.  To reduce the great degree of heteroscedasticity that cost values always present, LogCost modeling methods are used instead.  This is of greater value than the typical 99th pctl as high, 1st pct as low approach often seen being used.

Applying the above formulas to a grid map, we can define risk areas as those that are either:

  1. High risk due to high number or density of people in need of care or service within the given research grid cell area
  2. High risk due to high rates of sickness or prevalence
  3. High risk due to high numbers of events triggered (visits resulting in claims) for the metric being evaluated
  4. High “risk” (as a company, loss of profits may be considered a risk) due just to high cost, regardless of the other metrics.

Examples of how each of these four apply are as follows.

High risk due to high numbers or density of people pertains to dense population areas, where regardless of low cost, the system is stressed in terms of manpower (staff/clinicians), available product (open visits, available medications), and product demand (patients).  In places where childhood upper respiratory disease is prevailing for example, there is a rise in demand for visits, resulting possible in overprescribing of medications (URIs in children ❤ yo  normally should not have antibacterials prescribed).  Similar problems might exist for urban settings with old households and unhealthy living environments for individual with asthma or bronchitis.

High risk due to high rates of sickness or prevalence are cases where density isn’t the only factor contributing to rises in cases; in this setting an increase is due to a small and localized infectious disease event being passed around just the right kind of social setting, where aggregation or grouping of people occur, for example, a school setting spreading mumps to unvaccinated children, or a case of food poisoning being spread due to poor storage by the local store, or an infectious hepatitis cases putting others at risk within a local restaurant setting.

High risk due to high numbers of events is much like the first event, but has the additional factor of human behaviors  becoming involved.  Two recent examples of this in the news pertain to cases of Tourette’s Syndrome or Tics reported by mid-age people; these conditions prevail in children under the age of 17, yet the predominant numbers of cases during this new period were people in their 20s to 40s.  Similarly, people experience certain culturally-bound conditions will present at times in this manner.  Parents overly concerned about their child having cancer due to a local industry pollution the air or water is yet another example of this behavior.  For older people, as one gets older, more forms of illness interfere with a healthy body, making it increasingly necessary for older patients to visit their doctor’s office.  The claims that are documented increase as people get older, and as populations develop or obtain higher percentages of these people, claims naturally increase over time.  Claims might even continue to rise in numbers or average frequencies per patient even though the population with the disease and its comorbidities is dying off.

High “risk” due to high cost pertains to rare diseases with orphan drugs or incredible expensive medications like monoclonal antibodies and specific blood-derived medications.  This also pertains to high cost procedures such as Liver Failure, followed by a costly replacement more common to men in the late 40s than women.   Standard post-event hospitalizations such as due to stroke or heart attack are other examples of these risks that need to be evaluated due to cost alone, as well as catastrophic events for which costs skyrocket to new maximums.  These very high cost cases are the reason a logCost value is used in this analysis in fact.  Often companies, for statistical purposes, assign a max cost value such a $1M, reassigning that value to a $3M to $6.5M case.   LogCost allows you to keep the true value, adjusts it for your calculations, and diminished the problems such high ranges and variances have on the statistical evaluation process.  LogCost forces the data to perform in a way that enables more common methods and equations to be used, and so without reducing the truth of your results by eliminating these outliers, you have instead allowed them to control the results of the study more.  The only problem remaining therefore becomes determining if these high cost outliers in the equation are standard, or are they exceptions.  Does a population of a given size always have exceptionally high cost patients once it reaches a certain size, regardless of what the conditions responsible for these high costs may be, such as hemophilia, rare forms of anemia, solid organ or heart failure, rare intractable neurological conditions, severe and hard to treat infectious disease states.

One can use this logic to assign weight to each of the four types of risk noted above.  In some cases, most of the risk may be only of one or two types.  So using a multiplier for each of the metric types enables you to assign values to the positive identifications of some types of risks versus others.  We could find that population density is the major risk factor for diseases in certain ICD ranges (infectious diseases) and cost absolutely no concern, whereas in other population settings, due to the closed nature of the community being analyzed, certain high cost genetic traits may tend to prevail more than elsewhere, costing the system much more in terms of manpower, time and money.  For this reason, we can consider further differentiating this formula by adding multipliers.  The above formulas, rewritten with these multipliers (a,b,c,d) are as follows:

Log(aN x bPrev. x cClaims) x dLogCost

Log (aN x bPrev. x cClaims x dLogCost)

This allows you to define a weight of importance for each of these values to final cost.  Another way of thinking about this is considering exactly what activity or object is being measured or evaluated, and what certain changes in that object really represent.  For example, in some cases, N prevails over cost, claims, and prevalence, due to the stabilization of the event or activity being measured.  For immunizations, cost variance is relatively low, required immunization per person pretty much stable, and numbers of events related to these activities such as prevalence or number of visits per person for which billing is made become unchanging.  In the case of immunization, one might assign a value of 1 to each (1=b=c=d=1) or decide that the types of reasons between two areas differences due to the nature of filing claims (for example, claims are skipped due to other more important events related to the visit and so cost for billing is not assigned), thus causing you to want to rely more upon claims, by particular types of claims, to determine where the problem exists.

When this formula is used for two ethnically or culturally different regions, one can use this to define differences based on ethnicity.   In non-GIS approaches, matched pairing can be attempted.  In GIS approaches, nearest neighbor analyses can be employed.

In group related metrics, such as infectious diseases, since the grid cell approach is being used, population density as defined by N per cell can be weighted more for specific disease types, with the multiplier a related to degree of severity and ease of disease transfer from one individual to the next.  When using this to compare STDs, age becomes a factor.  When using it for looking at child abuse, fractures, and infectious disease, we expect only the latter to be related to high density, and wonder if the first two metrics have an inverse relationship to population density, favoring rural settings over semiurban settings.

Poverty and the Above

We can also add poverty or SES to the above evaluation method by assigning a value to a grid cell in relation to its state of income or poverty level.  This in turn is used as an additional multiplier in any of several ways, and should be evaluated in each of these ways:

Log(aN x bPrev. x cClaims) x dLogCost x Poverty Value

(Log(aN x bPrev. x cClaims) x dLogCost)  x Poverty Value

Log (aN x bPrev. x cClaims x dLogCost x Poverty Value)

How you define poverty value effects how this factor in the equation is best used.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.