Note: this page and neighboring pages are from older teaching materials used for a lab on GIS and the corresponding lecture/discussion series on ‘GIS, population health surveillance, epidemiology and public health’.
Question for the day:
Does a database representing 100 million people represent the US population as a whole?
This was a question I asked myself several years ago, when the applications of public health statistics methodology were about to undergo some new tests.
At the time, there were about 310 to 315 million people in this country, depending upon whose stats you trusted, and since 100 million represented just under a third of this population, I wondered if it was enough to make conclusions about the US population as a whole. Back in 2004 I learned that normally to produce a significant change in a population outcome, you need to double the value of that population size. This is because the random distribution law, distribution in favor of bell curve law, and the behave as people law (as I like to call it), define the real outcome we get whenever we try increasing the size of a population to get better results. It ends up that to assure you will improve or make more real your results, you have to add enough people to negate the first law just defined, favor the second law, and then eliminate the white noise effect that each of these have on the third law. For the third law to hold true, and the results to be reputable for an outcome, you must have a true representation of that ‘behaving-like-people’ law, and to behave like people, you behaviors have to distribute like they would for the whole population, not according to some bell curve approximating the make up of that whole population.
As of January 27th, 2013, according to US World and Population Clocks website (http://www.census.gov/main/www/popclock.html) there are 315,224,266 people right now residing in the United States.
In 2011, according to a US Census Bureau report made available on the web (http://www.census.gov/prod/2012pubs/p60-243.pdf), 84.3% of people were covered by some form of private health insurance. Of these, 32.2% had some sort of governmental coverage or approximately 95.5 million. The remaining 193.3 million were covered by a private health insurance company in 2011.
The 100 millionth patient mark in the US population is a magical figure. It represents a select percentage of the total US population, or in this case, a select percentage of that 84.3% of 315,224,266 total people or 265,734,056 people. It also represents the data storage of some high end Big Data industries.
By approaching the 100 million mark, we can in theory state that there is a large enough sample of the US population to study overall national statistics for the first time. Yet, this is not being done and the question to ask is ‘why?’
If we can for the first time analyze such a large number of people health wise, and develop models for future health care plans, then it only makes sense that such a process should be underway. With so much of a concern about costs, saving money, and in the case of business, making money, they why haven’t these big businesses jumped onto the bandwagon and developed a way to save money, work towards helping and improving the lives of 100 million people, and even laying out plans for change based upon the predictions they can make in health care costs based on the 100 million people model.
The reason this has not yet been engaged in by big business is simple. They don’t understand the probabilities and statistics underlying Big Data. Big Data is still being treated like regular data mathematically, and there is little to no evaluation of statistical prowess with Big Data. Small numbers techniques and routines are the focus of Big Data analysts, not Big Data Analysis and logic.
This criticism is really not my own fortunately, but is one already penned by a writer whose electronic article was published symbolically on the first of this year. His statement was simply that Big Business has to get out of their 20th century mindset, and welcome and make better use of the 21st century way of thinking and 21st century technology.
My take on this is that we cannot simply report any more our results using tables presenting the top 10 or 20 examples of high cost. It is a tremendous waste of time spending more than 15 months trying to develop some semi-automated way of making such reports when the original data sources are flawed, misapplied and even mismatched on the reporting teams spreadsheets. New methods and new logic are needed. That new method and new logic is producing a way to report on all 100 million patients completely and fully for the first time (my 3D maps of course), and to argue that these 100 million do represent the national as a whole, for the most part.
The questions I first asked myself when I was looking at data on 100 million people was simple as follows:
- This magic number, 100,000,000 of these 265,734,056 people covered by some form of health insurance, represents what percent of the total US population?
- Does 100,000,000 people mostly in the working class, including workers and their family members, constitute a decent sample of the US?
- How do these 100 million people stack up relative to the 266 million people out there with coverage?
- Can we use the results of 100 million people to make conclusions of the US working class population and some of the federally insured classes as a whole?
I add ‘working class’ here because we know that employed people and their families will have distinct health differences from those who are unemployed due to medical problems. For the most part, any large sized population making up a database will be primarily working class people, with non-working people included as well who manage to be included in the programs that appear on such a large database.
I first asked this question once after I analyzed 7 million people, and then 9, and then 16, and then 25, 37, 45 and finally 60 million people. By comparing the outcomes for each of these studies, I could see a behavior with the curves that told me we were reaching that important state defined as the total mean. There was this regression to the mean that was happening with my results, and after a certain point in these analyses, I found that the form of the population pyramid being used ceased changing its form. All it did was get wider and slightly smoother once the 25 million people mark had been reached. When I began my analyses of 45 million people, and then 60 million, and then 70 and finally 100 million people, I developed models defining the behaviors of the distributions of age-gender I was noticing.
I then applied this same approach to researching specific disease types, and for the first time had population pyramids telling me exactly how 100 million people experience each and every disease being evaluated.
‘Now we had something’ I was thinking.
I could use this to tell me when the 100 million population health members or patients reached their peak age in life. This is the age in one year intervals when the greatest number of patients with a specific diagnosis exist. After that age, the numbers of people with specific conditions begin to reduce, due to dying due to a direct cause, or due to some comorbidity, or due to any number of acute problems, accidents, mental health problems leading to suicide, unfortunate happenstance, etc. etc.
But I was still going through the process of trying to determine if 100 million patients could be assumed representative of the total US population. In the work place, such a question apparently has never really been asked before.
We are trained from infancy during our population health statistics training period that sampling is the standard way to look at things, unless the numbers that have to be evaluated are small enough for a total population analysis to be done. Usually when we do some form of population health analysis, we are realizing that our population represents just a sample of the total population at hand, and so rarely think about the possibility that the set of people we are studying does represent a true example of the population at large.
Normally, to correct for some of these uncertainties and discrepancies, we do this adjustment of figures. We normalize the population under review so it can be compared with another. The way we normalize is that we chose a base or foundation of people to compare our results with, make it so the percentages of each of the broad age groups predefined, such as 5 or 10 year age increments, of our test population, match the percentages in the base population. We then adjust our initial scores according to that per cent change.
For example, if our population has twice the percentage of people 70-74 years of age relative to the base population, we cut all the figures attached to our sample of this age group in half before comparing it with the base population. And we do this for every age group that is under review. Never mind the fact that the bulk of these 70-74 year olds in our population are only 70 and 71 years of age, whereas the base population is more equally distributed. Such details aren’t important in this case, at least usually not that important. A distribution of costs attached to a population of just 70 and 71 years of age is naturally going to be less than the costs attached to a population of members of all ages between 70 and 74 years of age. For now, this logical assumption has to be avoided when we use the traditional sample-age grouping types of logic applied to studies of people as groups.
The advantage to large population analysis is age grouping is no longer necessary, and adjusting grouped age values is no longer a part of the routine to be used. You are working with real numbers with this method, not theoretical results based on theoretical sampling philosophy. When you work with the entire population as a whole, you no longer have to ask that question ‘what is the real value for the population at large?’ You know the answer to this question because you are looking at the whole population in its real form.
However, this still does not exactly answer the question ‘does 100 million people represent the population at large?’ We are now closer to that answer in terms of the logic so far presented about using Big Data population numbers. But we still need to know if our 100 million population set can be used to make summaries about the nation’s health as a whole.
About one third of the people in this country are on some federal program like Medicaid, Medicare, etc. One third of 266 million is 87 million approximately. This doesn’t match up exactly with the 95.5 million noted above. It is about 10% off, but is in the ballpark for now. This means that my 266 million people insured has slightly more employees than unemployed federally covered people than expected. This has to be kept in mind when thinking through the analytic steps of Big Data mining some more.
But back to the main question, re-worded again for further clarification: 100 million out of 266 million represents 37.63% of the total 266 million people insured. One has to ask, does an evaluation of 100 million represent the 266 million as a whole? or is there significant room for error still since only a little more than a third is being considered a representative sample?
This is truly the most important question for this review. I want to know if my 100 million people has outcomes that could in the end be wrong, were I to go nby the assumption that my 100 million people does represent a certain selection fo the nation’s population as a whole.
There are several rules with Big Data statistics that I have seen come to fruition during my time studying large population datasets, ranging from 10s of millions to more than 100 million people per review of the statistical methodologies being put in place.
The first rule is the regression to the means rule. The truest result is the result of all people investigated. There is no sampling done, and no guesswork as to how to sample correctly. You simple evaluate everyone and get this result that is absolute. With regression to the means, there are two ideologies that apply.
Ideology 1 states that each time you sample, and then redo the same sample, again and again, that you will show over time the results that are most presentative of the total class. This is currently a method in use by one industry trying to make sense of its large population health database. Of course, the logic here has the problem that if you are going to sample, resample, continuously a population, driving your results down some long path towards the true end result for all people, that perhaps you simply ,might as well do the patient as a whole, the first time through. 10,000 analyses of an sampling of 100,000,000 people provides interesting results and an interesting example of how the results might trend towards to real end result or means. But the time and space used to produce this outcome could surpass the same had you just engaged in a single review of the total population and defined the true mean so to speak.
Ideology 2 with regression to the means logic relies on multiple sampling, each one bigger than the previous, to see where your results are taking you. In theory, when you re-do the sample each time, approaching the total population size, you will see results that ultimately head towards the true total population value. For example, if can take averages for 1%, then 2%, then 4%, then 8%, etc. on up to 100%, and will see a series of points that in turn imply curve of results that plateau around the true score for the total population. This method and logic are used successfully for example for analyzing small numbers.
If the numbers for review are too small to produce a reliable p result, I learned that all you had to do was double your numbers. The group averages are the same, but the inherent problem of small numbers in equations constructed for bigger number sets goes away. This approach to solving small group study responses is important to business settings, since they don’t have the luxury of saying they cannot necessarily report on something because the overall amounts are too small. This modified technique of analysis always points to the true results for your initial population.
Whereas traditionally, in academic settings, you simply accept the terms that n is too low and so you don’t evaluate the results (the same assumption drives HEDIS and NCQA work as well), in business settings, you add to your methodology the assumption that your sample does represent perfectly a true cross section of the much larger population, and so you double, and then triple, and then quadruple your values, on up to octuple your results, recalculate p values for each of these newer larger population, and see where the p value goes.
In this case, using this method, there are only four possible outcomes that are seen. The first is a rise and then fall of p values to its true outcome. The second is a continuous fall in p meaning that your p approached significance. For both of these routes, the final p will always either end up above or below your critical p line, which most of the time is p=0.05. If the increasingly larger numbers applied to the equation push your p value below that critical line, you have good results. If the results aren’t good enough, the ratios and such of change simply aren’t enough, your results never go below that 0.05 mark.
I typically use this method to evaluate numbers from 2x to 8x, but found that almost always, the 5x and 6x is where you need to go with this approach. the 7x and 8x only satisfy my desire to see that line flatten out above or below the critical value.
This methodology makes up for the excuse that we don’t have enough people to tell. If your population only has 23 asthmatics, and you need to know how that 23 people performed, this method is then employed for Chi Squared results testing. If you have only 30 responses to the survey, and are relying upon Mann Whitney testing, this result will be reliable for that particular population.
With Big Data and the 100 million people data, the opposite logic is at play here. You have the whole population and your values represent true results for that large N. If your n’s are still very small, you can double, triple and quadruple them to see what happens. But this approach eliminates the need for concerns about whether your final results are truly representative of the total N=100M or not. We still have to see the argument for the question does 100M present results that are regressed to the mean enough to represent to the total 266M people.
The second rule is equally distributed probability is not the rule for analyzing people. People have a pseudo-bell curve that defines likelihood of living at some given age. There is not an equal chance of being alive across all ages in one year increments. That means there is a higher tendency for people top be in that highest value set on the population curve, for any given population, and that this in turn transfers over somewhat to the likelihood that your sampling of people (the 100 million) is very much going to be like that curve for the whole people.
The third rule is likelihood of fit. What is the likelihood that the large population you have is going to represent the shape and form of the massive bell curve depicting everyone.
The large bell curve limits what you can and cannot have as sampling results. If the whole population has 3 million people at age 30, then your sample’s limit is 3 million for that age. Now apply this to every age on the curve, and what you see is that the likelihood of fluctuating your curve so much that a new peak or bell rises some where else within that total curve, meaning underneath it and totally encompassed by it, becomes nil once the curve of your sample, and the size of that curve becomes too big, too cumberson to fit anywhere else than beneath the total bell curve.
The fourth rule is what I call the 16.666666 percent rule, or the 1 in six rule that states that if your sample is at least 1/6th of the total population, it becomes hard to produce a curve with your sample that doesn’t mimic the original curve, and result in a set of values that are statistically significant in terms of differences from the total outcome. In other words, even though your averages may be difference with a sampling of 16.666666%, the likelihood that this will be statistically significant in its differences is very low.
The fifth rule, used by all survey programs, is that to better your population size, when n is the number surveyed or evaluated, you have to double n to see if there is going to be a swing in the results that are enough to demonstrate statistically significant differences in results.
This means that if you have 130,000 people, and you want to make your survey look more credible by adding another 20,000 to the list, you are only going to increase you sample by 2/13 or 15% about. This 15% change is more than likely going to have the same distribution as the first sample, unless you deliberately pick only outliers from the final population surveyed. Meaning that this will barely if at all change the outcomes of your survey, other than by changing an average by 1 or 2% of the total score. For example, a 1-5 survey result with total average of 3.5 may change to 3.51 or even 3.60 if you did pull from outliers, but this will not change the statistical significance of the total survey results you have, so that increase in score is not even of any use.
So, what I am saying here is that in order to improve your likelihood of being correct, at the 16% mark, you have to double your sample size, or make the 100 million people evaluated twice as big, in order to be sure your results are accurate, and more like the total population. In the survey world, I learned as a data entry person and analyst, that when you have a significant baseline number, that first 90% of the final number of surveys entered allows for little to no change in your results for much of the rest of your new data entry period. There is this base or foundation of numbers that don’t budge, unless you double their number, then maybe, a few outliers will show up and the final averages will change–not the statistical significance or p values, just the final averages.
Now, all of this logic I have presented is for my 1 in six rule or 16.6666666% sample size.
But in the case above, that 100 million people is not 1/6th of the population. We are working at twice that value with the percent total US population sampled–37.63%–more than twice the 16.66666% mark.
This means that if we go from more than one third the total population to two thirds the total population, in terms of curves produced of the total results, we are likely to seem some changes, although very miniscule since the differences in the shape and form of the two curves developed are more than likely going to be insignificant in themselves. Even kurtosis and skewedness don’t have too far to go to throw off your results when sample populations are this big. There is no room for outliers (those parts of a bell at the high and low ends), at least not enough to modify your final results, except at the thousandth decimal level. That is no reason to try to double you population size once it reaches 100 million, or even 50 million perhaps.
A population of 100 million, for now, represents the US as a whole, at least that part of the working class versus non-working class that you are dealing with. You cannot of course compare 100 million employees to 100 million on federally funded health insurance programs. In such cases, there are significant cultural differences.
But a study of 100 million people does say something for the US population at large. The only problem is, very few places store this much information about this many people. So for now, this theoretical concept remains simply that–a theory–until 100 million of each can for the first time be tested and compared, and then used to define with certainty the relationship between the employed and insured, and the federally populations in this country.