SAS, SPSS, Stata, S+, etc.

Everything from Calcstar/CPMSYS — UNIX and DOS SAS — Win32 SPSS — to the current SAS, S+, Stata, Excel, Access, GIS, Crystal Reports, NUDIST, Atlas/ti, Ethnograph, etc. Why learn these?

The way corporations/business manage GIS, knowing the technological potential that exists for its implementation, suffice it to say that 90%+ of the industries are at the peak of the curve or below, a little behind this rather idealistic bell curve form. This is a shame for a business environment where being ahead is purportedly what each company likes to have the rights to brag about. In terms of spatial knowledge skills and techniques, a typical way of defining companies is as follows:

Status Quo, behind the curve — no idea on the full applications of GIS, except perhaps as some additional way to look at data.
Status Quo, with the curve — some idea on what GIS is, but either fails to make any use of it or contracts out for this work, providing their data to another company to get a report back on their services, sales, etc.
Early Followers — make use of GIS in some way, but consider it very much low priority and requiring only cheap hires.
Supporters — make good use of GIS, at the intermediate level, and even know a little bit about spatial statistics and their applications
Innovators — produce new uses with GIS, and continue to improve productivity as a result. The most innovative continue to grow and improve, those lagging behind slightly are usually in need of improving the technological and manpower basic knowledge level.

This unfortunately means that if you are a GIS student coming out of college, even of an undergraduate school setting, chances are your are more trained in this new technology and know more about its potential uses that the typical CEO, Director, VP or manager. I say trained because any mid-level leader in a business on up likes to claim he/she knows something about this technology. But knowing and doing are two different things.

It was the 198os when some of the demands for innovation and the use of new points of view came to be, and even today we occasionally hear about mid to upper level people suddenly learning about these approaches to data from 20 to 25 years ago, and have to listen to those leading our company claiming that this is part of their next plan—if only they knew how to engage themselves and execute it, much less their coworkers or even others who are experts in implementing this kind of change.

Success is making such as change is measureable if that change in made in 12 months or less. The closer to four months it is when implementing that change the better. The only way to manage a change in founr months is to have workers who know the field they are going to implement the changes with. A 2010s decade change isn’t possible when its leader are a decade or two older in their knowledge base.

In the SAS environment, people like to feel they are ahead when they try to take on some simple new technology. We’ve been able to make use of SAS-GIS for years for example, and companies trying to make this a part of their main theme, using it to display counts and sums, not probabilities and predictions that are of any credibility and likelihood success, are now emerging, but with small likelihood for making change due to this information mapping. This is because the idea is there, but the skilled researchers who can analyze anything that is produced are not.

It was for this reason in fact that I developed an entire survey process to evaluate the level of GIS knowledge that exists in the job settings. In general, the changes that I have personally seen in the health world outside of federal and state/county offices are so far behind in knowing these applications of GIS that it appears as though GIS will remain experimental until the current ownership and leadership two or three generations die off or retire. To put it bluntly, we need more knowledgeable leaders and even recruitment officers.

The bell curve above exists for industries because this statistical relationship exists. If every business was as ahead as it likes to think it is, then they would be the status quo, because every business that claims to be ahead “knows” it’s ahead. But the truth is only a very few businesses are ahead, and when it comes to GIS implementation in the work space, these are the companies that right now make regular use of GIS for

standard reporting, not special reports for special clients,
designing next year’s programs and making changes
developing more and better ways to utilize the large amounts of data available to them, not trying to figure out how to improve the use of this data, typically for the first time

Having such experiences takes time, and unfortunately, I have yet to see an industry participate in my survey that is able to show it has much knowledge about what GIS can do and how it does it, much less be able to tell me he/she taking this survey and who is in a workplace and knows what GIS and what it can do. (Take my survey and prove me wrong, if this is not the case.) This places most industries with access to Big Data as service providers, not information or data content service providers, to the left of the peak in the above bell curve. This means there are only a few companies out there able to understand the skills and meaning of outcomes produced by a true [GIS employing] spatial analyst, and be able to improve upon this reporting process by making the best use of their manpower from within, or by knowing and accepting the requirements for such experts (being able to be humble and recognize they don’t know everything).

When I returned to graduate school in the mid-1990s, my philosophy was to learn about one piece of new software, tool or analytic method out there in the business world for each quarter or semester that I was enrolled in classes. Exploring the different analytic opportunities out there puts you ahead of the competition and at times works in your favor when it comes to making discoveries. It is for this reason in fact that I have the following four innovations worth mentioning:

1. The several pages I have devoted to my hexagonal grid mapping method together receive about 1000 hits per month (if we include the indirects), representing one-fifth to one-sixth of my visitors (assuming total per month at 5000-6000, as of 11-1-2012, a deliberate underestimate I might add). Approximately 7.5% to 10% of these visitors download the excel sheet I originally developed for producing these maps. (Hope to put the remaining SQL and SAS versions out soon.) But this value had gone up over time.

In the spring months in preparation for the end of the academic year I get as many as 50% of my Methodology page visitors downloading per day. On a good GIS conference day with world and regional Medical GIS the topic, my visits go up into the hundreds over a one to two hour lecture period, with peaks building just before and descending afterwards the next day. (I am of course assuming its a conference because the new update program allows me to see hourly activities so I can look on the internet and determine who or what is out there discussing my site). The largest count at a single moment to date, in Sept. 2012, is about 780 viewers reviewing my page (not this GIS page) for about 20 minutes.

2. My population pyramids method employs 1-year age increments method to analyze large populations of >1 million (possibly >250,000 to 500,000 based on age-gender distributions). This method provides more information that the standard 5 or 10 year age increment methods already in use, at least in a statistical sense. With this method, preventive care intervention programs can be made more economical by improving the definition of the target population. This new method tells you at exactly what age the critical threshold of risk begins to emerge in populations, such as the age at which a child undergoes the least doctors’ visits, the peak age for epilepsy prevalence (ca. 14 yo) versus raw numbers of patients (30 yos), the peak age differences between male smokers (17 yo) versus females (45 yo), the age when screening for aging related ICDs require much more aggressive evaluations, the significant gender differences that exist in certain physical and psychiatric disease types, the parallels between some American ICDs and culturally-bound syndromes identifiable with ICDs that bear similar age-gender prevalence rates (examples of all of these covered on other pages). There is a statistically significant formula I wrote that can be applied to this site, which I use to measure differences between to outcomes on a 1-year intervales, moving windows, a “portrait-like” or “facial-analysis” like routine I developed after talking about this with a cousin who works for the Feds, ca. 2002 I think.

3. My national grid mapping techniques for monitoring and displaying health and disease (25 mi x 25 mi cells) eliminate some of the errors I encountered in the past with block, block group and zip code analyses. It corrects for small differences and errors often encountered due to common math problems like low numbers and zeroes common to census block and block group data, usually associated with very small area analyses. This methodology provides a more accurate way to map out human health behaviors. The standard grid approach enables more accurate 3D images and contour maps to be produced depicting population health, which was the primary goal of this project.

4. GridEcon Matrix [GEM] is the natural follow-up to my national grid mapping project [NGM]. Like the NGM, this way of visualizing how to make better use of data came to me quite a while back, but pretty much materialized once I began working with large area population health data pulled from the Perot and National Prescription data systems, and some federally sponsored ftp and UNIX sites made available for early GIS users (prior to 2008/9). The purpose of GEM is to utilize NGM in the development of an automated (or mostly automated) reporting mechanism. The goal of the report being produced was to include the standard HEDIS/NCQA metrics, add topics in need of evaluation based upon continuing medical education program content, indicators of the hot topics out there, and the addition of special topics and areas of focus which are repeatedly excluded from most program evaluations. These special topics focus on groups routinely excluded for all health care systems analyses, and include low SES and special cultural-ethnic-heritage background members/patients, groups with high or special medical or pharmacal needs not being met, specific age range groups reviewed as a whole for all of their risks (i.e. children ❤ for non-compliance, abuse, malnutrition, etc; teens for drugs/smoking, pregnancy, abuse, behavioral dx; elderly populations for hip/femur fxs, nutrition deficits, alzheimers, clinical performance, age-related risk factors or ICDs, etc), and groups that define the high cost/high risk categories.

So what does this mean in terms of learning about software before finding a position in the workplace?

First it means you have to find out the technology yourself, and don’t expect much new technology to be shareable or teachable to you in the workplace. If they are ahead, they’ll probably be keeping the best to themselves. If they aren’t they either don’t know or are unwilling to admit to this stumbling block. So one has to learn the technology on his/her own, and then keep up with it.

Next, SAS has to be a goal of undergraduate schooling and early graduate schooling, for those into the statistical world. But there are other technologies coming about to be aware of, like the new national databases being developed for representing the US population as a whole, and how to apply this base knowledge to the local or specialized knowledge that whatever company you works for wants to relate to these national standards. In other words, know how to compare the small data with the big data.

Third, you must be able to go into and understand the newer statistical methods out there, and assume for a moment that the older methods are too archaic and outdated (since they are), and that you should not be using them to produce a good report.

Fourth, accept the fact that corporate workers have reached their limit in terms of focusing on left-brained ideologies to manage something that only right-brained research methodologies can make full use of. People like to categorize and tabulate results, analyzing everything, only to present it as a top 20 or 10, on a short table, because there is too much to report. This is because they are thinking non-creatively, relying upon standard methods that they grew up with, without realizing there are more wholesome ways to present your outcomes, and more effective ways of determine everything that can be done to make the best use of your data. Those students who are familiar with this kind of creative thinking perhaps need to take their place.

Fifth, GIS statistical methods have to be developed and implemented, since this is a world economy that is developing, not an economy where one company can assume its data is the best and make decisions locally about what changes to make, etc., and assume it is safe to ignore the data produced by their competitors working the exact same people, a company 1/4th of the population you are serving this year could switch to next year because neither or you are really top notch in what you do.

Fortunately, students are more open minded and able to commit to whatever changes needed to become creative. Unfortunately, corporate workers are unwilling to accept that they are behind for the most part. This means that all of those theoretical methods one learns as a statistician, such as the use of NUDIST, Atlas/ti, word recognition software, ethnograph, etc. etc. are not tools that the field is going to know about. These tools, like GIS have potential applications to the current businesses that like to rely upon survey results to make their decisions.

A Note about Sources for good information to play with using GIS

Like many of my projects, this work has old data-new data problems due to the age of the stats. In GIS, retroanalyses are typical for health care. In Oregon and many states, the data I have used comes from a variety of sources, for which there were plenty in 2006. For my midwest contracted work, daily and weekly databases on claims, services, labs/dx tests, and office visits data could be evaluated to develop a semi-automated download-processing tool linking the data and producing outcomes for a review of slightly more than 300 ICDs determined to be age range specific public health indicators for this population (teens were the hardest to define metrics for, see QA methodology pages for more on this experience).

For now, the following is part of the access information handed out to my students, pulled from one of my older zip disks. I updated some of the major links since the older SIERRA and SAGE are apparently no longer available. The partially secured regional ftp sites are still operational, but with stricter guidelines. Each person has his/her “special places”, and the best of these are typically found on the pages with the longest lists. You’d be surprise what you can find, but don’t be surprised if you also learn on some later date that your activities are being monitored. (The warnings are clearly posted.)

********************************************************

EXAMPLES of ftp sites (updated 11-12)

A. Google Search “USGS ftp sites” search

B. Search in text form (all lines): http://www.google.com/search?q=USGS+ftp+sites&sourceid=ie7&rls=com.microsoft:en-us:IE-SearchBox&ie=&oe=

C. Current, popular http sites leading you to ftps:

This last site provides details about getting to the ftp sites. It is important to note that not all of these sites are acessible, some may have open door settings, and policies posted there for your to read before clicking in that direction; event though you can access it you are still not supposed to be there, and some or many of these sites monitor everything you do when accessing these sites. Some of these ftps are specifically medical economics related, some business-medical data providers (i.e. zips and health/claims), and some are privately secured person request sites targeting just one person–the researcher–these are for grant-sponsored research, professional academic inquiries personally made with special insiders, etc.

There are dozens if not between 100 and 200 of these sites and links. Examples of the more directly related GIS-medical-business ftps follow.

D. Anonymous FTP Sites In Domain GOV

E. The following is an example of subfolders for downloads

Final Note: With iCloud coming to be, all of the above should be more readily available and accessible, assuming you are allowed to access this data.

********************************************************

Aside from the standard governmental ftps, CDC/NIH datasets, US Census and purchased zip code health care utilization information, there are some national or regional MedicaidMedicare sets occasionally out there.

I have seen evidence for black marketing at the military academy level as well, and was taken by one 5 or so years ago from a training facility in the midwest. These pages are run by students, are short lived, and are apparently an attempt to get money, and may or may not provide you with legal, scrubbed data. Their most common path I found was through eBay, by requesting GIS programs and data. There is real data here, but it may not be legal. I am not sure if any agency has effectively prevented this from continuing.

Most of the good information exists at ftp sites and the best is offered by private sources we link up with in these settings. With iCloud developing, a number of sites are trying to provide data for public access and download, usually aggregated considerably, but still valuable.

A National Pharmacy network dataset last time I accessed it (10 years ago), and the Perot systems (now no longer around), both of which were available through through work and grant-funded research links or https/ftps have access to the national data in bulk aggregate scrubbed form. These are/were accessible from teaching institutions with security clearance into these data sources. There are/were also paid access sites (one time access or membership) with scrubbed medical data and paid provider sites with data scrubbed on a per need basis are very much the way to go now with obtaining useful datasets; for the latter you often had to have direction connections with a staff workers and provide an institutional grant number.

This field of IT business is very much in its infancy, and to date is not well utilized by health care clinical and epidemiological researchers. Researchers and institutional (university) workers with a long history of operating in these systems know best how to find you information. These routes for information are not well known by business leaders, and so are not fully put to use as of yet. Once the ICD9-10 transition is complete, this data will be more reliable, probably about 2 to 3 years from now (ICD10 conversion and utilization is due this up coming year!)

A Note about Methodology

I use the ICD [ICD9] mapping technique to predict and model disease and human behavior due to its reliability. ICD9s are most times easier to work with, or relearn from each time you go back to the project. With ICD9 older data I have measured several dozen forms of human health behavior such as drug misuse/abuse, non-compliance and smoking, disease prevalence rates, behavioral and psychiatric socioeconomic indicators, suicide rates and differences, environmental exposure risk, ED visit types, regional cost analyses, institutional and CAHPS-like survey results, regional health risk or well-being scores, foreign disease penetration patterns, and even terrorist activities, all at a small area regional level. This methodology has no limits to the grid cell area size that can be applied. But several point-area statistical tests like G*, Moran’s I and Geary’s C tests should be performed to identify the best small cell size (see ArcView/ArcGIS extensions notes on this at ESRI; to be added here later; for now, see slide 47 in http://bber.unm.edu/presentations/GISApplicationDemoEst.pdf).

When I engage in my medical GIS research, I like to ask:

Which method(s) is/are best?
Which of these is/are best to learn?

During my 17 years of teaching at Portland State, at least one student asked me this per class per year. This was posed to me because I was developing a process for evaluating plant chemical distributions across nearly 30,000 species of plants, looking for specific clusters of chemical groups that could be defined in terms of the primary and secondary metabolic processes which plants engage in to form these chemicals, and whether or not such a model could be used to predict other sources for the same end structures. I tried to explain to my students that the answer to this question was yes, and used tables to demonstrate where clusters of economically important chemicals could be found in the plant kingdom.

Unfortunately, this left me with no good answer to their questions however. At times I found SPS was the mainstay for job-seekers at the university and even at university hospital level. SAS now seems to be the standard for most complex science related methods, including several kinds of engineering positions and medical-public health-population research positions. Yet I never used SAS to evaluate my plant chemistry work. Some of the students that I had wanted to know why.

I told these students that it was very important to be able to understand your formulas, not just write a basic program used to implement them and see the results. A person who lives and works with his data, step by step, breaking it down into the most numerous and infinitesimal subgroups at times, puts this information back together with a more complete understanding of the picture than just the individual focused on the just the tail of the elephant (large detailed dataset) that you have amassed to develop your dissertation or thesis.

Since Excel was the common programs most students had to learn about, we usually used excel to perform most activities within the classroom setting. This enabled us to write the formulas and understand them and their limitations. This also taught us how to avoid certain common problems excel projects often produce, mostly due to missing data and empty cells. In general, I told them, that Excel add-ins are not employed by professional statisticians unless he/she really knows the methods being employed, their strengths, their weaknesses, how they can be cross-checked, etc. Yet, the hands on approach often required by certain methods does more for the statistician than the ability to simple rerun old programs, with limited ability to modify or update them.

In general, I have found that students in my classes (assuming they can type in numbers and formulas and do the stats, at least at a beginner’s level), are of two types: those that can do the brunt work as a statistician and come up with some results, replay the numbers, etc., and those who can really run the more sophisticated programs fairly well, if not even at some expert level.

It is difficult to quantify the meaning of this to headhunters, or tell them how much work a statistician actually does or is capable of doing. A good statistician perhaps can be measured by the numbers of projects he/she has completed, or the mixture of project types and analytic methods used over time, or the various ways in which statisticians like to engaged in these projects–the types of programs they use to accomplish this task. But the best statisticians make discoveries with their work. This means the statistical software put to use is meant to serve as an identifier and quantifier, but not necessarily as the only tool by which discoveries can be made.

In general, I have demonstrated to all students curious about the various things I play with numerically that there are three stages one has to go through in order to learn, do and finally excel in biostatistical work. First, you learn the basics (Excel, Add-ins, Stat plus, Access-database management and numbers, etc.). Then you learn some sophisticated programs like SPSS, S+, Stata, and even SAS). Finally, since you know the programs, and no longer want to experience the limitations software programs force you to comply with, you develop your own methods and programs/equation structures needed to produce the needed outcomes in less time and with significantly reduced amounts of time and “handpower” on your behalf.

SAS pretty much enables us to perform detailed analysis in a cut and paste fashion using standardized formula writing procedures that reduce you need to pay close attention to each step as it progresses in the analytic process. My first PC back in 1982 (Sanyo MBC500) had Calcstar installed, a program which required much the same understanding or program writings and expressions wording and manipulations in order to get the task done. Since then, programs like dBase and other programming option database development tools have enabled much the same types of activities to be enaged in, although with different programming backgrounds and requirements.

I currently favor being able to same basic material (data sets) and produce a way to automatically calculate all of the needed measurements in a single run, in just a few seconds. The reason for this need for a speedy program and methodology was the numbers of calculations required for some of the more recent analyses I have engaged in. Relatively speaking, a typical report contains just a few data points in its final report. The reliability and validity of such outcomes is questionable in medical statististics due to the human nature of these types of studies–people don’t follow rules, and so the results they can produce in studies provide ample opportunity for types 1 and 2 errors to creep in.

For excel, this means one has to produce working spreadsheets with automatic calculators. Macros and the like help, but with a good set of spreadsheets are not necessary. For SPSS the same applies, but with emphasis on the macros instead. Either of these tools produce functional work environments for good statisticians. For an npo with a low annual income, the low cost of Excel defines a great statistician as someone who can automate all required work. For the University setting, software knowledge and background across all brands and products defines a good researcher (not necessarily the best). SAS makes sense with exceptionally large datasets work, where time and numbers of rows become the defining factor for productivity.

Adding GIS to this research environment increases the complexity of the work, but not necessarily the process. The given GIS requirements for true research is the development of new databases and shape files particular to one’s work. If a GIS is being employed and the only is either just informative or makes absolutely no use of new databases as they are required, then the cheapest GIS is perhaps the best way to go. GIS should be employed for research and exploration, not simply as a replacement for a drawing or drafting tool. In cases where GIS is lacking, both Excel and SAS provide research environments where map substitutes can be provided. For Excel, the background map is simply a figure added to a spreadsheet, over which specific graphs, etc may be placed. (Excel does have mapping tools and add-ons, but they lack topological information.) In SAS and several othe standard sotfware packages, a simple 2D or 3D mapping method can be employed, assuming one has the required areal information down to the 6th or 7th digit decimel level or better.

For this reason, I like to see my projects automatically engage in numerous calculations as the data is entered and is essentially completed as soon as the last number in the data list is entered. Fortunately, I don’t have to click or enter run and see the program try to make its way through the dataset, only to come up with either an Error! or final summary number (correct we hope). I only have to enter the data and then look at a Summary report to see all outcomes that I wish to see; there are hundred of other calculations for a typical 20+ question review located elsewhere in my reports for my own review.

For PIPs and QIAs, researchers like to employ the standard 2×2 Chi Squared Pre-Post, or Year1-Year2 format to perform their calculations. At times, the results even appear the way Chi Squared techniques state they should be. Still other researchers researching the same topics might plan their study based on some form of ANOVA technique or a t-Test routine. these too are helpful, and as experienced statisticians know, sometimes counter the Chi Squared findings. A good rule of thumb is to stick with institutional standards but develop your own secondary methods of analysis to determine if and when significant outcomes are in fact there, even when the primary statistical analysis technique precludes otherwise. Chi Squared methods are limited by the size of the values; it works best at 40+ totals measured, and will continue to work quite well as the numbers increase, even into the tens of thousands. The problem is, some statisticians don’t like to use this method for large numbers, since we are taught it is best to be used for smaller groups.

As a statistician I can tell you that rerunning a chi squared on the same averages and exact same groups sets of outcomes, but for a larger propulation, will provide a more reliable answer. A large number of insignificant chi squared outcomes for small groups are guilty of Type 2 error [missing the point that an effect or change has taken place when it actually has]. To overcome this problem, use multiple methods of measurement and make it a rule to produce two or more results that are comparable and measureable whenever performing a measure–even if it is the rule to report on only one of these outcomes.

In a recent project on Diabetes PCP visit and care activities, HbA1c screening percentages (percent of population screened according to guidelines) were compared with the internal institutional average. Yearly repeats of this measurement demonstrated that HbA1c screening was always below the average (or pre-set goal) in terms of Chi Square analytic techniques, and yet as a Year1-Year2 process, always demonstrated a certain amount of change that was often significant. In other words, the outcome can be statistically significant at the patient level, in terms of numbers of people screened in a timely fashion, even though the average value of their outcomes never reached the goal or population average. Average values may in fact have no say in the final outcome for this type of project. It can be a success statistically, in terms of the amount of change made at some non-parametric population level, even though the qualitatively measured averages, values when compared to a population mean, or institutional goal, are never reached.

In such a case heteroscedasticity may be the best way to add another indicator to the “tightness” of the outcomes. We want outcomes to not only show improvement, but also show changes made towards a given value, with more individuals coming in quite close to that value versus those that were highly dispersed once the program began. Excluding outliers, this tends to be a valuable indicator of overall program success for clinical activity studies (not necessarily qualitative parametric studies).

Brian Altonen, MPH, MS