Applying New Methods with GIS

NOTICE!!

A Survey has been developed to document and compare GIS utilization in the workplace. This survey assesses GIS availability and utilization in both academic and non-academic work settings. The purpose is to document the need for GIS experience as an occupational skill. GIS is currently being underutilized by most companies. Spatial Technician and Analyst activities and a few managerial activities requiring GIS are reviewed.

This survey, which takes about 20-25 mins to complete (’tis a bit long), can be accessed at

Survey Link

…………………………………………………………………………

Note: As of May 2013, I have a sister site for the national population health grid mapping project. Though not as detailed as these pages, it is standalone that reads a lot easier and is easier to navigate. LINK

………………………………………………………………………………

Applying new Methodologies to Spatial Data for Disease Ecology and Epidemiology Research

A number of my research processes which use GIS, I have to admit, are unique. These methods are not simply developed from scratch. They are usually methods I saw applied at one or more of the various remote sensing or advanced engineering symposia that are held annually, and usually have an abstract or entire report published about the methodology used. These writings are what provided me with the idea of applying the same tools to another application. This is what I like to do with GIS.

I reconstruct the old formulas and methods for new uses in the analysis of health and disease. I make no friends at work engaging in this kind of research. But years ago I learned it was more satisfying being ahead of the pack, than being just another biostatistician.

One typically does not make many friends working this way, but one does feel he/she has better control of the industry, perhaps even impacting what might happen in the future. If this ability to control what is about to happen is lacking, usually it is the case that you can still foresee what is about to occur, without much of your input, and decide whether or not you want to take this popular route or find yet another new pathway to discovery.

Regarding my methods and discoveries, the way I see it that if they don’t become a part of the status quo, then another person’s idea on how to accomplish the same results will be part of the status quo. My disease and population prediction formulas will be rediscovered by someone engaged in some other big industry like oil or some sort of consumer marketing predictive modeling. Currently, my 3D modeling techniques are 15 years old to me, new to the industry, but new only because the industry has been so slow at incorporating any innovations into their processes.

Unfortunately I have learned over the years years, since focusing on industry first with regards to GIS and academia second, that industries usually like the say they are ahead, or at least believe they are ahead. One of the first and perhaps most cited famous people viewed by industry as “ahead of the time” is Edward Tufte. It is now 15 or 20 years since he published his two books on Visual Display.

I find it very amusing to hear from people at work who tell me they just discovered him and want to incorporate his concepts into their work. This brings me back to the bell curve as it relates to the status quo–67% of those who say this do not understand anything enough to make such a change come true, another 27% have some idea on what to do but don’t know enough to implement it, and of the remaining 6%, the 5% majority has individuals who can make some parts of these new ideas come to life, often requiring teamwork and time to accomplish this task, and usually with little support from higher authorities. Then there are those 0.9% who are able to make things happen, the “Bill Gates” and “Steve Jobs” of their field. Unfortunately there are not many leaders like them, for if there were, this 0.9% wouldn’t exist. Such is the end result of being too much in the norm, even when you like to claim innovativeness.

So what about that last 0.1%? They are the true discoverers and inventors, a few of which made it into the system some how. They are not people you see in nearly all companies, especially American companies, which seem to like to play down the meaning of teamwork when bringing innovative ideas together within an industry. Edward Tufte’s model of how to produce and display information using methods born by creativity and genius is a primary example of this. As a statistician, I see 20th century math techniques being applied to 21st century medicine and industry. An effective CEO or company will hire the newer people capable of bringing the new knowledge and experience needed for GIS into the workplace. Any who don’t will watch as their company falls behind and is replace by the innovators.

VDQI Book Cover

From http://www.edwardtufte.com/tufte/books_vdqi

There is a reason for this that impacts us GIS people. That reason is the focus on managing skills, not the knowledge base and technical skills needed for the company to get ahead. Companies like to keep old time managers, and keep on using those older pre-GIS (essentially pre-1987 +/-) statistical skills. for this reason GIS is rarely employed in any innovative ways by most of the companies out there. GIS is more a tool to calculate that to explore, like an excel graphing program or some modified Access program capable of making different looking reports, displaying the same old in a slightly newer way. Even new managers appear to lack much of this new technology knowledge and skills regarding GIS. Even more importantly, I suspect fewer than 5% could produce a map from scratch on a GIS system and then add a point to that map depicting where their office sits, and ask the GIS to identify the closest office neighbor of similar gender and age. Persons highly skilled in AutoCAD, in spite of their 20 years experience, do not understand the differences between AutoCAD and GIS, unless they used the two.

Δ

Something I produced following Tufte’s recommendations

One of the things statisticians are most sensitive about is their formulas. Their math and boolean products are the result of a personal intellect that has evolved along with those their colleagues and co-inventors. Companies like to lay claim to these workers’ IP, but if the company lacks the manpower needed to duplicate or replicate such knowledge, they are probably being more like that 67% + 27% I just spoke about, unable to make use of this new knowledge much less replicate it. It is the employee that has and uses the IP, not the company, making IP Rights a major issue when it comes to information and knowledge sharing.

This is why I make my formulas difficult to decode. At times they may even be impossible to decode and redevelop, even with the sql and other languages sitting there right in front of you. Those of you who know me already know that if I wanted to release a formula, it is probably already out there. I am more into sharing and teaching than keeping my discovery locked up for some company to decide whether or not to make use of it. Take for example the formulas I used for hexagonal grid mapping (still need to produce this in sql and SAS versions for release). These formulas are out there because I learned over the years that they are mostly used by students. Students are the best learners. Professionals and other people away from college too long, are probably no longer learners and innovators, just doers.

I have yet to see someone perform transect analyses of west nile mosquito ecology areas, the way to better understand the spatial data. These GIS epidemiologists still abide by the point-based analytical methods requested from above. Whatever data they gather has limited applications to anything other than another point area within a certain distance from their test site. They cannot use these results really to make any predictions. With transects and gridded areas, you learn more about an large area that is exponentially greater than what you learn from a single series of point observations.

So be it for this new technology. Due to those diehards who favor the old-fashioned methodologies, GIS is not being fully applied in the workplace and spatial statisticians are engaged in a profession that is very misunderstood.

. Y. .

The following are important GIS applications to medicine and epidemiology that I feel require a more complete and consistent means for engaging is disease studies in a more “modern” way (for lack of a better term). I call these methods more modern because most traditional epidemiology is still engaged in using standard methods usually designed several decades ago, if not one or more centuries ago. We typically see a fairly standard point-based analysis of people, diseases, host, vectors, pathogens or causative agent, all in relation to statistics based upon imprecise local areal data such as population densities and counts. Since this method which relies upon coarse areal data even at the block level is considered the most accurate way to explain incidence and prevalence, it is the methodology chosen to publish estimates of risk and the like. Another way of looking at this issue is at the personal, patient, consumer level–trying to find an answer to the question ‘what is the likelihood that I will have the problem under review based upon my contacts in the surrounding environment with its various causes or methods of development?

book cover

ibid.

Note: Some of these new techniques are used by innovative, modern and future technology companies, like some of the NOAA and Aerospace groups. The above 3D mapping of clouds is much like the mapping of mines, ocean bottoms and topography. Population health mapping techniques is an extension of this type of application of large amounts of data. The National Population Health Grid-mapping technique I developed is an example of this.

Ω

Comparing two diseases

. . .

Improvement #1. Spatial analysis using blocks, grids, polygons and contours.

Traditional Approaches

The following is a typical method of reviewing case data in relation to environmental data. This method employs fairly large areas to perform the rule, and lacks the spatial detail needed to assign cause and effect in a fairly reliable fashion. For the most part, this is how information is presented to the public and others outside the immediate research setting. This method for reporting is used in order to assure a certain amount of privacy on behalf of the patient populations often referred to and discussed in such epidemiological reviews. A small problem with this argument for why cases are not provided in detail is that there is concern that some individuals might be identified, in violation of human privacy rights. This is not really the case however, as we can tell by the publication of county data pertaining to rare of unusual medical problems, where only one or a few such individual reside in that particular area.

The disadvantage with the census-block approach to this type of analysis is that it ignores ecological differences found within each census block. Of course there is another problem with the use of small census blocks–the fact that are many blocks that either lack people completely, have small numbers of people and therefore usually have 0 cases to review, or have portions that cannot be related to your work, are what make this methodology problematic at times and the results often questionable and unreliable.

Take for example the following presentation of data on cancer, leukemia and lymphoma. These presentations are easy to understand and perhaps even impressive to include in a local town hall discussion, but they do little to provide us with the statistical information needed to actually infer a link between chemical release sites and cancer.

The more helpful methods for applying GIS to work on spatial incidence data require more than just a block review in relation to roughly mapped out cases. There are several possible avenues (no pun intended) that may be taken with GIS for chemical release site spatial analysis. Although these methods go about this review in a manner opposite of traditional epidemiological methods, in which the focus is on cases and people and the likelihood for cases to develop within specific population types, they assign more importance to sites and their chemistry first, to which risk is then assigned based on site features and behaviors, followed by hen attempts to relate this information to the local populations as a whole, based upon standard methods of population analysis such as age- and gender-related normalization of data sets (usually referred to age-gender adjustments in incidence rates in non GIS-focused reviews).

One method employed for this type of analysis in the past by this researcher and colleagues was the use of a Monte Carlo model to review spatial locations using a standard non-GIS statistical tool. For this method, locations for each case had to be produced (with GIS) and then this database used to perform a Monte Carlo analysis of the points based on standard census data, with the application of a moving circle to determine if anywhere within that moving circle an area could be found that appeared to have an exceptionally high rate of cases based on block group data linked to the moving circle area. This resulted in outcomes that appeared more to be a result of errors. Theoretically thousands of analyses were performed for each moving circle placement, for each of 18 sites focused on for this study in a fairly large urban and suburban settings (possibly 1800 sq mi), using a 9-mile diameter moving window. The likelihood of errors in the outcomes for such a widely dispersed area of review was increased due to the multistep, multi-tiered (18-tiered) methods of analyses required for this activity. Of all outcomes to achieve, this resulted in the identification of 2 sites where statistically “different” or significant outcomes in terms of population density and incidence/prevalence rates prevailed. Upon further inspection it was found that this outcome was generated for an area with an exceptionally low population density, making it more likely the outcome was due to a low numbers issue and not the result of the use of this methodology or the actual data.

Although interesting to some extent at the public health officer’s level, the reality of this methodology and its result indicated that it was fairly unreliable. This means it would be a mistake to apply this GIS methodology to developing some sort of intervention or public health awareness program related to human exposure to toxic release chemicals. This also means that any related arguments fostered by this method of analysis, with the goal of initiating a site clean-up, would be considered unreliable by any reasonable statistician reviewing this work. This results in an overall outcome of the project that no GIS epidemiologist wants to have to deal with in terms of publicity and credibility. Therefore, in terms of monitoring and documentation of risk, this Monte Carlo method, using many of the standard methods employed currently in poorly developed and managed public health programs, I hate to say that this methodology can be very poorly applied for local environmental health surveillance programs.

In the least, a more robust use of the same methodology has to be employed, with the analysis carried out at the house-by-house, business-by-business level. Part of the problem with the above method was the size of the circle used for the analysis. Although large circles give large numbers overall, their “precision” was questionable. Such a method might be better used using a smaller circle. The problem however then becomes exceptionally small numbers increases the likelihood that there will be any credibility and reliability for the outcome. So we might as well just skip this step in such an analysis in the long run, since it works best with cases of high incidence rather than low incidence such as certain forms of cancer and leukemia.

Traditional Methods for Release Site Mapping

There are a number of ways we can produce traditional chemical release site maps. I call these methods “traditional” because they do not necessarily go much beyond the standard methods that seem to appear in the GIS literature. Some of the steps used to produce “traditional” site maps are not necessarily in common use. These methods have to have been applied at some point in site studies, even though few articles exist describing them. In one major reference book on release site chemistry, one of the methods I developed was used by another leader in the field, but for some reason this method was never fully implemented or employed, and is not currently modified, improved and/or published.

The following threesome of maps depicts one common way data might be mapped and demonstrated. The leftmost maps depicts all local chemical release site history, the central map identifies primary culprits attached to this environmental history (superfund and superfund applicant sites), and the third, the distribution of a particular type of cancer that can be correlated with site distribution.

The next series depicts case distribution as it related to census block data values, with number of cases per standard number of people evaluated.

The following three images depict census block case density in relation to sites and a general overview of the chemicals they were noted to release in the chemical datasets provided for each and every release site, except for those excluded or removed from the report list.

Release Sites.

In terms of site analysis, the ways to review sites have focused on just the site, with limited review of local epidemiology (due to restriction of such information due to HIPAA rights). Statewide, the following two maps depict a general statewide condition of Oregon in relation to release history. Most important to note is the high density of sites in just a few small areas of the state.

Each county, site, town, etc. has a chemical history that can be evaluated. A number of methods were attempted to determine the best way to analyze site chemistry, in relation to toxicity and carcinogenicity. The methodologies employed for this part of the research utilized standard methods of chemical evaluation based on widely recognized CERCLIS and NIOSH-derived chemical and toxicological information. Information produced by NIH, the AMA, EPA etc. were also applied to this work, and the formulas established by each used to define general toxicity and carcinogenicity features for each of the chemicals found at teach site also evaluated mathematically to define a way to quantify site chemistry and carcinogenicity features.

The most important step in release site chemistry research was a close review of each chemical tested for and reported in the EPA databased. This process involved the review of more than 65,000 chemical reports over time, for approximately 2500 sites, involving about 175 chemicals tested for in different amounts at each of the sites. All reports were scrutinized as part of this Oregon State University produced EPA site review process. with reports stored on the internet. In many cases records were not documented when they appear as redundancies in the sampling process, either temporally or spatially (i.e. many reports generated for the same site on the same chemical, same test, same date). Even with this filtering, 65,000 reports were added to the database developed for this research and evaluation. These represented 450 Confirmed release sites, all superfund and superfund applicants sites (82 total, 12 of which were superfund), and the remaining sites as general sites without confirmation of release according to the status of this site in the EPA database. Since sites seemed to be most important to review during the earliest decades of this work, these are considered potential methods of exposure related to the local cancer cases included in this study and so are the focus of the database review. In general, recently recorded sites in the database are much smaller in size and less severe regarding spill type and toxicity. Very few major reports of spills (i.e. > 100 reports, or greater than 35 chemicals tested at the site) were found in the years after 2001 during the course of this study.

The following illustrates how the chemical data was reviewed at a typical site level. For each of these, chemicals were reclassified to larger groups, with groups defined based on a combination of chemical bonding and toxicity features. Since carcinogenesis is often the result of bond-related features, chemicals with similar bond types (aromatic double bonds) or elemental constituents (those with highly electrophilic features like the halogens, or those which “shed” electrons easily like metals) had a similar classification, even though individual toxicities differed. In this way, the least carcinogenic substances could be separated from the most toxic features and/or most carcinogenic of these, with many groups formed in between. (For more on this, see the groups identified by bar charts produced with the GIS). What these bars demonstrate is that for each chemical site type that exist, there is a fairly predictable release history, and a related predictable carcinogenicity for the releases that happen. In general, it can be stated that sites can be carcinogenic if their releases include petroleum products (due to high benzene and polycyclic aromatic content), sites with well-defined halogenic chemical content in their history )esp.e halogenic aromatic compounds), and sites with well-documented carcinogens with a well-documented association with the release of these substances based on their industrial history (i.e. SIC number).

Not only could individual sites be mapped for their chemical history, but also, as a collection of data, the various sites evaluated for this work could be collectively reviewed and standard chemical profiles defined for given industry types, based on the chemical profiles of sites already documented in the EPA database. For example, the following chemical profiles were established for several major reclassifications of sites based on their history and SIC. This data could then be assigned to sites with matching SIC values, classes and history, site which due to their EPA Superfund status were not fully reviewed for their chemical history and site profile.

As an example of a potential applications for this type of review, let’s say you were investigating the local tanning and creosoting industries and wanted to know more about what they were most likely to be producing and releasing. A review of the statewide Confirmed Release Site data for Oregon states they have the following profiles or chemical “fingerprints”

Tanning industries are more likely to be carcinogenic due to their release of halogenic aliphatic (non-aromatic C2-C4 or C5 linear and isoforms usually, that are halogen bearers) or metals that may be considered carcinogenic (Ni? or Cr?), whereas creosoting businesses utilize large amounts of polycyclic aromatic hydrocarbons (oil-based fuels) and utilize halogenic aromatics as their extraction, stabilization and evaporative coating agents (a benzene or phenolic with one or more Br or Cl attached to the 6-membered ring).

The typical presentation of this data is at the following typical county-based level. Sometimes unique areal coverage is provided we well (perhaps as an example of the problems within a state congressman’s district). But usually individual site-by-site detailed chemical reviews are not managed except by the companies themselves or environmental engineering firms overseeing their disposal of waste and clean-up processes.

Defining the Research Area

A number of areas in the state are what we consider “island communities” or regions where external influences are the minimum. This means that due to low traffic flow into these regions from other regions nearby, the link between local cancer cases may be reviewed as cases more likely due to local environmental influences. the best example of this is Bend, but perhaps also Coos Bay located on the western shore (not shown on this map due to lack of significant superfund history). We can also perhaps consider the two southern cities fairly isolated as well. These assumptions allow us to review one region in relation to another, at times determining whether or not cancer incidence and potential for exposure are regional intrastate features in need of further investigation. The following map was produced early on in order to define regional boundaries for the state of Oregon, and then to determine if cancer could be regionally assessed spatially and statistically. (Cases are displayed as points, varying in color for each region.)

A more applicable way to credibly define risk is to reverse the argument being used for your research. Instead of looking at an area, pointing the finger to a single cause such as an electric line, a mass communications antennae, an industrial stack or known chemical storage site, an old undocumented factory building, one can look at the total features in an area and relate this to the numbers of people on an individual basis, not at an areal level as is typically done with census data. The major argument against this approach, to date, pertains to the human privacy act. By displaying a map of this data with spatial detail regarding cases, you might reveal who it is in a given area that possibly has a cancer type, pointing the finger at local exposure risk even though this person actually developed the problem due to a worksite or even residing elsewhere earlier in life. This causes problems at the personal level, in terms of victims rights and the impact of such public knowledge on his/her personal and professional life, and it affects the rights of the producer of the potential carcinogen as well–the industries or land owners involved with the various local carcinogen release sites. Besides, stating that there is a case of cancer locally does not absolutely link that case to a specific local feature.

Grid Analysis

In the following examples of some studies performed of the Portland, Oregon area, chemical release sites were reviewed areally by producing hexagonal grids of various sizes to test out their application. These hexagons mimicked the moving windows (circles) ideology, with the exception that there is a periodic rather than continuous assignment of new centroids for each adjacent area under review. Whereas the moving window in circle or square form does take a look at every possible spatial association due to its focus on continuity and contiguity of adjacent regions being tested, the use of grid cells as a substitute for this become a better method when the size of each cell that is defined is significantly smaller than the moving window normally employed, and the moving or recurring windows that exist, one next to each other, are of such a size that possible confounding features are eliminated or reduced to a bare minimum. A small area hexagonal area analysis, if applied using the same mathematical formulas as a moving window, becomes more detailed, productive and applicable in its outcome than the same study using a fairly non-descript feature that fails to assign risk and outcomes to a small enough space worth following up on in real life and time.

The following are examples of this applications of square cell grid mapping. The use of square grid cells is the most common method employed, even though it has some major area-based flaws that can creep into the use of these for a spatial analysis based too much upon distance related information. The following two examples reviewed toxic release in two urban-small city like settings where a number of toxic release sites were identified, some with either a history of high toxicity or with a history of evaluation for superfund status.

Astoria

Description: NW Corner of Oregon. One mile square grid cells were used. An isoline was produced and overlain over most toxic port area. Cell darkness depicts severity of potential for local toxicity based on chemical reports. Point data pertain to cases.

Coos Bay

Description: South-central coastline, one mile grid with 0.25 mi grid overlay used to define parts of major one-mile plots under review. More colorful (light salmon ti yellow to red) 0.25 grid cells depict more toxic small areas. Point data pertain to cases.

Hexagonal Grids

Cristaller’s method of reviewing population data was originally considered an economic geography methodology. It was used to define areas of need based upon specific places where new or unique events were taking place. A typical Cristaller method of reviewing clinics for example would state that a large area might have several areas where the same clinic could be placed in order to serve a given region, with multiple clinics serving multiple regions, each in its own specific way based on an identical goal or premise. Cristaller’s method also allowed for some hierarchical considerations to be made, such as how might one area and its centroid be linked to surrounding and distant areas, with stratification of sites allowed for this method or large area review. Cristaller’s method also had attached to it the assignment of similar regions established side by side, in some form of grid format, in which the grid consisted of identical hexagonal boxes rather than identical square boxes. The purpose of this method of grid development was that hexagons more accurately represent the relationships spatially that exited from one part of a surface to its nearest neighbor of equal size and portion. All of the points within a hexagon demonstrate a closer relationship to the hexagon’s centroid than the corner sites in a square grid cell. This means that spatially, all of the points forming that hexagon are more closely tied to their centroid, with less variability and a lower range of distance, than all of the points in a square.

To apply this thought process to GIS, grid maps must be produced using hexagonal cells rather than square cells. This way, all of the points in the cell (i.e. patients, cases, etc.) are more correlated to the centroid of that cell area, than all of the points in a square cell of the same height and width. In the past, hexagonal grid modelling was probably not popular due to the difficulty of defining its layout. To produce hexagons, if you focused on the lines you would rapidly reach a limit due to the complexity of the line drawings and their multidimensional 2D appearance relative to the much simpler two-dimensional method used to produce traditional grid layouts. However, if grid cells are produced using Thiessen Polygon methods based on a series of point distributions laid out according to a series of formulas derived from hexagonal-derived trigonometric equations, what we get is two sets of formulas, alternating within rows, that can be used to map out the centroids points for each theoretical hexagonal grid cell. A Theissen Polygon methodology can be used to then depict the spatial relationship between neighboring points, producing sequences of neighboring polygons of areas closest to each independent point. Except at the edges, each of these polygons forms a near perfect hexagon. (Variance in shape is due to changes in lat-long dependability in mapping, area or distance are slightly different between the center of a map and its edges, and GIS mapping tools may not detect these differences, even at the 8 decimal level sometimes.)

Portland

The hexagonal grid map, when produced with exceptionally small cell area values, is nearly identical to point mapping to some extent. If you assign the smallest cell maximum-distance values based on error values, then each cell effectively becomes a representation of space with error taken into account. Each small cell represents the most reliable value for a given study. For example a 0.1 by 0.1 mile area (.01 sq mi), with an error of approximately .01 to .05 length, represents the lowest areal distribution that allows for error and yet still results in placement within the cell, making this point’s values applicable to the cells’ centroid point. If you have an error in calculation that is beyond the limits of the cell boundary, this suggests a neighboring cell could have actually possessed that value, meaning that the calculation for this site may also be considerably off cell-wise. In essence, if you double the size of the cell above whatever error you assign to a value, you reduce the likelihood of being wrong in terms of true placement of that point within the cells. For this reason, small area hexagon analysis should be considered fairly accurate, so long as measurement or mapping error is taken into consideration.

Contours or Isolines

In the following examples of isoline maps, the first map demonstrates disease density isolines for the distribution of cancer cases researched for this study. This was produced using an Arcview Avenue Extension, with a square grid place over the entire map, and the numbers of cases per cell counted and placed in the dataset for each cell (using an extension). These values were then transferred to centroid points on a per square grid cell basis (a several hour step using yet another Avenue extension back in 2003), and a centroid point shapefile produced with the data attached to each point (using the final extension).

This first map had to be cleaned up quite a bit. The parts of the entire grid not over the state had values attached to them. The edge of the state, if viewed close up has lines that extend well beyond the state’s border, due to empty cells in that part of the map. These were covered us by overlaying a white polygon drawn to fit the edge of the state boundary, to the edge of the grid. Another problem with the use of the square grid (as mentioned earlier) was that a close up view of the isolines would show the staircase like change in values from one cell the next–there was no smoothing of each isoline.

This second map worked much better with producing isolines that are smooth. The above map demonstrates the fairly visually-friendly depiction of contours lines (these at an extent less than the grid cell that is displayed). Hot spots can easily be identified with this method, and as will later be demonstrated, the relationship between these toxins-rich areas (cells) and cases will be a very way to assess spatial features for the disease-causative agent location relationship.

Advantages to Hexagon Grid Mapping

The major advantage to using a hexagonal grid to map demographic and disease ecology or statistical features is that this methodology produce more credible contour or isoline maps for a given areal feature. The use of square-cell grids to produce contours results in fairly unpleasant maps, with stepwise-looking lines meant to move in some fairly smooth angular fashion. This side-effect of square grid mapping is almost completely eliminated using small cell hexagonal grid mapping techniques. The result is an isoline map that is fairly accurate in defining spatial distributions. If you increase cell size, you do increase error problems at the areal analysis level, but you can also intensify the contouring of features, making them more visible to the viewer, and, by the way, more accurate and applicable at time.

In addition, this method should follow a fairly standard methodology used for other GIS spatial analysis methods. It helps to produce similar maps using several cell sizes to determine the level at which sensitivity to change across space seems most noticeable. This will be visible with the naked eye, the neighboring cells will show more distinct differences between neighbors extending across great distances. The most sensitive and useful level of differentiating cell across space using this measure will produce fairly useful maps showing fairly distinct clusters and high ranked cells versus low ranked cell, and lack either cells that are all displaying relatively no incidence, or too few cells unrealistically demonstrating exceptionally large areas, all with similar amounts of the measured incidence.

Extremely local or small-scale analysis of a particular area is perhaps the best way to engage in the research of a site location and the chances for chemical exposure. The benefit of this approach is the manner in which multiple sites may be reviewed for each area. Instead of assigning blame to a specific type of site (since that site may not be to blame), this method of research enables a blame to be place on general areal features. This method is in fact similar to a recent method used to assign blame for unhealthy conditions in Oregon, just east of Portland, in relation to watershed contamination and public health history . The local government and EPA could have pointed a finger to particular industries by linking the chosen section fo the watershed to its major industry or polluter. Instead, since it was possible that both multiple industries and numerous domestic settings could have been responsible for this contamination, that the watershed itself was determined to be the site, and not a particular industrial site or specific location definitive of this area such as the landowner of its centroid.

The use of small area grid cells removes some of the blame for increased incidence rates from certain industrial causes. Such a methodology enables us to develop certain social impressions and safety-minded conceptions about a given area, without assigning so much political blame (often a cause for local government unrest with such social stances.) In this way we can use this approach to achieve an outcome without necessarily assigning direct blame or pointing the finger for local problems. This is a major benefit to taking this research approach since there is always a lack of certainty with defining a single local cause for a number of local cases that erupt, or the need to assign blame based on air or water pollution history, or developing such a need due to the impact of such environmental “villains” on local property values. Whereas social inequality is an important reason for designing and implementing research programs have good intent, it is hard to find or produced unbiased results that are also reliable for many of these studies.

Recommended Methodologies

So why perform this method of research at the point-point analysis level as I strongly suggest GIS technicians and epidemiologists do?

It ends up that we still get important insights into the situation at hand using GIS, and taking this non-traditional research approach. Some could argue that this approach even pays more attention to the patient, consumer or local resident and is therefore more publicly focused and potentially more publicly supported and financially supportable. As long as the statement of the limited reliability of this method is made clear from the beginning, this provides a method by which monitoring can happen, enough to provide background data that will be helpful to future generations of researchers of this topic, data that will be applicable if and when an actual public concern begins to surface, and/or actual cases of questionable relationship to this feature develop. The complete look at risk for a region, in relation to points, provides the data needed to search for and compare incidences in similar regions, with similar point related history and background (areas with similar population features, case features or release site features), allowing for further comparisons to be made between areas with supposed identical histories.

Another reason for promoting the use of GIS for testing out non-traditional research methods for disease related to the uses of other tools and methods that have been out there for quite some time, which for whatever reason are never fully applied by most traditional epidemiological researchers. Examples of this method of newly applying GIS to epidemiological work include the use of Theissen Polygons, Buffer Analysis, Spider diagrams, hexagonal grids and isolines to demonstrate spatial relationships for disease. Each of these can be used to illustrate a spatial relationship in a way quite distinct and different from that often found in contemporary reviews. Since these methods have not been fully implemented and tested, their lack of use has to be due mainly to the inability of epidemiologists to evaluate their reliability and validity, or determine if they could be of further assistance in the research of disease events, in some way, shape or form.

Thiessen Polygons. Theissen’s Polygons are used to select areas of interest and then compare them to one another. We can identify the centroid of areas demographically and/or economically defined statistically, through census documents for example, and compare events that take place in each of these regions relative to their neighbors. We can use this method to evaluate and define risk based on such observations, and demonstrate such important social issues and problems as social inequality. For comparison of regions, the centroid for each Theissen polygon region would be used to document and display the data. Traditional statistics could be used to contrast and compare each region. The reporting of this data would serve more a sociological purpose carried out at an administrative level, than an actual epidemiological purpose carried out at a statistical or probability-related level.

In the following series, the “Urban boundaries” were reviewed and used to produce a smooth-edged boundary of the urban and suburban area to be researched in and around Portland, Oregon. Based on knowledge of the surrounding urban setting, its topography, its census and areal income data, and information about the locations of industrial properties and other reported chemical release sites (especially superfund sites), an obvious breakdown of this could be produced, in which social inequality appeared to be a major part of this mapping theme (the highly industrial and toxic northwest sector of the area also has families with the lowest income status).

To produce the Theissen polygons, first the urban area is defined (urban boundary), then a centroid for the group of sites in this northwest area is produced (their weighted point), followed by a weighted point representing the location of the sites in the northeast sector, followed by two more estimated weighted points defined for the two remaining areas. These weighted points were then used to define the thiessen polygon sectors based on the estimated urban boundary. Each sector was then independently evaluated and compared, in terms of cases, incomes, and other census data. Each fo the release sites in these sectors were also evaluated

Buffer Analysis. Buffer analysis is a traditional application of GIS for analyzing spatial relationships. With buffer analysis areas or points or interest are identified, and locations selected to serve as major sites of interest. One method to design these points on a map is to begin with some information regarding the major areas of interest, such as a specific number of sites that exist in an area that are of potential risk to each of their communities. These sites are identified, and specific areas representing specific distances from these sites established, and then the point data located within each area are analyzed to see if one particular area is different from its neighbor or any other research areas. For this analysis, the Northwest Sector of the Theissen Polygon research areas above became the focal point,along with sections of the neighboring sectors. This enabled the location of potential “hot spots” to be subdivided further and the population and epidemiological features for these subdivisions more carefully reviewed to determine if one area is of high risk than another based on specific exposure or case-related features. In the following four maps, four chemical types often linked to certain forms of cancer are analyzed to see if one part of the urban setting appears more likely than others to result in exposure to specific chemical types. The design of this contour analysis enabled hot spots to appear in deep red; significant changes in contours over distance suggests a large area impact.

Comparing the top two figures, one can see that although aromatics in general have a high risk focal point established close to a site identified as high risk based on its chemical release history and features [‘H’], that reviewing halogenic compounds alone, compounds of considerably higher risk for carcinogenicity that non-halogenated aromatics, did not relate to the high risk defined for this particular point. In the lower left maps, we see additional chemical features displaying a possibility relationship to local potential for carcinogenicity, with four addition hot spots identified based on the contour maps. The lower right map is the simple aromatics map already discussed. Of these four maps, we find the greatest carcinogenicity potential risk linked to the concentration of petroleum products sites and PAHs, followed by halogenated compounds. Typically the reverse is expected assuming amounts of exposure were equal for both; the reason for this is the greater density of the Petrol-PAH sites relative to the same for the Simple and Halogenated Aromatic sites. (In general, Halogenated aromatics are one of the most carcinogenic organic chemical groups.)

This same data can then be compared with health related data, in particular human cases. One common method employed is to look at distances of cases from a given spot, using some form of nearest neighbor analysis to see if significant differences exists between two groups of results.

In the following figures, the “hot spots” density is indicated in blue. This is a contour map that was produced by evaluating all of the chemical release sites in the cell’s area and then evaluating their carcinogenicity (unlike the prior example, which has contours drawn based on the chemistry of each site, this method reviews all chemicals and their carcinogenicity/toxicity on a per site/ per point basis, so some small sites also stand out in the production of the contours. A small cell hexagonal grid (as above) is then laid over this map, and each site (point) value for sites within the cell assigned to the cell centroid, in an additive manner. These centroid values are then used to make the blue contour maps.

In the next phase of this analysis, cases are located and/or counted based on their distance from the centroids used to define the 9 major “hot spots” regions. Based on centroid location, specific normalizing techniques can be used to assign values to the case numbers and frequency in a region/its centroid, and then comparisons can be made between data sets for each given release site focis (the primary buffer centroids), in relation to individual cell values (numbers of cases, sites, toxicity level, carcinogenicity index value, etc.)

Spider Diagrams. The use of spider diagrams enables the GIS epidemiologist to display spatial relations in a fashion not normally linked to disease itself, but more at a place- or site-related level, such as linking patients to a local clinic or a local lab site. In this case, the spider is used to link people with a specific disease type to a specific area, and the denser the links that are displayed for each given place, the stringer the correlation that can be drawn. This does not state that incidence is higher, but rather that case counts is higher, and thus the demand for related care and any related corporate or government costs linked to this disease phenomenon. Assigning specific causes to specific centroid points for areas, such as a release site with Theissen Polygons dispersed around it, we can use this evaluation to better understand risk related to a specific site based upon site type and release concentrations or densities.

This method focuses on identifying specific hexagonal grid cells placed within a specified distance from the given potential exposure sites. The spiders point from the closest chemical site to the grid cell within the given distance definition. One can compared cases for the three specific distances from each site, but that might be rather redundant since if is similar to the above buffer approach. Instead, cluster analysis can be performed on grid cells related to each given site, and this data evaluated in comparison with the underlying demographic data. With ArcGIS, we can divide the census block shapefiles into smaller parts, using the grid cell boundaires to form these newer, much smaller sections. If one wished to focus on census data, the population counts could be redefined for each census block-grid cell part based on census block-grid cell portion surface areas. These numbers themselves could be reviewed spatially within the given grid cell per release site, and areas with higher values defined thereby defining what are theoretically higher risk regions, based on exceptionally small area methods of analyses.

There are several advantages to using this methodology in spatial analysis when it is applied to disease ecology/epidemiology work using GIS. First, it serves as an additional tool useful in ways not found with the standard methods used to apply GIS to epidemiological or disease ecology work. The general rule for standard GIS to date has been to limit the methodologies we apply as GIS statisticians to using standards formulas and methodologies (even though numerous other formulas and tools are already there). Some of these methods even harp back to some of the 1840s methods used for some cases, which although still reliable to some extent, do little to tell us more about actual spatial distribution patterns for the disease itself, separate from any human case related information. This method also avoid the one problem we have with using the twentieth century process for disease analysis linked mostly to areal census data–the census data approach is likely to result in types 1 and 2 errors due to the format, quality and applicability of the standard datasets used for these analyses–the standard US Census datasets used to represent population density at a coarse aerial level, based on studies of hundred if not thousands to tens of thousand of separate aerially derived equations, means than some of these “hits” are going to be coincidental and nothing more. Large area surveys, due to the large number size of areas and populations independently reviewed, will ultimately have errors that occur due solely on the probability of error for such a large number of calculations. The use of the alternative techniques provides a series of mapped results that can be related to these traditional studies, to determine if there is in fact a unique spatial correlation of the other causative factors. In this case, a hit in both series of maps is more credible than just a positive outcome for a standard monitoring tool, and the environmental/ecological tool that is employed, can now be put to further use by applying it to small area studies meant to define unique causative features related to positive “hits” or hot spots.

The alternative to relying solely upon aerial census block data, for cancer incidence in this case, is to define new datasets that can be spatially applied to the same work. For this type of environmental research, the next step in improving this application of spatial analysis of medical data requires that two forms of review be carried out: one that is traditional and one that is much like any of the several methods noted above. Based on high-resolution grid mapping methods that are sensitive to cell size, shape and form, such a this step can now be added to most spatial analysis routines, assigning more validity to their final outcomes.

Isolines or Contours. For reasons illustrated above, the use of isolines or contours is highly recommended for spatial reviews. This means that either a raster program has to be employed as part of your program or the right extensions be included in your GIS package–for ArcGIS and ArcView, these are the spatial analyst and perhaps 3D extension. The uses for isolines in environmental medicine and disease ecology are unlimited. Yet, these methods are rarely employed in many GIS epidemiological studies. Part fo the reason for the exclusion of this method is that it does involve a considerable amount of legwork and guessing, with the legwork performed by the data gatherers and groundtruthers, and the “guessing” taking place at the PC or desk level by both the GIS technician and project manager.

Although this latter aspect is referred to as guessing, it is not exactly a complete result of guesswork. In fact, the guessing process is really a trial and error period of spatial data evaluation in which numerous tools are used in order to develop a better understanding of the behavior of you spatial data. next, you have to engage in a fairly strict and orderly review of the data at various levels of resolution spatailly, such as attempting a technique first at a 9 mile circles level followed by a series of 18 more studies in which the same analysis is re-applied at smaller and smaller circle sizes, based on half-mile increments. This method of utilizing GIS extensions was also strongly recommended by ESRI GIS statistical experts discussing the same topics and spatial data issues at the 2009 MedicalGIS conference held in Denver, Colorado.

In fact, there is not set rule as to number or size. Since each dataset pertaining to a project is unique, and two or more datasets are typically involved in this method of analysis, each analysis has to be carried out from scratch, on a trail by trial basis until a better understanding of the research and statistical issues become better understood. Only once these two factors of the research project are better understood can you then make a scientific “guess” as to how to best approach the analysis of the given datasets for final review, reporting and publication. This scientific guess has to be based on some sort of qualitative or quantitative research reasoning and methodology. In the least, your analyses of the preliminary results should lead you to use some sort of grounded theory approach to explaining your findings, subsequently backing up these claims even further as the project progresses. Or you must have some sort of hardcore scientific and mathematical process at work, statistically as well as descriptively, in order to make the best use of the datasets developed in GIS Environmental research projects. There are numerous avenues one can take when performing such research. The best avenue to take is to make the fullest use possible of all possible ways to approach the GIS-spatial epidemiology research process, even if that means data manipulation both in and out of the GIS setting to make the best use of your data.

Improvement #2: Remote Sensing and Raster Analysis

The second area of emphasis I like to apply to my work is the use of plant ecology data as a part of disease ecology work. In cases where other natural features may be important to disease ecology analysis, these too have to be included in the project, such as topography, local elevation features, local weather and climate patterns at a microecology level, soil chemistry and ecology, etc.

Lyme Disease. With respect to vectored disease ecology, real time and real ecology data also have to be added to the standard studies of ecological disease patterns when they are done for predictive modeling purposes or to uncover the cause for a given case cluster or epidemic outbreak. Most of the data currently being accumulated is fairly basic in nature, not very helpful from one region to the next, and is usually viewed in some retrospective fashion. One can use the data to predict the future of a given disease or certain environmental conditions that help the disease to progress to new forms on a seasonal level, such as increased likelihood for infections of humans, but it is unusual for GIS epidemiologist to use this data at a real-time level and rely upon it for finding “hot spots” or for predicting future hot spots for a given area. Since limited highly detailed local data is typically collected for national databases on ecologic disease, and if so it is collected at the macro-level in terms of applications, limited local use of the data exists for preventive health purposes.

For lyme disease in southern Oregon, for example, cases dependent upon specific conditions in one part of a state may behave completely different under the same conditions in another part of the state with identical features. We can apply this line of reason to the impact of topography and elevation above closest water surface on mosquito species spatial distribution features (high elevation versus low elevation, and/or at-canopy versus sub-canopy swarms), which may be different in a northern part of a state than in a southern part. We can use it to determine that chrysolithic soil limits host species and therefore vector species distribution for borellia (lyme disease) in one part of the state, whereas in another part of the state soil had nothing to do with the limiting host-food source, but instead topography and rapid elevation changes. These are the types of insights that a fairly detailed ecological approach to studying disease using GIS can accomplish.

Applying this approach for GIS to understanding the spatial behaviors and disease outbreaks of Lyme Disease in the Connecticut-New York area, uniquely different spatial features are found to play a role in the distribution and distribution of this problem. Tree canopy size and type, and the related shading and cooling of the land surface features, in relation to human and potential domestic animal (dogs) distribution became the limiting factors.

West Nile.

Introduction

My west nile work consisted primarily of 1991 to 2003 data. Preliminary work on this project began in fall 1999 when the news of West Nile near the zoo in Yonkers NY first hit the newspapers. The results of this work were later developed further between mosquito seasons in winters of 2002, 2003 and 2004. This work included reviews of the limited amounts of literature out at the time on GIS and disease mapping for West Nile virus ecology, and an extensive review of the datasets provided and data and disease history for the area I was researching in the mid-Hudson Valley part of lower New York State.

The data analysis occurred in several major steps. The first was a review of all raw data developed documenting mostly larval ecology and trap use methodology and outcomes for various pre-selected sections of the county, each area defined based on typical areal size and features, in order to produce a fairly equal analysis across all possible parts of the research area, which was Dutchess County, New York.

The second stage in the analysis was the evaluation of actual field experiences pertaining to standardized larva gathering, mosquito capture and host pick-up and testing methodologies. Due to earlier experiences, larva gathering became less essential but still important for the overall surveillance process. Trapping became more important, in terms of both numbers and types of traps set, as well as the addition of new sites for the evaluation of west nile history defined by dead host sites and possible human case sites. The third stage in this work was based upon a year of field surveillance and note taking, an evaluation of all datasets available for review at the time (1999 to 2002), and a preliminary review of the numbers of vector species and trap counts developed over the seasons.

The final step in this review involved the targeted use of trapping practices in order to increase our likelihood of finding a positive test site. This came primarily as a result of the successful use of GIS for the first time to identify a positive vector ecologic site based on a review of a cluster of positive testing dead host species in 2002. Although reviewed in retrospect with regard to the host-testing analysis, the actual field evaluation, testing and evaluation process was performed live. The resulting positive outcome of this ecological review of the vector species linked to the first positive testing vector case in the county was a direct result of field assessments carried out prior to an actual identification of the positive capture. The following year, this led to a complete review of all sites evaluated for west nile purposes, using a standardized survey tool developed the winter before.

The year of West Nile Ecology research included analyses with the intent of using GIS to demonstrate its applicability to reviewing positive human and animal cases when they were reported, for reviewing the ecology fo positive testing vector cases, and for an evaluation the spatial distribution of hosts that were collected for west nile antibodies testing. All of these types of reviews were carried out with a focus on three things: topography and hydrogeography, phytoecology and canopy types and form, and raster-related evaluation of land use, DEM, NDVI and numerous other potential mapping shapefile and raster sources of given research areas.

Some research areas were fairly large in size, such as a mile long transect studied in la lateral direction moving away from the water along a large floodplain. Others were defined by property size or landform/use area size. Many studies were small area studies ranging in size from 50 x 75 feet on down to 15 x 15 feet (a trap site area). The largest study area, ecologically speaking, is the 60 miles shoreline of the entire county along the Hudson River estuary, the entire length of which was reviewed for species types found and located, in order to rule out the likelihood that another known positive carrier of west nile, the estuarine-brackish water species growing further south and closer to the Atlantic coastline and marshlands, was not residing in the nearby ecosystems typically.

The Research Area

The entire county area research was in fact 45 by 60 miles in size. Its topography varied from fields and flat wetlands terrain, to areas with significant cliff faces and escarpments well distanced from local water sources, and small mountainlike range settings typically measuring about 1000 to 1750 in height, and usually well wooded and rich in boulder debris. Some unusual features of the region relate to its glacial history. A number of perfectly round glacial ponds can be found scattered throughout the wooded areas. Along many streams, wetlands are formed and fields that were formally farmland, and typically flooded for several months fo the year. In terms of land use features in relation to natural ecologic settings, lands may be significantly altered by one to two century old factory facilities, or they may remain pristine in nature due to local conservation practices engaged in since the early 1900s.

In sum, this county region represented a fairly complex ecological setting for west nile to be studies. It consists of numerous regions where ecosystems tend to change significant from north to south or east to west, from upland to lowland, or from urban center to periurban overdeveloped semi-urban and hinterland rural settings. According to the numbers of species captured within this county, its biodiversity of the mosquito population alone made it an important baseline study applicable to other GIS-based host-vector analyses as well, such as the study of mosquito-vectored equine encephalitis cases involving horses, or the occasional cases of malaria that are brought in by temporary visitors taking advantage of the highly popular and lucrative local horse training and recreational summer camp settings.

GIS Applications

In terms of west nile studies, GIS had its most significant applications in terms of mapping disease ecology history and using these maps to identify the source for a positive vector, determine whether or not a local human case could be locally ecologically related, and use this data predict possible high risk or future outbreak sites.

Several other examples of this application of GIS were applied to other positive testing host, vector and human disease locations. This methodology was also used to demonstrate when the local human cases were of some non-local source, verified completely by potential vector data, and supported by ecological data. In another case I applied NDVI and NLCD datasets to the analysis of several positive testing regions, demonstrating the application of AVHRR data as a means to identify the most likely vegetation zones/small area biomes or ecotomes where host-vector relationships could exist (essentially a rehash Voronov’s 1970s method for modeling disease ecology, one step above that of Pavlovski). Satellite imaging and DEM was used to demonstrate the relationships between wateredge and species (an IDRISI project), for lakes, streams and estuarine settings, with a demonstration of species relationships determined by tree canopy, terrain type, and elevation features in relation to the local watersource. To date, this method has not been fully employed in any other sites, as far as I know, since my very brief presentation of this methodology at an ESRI MedicalGIS conference in 2007.

For some reason, research teams have tended to avoid these two avenue of studies. Theissen Polygon-Hexagonal Grid and Plant Ecology-Remote Sensing methodologies are not typically reviewed in many disease ecology publications. The value of using hexagonal areal analyses methods to more accurate gridmap and analyze any disease or health data are fairly uncommon to this field of study. My belief is that these methods are not applied because: 1) they are typically not emphasized in medical geography classes–they are not used by the teachers and therefore not taught; 2) square grids are the common methodology, in spite of their inaccuracies, due mostly to the ease of producing them mathematically; current technology makes their ability to be produced much easier and their use more likely; and 3) due to the ease and availability of other methods or software tools for performing the traditional methods already in place, new methods are neither attempted or made use of, and when used perhaps have a limited support base due to lack of other applications already published.

For theses reasons, we find most disease ecology studies are performed based on the easiest methods to be employed in such research, in spite of their inaccuracies at times, using mostly data that is readily available to us, and methods that are already supported. For this reason, performing a study of disease ecology is limited just to the following: ecology = man + pathogen + host + vector + environmental factor(s) [EF], with EF = weather/climate datasets, general topographic features/DEM, etc. Disease ecology is not just a study of man + physical environment + biological host/vectors/pathogen data. It ends up that details related to the most ignored features of the environment–the plants and their ecosystems/vegetation zones–are typically directly related to all other microclimatic, meteorologic, topographic or physical geography features reviewed. Disease ecology has to be engaged in with a more complete and “wholistic” approach to its analysis. Disease incidence, prevalence, and methods of prevention have to be reviewed using small area analysis, based on realistic raster systems, more accurate grid approaches to these analyses (remote sensing tools and formulas, with hexagonal grids), and more realistic small area analyses techniques capable of eliminating the limits we have due to the census block related limitations already discussed.

For this reason, my approach to studying disease ecology begins with the assumption that plant ecology plays a more important role in this ecology that is typically assigned to the local environment. My review of environmental exposure and the exposure of people to pathogens and the like requires the use of exceptionally small area analysis, performed using accurate point-ecology analyses techniques, and in the case of areal dispersal features, small area hexagon tools analysis, performed in order to define exceptionally small area risk grid or contour maps.

Now of course, these methods require, as earlier stated, some knowledge and background in this field, but I have found that one can often simplify this type of work in such a way that more people can engage in it with basic ecology and ecology background. It ends up, that for the most part, plant morphology and genus are all that are needed to understand the impact of a plant on the local ecosystem, for the most part. There are of course exceptions to this rule, such as the unique ability of specific species to do something that their close relatives cannot, such as the ability of one species to impact the environment through the release of essential oils from its leaves versus the inability of its similar (i.e. the aromatic black birch with wintergreen oil or methylsalicylate in its leaves and young branches versus all other birches), which on one case produces non-aromatic steroids instead or a totally different class of essential oils (such as the 30 carbon steroid-rich non-aromatic Prunella versus most other mints who use their 15-carbon compounds to produce essential oils). These differences between plants is the hardest information to assimilate and include in these studies, and may even be the reason such an approach to ecology research of disease becomes highly important to the field in general.

Brian Altonen, MPH, MS