Warning! It appears that your browser may be incompatible with this application. Please assure that you are using an up to date version of your browser with JavaScript enabled.


Tracking pollution in your neighbourhood



In 3 sentences:

We downloaded facility emissions data from NPRI, added in the facility coordinates from a separate file here, and added up 2008 air pollution data for each facility. We then sorted all facilities by industry, percent-ranked all facilities in their industry and mapped the rankings to a gradient where the median percent rank is "orange", low rank is "green" and high percent rank is "red". Thus, the color of a facility corresponds to the relative ranking of that facility's total emission against all other facilities in Canada in the same industry.


The nitty-gritty:

As our Data Source we used Environment Canada's National Pollution Release Inventory Database. The NPRI Data set represents open data in that the Government of Canada requires that Industry report their annual emissions, and these data are available for Citizens to use. There are several barriers to accessing this open data, though. First, the data is in an expert format that is not useable by most people. Second, it is a very large data set with over 30 000 (?) datapoints in several different data sets. As such, the data that is of interest to people - those emissions close to where they work or live - are not easily found and the data is not very easy to interpret for an average user. Details on how this data base is populated and maintained can be found here.

Emitter.ca represents an initial attempt to make these data accessible to anyone who can access a web browser, creating an interface that is usable by Citizens, not just experts.

We were happy to see that Environment Canada provided both a nice web interface to query the data, providing details on the types of substances polluting different regions in Canada and the facilities that were emitter those substances. Generally speaking, the NPRI data is sorted by emission type (i.e. the substance being emitted) and the medium where it is deposited (i.e.air, land, or water). These data can also be sorted by industry type, and location (City, Province, Street Address). The NPRI data reported by various facilities could also be downloaded and even visualized.

Sadly, the data is pretty technical and does not really give an "average Joe" a sense of whether a particular area is exposed to a "lot" or "little" pollution, or how well some regions or facilities in a particular industry (i.e. mining or manufacturing) stack up against other areas or facilities. On one hand, the downloadable datasets did not carry coordinates for facilities, and on the other hand while the NPRI Google Earth KMZ layer had coordinates, it did not contain emissions information in a useful format.

Because we are interested in mapping the emission locations, it was important to "mash-up" facility emissions data (denoted by NPRI ID) with geographic information (latitude and longitude of each facility). We first obtained the coordinates along with the associated NPRI_ID, by parsing a NPRI Google Earth KML File's HTML code - which happened to contain a link back to NPRI's website with the NPRI_ID. Later as we were digging through NPRI's site we did find facility data with coordinates here, it was there all along... just buried.

Each of the relevant data tables were loaded into ESRI ARCgis software where a number of 'joins' were performed to create one large data table that contained all of the relevant parameters. Specifically: Air Emissions, Water Emissions, Land Emissions, Latitude, Longitude, City (if applicable), Provence, and Industry Code. The NPRI_ID was used as the common parameter for each table, and we used NAICS2 code to sort facilities by industry.

Each location has a number of emission types and mediums associated with it. Because we are interested total emissions for each location MATLAB (a specialized software package) was used to create aggregated totals for each of air, land, and water. Locations that had no emissions for at least one of those media were removed.

These aggregate tables were sorted by industry type and province & analyzed. Those locations with 'zero' emissions for the medium (air, land, water) were removed. A Percent Rank (Excel PERCENTRANK) analysis was run on the remaining data, and the green to red graphic was created with the median scores in the middle - the orange color. A Percent Rank provides a the relative standing of a value within a data set; it ranks the emission 'standing' of a facility with other facilities within its industrial class, or Provence. A Percent Rank does not need a normal distribution curve to 'work'. A Percent Rank is NOT a Percentile. A Percentile can be defined as the value of a variable below which a certain percent of observations fall, and this does require a normal distribution curve to 'work'.

The end result is a relative ranking of the emissions (by industrial sector or province) with those emissions below the median being 'green', those that are the median being 'orange', and those above the median, 'red'.