Commencing with the recently-released 2010 data FARS (The USDOT’s Fatality Analysis and Reporting System) will have far more specialized detail on Pedestrian and Bicyclists crashes.
618 cyclists (person type 6 bicyclist, and 7 other pedalcyclist) were killed in 2010 in traffic collisions — and as noted at the link above, only collisions with motor vehicles in-transport are tracked by FARS. So for example, a bicyclist who lost control and died as a result of crashing into a tree would not be tracked here, nor would a bicyclist who strikes a parked motor vehicle.
The added information becomes a new “table” (in the parlance of databases), if you download the raw data files, it will be all found in the file PBtype.dbf
The information follows more-or-less exactly the PBcat, probably no coincidence. PBcat is the Pedestrian and Bicyclist Crash Analysis Tool, and contains among other things the bicyclists direction, and a detailed crash type (or for the complete reference, see PBcat’s manual, appendix C). [If these links go dead — Looks like it can all be gotten from the archive.org wayback machine; here is the reference manual; here are images of crash types] [Local archives: PBCAT_Manual.pdf and the (now very old, ver 2.1.1 windows xp-only) installation of PBCAT is available but not presently online. Along with the original/old PBCAT manuals, there is an updated coding manual 2014 FARS NASS GES Pedestrian BicyclistsManual .
So, say you were interested in bicyclists running stop signs; you would do a query and then a univariate split and take a look at Types 142, 144, and 147; all subtypes of “Bicyclist Failed to yield — Sign-Controlled Intersection”. There were 17+41+1 = 59 such fatalities recorded.
Since this is the first and only year where these national stats are available, they are of limited usefulness — that will change over time as the dataset grows and wil become a very useful comprehensive source of understanding bicyclist (and pedestrian) traffic fatalities. (update: see this comment below; all pbcat data is now gone, and they hope to have it back for 2015. sad.)
PBcat in the wild
It’s not clear to me why, but the state of North Carolina had PBcatted all their ped and bike crashes statewide for many years; leading to a very rich database of statistical information. E.g. 12,000 bike crashes covering the period 1997-2008!
In Arizona, ADOT as part of the BSAP has PBcatted 746 bike crashes in the “concentration” areas but it only covers the state-highway system. See e.g. Table 1 in working paper 3 for the full breakout of crash types.
Here is a presentation from the Pro Walk/Pro Bike 2006 conference by Steiner, Henderson, Libby Thomas that has some interesting details and background on PBCAT. I still don’t understand why pbcat won’t run on Windows 7(! requires XP!? I got it installed in an XP virtual machine). Here are a few places that used pbcat: . Orlando (no crash groups. They have some interesting breakdowns), Wisconsin (they present crash types, didn’t use groups), Miami-Dade County (they seem to have rolled-their-own crash groups; see section 2.4 of the plan)
The MySQL Database
I got so frustrated trying to use NHTSHA’s front-end to query the FARS database, I set out to build my own mysql database;
anyone wanting to query can do so by requesting the password and using the following details Sorry, external/internet access no longer available:
hostname (internal only): 10.1.1.10 databasename: fars username: asdmuser password: (internal use only) access via phpmyadmin.
The dbf formatted files were downloaded and unzipped; there are over a dozen .dbf file, each represents a “table” of data. The main tables are accident.dbf, person.dbf, and vehicle.dbf (and PBType.dbf of special interest).
Each dbf of interest was opened in OpenOffice Calc (this can take several minutes for the large tables!) — the first row was extracted and saved as text for later processing to become the CREATE TABLE command. Each field name has the form FIELD_NAME,[N | C],x,0. N becomes datatype int (or float in the case of latitude and longitude) and C becomes char(x). The header row is deleted, and the data is saved as text (csv) file [see the dbf2csv converter mentioned below for and alternative] — OpenOffice insists on surrounding text fields with either a single or double quote. These need to be subsequently eliminated with any text processor… this linux command seems to do the trick (the updated version of OO I now use has an option to not put in the quotes, so this is no longer needed. Yay), anyways if need be here is a sed command to do the trick (it’s tricky because of the escaping necessary):
sed -i 's/"//g' filename.txt
These text files are then LOADED into MySql tables, see e.g. 2010_PBType.sql
There are 16(!) tables in the 2010 data; several of them are new. Currently the following tables are in the MySQL database:
- 2010_incident (from accident.dbf)
- 2010_person (person.dbf)
- 2010_vehicle (vehicle.dbf)
- 2010_PBType (PBType.dbf)
The data is also available in SAS format. SAS is some sort of statistics package that I am unfamilar with; however there’s a very useful-looking file called format10.sas which has data definitions in it, and is easier than using the 811530.pdf, the FARS Coding and Validation Manual. There’s a second manual, the FARS Analytical Reference Manual 811529.pdf
Anyways, the correlation between the fields in the .dbf files can be looked up in the Analytical Ref Manual: for example in the Manual p.79 there is a field called Special Jurisdiction, FARS C16 (i.e. the 16th element of the crash level data) that is in the accident dataset (which I’ve taken the initiative to properly name incident rather than accident) — with a “SAS name” of SP_JUR… which corresponds directly to the column heading name in accident.dbf. This field also appears to correspond to SPJUR10F in the format10.sas file, but I am unclear as to how someone is supposed to know that.
Another oddity — there is much strangely redundant stuff in person.dbf like e.g. make and model of vehicle, which is also and obviously in the vehicle.dbf ?? The idea of duplicating data across tables is a database design no-no!
The order of data in 2009 (and earlier, i would guess) is SIGNIFICANTLY different compared to 2010 and later. So I didn’t grind through it. Another reason is 2009 didn’t have PBCAT data, an likely never will.
2010 was the oldest year for which I loaded up data. See above…
The 2011 tables were loaded sometime around fall 2012. The structures were identical; or at least identical for the four tables of interest.
The big (bad) news, there is not PBCAT info this year; so there is no 2012_PBTable, and furthermore it’s been “removed” from the published 2010 and 11 data; see this comment for details. Other than that, not much to report; several fields got slightly switched around; which is annoying. See the .sql files for specifics. For example, the field IMPACT2 was removed from both the Person and Vehicle tables.
Re-downloaded FARS2012.zip Feb 22, 2015 (the dates for most of the .dbf’s were 1/28/2014 or newer) in order to get “final” files. Reprocessed everything with dbf2csv script; and rebuilt the databases (added an empty PBType table so multi-year queries won’t fail).
An annoying note about 10000000000000000, 9999999999999999, and 2147483647
Ok, so I looked at a hexdump of the original nhtsa .dbf files — the value 9999999999999999 (so, sixteen ascii 9’s) often show up seemingly to mean “not coded” or “not available”. For whatever reason when I open the .dbf in OpenOffice it interprets this as 10000000000000000 (that’s 1 followed by 16 zeros); and subsequently saves the text file with the ascii representation of this. When I read it into mysql; I had (and still have) spec’ed the fields as int, and not bigint. The largest int value is 2147483647 — thus you will see that value a lot. This doesn’t seem to cause any trouble — but it does seem to violate the documentation set forth in the FARS Analytic Reference Guide. For example the field Curb Weight, VIN_WGT is supposed to be either 0 meaning not available, 9999 meaning Value not coded, or a number between 1 and 9998 meaning weight in pounds. Why is the value 9999999999999999 be in the data? Other fields are similar; that is to say the sixteen-nines is notwhere to be found in the documentation. In any event in my mysql tables anything above, in this instance, 9998 will turn out to be 2147483647 and that simply obviously means not coded. [update: the dbf2csv python script maintains the 9999999999999999 values]
Surprisingly, the 2013 dataset I grabbed in January 2015 has PBCAT data, i.e. has a PBType.dbf file. The analytic reference manual for this year 812092.pdf says it should not be there, and was not expected to be there(?); “Appendix E: Pedestrian and Bicyclist Data Availability Change”. Weird. Anyway, i loaded it. This appears to be a mistake/leak, the .zip file as of 2/22/2015 has that file removed. The data appears to be reasonable.
Inserted the 2013 data into my private server’s database Jan 2015. There were quite a few fields removed from person and vehicle tables (the VIN stuff, got moved to a new table VINDecode), as compared to 2012, see the .sql files for details. Also see 20xx_postproc.sql for definitions of some synthetic fields (so-called because they are derived from other fields), including person type tags I elevated to the incident (“accident”) table so that you can select incidents involving, e.g. bicyclists, or ped, e.g. to check the split between urban/rural for bicyclist incidents you can do this without having to join another table:
select sUrbanRural,count(*) from 2013_incident where sF_Bicycle GROUP BY 1;
Not much changed in the data definitions: a new harmful event was added #73 ‘Object fell from Motor Vehicle In Transport’; a bunch of typo corrections, and oddly, updated descriptions that only apply to pbcat values, like BKPOS (bicyclist position), and as such aren’t even in the dataset this year, e.g. #01 was “Travel Lane” and is now “On a roadway, in a shared travel lane”.
Here’s a query that joins the incident, person (specifically, the bicyclist), and the vehicle tables. Note that it returns more than one result in the few instances where there is more than one fatal in an an incident, e.g. in 2013 there were 740 fatal bicyclist incidents and the query below returns 743 results (note: peds and cyclists are always VEH_NO=0. I have to go back and verify if there is always a number 1, I seem to remember a handful with only a number 2 or something).
SELECT p_bike.eLOCATION,sUrbanRural,count(1) FROM (( 2013_incident as i JOIN 2013_person AS p_bike ON (i.ST_CASE = p_bike.ST_CASE AND p_bike.eINJ_SEV LIKE ('Fatal%') AND p_bike.ePER_TYP IN ('Bicyclist', 'Other Cyclist') )) JOIN 2013_vehicle as vehicle ON (i.ST_CASE = vehicle.ST_CASE AND vehicle.VEH_NO=1 )) GROUP BY 2,1;
So, there are 743 bicyclist fatalities; i.e. three of the 740 incidents involved two persons killed; and other 737 involved one each. Slightly different results for total number killed can be gotten either from the incident table, or the person table:
select SUM(FATALS) from 2013_incident where sF_Bicycle; --> 744, i.e. includes one not-cyclist SELECT count(*) FROM 2013_person WHERE eINJ_SEV LIKE ('Fatal%') AND ePER_TYP IN ('Bicyclist', 'Other Cyclist') ; --> 743
Created a VIEW called person (i.e. like 2013_person w/o the year prefix) that was just most fields from each year UNIONed together. See the readme in folder FARSALL. I should/will then create an incidents, and vehicles view to get multi-year data. [as things turn out, this was horrendously slow for anything interesting; so created tables without the year_ prefix, see the FARSALL folder]
I did a bunch of work with the alcohol fields in person table this year, and reflected back to 2010; eALC_DET, eALC_STATUS, eATST_TYP, sALC_RES, see use-of-alcohol-as-a-risk-factor-for-bicycling-injury
Used the python script dbf2csv to re-process FARS data files; see more about how to use it at this private post. As of this writing, 2010-2012 have been re-done with finalized data, and 2013 is still preliminary, final is due out in approx December 2015.
This link www-fars.nhtsa.dot.gov/Common/FileVersions.html gives the current version and release dates on a year-by-year basis. As of this writing, December 2015, the most recent year’s final files are available for 2012 and earlier. So I have final files for years 2010-2012, and preliminary for 2013. Based on the most recent file release, 2012 did not become final until July 24, 2015.
2014 and 2015
With the sharp increase in traffic fatalities in 2015 (35,092) — this is getting some extra attention, e.g. the Fastlane blog said on Aug 29, 2016: “2015 Traffic Fatalities Data Has Just Been Released: A Call to Action to Download and Analyze”. The big data news here is that PBCAT data has reappeared(!), and the 2014 and 2015 datasets released today both have a PBtable.
2014: replaced existing 2014 data with this new dataset, from the datestamps, it appears I had downloaded 2014 just a few days before the update was release. Oh well; the update went completely smoothly and followed the same file procedures: that is there is one zip file with national dbf’s in it. I used dbf2csv to convert, and all columns matched what was previously released in 2014 preliminaries. See “New in 2014” in the 2014 FARS Analytic Reference Manual for all the changes. There are several non-motorist tables that contain tiny tidbits of info each: nmprior, nmcrash, nmimpair — I never understood why this info isn’t in the person table; ie. would be undefined when the person is a motorist.
2015: Was a different story. There were several changes in the tables I import (the accident/incident, person, vehicle, and PBType tables). See “New in 2015” in the 2015 FARS Analytic Reference Manual. Some columns got deleted (person death CERT_NO), the functional classification in the accident/incident table got completely changed, with ROAD_FNC deleted, and simliar info into three new columns: RUR_URB (which is essentially the same as my synthetic sUrbanRural flag), FUNC_SYS, and RD_OWNER. And ALC_RES in person table has been re-worked from 2 to 3 digits, which if i’m lucky can be consistent for prior years by re-working the synthetic sALC_RES.
I discovered an error in this 2015 fatality (bicyclist direct incorrectly said “facing traffic”) in january 2017; I have an email thread thru some USDOT folks back to ADOT (Kerry Wilcoxson, formerly of CoP, now ADOT State Safety Engineer), the answer was it was a typo; but unfortunately they said the cutoff for corrections was the end of the year (2016).
Current status: 2014 and 2015 both rolled into my mysql database including PBCAT.
I downloaded the 2016 FARS dataset on 1/9/2018, though it apparently has been available since 10/6/2017, see e.g. this USDOT press release, the top-line stats being deaths were worse, even when adjusted for an increase in VMT:
37,461 lives were lost on U.S. roads in 2016, an increase of 5.6 percent from calendar year 2015. The number of vehicle miles traveled on U.S. roads in 2016 increased by 2.2 percent, and resulted in a fatality rate of 1.18 deaths per 100 million VMT – a 2.6-percent increase from the previous year.
Review of the changed info indicates no new fields. Also, there is now a direct .csv version of the data; and release notes indicate they will be phasing out .dbf version of FARS data sometime in the near future — this is good news, as it’s one less step in the import process (no need for dbf2csv tool/converter).
2018 / 2019
Some year’s raw csv files have a tiny number of weird charachters in them, in them. For Example 2018 Vehicle has several occurances of xA0 (hex A0, which is a non-breaking space, nbsp).
Fixed by simply specifying character set as latin1 instead of default. Updated by putting that into the LOAD DATA commands in the sql files, e.g. 2018_vehicle.sql
other ideas are to use iconv command to eliminate errant charachters, e.g. either would work:
iconv -c --from-code ascii --to-code ascii in.csv --output out.csv iconv --from-code latin1 --to-code ascii//TRANSLIT in.csv --output out.csv
Query to Validate Correct Counts
visit main page to correlate between datafile and FARS official data; tallies of total crashes and total deaths should match exactly:
SELECT ePER_TYP, eINJ_SEV, YEAR, count(*) FROM 2017_person WHERE eINJ_SEV LIKE 'Fatal%' GROUP BY 1,2,3; SELECT SUM(FATALS), count(*) FROM 2017_incident; SELECT sUrbanRural,count(1), YEAR FROM 2017_incident GROUP BY 1;
The last query is a sanity check; (see more queries here ) and look at the number of Unknowns for Urban/Rural — in preliminary (“annual”) datafiles, there will be a relatively large number of unknowns; many hundreds. In the final file the number is typically less than 100.