Reading this article about vehicle fatalities and GM air bags got me thinking about "Big Data". Why don't safety nerds regularly run regressions at the individual level using a linear probability model to examine how the probability that a vehicle fatality took place varies as a function of the vehicle's make and model?
In a regression framework, the dependent variable be a dummy variable that equals one if the vehicle was involved in an accident that caused a fatality to someone in the vehicle and equals zero otherwise. I realize that people who drive more miles will be more likely to be in vehicle accident but the researcher could collect information on what zip code the vehicle is registered in and merge in data on the average adult who lives in the zip code's average income and age. By knowing the zip code's centroid location, the researcher could calculate the residential area's distance to public transit and to the city center and to employment centers. Knowing the calendar date, the researcher could merge in climate conditions (i.e icy, snowing, rainy). Controlling for these demographic variables, urban form variables and the climate conditions, the researcher could estimate a series of dummy variables for what make/model/vintage are most likely to be in fatalities. While large positive estimates could be spurious, they would provide a clue for what types of vehicles the regulators should focus on. A researcher would have to recognize that if year 2005 Toyota Avalon's are 33% of the vehicle fleet and drive 33% of the total U.S miles driven, then it shouldn't be surprising that a large number of fatalities occur in those cars. Such a research design would need to study which vehicles/makes/models and vintages are over-represented among fatalities relative to their respective share of total miles driven.
The fixed effects recovered from these regressions could be graphed relative to data from RL Polk on the vehicle make/model/vintage's share of all vehicle registrations and those observations that lie above the 45 degree line should be targeted for a safety inspection. For example, if a vehicle type is involved in 3% of all accidents but this type of vehicle is only 1% of the total vehicle fleet, then this vehicle should be investigated for safety problems. It is possible though that the vehicle is safe but that crazy drivers tend to buy it. How can this selection issue be handled? If crazy drivers tend to buy risky cars, how does a Big Data nerd tease out whether the extra accident risk is due to bad driving or bad initial vehicle construction?
Years ago, Steve Levitt and Jack Porter wrote an important paper on teasing out causal effects about drunk driving from driving data. Can their framework be used to recover vehicle/model/model year fixed effects to identify the risky cars? A selection model would be needed to judge what types of drivers self select to drive what types of cars. Note that the risk here that safety experts seek to estimate is not from the driver's type but due to the manufacturer's choices. Can these two sources of risk be disentangled in order to identify deadly products before more people die?