- Original Paper
- Open Access
Severity analysis of powered two wheeler traffic accidents in Uttarakhand, India
© The Author(s) 2017
- Received: 19 August 2016
- Accepted: 21 April 2017
- Published: 1 May 2017
Powered Two Wheeler (PTW) vehicles are one of the preferred modes of transport used in India. Also, PTWs accidents are comparatively more frequent than other type of accidents on road. The influencing factors of PTW accidents are also differ from factors that affect other accident types. The objective of this study is to analyze newly available PTWs road accident data from Uttarakhand state in India and revealing the factors that affect the severity of these accidents in various districts of Uttarakhand..
To analyze the factors that affect the severity of road accidents in Uttarakhand, initially we have compared three popular classification algorithms i.e. decision tree (CART), Naïve Bayes and Support vector machine on PTW accident data set. The decision tree algorithm’s (CART) classification accuracy was found better than other two techniques. Hence we have preferred CART algorithm to extract the factors that affect the severity of PTWVs accidents in whole Uttarakhand state and its 13 districts separately.
The analysis of PTWVs accident data using CART for 13 districts of Uttarakhand and the whole state reveals that every districts have different factors associated with PTW accidents severity. There are some districts in Uttarakhand state which have similar PTW accident patterns, whereas few districts are found to have different PTW accident patterns. These results are very useful to understand the pattern of PTW accidents in Uttarakhand state. These results can certainly be helpful to overcome the PTWs accident rate in Uttarakhand state.
- Powered two wheeler (PTW) vehicle
- Road accident
- Data mining
- Decision rules
- Traffic safety
Traffic accident can be considered as an incident in which one or more vehicles collide with another vehicle, person, animal or any other fixed object. Traffic accidents do not only involve human life loss but also property damage. World health organization (WHO) mentioned that there are 1.2 million deaths and around 4 million injuries every year around the world due to traffic accidents . An increasing number in vehicle purchase is increasing the number of vehicles on road day by day. Hence, the chances for traffic accident are also increasing.
The traffic accident not only affects the life of victims involved in accidents but also affects the life of their associated peoples i.e. family members, business associates etc. Every road accident is left with a record in police database or hospital database. This record consists of various important information about road accidents i.e. time, date and location of accident, weather information, road characteristics and traffic information at the time of accident. The proper analysis of this information can certainly produce some good results. These results can be utilized to know the factors behind road accidents and certain accident preventive efforts can be taken.
Traffic accident analysis is a well known research area. There is a rich literature available that reveals the different techniques and their outcome in road accident analysis. Abdalla et al.  analyzed road accident data from Scotland and establish the relationship between traffic accident location and its distance from residential areas. Their finding reveals that traffic accidents are more frequent near residential areas in comparison to areas that are not in close proximity of residential areas. Mussone et al.  analyzed road accidents that occurred at intersections in Milan, Italy region. They used neural network model to analyze the accident data. Their results showed that the pedestrian hit accident at night time and at non-signalized intersection has the highest frequency of accidents in that region. Several other studies focused on traffic accident severity analysis using traditional statistical techniques and provide good results [4–12]. However, [13, 14] shown that traditional statistical techniques has certain limitations in analyzing road accident data. Further, several studies using data mining techniques in road accident analysis has shown that data mining provides productive results than traditional statistical techniques. Data mining techniques  are further used to categorize the road accident locations and indentifying factors that affects accidents in those locations . Some authors raised the issue that road accident data is of heterogeneous nature and suggested that clustering prior to analysis of data can certainly remove the heterogeneity [17–19]. Some studies also used data mining techniques to analyze crash counts using time series analysis [20, 21].
Powered two wheelers (PTW) are one of the most involved vehicles in road accidents. Although it is directly related to the more number of PTW purchased in comparison to other vehicles. The reason behind the rapid purchase of PTW is that these vehicles are more easily affordable, small in size, light-weighted, flexible, and speedy than other vehicles in heavy traffic conditions. In other words, a PTW is the vehicle that has been driven by people with all economic conditions (rich, middle-class and poor) in both urban and rural roads. Various studies used traditional approaches [22–26] to analyze the crash severity of PTW accidents in developed countries. A study  used classification trees to generate rules that predict the crash severity of powered two wheeler accidents.
One of the important things about PTW riders is that, they are more prone to road and traffic accident in comparison to other vehicles such as cars, SUVs, vans and buses. The motivation behind this study is to identify the different factors that affect severity of road accidents among PTW accidents in Uttarakhand state. We have used decision tree classifier, support vector machine and naïve bayes classifier to predict the factors that affect the severity of PTW road accident in 13 districts of Uttarakhand state. The severity of accidents is categorized into KSI (Killed or severely injured) and SI (Slightly injured). In this study, we have identified several factors that affect the severity of PTW accidents in Uttarakhand, India that will certainly help in overcome the accident rate.
2.1 Data set used
PTW road accident data attributes
Number of injured persons/Accident: NOI
Age of victim: AOV
< 18 years
Time of day: TOD
Lighting condition: LIG
Roadway feature: ROF
Road type: ROT
Accident severity: ASV
Killed or severe injury
Surrounding area: SUA
Day of week
2.2 Classification techniques
In the domain of data mining , classification is a supervised learning technique that can be defined as follows: given a set of observations, we are interested in extracting certain rules that can be used to predict the class of the each new observation. The set of observations used to extract the rules are known as training set. Another set of observations, known as test set is used to verify the quality and accuracy of the rules. Initially training data and test data both are part of the data set available at the moment. Classification is widely used technique that shows its importance in various fields such as bioinformatics, pattern recognition, image classification etc. In order to achieve the best prediction, more suitable classification techniques must be selected. The selection of any classification technique depends on the type and nature of data. As our data is more like a categorical data, we are trying to evaluate the prediction accuracy of three best suitable classification techniques on our data i.e. decision tree algorithm , naïve bayes algorithm  and support vector machine algorithm . Further, the technique with higher prediction accuracy will be used for analysis.
2.3 K-fold cross-validation
The common problem with classification technique is the partition of the data into training and test data . Sometimes, it is a value decided by the user itself, where training data is usually kept larger than test data. Some choose 70%–30%, 60%–40%, 80%–20% and so on for training and testing set and they check for the better accuracy. But it is rather time consuming and complex process to divide the data based on user’s choice. Also, this technique fails in the case of imbalanced data where class values to be predicted are not similar or they differ by some large ratio. K-fold cross validation  is a statistical technique that divides the entire data set into k groups. K is any number greater than 1. Out of k sub groups, a single group is retained as the test data and remaining k-1 sub groups are taken as training data. The k-fold cross validation process is then repeated k times, with each k subgroups used as a training set exactly once. Further, the k outcomes from the k-fold cross validation can be averaged to produce a single estimation. Usually k remains unfixed in k-fold cross validation, but k = 10 is a standard value that is widely acceptable for k-fold cross validation. This study used k-fold cross validation method to partition data into training and test sets where k = 10 is used.
2.4 Classifier accuracy measures
One of the most important aspects in the classification process is that how well your classifier predicts for unobserved instances. This is known as accuracy of a classifier. Sometimes accuracy itself is not a good measure of classifier goodness. Here, we are providing some classifier accuracy measures that can help in identifying the goodness of a classifier.
2.4.1 Confusion matrix
A confusion matrix (or error matrix)  is a contingency table that allows visualization of the performance of a classifier. A column in confusion matrix denotes the predicted class instances and a row represents the actual class instances. In order to understand the confusion matrix, consider an example of a data sample of 10 animals with 4 lions and 6 tigers. A classification algorithm is trained to distinguish between lions and tigers, a confusion matrix will summarize the results of the algorithm for the given sample of data. The confusion matrix for PTW accident data is given in Table 3.
Example confusion matrix
2.4.2 True positive rate (TPR) and false positive rate (FPR)
TPR measures the fraction of positive that are correctly identified. It is also known as sensitivity of a classifier. It can be calculated using parameters in contingency table using Eq. 1. Whereas, FPR also known as false alarm ratio refers to the probability of falsely rejecting the null hypothesis. It can be calculated as the number of negative events that are mistakenly categorized as positive and the total number of actual negative events. The formula is given in Eq. 2.
The specificity of a classifier is the accuracy of classifier to correctly predict the negative cases in the data set. It can be calculated as
2.4.4 Precision and recall
The precision and recall measures are mostly used metric to measure the performance of a classification algorithm. Precision can be defined as a measure of exactness i.e. if all the predicted labels for a given class X is given, how many instances were correctly classified. Recall which is similar to sensitivity or TPR is the measure of completeness i.e. for all data instances with class value X, how many of these instances are correctly captured.
The formula for calculating precision is given in equation4 and formula to calculate recall is same as for TPR in Eq. 4.
2.4.5 F-measure and MCC
F-measure  also known as F-scores is a measure of the classifier test’s accuracy. In order to calculate the F-score of a test, both precision and recall are considered. In other words, F-score can be defined as the harmonic mean of precision and recall. The best value for F-score is close to 1 and worst value is close to 0. F-score can be calculated using Eq. 5.
MCC or Matthews correlation coefficient  is a measure of the quality of a binary classification, in which variable to be predicted has two values only. In our case, we have two class values for the target attribute i.e. KSI (Killed or severely injured) and SI (Slightly injured). It is also considered as a balanced metric to measure the quality of a binary classification even if the classes are not balanced. Its value ranges between +1 to −1. A value of +1 is considered as a perfect prediction, 0 for average prediction and −1 for no prediction. MCC can be calculated using the values in the confusion matrix using Eq. 6.
2.4.6 Receiver operating characteristic (ROC) curve
ROC  is an important measure to check the accuracy of a classifier. It has been previously used in signal detection theory to depict the tradeoff between hit rates and false alarm rates over noisy channel. Now, it is widely used in machine learning field as a useful technique to visualize the performance of the classifier. ROC curve is a plot between TPR and FPR. To evaluate the performance of the classifier, AUC (area under ROC curve) is calculated. An AUC value close to 1 represent very good performance and a AUC value <0.5 is considered as not good performance.
This section presents the results and experimental analysis of the PTW road accident data mentioned as follows.
3.1 Performance of classification techniques on PTW data
Initially, we applied Classification and Regression Trees (CART) algorithm for decision tree classification, naïve bayes and support vector machine techniques to evaluate the prediction accuracy on accidents data. The prediction accuracy obtained for CART is higher than naïve bayes classifier and support vector machine (Table 3). Hence, we have selected CART decision tree algorithm to analyze our road accident data. The Table 3 illustrates the prediction accuracy of all three classifiers on PTW accident data set.
3.2 CART performance analysis
Prediction accuracy of different classifiers
Classification accuracy (%)
All Uttarakhand data
District wise data
Decision Tree (CART)
Support vector machine
Confusion matrix for 13 districts and EDS of Uttarakhand
The values of different parameters shown in Table 5 indicate the performance of CART to predict the severity of PTW accidents. The Dehradun, Haridwar, Nainital and Udham Singh Nagar districts which have the high PTW accident rate in Uttarakhand state. The decision tree classifier’s accuracy is found better than other remaining districts. In other districts, the performance of the classifier is not so accurate. The one reason can be the small size of the accident records. This certainly reveals the conclusion that if data set is not sufficiently large enough, then the decision tree algorithm may not be accurate as desired. The other reason for low accuracy is that the similar values for different attributes are there that predicts the KSI and SI both. The ROC plot is illustrated to show the performance of decision tree classifier for all 13 districts and EDS in Fig. 1.1 to Fig. 1.14. The AUC (Area under ROC curve) is shown in each figure. The AUC indicates that the decision tree classifier performs worst for Bageshwar district and best for Dehradun, Nainital, Hardiwar and Udham singh nagar district.
3.3 Decision rules extraction and description
Further, decision rules are extracted from decision tree build for all districts and EDS. The relevant and interesting rules have been chosen to describe the patterns of each district and EDS. The description of decision rules are given as follows:
The decision rules for Almora, Bageshwar and Chamoli districts indicate that NOI, TOD, SUA and LIG are the main contributing accidents attributes that is involved in several PTW accidents. The decision rules revealed that PTW accidents that occurred during night time with no light conditions were KSI accidents. The locations where road light facilities were present during night time have SI accidents only. In other conditions, it is difficult to conclude between KSI and SI accidents, because similar attribute values were present for both KSI and SI accidents. The other attributes that were not available with the data such as speed and weather information may be the responsible factors for PTW accidents in these districts of Uttarakhand.
CART performance metrics for 13 districts and EDS
The Dehradun district that has the highest PTW road accidents in Uttarakhand state was mainly affected by NOI, TOD, ROF, SUA and LIG road accident attributes. The decision rules certainly reveal some interesting information. According to decision rules, most of the KSI accidents have occurred in no light conditions in intersections near markets, residential area and agriculture land. Curve on road near forest area was also KSI prone area for PTW accidents with 1 victim involved. Other values of different attributes were usually involved in SI accidents.
The factors that affect the severity for PTW accidents in Nainital districts, in addition to other previously mentioned districts, has few more accident attribute responsible for accidents i.e. Age of victim and ROT. The rules reveals that curve on road are the main factor that contributes to KSI accidents at night and early morning duration. Also, in evening duration the KSI accidents on highway roads were involved with minor victims or victims less than 18 years of age.
For Udham singh nagar district, the factors that affect severity of road accidents were quite similar to those factors in Dehradun districts. The colonies and markets areas were the major location where lots of the accidents have occurred but most of these accidents were SI accidents only. The PTW KSI accidents were mainly occurred at a highway that goes through the agriculture land or the forest area. The YNG and ADU age group victim were mainly involved in KSI accidents. Very few KSI accidents were involved SNR and CHD group victims.
Rudraprayag, Tehri and Uttarkashi districts were not mainly affected by ROT, ROF and other important factors which were found for the previous districts. One common factor revealed by decision rules is the LIG condition. Most of the KSI accidents in these districts have occurred in DUSK lightning condition. Other lightning conditions were usually involved SI accidents. As the accident records for PTW accidents for these districts were comparatively low, some other factors remain hidden. The decision rules for Pauri and Pithoragarh districts revealed that these two districts have similar patterns for PTW accidents. In both districts, the KSI accidents mainly involved the AGE group CHD and SNR and the LIG condition as DUS. Also, these accidents were mainly happened in Q1 and Q4 months of the years. The SI accidents were mainly involved the AGE group ADU, whereas YNG age group was equally involved in both SI and KSI accidents.
Further, the rules for the EDS have been analyzed. It was found that for EDS almost all attributes except the MON (month) attribute were involved in KSI and SI accidents for PTW. Most of the KSI accidents were involved NOI values of 1 but very few KSI accidents involved NOI = +2 for EDS. For AGE attribute, the values YNG and ADU were mainly involved in KSI accidents, whereas the number of CHD victims was comparatively low. SNR victims were found to be involved in both KSI and SI accidents but these accidents are comparatively lower than accidents with other victims. The major road location where most of the KSI accidents have occurred was intersections on highways. Most of the intersections where KSI accidents have occurred were a part of highways. Also the curve on highways was found to be dangerous as it involves most of the KSI accidents than SI accidents. The SUA attribute values MAR and HIL are the locations where most of the accidents have occurred but the number of SI accidents was more in comparison to KSI accidents in these locations. The SUA values FOR and AGL was found to be dangerous for PTW accidents on local roads. For attribute LIG, around 10% of accidents have occurred in DUS condition in which 46% accidents were KSI, hence the DUS condition could be dangerous for PTW accidents. Although, lots of accidents have occurred in DLT condition but most of the accidents were SI accidents. In RLT condition, it is found that most of the PTW accidents were KSI accidents. Some of the PTW accidents have also occurred in NLT conditions but most of the accidents were SI accidents.
Therefore, it is found that a separate analysis of every district data and a complete analysis of entire data certainly reveal different but important information that can be utilized to understand the factors that involved in PTW road accidents. The different accident attributes have different impact on PTW accidents in every district. It can be concluded that the analysis of entire data can give you a broad overview of the information about the factors involved in road accidents of PTW accidents, whereas a separate analysis of each district can reveal factors associated with PTW accidents in those district only. Therefore, both type of analysis should be performed with EDS and each districts to get a broad and insight information about accident factors.
We thankfully acknowledge the GVK-EMRI, Dehradun to provide the data for our research work.
Compliance with ethical standards
Conflicts of Interest
Authors do not have any conflict of interest in publication of this manuscript.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- World Health Organization. Global Status Report on Road Safety 2015. Available online: http://www.who.int/violence_injury_prevention/road_safety_status/2015/GSRRS2015_Summary_EN_final2.pdf?ua=1 (accessed on 01.07.2016)
- Abdalla IM, Raeside R, Barker D, McGuigan DR (1997) An investigation into the relationships between area social characteristics and road accident casualties. Accid Anal Prev 29:583–593View ArticleGoogle Scholar
- Mussone L, Ferrari A, Oneta M (1991) An analysis of urban collisions using an artificial intelligence model. Accid Anal Prev 31:705–718View ArticleGoogle Scholar
- Poch M and Mannering F (1996) Negative binomial analysis of intersection-accident frequencies. J Transp Eng 122Google Scholar
- Karlaftis M, Tarko A (1998) Heterogeneity considerations in accident modeling. Accid Anal Prev 30:425–433View ArticleGoogle Scholar
- J. Ma, K. Kockelman (2006) Crash frequency and severity modeling using clustered data from Washington state. In: IEEE Intelligent Transportation Systems Conference. Toronto CanadáGoogle Scholar
- Abdel-Aty MA and Radwan AE (2000) Modeling traffic accident occurrence and involvement. Accid Anal Prev 32Google Scholar
- Miaou SP (1994) The relationship between truck accidents and geometric design of road sections–poisson versus negative binomial regressions. Accid Anal Prev 26Google Scholar
- Chen W, Jovanis P (2002) Method of identifying factors contributing to driver-injury severity in traffic crashes. Transp Res Rec. 1717Google Scholar
- Maher MJ and Summersgill IA (1996) Comprehensive methodology for the fitting of predictive accident models. Accid Anal Prev 28Google Scholar
- Joshua SC and Garber NJ (1990) Estimating truck accident rate and involvements using linear and poisson regression models. Transp Plan Technol 15Google Scholar
- Jones B, Janssen L and Mannering F (1991) Analysis of the frequency and duration of freeway accidents in Seattle. Accid Anal Prev 23Google Scholar
- Miaou SP and Lum H (1993) Modeling vehicle accidents and highway geometric design relationships, Accid Anal Prev 25Google Scholar
- Chang LY and Chen WC (2005) Data mining of tree based models to analyze freeway accident frequency. J Saf Res 25Google Scholar
- Kumar S and Toshniwal D (2015) Analyzing road accident data using association rule mining, international conference on computing communication and security (ICCCS-2015), Dec-2015, MauritiusGoogle Scholar
- Kumar S, Toshniwal D (2016) A data mining approach to characterize road accident locations. Journal of Modern Transportation 24:62–72View ArticleGoogle Scholar
- Oña JD, López G, Mujalli R and Calvo FJ (2013) Analysis of traffic accidents on rural highways using latent class clustering and Bayesian networks. Accid Anal Prev 51Google Scholar
- Kumar S, Toshniwal D (2015) A data mining framework to analyze road accident data. Journal of Big Data 2:1–18Google Scholar
- Kumar S, Toshniwal D, Parida M (2016) A comparative analysis of heterogeneity in road accident data using data mining techniques. Springer, Evol Syst doi:10.1007/s12530-016-9165-5
- Kumar S, Toshniwal D (2016) A novel framework to analyze road accident time series data. Journal of Big Data 3:1–11Google Scholar
- Kumar S, Toshniwal D (2016) Analysis of road accident counts using hierarchical clustering and cophenetic correlation coefficient (CPCC). Journal of Big Data 3:1–11Google Scholar
- Quddus MA, Noland RB, Chin HC (2002) An analysis of motorcycle injury and vehicle damage severity using ordered probit models. J Saf Res 33:445–462View ArticleGoogle Scholar
- Rifaat SM, Tay R, de Barros A (2012) Severity of motorcycle crashes in Calgary. Accid Anal Prev 49:44–49View ArticleGoogle Scholar
- Savolainen P, Mannering F (2007) Probabilistic models of motorcyclists’ injury severities in single- and multi-vehicle crash. Accid Anal Prev 39:955–963View ArticleGoogle Scholar
- Yannis G, Golias J, Papadimitriou E (2005) Driver age and vehicle engine size effects on fault and severity in young motorcyclists accidents. Accid Anal Prev 37:327–333View ArticleGoogle Scholar
- de Lapparent M (2006) Empirical Bayesian analysis of accident severity for motorcyclists in large French urban areas. Accid Anal Prev 38:260–268View ArticleGoogle Scholar
- Montella A, Aria M, Ambrosio AD, Mauriello F (2012) Analysis of powered two wheeler crashes in Italy by classification trees and rules discovery. Accid Anal Prev 49:58–72View ArticleGoogle Scholar
- http://www.emri.in/ accessed on 14.07.2016.
- Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers, USAMATHGoogle Scholar
- Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106 doi:10.1023/A:1022643204877
- Russell S and Norvig P (1995) Artificial intelligence: a modern approach, (2nd ed.). Prentice HallGoogle Scholar
- Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297MATHGoogle Scholar
- Tan PN, Steinbach M and Kumar V (2006) Introduction to data mining. Pearson Addison-WesleyGoogle Scholar
- R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann) 2 (12): 1137–1143, 1995.Google Scholar
- Powers MW (2011) Evaluation: from precision, recall and F-measure to ROC, Informedness, markedness & correlation. J Mach Learn Technol 2:37–63Google Scholar
- Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405:442–451View ArticleGoogle Scholar
- Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874View ArticleGoogle Scholar