In this section, we describe our methodologies for the automated analysis of traffic counts. We start with some notations.
The notations \(x_{i}^{(j)}\) and \(y_{i}^{(j)}\) are used for the number of vehicles per the i:th time slot and at the locationj, j=1,…,6. The \(x_{i}^{(j)}\) always indicates the number of vehicles that enter the AoI and \(y_{i}^{(j)}\) always refers to exiting vehicles. Since all locations are handled in a similar way, the location index (j) is sometimes dropped in the notation unless different locations are considered simultaneously. The time of the i:th slot is denoted by ti with the interpretation that the time stamp represents the end time of the slot. For example, ti=16:15 refers to the number of vehicles observed at the given day between 16:00−16:15.
4.1 Transformation of the data
At each location j we study the transformed data
$$ \left(x_{i}^{(j)},y_{i}^{(j)}\right)\mapsto\left(x_{i}^{(j)}-y_{i}^{(j)},x_{i}^{(j)}+y_{i}^{(j)}\right) $$
(1)
The transformation (1) is bijective
$$\left\{\begin{array}{l} x_{i}+y_{i}=v_{i}\\ x_{i}-y_{i}=z_{i}\\ \end{array}\right. \text{if and only if}\; \left\{\begin{array}{l} x_{i}=(v_{i}+z_{i})/2\\ y_{i}=(v_{i}-z_{i})/2\\ \end{array}\right. $$
so there is no loss of information in this step. The difference zi=xi−yi is called the (traffic) asymmetry and the sum vi=xi+yi is called the (traffic) volume. Since xi≥0 and yi≥0 the inequality −yi≤zi≤xi always holds. Due to limited space, in this paper we concentrate on applications of asymmetry but the same methodology framework can be applied to volumes as well. Figure 2 shows a scatter plot example of the transformation (1). The data in Fig. 2 consist of the first 10 000 available pairs \(\left (x_{i}^{(j)},y_{i}^{(j)}\right)\) from location j=Tampella.
If \(z_{i}^{(j)}=x_{i}^{(j)}-y_{i}^{(j)}>0\), the number of the vehicles inside the AoI increased during the time slot ti at location j and, if \(z_{i}^{(j)}<0\), the number of the vehicles inside decreased. Thus, asymmetry is a measure of excess/shortfall at location j during the time slot ti when the AoI is considered as a reservoir of vehicles. If \(z_{i}^{(j)}\approx 0\) then, no matter how big the volume \(x_{i}^{(j)}+y_{i}^{(j)}\) is, the total of the vehicles inside AoI is not essentially affected by the traffic at j. Another justification behind this transformation is explained in Section 4.4. In [18] the asymmetry and the volume transform were utilized with a similar data but in a different problem set-up.
4.2 The Normal distribution as a baseline model
The empirical distributions of the asymmetries zi (and volumes) are typically unimodal, symmetric, light tailed and well approximated by the normal distribution. This is not a coincidence since the observed asymmetry values zi at any location j and at any time ti can be considered as sums of large number of possibly slightly correlated random variables with bounded small variances. The Central Limit Theorem (CLT) dictates that the distributions of zi should be approximately normally distributed [19]. Therefore, the normal distribution will be used as a baseline model for the asymmetries.
In an automated analysis, the parameters of the normal distribution model must be estimated in a robust manner. The reason for this is that the CLT-based argument for the normal distribution model covers only non-mixed cases. The observed data includes also observations that are mixed in the sense that any incident that restricts traffic anywhere nearby a loop detector can increase or decrease the number of observed vehicles in the detector. Thus, the observed data is a mixture of at least two qualitatively different sources of randomness and the use of ordinary sample means and sample variances is not justified for data from contaminated distributions, see [20] for further reasons. The next section discusses some technical details about robust estimation used in the proposed framework.
4.3 Robust estimation of the parameters of the Normal distribution model
Robust statistics is a well-developed area of statistics. In [20] robust estimators are heuristically defined as follows. Robust estimators should be statistically efficient at the assumed model, stable in the sense that small deviations in the model assumptions should impair the performance only slightly, and they should have high breakdown point meaning that somewhat larger deviations from the assumed model do not cause a catastrophe (see Section 1.2 in [20]). Any chosen robust estimator is always a compromise of these properties, including also conceptual clarity and computational issues.
The sample median is a well-known robust estimate of the mean μ of the normal distribution. If the data vary symmetrically around μ it is also reasonably efficient. Well-known robust estimators of the scale parameter σ (standard deviation) include the interquartile range (IQR) that has the following justification. If Φμ,σ is the cumulative distribution function (CDF) of the normal distribution N(μ,σ2) and Φ0,1 is the CDF of N(0,1), then the ratio of IQRs of these distributions is
$$\frac{\Phi_{\mu,\sigma}^{-1}(0.75)-\Phi_{\mu,\sigma}^{-1}(0.25)}{\Phi_{0,1}^{-1}(0.75)-\Phi_{0,1}^{-1}(0.25)}=\sigma, $$
and, since \(\Phi _{0,1}^{-1}(0.75)-\Phi _{0,1}^{-1}(0.25)\approx 1.3489795\), an estimator of the σ is just IQRn/1.3489795 where IQRn=Q3−Q1 is the sample IQR. The sample IQR is the difference of the 3rd sample quartile Q3 and the 1st sample quartile Q1. Again, symmetric variation around the mean is benefical for the efficiency of this simple estimator. The 2nd sample quartile Q2 is the sample median.
Conceptually three robust values, the sample quartilesQ1, Q2 and Q3, provide robust estimates of the two parameters μ and σ of the normal distribution, if the assumption of the normal distribution is valid and the data typically vary symmetrically around the mean. Moreover, the quartile skewness, defined as
$$\frac{ \frac{Q_{3}+Q_{2}}{2}-Q_{2}}{\frac{Q_{3}-Q_{1}}{2}}, $$
can be used to indicate lack of symmetry in the quartile scale. In practice, it means that the fitted normal distribution typically fits the body of the empirical distribution well. There is deliberately no attempt to try to fit the normal model to the tails of the empirical distribution.
4.4 Robust estimation of the correlation
To understand the correlation (association, dependence) between the asymmetries of two locations the Spearman’s correlation coefficient is used as a complementary tool to the ordinary linear correlation coefficient. See [21] for the definition and Section 8.3 of [20] for the properties of the Spearman’s correlation. Spearman’s correlation ρS is a measure of monotone correlation. If the data are possibly contaminated, it is better suited for the data analysis than the linear correlation coefficient ρ. The sample version of ρS is denoted as rS and it is more robust against outliers than the sample version r of ρ. Linear correlation is a special case of monotone correlation and if linear correlation is true, the usually rS≈r. There is a negligible bias towards 0 in rS in that case and rS has slightly larger variance than r. In the binormal model case, this is straightforward to test by simulations. If the assumption of linear correlation is not true then the sample value r can be misleading while the sample value rS is meaningful as long as the monotone correlation, a concept with much wider extent, is plausible. As long as we do not know whether there are bivariate outliers or contamination in the data we trust more on rS than on r.
The simplest assumption for the main cause of the correlation is that the same vehicles are observed in two places, first in and then out. The 15-minute slot is sufficiently long so that a vehicle can enter and exit the AoI during the same slot at any two locations j and k. This specific kind of causality is the target to estimate. Note that an alternative explanation, in which totally different vehicles enter than exit, is always possible. However, being a significant and regular phenomenon such a coordinated common behaviour would require a more complicated explanation.
However, there are several other causes for the correlation between any two traffic streams. Other causes for correlation include such effects which appear in daily profile at every measurement point. For example: silent hours at night 0:00-06:00, rush hours around 8:00 and 16:00 during the working days and relatively silent moment just before early lunchtime 11:00 during the working days. In these cases the amounts of observed vehicles per slot are either increasing or decreasing everywhere and this shows up as always positive correlation in pairs like \(\left (x_{i}^{(j)},x_{i}^{(k)}\right)\), \(\left (y_{i}^{(j)},y_{i}^{(k)}\right)\) and \(\left (x_{i}^{(j)},y_{i}^{(k)}\right)\). This happens even in the cases in which the amount of the same cars in the two locations and in the directions in question must be practically zero. With the 15 minute granularity these effects are practically simultaneous. Ideally, our causal assumption can be expected to produce linear correlations. However, this is not necessarily true for the other causes. Either the above listed other common causes typically increase or decrease traffic amounts everywhere so that their combined effect should still be monotone. The use of rS to complement r is even more justified by this.
Another justification for the transformation (1) can now be expressed as follows: while the other common causes of correlation affect both xi and yi, the effect of other common causes diminishes when the difference xi−yi is considered and increases when the sum xi+yi is considered. That is, the differences xi−yi are less affected by the other common causes and they are easier to use when the assumed causal cause of the correlation is studied. Therefore, we use values
$$ r_{S}\left(x_{i}^{(j)}-y_{i}^{(j)},x_{i}^{(k)}-y_{i}^{(k)}\right) $$
(2)
with different locations j and k.
The negative correlation in (2) tells something about the dynamics of the traffic. For example, if there is an asymmetric burst of traffic coming in at Tampella then, simultaneously, there is likely an asymmetric burst of traffic going out at Santalahti and vice versa. This holds true independently of the time of the day or the day of the week. It is plausible to assume that, whenever significantly non-zero, this correlation is caused by typically detecting some amount of the same vehicles at these locations during the same time slot. In [18] the application of asymmetry and volume was in a different context and the directions were chosen so that a positive correlation was targeted.
The correlation matrices can be estimated between the asymmetries of different locations. Moreover, the Spearman’s rank correlation test, see [21], can be added to the estimation process so that the null hypothesis of independence of asymmetries between two locations can be tested. If there is no evidence to reject the null hypothesis ρS=0, that is, rS≈0 with the large enough sample size n, then we can set rS=0. The correlation matrix is symmetric with diagonal values 1.
4.5 Empirical conditional expectation
For integer-valued random variables U and V, and an integer a with \(\mathbb {P}\{V=a\}>0\), the conditional expectation \(\mathbb {E}(U|V=a)\) can be computed as
$$\begin{array}{*{20}l} \mathbb{E}(U|V=a) &=\sum_{u=-\infty }^{\infty} u \left(\frac{\mathbb{P}\{U=u,V=a\}}{\mathbb{P}\{V=a\}}\right). \end{array} $$
Analogously to this, assuming that \(\mathbb {P}\{V>a\}>0\), define
$$\begin{array}{*{20}l} \mathbb{E}(U|V>a) &=\sum_{u=-\infty }^{\infty }\!\! u\! \left(\frac{\mathbb{P}\{U=u,V>a\}}{\mathbb{P}\{V>a\}}\right)\\ &=\sum_{u=-\infty }^{\infty }\!\! u\! \left(\frac{\sum_{v>a} \mathbb{P}\{U=u,V=v\}}{\mathbb{P}\{V>a\}}\right). \end{array} $$
The last formula can be used to compute a sample version\(\mathbb {E}_{n}(U|V>a)\) as
$$ \frac{1}{\mathbb{P}_{n}\{V>a\}} \sum_{i=1}^{n} u_{i} \left(\sum_{v_{i}>a} \mathbb{P}_{n}\{U=u_{i},V=v_{i}\}\right), $$
(3)
where (ui,vi), i=1,…,n, is the bivariate sample of size n. The formula for \(\mathbb {E}_{n}(U|V\leq a)\) is obtained similarly. There are two sums included and it requires some computation. The computational complexity is \({\mathcal {O}}(n^{2})\). The sample estimates of the joint probability mass function of the pair (U,V) and of the sample CDF of V are needed. The robustness properties of (3) should improve compared to if conditioned on the event {V=a}, but this is out of the scope of this publication.
The sample version (3) is computed for all vmin<a<vmax, where vmin and vmax are the minimum and maximum observed values. Typically n<vmax−vmin so there is implicit interpolation included in the map \( a\mapsto \mathbb {E}_{n}(U|V> a)\) since a need not be an observed value. This map defines a piecewise constant curve. If U and V are independent, it follows from (3) that \(\mathbb {E}_{n}(U|V>a)=\mathbb {E}_{n}(U)\) for all a.
4.5.1 The multinormal model
Given a binormal or trinormal model for asymmetries at two or three locations, formulae for the conditional expectations (4) and (6) and, especially, their variances can be computed analytically. These are provided in this section. The formulae (5) and (7) of conditional variances are important since they quantify the confidence limits and, therefore, allow the automatic detection of observations that do not fit into (multi)normal models.
Linear regression which is based on conditional expectatation \(\mathbb {E}(U|V=~a)\) with multinormal models are discussed, for example, in Section 4.3 of [22], in Section 11.3 of [23] and Chapters 4 and 7 in [24]. The conditioning with an event of type {V>a} is more rare but it is used at least in [25] in the context of economical self-selection models. Similar mathematical formulae appear also in the context of truncated distributions [26–28] since truncating a distribution is equivalent to conditioning on an interval.
Assume (Z1,Z2) is binormally distributed with the mean vector μ=(μ1,μ2), variances \(\sigma _{1}^{2}>0\), \(\sigma _{2}^{2}>0\) and correlation −1<ρ<1. If \(a\in \mathbb {R}\), then
$$ \mathbb{E}(Z_{1}|Z_{2}>a)= \mu_{1}+\sigma_{1}\left(\frac{\rho\,\phi(\alpha)}{1-\Phi(\alpha)}\right), $$
(4)
in which, to simplify the notation, we define \(\alpha =\alpha (a)=\frac {a-\mu _{2}}{\sigma _{2}}\) for all \(a\in \mathbb {R}\). The formula for the conditional variance is
$$ \text{Var}(Z_{1}|Z_{2}\!>\!a)\,=\, \sigma_{1}^{2} \!\!\left[\!1\,+\,\frac{\rho^{2}\alpha\,\phi(\alpha)}{1\,-\,\Phi(\alpha)}\,-\,\! \left(\!\frac{\rho\,\phi(\alpha)}{1\,-\,\Phi(\alpha)}\!\right)^{\!\!2}\right]. $$
(5)
Assume (Z1,Z2,Z3) is trinormally distributed with the mean values μi, variances \(\sigma _{i}^{2}>0\), i=1,2,3 and pairwise correlations ρij, i<j. The conditional expectation \(\mathbb {E}\left (Z_{1}|Z_{2}=z_{2}, Z_{3}=z_{3}\right)\) can be computed as
$$\begin{array}{@{}rcl@{}} \sigma_{1}\left[\frac{(\rho_{12}-\rho_{13}\rho_{23})(z_{2}-\mu_{2})}{(1-\rho_{23}^{2})\sigma_{2}}\right.\\ +\left.\frac{(\rho_{13}-\rho_{12}\rho_{23})(z_{3}-\mu_{3})}{(1-\rho_{23}^{2})\sigma_{3}}\right]\!+\mu_{1}. \end{array} $$
(6)
The conditional variance Var(Z1|Z2=z2,Z3=z3) can be computed from
$$ \sigma_{1}^{2}\left(1-\frac{\rho_{12}^{2}+\rho_{13}^{2}-2 \rho_{13} \rho_{23} \rho_{12}}{1-\rho_{23}^{2}}\right). $$
(7)