In this section, we describe our methodologies for the automated analysis of traffic counts. We start with some notations.

The notations \(x_{i}^{(j)}\) and \(y_{i}^{(j)}\) are used for the number of vehicles per the *i*:th *time slot* and at the *location**j*, *j*=1,…,6. The \(x_{i}^{(j)}\) always indicates the number of vehicles that enter the AoI and \(y_{i}^{(j)}\) always refers to exiting vehicles. Since all locations are handled in a similar way, the location index ^{(j)} is sometimes dropped in the notation unless different locations are considered simultaneously. The time of the *i*:th slot is denoted by *t*_{i} with the interpretation that the time stamp represents the end time of the slot. For example, *t*_{i}=16:15 refers to the number of vehicles observed at the given day between 16:00−16:15.

### 4.1 Transformation of the data

At each location *j* we study the transformed data

$$ \left(x_{i}^{(j)},y_{i}^{(j)}\right)\mapsto\left(x_{i}^{(j)}-y_{i}^{(j)},x_{i}^{(j)}+y_{i}^{(j)}\right) $$

(1)

The transformation (1) is bijective

$$\left\{\begin{array}{l} x_{i}+y_{i}=v_{i}\\ x_{i}-y_{i}=z_{i}\\ \end{array}\right. \text{if and only if}\; \left\{\begin{array}{l} x_{i}=(v_{i}+z_{i})/2\\ y_{i}=(v_{i}-z_{i})/2\\ \end{array}\right. $$

so there is no loss of information in this step. The difference *z*_{i}=*x*_{i}−*y*_{i} is called the (traffic) *asymmetry* and the sum *v*_{i}=*x*_{i}+*y*_{i} is called the (traffic) *volume*. Since *x*_{i}≥0 and *y*_{i}≥0 the inequality −*y*_{i}≤*z*_{i}≤*x*_{i} always holds. Due to limited space, in this paper we concentrate on applications of asymmetry but the same methodology framework can be applied to volumes as well. Figure 2 shows a scatter plot example of the transformation (1). The data in Fig. 2 consist of the first 10 000 available pairs \(\left (x_{i}^{(j)},y_{i}^{(j)}\right)\) from location *j*=Tampella.

If \(z_{i}^{(j)}=x_{i}^{(j)}-y_{i}^{(j)}>0\), the number of the vehicles inside the AoI increased during the time slot *t*_{i} at location *j* and, if \(z_{i}^{(j)}<0\), the number of the vehicles inside decreased. Thus, asymmetry is a measure of excess/shortfall at location *j* during the time slot *t*_{i} when the AoI is considered as a reservoir of vehicles. If \(z_{i}^{(j)}\approx 0\) then, no matter how big the volume \(x_{i}^{(j)}+y_{i}^{(j)}\) is, the total of the vehicles inside AoI is not essentially affected by the traffic at *j*. Another justification behind this transformation is explained in Section 4.4. In [18] the asymmetry and the volume transform were utilized with a similar data but in a different problem set-up.

### 4.2 The Normal distribution as a baseline model

The empirical distributions of the asymmetries *z*_{i} (and volumes) are typically unimodal, symmetric, light tailed and well approximated by the normal distribution. This is not a coincidence since the observed asymmetry values *z*_{i} at any location *j* and at any time *t*_{i} can be considered as sums of large number of possibly slightly correlated random variables with bounded small variances. The *Central Limit Theorem (CLT)* dictates that the distributions of *z*_{i} should be approximately normally distributed [19]. Therefore, the normal distribution will be used as *a baseline model* for the asymmetries.

In an automated analysis, the parameters of the normal distribution model must be estimated in a robust manner. The reason for this is that the CLT-based argument for the normal distribution model covers only non-mixed cases. The observed data includes also observations that are mixed in the sense that any incident that restricts traffic anywhere nearby a loop detector can increase or decrease the number of observed vehicles in the detector. Thus, the observed data is *a mixture* of *at least two* qualitatively different sources of randomness and the use of ordinary sample means and sample variances is not justified for data from contaminated distributions, see [20] for further reasons. The next section discusses some technical details about robust estimation used in the proposed framework.

### 4.3 Robust estimation of the parameters of the Normal distribution model

Robust statistics is a well-developed area of statistics. In [20] robust estimators are heuristically defined as follows. Robust estimators should be statistically *efficient* at the assumed model, *stable* in the sense that small deviations in the model assumptions should impair the performance only slightly, and they should have high *breakdown point* meaning that somewhat larger deviations from the assumed model do not cause a catastrophe (see Section 1.2 in [20]). Any chosen robust estimator is always a compromise of these properties, including also conceptual clarity and computational issues.

The sample median is a well-known robust estimate of the mean *μ* of the normal distribution. If the data vary symmetrically around *μ* it is also reasonably efficient. Well-known robust estimators of the scale parameter *σ* (standard deviation) include *the interquartile range (IQR)* that has the following justification. If *Φ*_{μ,σ} is the *cumulative distribution function (CDF)* of the normal distribution *N*(*μ*,*σ*^{2}) and *Φ*_{0,1} is the CDF of *N*(0,1), then the ratio of IQRs of these distributions is

$$\frac{\Phi_{\mu,\sigma}^{-1}(0.75)-\Phi_{\mu,\sigma}^{-1}(0.25)}{\Phi_{0,1}^{-1}(0.75)-\Phi_{0,1}^{-1}(0.25)}=\sigma, $$

and, since \(\Phi _{0,1}^{-1}(0.75)-\Phi _{0,1}^{-1}(0.25)\approx 1.3489795\), an estimator of the *σ* is just *I**Q**R*_{n}/1.3489795 where *I**Q**R*_{n}=*Q*_{3}−*Q*_{1} is *the sample IQR*. The sample IQR is the difference of the 3rd sample quartile *Q*_{3} and the 1st sample quartile *Q*_{1}. Again, symmetric variation around the mean is benefical for the efficiency of this simple estimator. The 2nd sample quartile *Q*_{2} is the sample median.

Conceptually three robust values, *the sample quartiles**Q*_{1}, *Q*_{2} and *Q*_{3}, provide robust estimates of the two parameters *μ* and *σ* of the normal distribution, if the assumption of the normal distribution is valid and the data typically vary symmetrically around the mean. Moreover, the *quartile skewness*, defined as

$$\frac{ \frac{Q_{3}+Q_{2}}{2}-Q_{2}}{\frac{Q_{3}-Q_{1}}{2}}, $$

can be used to indicate lack of symmetry in the quartile scale. In practice, it means that the fitted normal distribution typically fits the body of the empirical distribution well. There is deliberately no attempt to try to fit the normal model to the tails of the empirical distribution.

### 4.4 Robust estimation of the correlation

To understand the correlation (association, dependence) between the asymmetries of two locations the Spearman’s correlation coefficient is used as a complementary tool to the ordinary linear correlation coefficient. See [21] for the definition and Section 8.3 of [20] for the properties of the Spearman’s correlation. Spearman’s correlation *ρ*_{S} is a measure of *monotone* correlation. If the data are possibly contaminated, it is better suited for the data analysis than the linear correlation coefficient *ρ*. The sample version of *ρ*_{S} is denoted as *r*_{S} and it is more robust against outliers than the sample version *r* of *ρ*. Linear correlation is a special case of monotone correlation and if linear correlation is true, the usually *r*_{S}≈*r*. There is a negligible bias towards 0 in *r*_{S} in that case and *r*_{S} has slightly larger variance than *r*. In the binormal model case, this is straightforward to test by simulations. If the assumption of linear correlation is not true then the sample value *r* can be misleading while the sample value *r*_{S} is meaningful as long as the monotone correlation, a concept with much wider extent, is plausible. As long as we do not know whether there are bivariate outliers or contamination in the data we trust more on *r*_{S} than on *r*.

The simplest *assumption* for the main cause of the correlation is that the same vehicles are observed in two places, first in and then out. The 15-minute slot is sufficiently long so that a vehicle can enter and exit the AoI during the same slot at any two locations *j* and *k*. This specific kind of causality is the target to estimate. Note that an alternative explanation, in which totally different vehicles enter than exit, is always possible. However, being a significant and regular phenomenon such a coordinated common behaviour would require a more complicated explanation.

However, there are several other causes for the correlation between any two traffic streams. Other causes for correlation include such effects which appear in daily profile at every measurement point. For example: silent hours at night 0:00-06:00, rush hours around 8:00 and 16:00 during the working days and relatively silent moment just before early lunchtime 11:00 during the working days. In these cases the amounts of observed vehicles per slot are either increasing or decreasing everywhere and this shows up as always positive correlation in pairs like \(\left (x_{i}^{(j)},x_{i}^{(k)}\right)\), \(\left (y_{i}^{(j)},y_{i}^{(k)}\right)\) and \(\left (x_{i}^{(j)},y_{i}^{(k)}\right)\). This happens even in the cases in which the amount of the same cars in the two locations and in the directions in question must be practically zero. With the 15 minute granularity these effects are practically simultaneous. Ideally, our causal assumption can be expected to produce linear correlations. However, this is not necessarily true for the other causes. Either the above listed other common causes typically increase or decrease traffic amounts everywhere so that their combined effect should still be monotone. The use of *r*_{S} to complement *r* is even more justified by this.

Another justification for the transformation (1) can now be expressed as follows: while the other common causes of correlation affect both *x*_{i} and *y*_{i}, the effect of other common causes diminishes when the difference *x*_{i}−*y*_{i} is considered and increases when the sum *x*_{i}+*y*_{i} is considered. That is, the differences *x*_{i}−*y*_{i} are less affected by the other common causes and they are easier to use when the assumed causal cause of the correlation is studied. Therefore, we use values

$$ r_{S}\left(x_{i}^{(j)}-y_{i}^{(j)},x_{i}^{(k)}-y_{i}^{(k)}\right) $$

(2)

with different locations *j* and *k*.

The negative correlation in (2) tells something about the dynamics of the traffic. For example, if there is an asymmetric burst of traffic coming in at Tampella then, simultaneously, there is likely an asymmetric burst of traffic going out at Santalahti and *vice versa*. This holds true independently of the time of the day or the day of the week. It is plausible to assume that, whenever significantly non-zero, this correlation is caused by typically detecting some amount of the same vehicles at these locations during the same time slot. In [18] the application of asymmetry and volume was in a different context and the directions were chosen so that a positive correlation was targeted.

The correlation matrices can be estimated between the asymmetries of different locations. Moreover, *the Spearman’s rank correlation test*, see [21], can be added to the estimation process so that the null hypothesis of independence of asymmetries between two locations can be tested. If there is no evidence to reject the null hypothesis *ρ*_{S}=0, that is, *r*_{S}≈0 with the large enough sample size *n*, then we can set *r*_{S}=0. The correlation matrix is symmetric with diagonal values 1.

### 4.5 Empirical conditional expectation

For integer-valued random variables *U* and *V*, and an integer *a* with \(\mathbb {P}\{V=a\}>0\), the conditional expectation \(\mathbb {E}(U|V=a)\) can be computed as

$$\begin{array}{*{20}l} \mathbb{E}(U|V=a) &=\sum_{u=-\infty }^{\infty} u \left(\frac{\mathbb{P}\{U=u,V=a\}}{\mathbb{P}\{V=a\}}\right). \end{array} $$

Analogously to this, assuming that \(\mathbb {P}\{V>a\}>0\), define

$$\begin{array}{*{20}l} \mathbb{E}(U|V>a) &=\sum_{u=-\infty }^{\infty }\!\! u\! \left(\frac{\mathbb{P}\{U=u,V>a\}}{\mathbb{P}\{V>a\}}\right)\\ &=\sum_{u=-\infty }^{\infty }\!\! u\! \left(\frac{\sum_{v>a} \mathbb{P}\{U=u,V=v\}}{\mathbb{P}\{V>a\}}\right). \end{array} $$

The last formula can be used to compute *a sample version*\(\mathbb {E}_{n}(U|V>a)\) as

$$ \frac{1}{\mathbb{P}_{n}\{V>a\}} \sum_{i=1}^{n} u_{i} \left(\sum_{v_{i}>a} \mathbb{P}_{n}\{U=u_{i},V=v_{i}\}\right), $$

(3)

where (*u*_{i},*v*_{i}), *i*=1,…,*n*, is the bivariate sample of size *n*. The formula for \(\mathbb {E}_{n}(U|V\leq a)\) is obtained similarly. There are two sums included and it requires some computation. The computational complexity is \({\mathcal {O}}(n^{2})\). The sample estimates of the joint probability mass function of the pair (*U*,*V*) and of the sample CDF of *V* are needed. The robustness properties of (3) should improve compared to if conditioned on the event {*V*=*a*}, but this is out of the scope of this publication.

The sample version (3) is computed for all *v*_{min}<*a*<*v*_{max}, where *v*_{min} and *v*_{max} are the minimum and maximum observed values. Typically *n*<*v*_{max}−*v*_{min} so there is implicit *interpolation* included in the map \( a\mapsto \mathbb {E}_{n}(U|V> a)\) since *a* need not be an observed value. This map defines a piecewise constant curve. If *U* and *V* are independent, it follows from (3) that \(\mathbb {E}_{n}(U|V>a)=\mathbb {E}_{n}(U)\) for all *a*.

#### 4.5.1 The multinormal model

Given a binormal or trinormal model for asymmetries at two or three locations, formulae for the conditional expectations (4) and (6) and, especially, their variances can be computed analytically. These are provided in this section. The formulae (5) and (7) of conditional variances are important since they quantify the confidence limits and, therefore, allow the automatic detection of observations that do not fit into (multi)normal models.

Linear regression which is based on conditional expectatation \(\mathbb {E}(U|V=~a)\) with multinormal models are discussed, for example, in Section 4.3 of [22], in Section 11.3 of [23] and Chapters 4 and 7 in [24]. The conditioning with an event of type {*V*>*a*} is more rare but it is used at least in [25] in the context of economical self-selection models. Similar mathematical formulae appear also in the context of *truncated* distributions [26–28] since truncating a distribution is equivalent to conditioning on an interval.

Assume (*Z*_{1},*Z*_{2}) is binormally distributed with the mean vector *μ*=(*μ*_{1},*μ*_{2}), variances \(\sigma _{1}^{2}>0\), \(\sigma _{2}^{2}>0\) and correlation −1<*ρ*<1. If \(a\in \mathbb {R}\), then

$$ \mathbb{E}(Z_{1}|Z_{2}>a)= \mu_{1}+\sigma_{1}\left(\frac{\rho\,\phi(\alpha)}{1-\Phi(\alpha)}\right), $$

(4)

in which, to simplify the notation, we define \(\alpha =\alpha (a)=\frac {a-\mu _{2}}{\sigma _{2}}\) for all \(a\in \mathbb {R}\). The formula for the conditional variance is

$$ \text{Var}(Z_{1}|Z_{2}\!>\!a)\,=\, \sigma_{1}^{2} \!\!\left[\!1\,+\,\frac{\rho^{2}\alpha\,\phi(\alpha)}{1\,-\,\Phi(\alpha)}\,-\,\! \left(\!\frac{\rho\,\phi(\alpha)}{1\,-\,\Phi(\alpha)}\!\right)^{\!\!2}\right]. $$

(5)

Assume (*Z*_{1},*Z*_{2},*Z*_{3}) is trinormally distributed with the mean values *μ*_{i}, variances \(\sigma _{i}^{2}>0\), *i*=1,2,3 and pairwise correlations *ρ*_{ij}, *i*<*j*. The conditional expectation \(\mathbb {E}\left (Z_{1}|Z_{2}=z_{2}, Z_{3}=z_{3}\right)\) can be computed as

$$\begin{array}{@{}rcl@{}} \sigma_{1}\left[\frac{(\rho_{12}-\rho_{13}\rho_{23})(z_{2}-\mu_{2})}{(1-\rho_{23}^{2})\sigma_{2}}\right.\\ +\left.\frac{(\rho_{13}-\rho_{12}\rho_{23})(z_{3}-\mu_{3})}{(1-\rho_{23}^{2})\sigma_{3}}\right]\!+\mu_{1}. \end{array} $$

(6)

The conditional variance *V**a**r*(*Z*_{1}|*Z*_{2}=*z*_{2},*Z*_{3}=*z*_{3}) can be computed from

$$ \sigma_{1}^{2}\left(1-\frac{\rho_{12}^{2}+\rho_{13}^{2}-2 \rho_{13} \rho_{23} \rho_{12}}{1-\rho_{23}^{2}}\right). $$

(7)