Mining User Behaviour from Smartphone data: a literature review

To study users' travel behaviour and travel time between origin and destination, researchers employ travel surveys. Although there is consensus in the field about the potential, after over ten years of research and field experimentation, Smartphone-based travel surveys still did not take off to a large scale. Here, computer intelligence algorithms take the role that operators have in Traditional Travel Surveys; since we train each algorithm on data, performances rest on the data quality, thus on the ground truth. Inaccurate validations affect negatively: labels, algorithms' training, travel diaries precision, and therefore data validation, within a very critical loop. Interestingly, boundaries are proven burdensome to push even for Machine Learning methods. To support optimal investment decisions for practitioners, we expose the drivers they should consider when assessing what they need against what they get. This paper highlights and examines the critical aspects of the underlying research and provides some recommendations: (i) from the device perspective, on the main physical limitations; (ii) from the application perspective, the methodological framework deployed for the automatic generation of travel diaries; (iii)from the ground truth perspective, the relationship between user interaction, methods, and data.


Introduction
Travel surveys capture an essential aspect of user behaviour. To deliver the best possible user experience through the transport network infrastructure, and the transport system [1], such a knowledge base enables development of user behaviour models that support planning, design and policy making.
In Denmark since 1975 the National Travel Survey (TU, Transportvaneundersoegelsen) collects data about travel behaviour. The Center for Transport Analytics at the Technical University of Denmark is now running the latest version of the survey. To sustain statistical representativeness regarding the whole Danish population and keep it up to date, TU requires the collection of multiple new interviews every day of the year [2], totalling an average of 12000 interviews per year since 2010 [3].
Since the introduction of the first generation of smartphones equipped with Assisted Global Positioning System (AGPS) sensor, the research community has considered Smartphone-based Travel Surveys (SBTS) as a promising platform. The first experiment has been attributed to a Personal Digital Assistant (PDA) in 2005 [4], in Japan. In 2007, Samsung announced the smartphone i550, as their first smartphone equipped with AGPS 1 . To compare SBTS with Traditional Travel Surveys (TTS) similar to TU, researchers produced a large body of literature since 2004 [5,4,6]. 1. The data preparation techniques and the machine learning methods used for mining the user behaviour. Essentially, these are allowing automatic ground truth collection about the user route choices and travel time variations.
2. The technologies' applications contributing in the remote sensing and data collection. These are extending the domain from sensor fusion within the smartphones' protocol stack (e.g. with AGPS and Smartphone Location Services), towards the Internet of Things.
Having in mind the needs of the subject responsible for the Travel Survey, we want to highlight opportunities as well as limitations both old and new.
Therefore, the present review focuses on providing a qualitative analysis to help answering the following research questions: (i) What are the main Machine Learning (ML) methods in the field? (ii) What is the ground truth? (iii) What is the ground truth relationship with the ML methods? (iv) Which are the main data sets studied? (v) What characteristics do they have? (vi) What are the features we can extract from these data sets, and how can we extract them? (vii) What are the challenges for ML in the field of smartphone-based travel surveys?
To tackle the above questions we proceeded snowballing forward first and then backwards [29]. Therefore, we looked to the most cited literature of the field in the first step, while in the second step we looked at the references found within the literature harvested in former step. The papers have been selected in order to cover deterministic and machine learning methods based on different typology of data sets. We aimed at data sets collected in various geographical areas. About the models and algorithms, we looked at how they exploited various categories of data sources such as GPS, Inertial Navigation Systems (INS), Geographic Information Systems (GIS), and Internet of Things. Moreover, we paid further attention on the underlying variables, distinguishing between location and person agnostic or specific. For example, (i) speed is a location and person agnostic variable; (ii) the distance between a GPS position and a bus station is a location specific variable; (iii) the travel history of a user is a person specific variable.
The paper begins providing definitions for the main concepts in Sec.2.
Sec. 3 provides the background of the TS field of research and application.
Sec. 4 provides an overview of the technology enabling the data collection, highlighting potential and constraints.
Sec. 5 presents the techniques relevant for data preparation and features extraction. In particular, we want to introduce their potential impact on the following assessment about users route choices and their travel time. To detect and compare travel time variations and Figure 1: Tour Components.
user/context specific relationships, we provide a distillation of ML methods for mining the user behaviour from smartphone data.
To generate travel diaries automatically, these methods are targeting why one travels, where on the transport network, using which transport mode. Thereby, we review purpose imputation, map-matching, and mode detection methods.
We conclude by discussing what relevant data sets are, and how their features could contribute to understanding the performance of an SBTS, and the underlying ML methods, including the challenges that future research might face to boost advancements in the field (see Sec. 6).

Definitions
This section defines the main terms used.We pay particular attention to aligning with current terminology describing users journeys (see Fig. 1). Definition 2.1. Tour. Aggregation of trips, such that users' travels start and end at the same place 2 , e.g. at home [2] Definition 2.2. Trip. Travel entity identified with a set of attributes such as: Start location, Start time, Purpose, Transport Mode, Arrival Time, Arrival Location. A trip could be composed by multiple trip legs [2]. Definition 2.3. Trip Leg. Also identified as trip segment, it identifies, e.g. intermediate short stops for purposes such as: pick-up, drop-off, transfer or change of transportation mode. Each trip segment presents both start and end of both time and location as well as the purpose of the intermediate stop [2,25]. Definition 2.4. Trip Purpose. Often identified with "activity", the trip purpose represents what triggers the trip from origin to destination. Normally the purpose is related to something to do at the destination and related to, e.g. work, shopping, meeting, picking up, dropping off (somebody or something), eating, education, socialisation, exercise, second home, etc. (see Fig. 1, A, B, C, D). Definition 2.5. Stop Purpose. It can be reduced to two categories. The first group represents the stops where the purpose is changing transportation mode (see Fig. 1, B). The second represents the stops where the purpose is performing the activity which triggered the trip (see Fig. 1, A, C, D). Definition 2.6. Transport mode. It is referred to a trip leg and it identifies the mode used to get from one end of the segment to the other, e.g. walking, bicycling, car, train, bus, light rail, etc. [2,30]. Definition 2.7. Mode chain type. The literature provides no strict consensus on this term. [2] provides an extensive list including: walk only, bicycle only, driver of passenger car, driver of other vehicle, car passenger, passenger in other vehicle (e.g. taxi, van etc.), aeroplane, other (e.g. ferry, boat, horse), train only (including multiple trains), bus only (including multiple buses), train-bus in combination, train-bus in combination with bicycle, train-bus in combination with car. In consideration of the car sharing business present in the Copenhagen area and spreading across the main capitals worldwide, we might consider train-bus-car in combination with bicycles and scooters.
[41] introduces the concept of "acceptable truth". Instead of the theoretical ground truth, we believe that the acceptable truth represents better the above definitions (i), and (ii).
About map-matching, [42], [43], and [44] refer to a "hand match" process correcting visible errors on the output of the map-matching algorithm, or on the map itself. When the map-matching algorithms are focusing on public transportation the ground truth can be extracted from the network: for example [45] and [46] focus on bus, and extract the ground truth from both bus stop and intersections networks. In case of synthetic data, the ground truth can be randomly selected between a number of alternatives. For example, [33] defines the ground truth of a synthetically generated trajectory between an origin and a destination as the random selection within a set of alternative shortest paths. In map-matching applications there are studies using data gathered from an external receiver collecting "GPS traces with high sampling rate" [47,48].
About mode detection studies, [11] lists an overview of various ground truth definitions: (i) prompted recall survey, (ii) user input in mobile phones, (iii) travel diaries, (iv) experiment (e.g. mode known). About purpose imputation studies, [11] lists: (i) travel diaries, (ii) prompted recall survey. [49] refers to the trips reported in-situ by the user participating in the experiment. [50] refers to "counts" extracted from video recordings collected from the area relevant for the experiment. Finally, there are several cases where the authors do not mention what the ground truth is [11] or they mention scenarios where it is lacking [51,24,13].
Definition 2.9. Travel Diary. It can focus on "one-day" or on "multiple-days" and it describes the user journeys through attributes, such as: (i) Date/s, (ii) Trips of the day, (iii) Destination of each trip, (iv) Primary Target Destination (see Fig. 1, C), (v) Purpose (see Fig. 1

Background
Human decisions may be determined by different factors, such as (i) biology, (e.g.) endocrinology, genetics; (ii) society, (e.g.) culture, gender, religion, wealth; (iii) mental capacity, (e.g.) IQ or cognition [52]. [53] shows that user behaviour depends on perception, context, environment, prior knowledge, and interaction with others, concluding that human behaviour should be modelled taking into account the context in which users interact. "These contexts are sometimes referred to as spatial" [52], temporal [54], " personal and social aspects, or user context in context aware systems" [52]. In this extent, TTS are designed to observe the external effects of the user decisions, based on the context in which one interacts. To find correlations between user behaviour and context, a multitude of methods aim at exploiting the data TTS generate. The notion of context here can be extended. In a mode choice model we see variables such as travel time or cost, as part of the context [55].
In a TTS, trained operators are actively interviewing users, interacting with them to collect the ground truth about trips, purpose of each trip, transport mode-chain, route choice, and much more. The operators' job is crucial for the quality of the data collected. Any error in the data collection process could undermine the research based on the collected data.
In Denmark, with a population of about 6 millions, the National Travel Survey (TU) collects 20 000 interviews per year in total. The interviews are designed to cover each of the 365 days of the year (temporal representativeness) by focusing on a single day per interview and applying a stratified sampling containing more than 200 strata. The sample represents 2 genders, around 8 age groups between 10 and 84 (socio-demographic representativeness) and approximately 13 geographical groups (spatial representativeness). The information about the trips contain sources and destinations, purpose, mode (both non-motorised and motorised are considered) and temporal information for each trip [3].

Pioneering Smartphone Based Travel Surveys
Within the last 20 years, TTS methods have been subject to the pressure of disruptive technological evolution. The large penetration of smartphone devices equipped with low cost sensors, the introduction of web 2.0 and its implications [56], such as Big Data, could identify a tipping point also for this research area, which is constantly pushing towards the full-context awareness [52,25,26].In this regard, abundant academic literature assesses the impact of new technologies. We highlight the introduction of Global Positioning System (GPS) loggers in the '90s, later the large penetration of sensors rich smartphones, integrated in a high performance communication network (3G, 4G, soon 5G) [57,58,59,60].
The reason to complement and/or substitute TTS with smartphone-based technology is: (i) statistic representativeness, improvable or decreasing in some of the sampled strata [10]; (ii) trend of unreported short trips which the user tend to forget or does not want to mention [27]; (iii) undetected behaviour variations of the same user, due to the design of traditional surveys, which collect a cross-section sample of the population, normally focusing on one single day for each respondent [23,24].
Only by looking at information on route choice and travel variability described in (iii), we can see that detecting behaviour variations within the same user would require the extension of each interview's time scope from 1 day to N days per user. Consequently, each interview would require an amount of time that is proportional to N , impacting negatively on both the resources necessary for the task and the percentage of expected rejections, the latter due to the increased burden of such a survey design on each user. Moreover, even in the assumption of this possibility, we should always keep in mind that respondents are likely to fail recalling important details of travel decisions, if they refer back to trips too old to be remembered, and when these decisions deviate from their normal pattern [61].
To the best of our knowledge, all the above SBTS user interfaces have been designed assuming that users would either validate travel diaries accurately generated or correct the errors of the wrong ones. Nonetheless, excluding from the interface any option, and related button, allowing users to report mistakes in the diaries, which, some times, are likely challenging to correct, the risk of including these incorrect data within the ground truth seems unavoidable for the time being. Whether the impact of this possibility is significant or not, we should assess with field research.

Mining user behaviour from smartphone data
Building upon the work of [52], we highlight the following main progress drivers of SBTS: (i) low cost sensors; (ii) support to developers by increasing availability of Software Development Kits (SDKs), Application Program Interfaces API(s), Machine Learning methods and GIS accessibility; (iii) introduction of application stores for the distribution of developed applications on a worldwide scale (e.g. Apple App Store and Google Play); (iv) Graphic Processing Units (GPU) and cloud processing power.
The introduction of smartphones in the field of travel surveys, similarly to what has been reported in other fields [67], shifts the data generation model from the subject running the survey to the participating user. However, in order to collect high quality data while shifting from a Person to Person (P2P) interaction of traditional surveys, to SBTS, the user/respondent needs to be supported in different but equally effective ways.
The paradigm rising from the introduction of the technologies described above -where the data mining platform generates automatically the travel diaries from the smartphone data -has lead to machine learning and artificial intelligence having taken over the role of the operator by providing predictions that the user has to validate or eventually correct in order to allow the collection of the ground truth about the recorded trips.
For the subject running the smartphone-based survey, the new process carried out via a dedicated mobile application involves the following steps: (i) collecting most of the travel data passively, (ii) generating the users' travel diaries automatically, (iii) submitting the travel diaries to the user, (iv) collecting the user validation and/or amendments (ground truth).
For the user participating to the smartphone-based survey, the new process involves the following steps: (i) installing a survey application on her smartphone and authorising it for data collection, (ii) accessing the application regularly to review, validate and/or amend the travel diaries generated by the survey app.
The automatic generation of such travel diaries is based on machine learning algorithms, relying on the background data collected to infer: (i) travel purpose, (ii) transportation mode chains, (iii) route choices (map-matching). These methods in turn rely on other methods targeting: (i) stop detection, (ii) trip segmentation. Each of the above methods' families can be implemented using multiple machine learning approaches, e.g. (i) Discriminant Analysis (Gaussian), (ii) Bayesian Networks, (iii) Hidden Markov Model, (iv) Support Vector Machines, (v) Decision Trees, (vi) Random Forests, (vii) Hierarchical Thresholds, (viii) Fuzzy logic, (ix) Neural Networks, (x) various clustering and classification techniques (e.g. k-means and Key Nearest Neighbour) [68,52]. Depending on the ML method, better performances can be achieved through, e.g.: (i) features extraction techniques [69]; (ii) or hyper parameters selection [70].
Iteration after iteration, user after user, the algorithms should increase their accuracy. Higher ML performance enables a higher quality of the ground truth collected from the survey while reducing its burden on the user and thereby facilitating longitudinal data collection, which is targeting behavioural variation within the same user.
Methods able to handle large amounts of location data, with its inherent noise, have huge value [10,71]. "The discovery of certain mobility patterns from the big data offers us an opportunity to identify the links between microscopic individual choices and emergent macroscopic behaviours and to re-examine the decision rules used to model travel related choices." [19]. A key in this process is the enrichment of such location data sets with contextual information. In SBTS, essential components are user validation, as well as  other external data sources (e.g. transport level of service, points of interest, accompanying travellers). A properly validated travel diary data set is thus valuable to investigate traveller behavioural and systemic patterns, such as: (i) choice models [55]; (ii) driving style [72]; (iii) travel time [73]; (iv) roads safety [72]; (v) traffic congestion [74,75,76]; (vi) or even drivers emotions [77].

Impact of interaction design and battery life of smartphones
Within the scope of smartphone-based travel surveys, there are also other hitherto less obvious but important implications of moving to this new technology, such as the following. (i) User interaction: simplicity and intuitiveness of the interaction design should reduce any potential distraction of the user while interacting with the survey application [10], as distractions could impact the quality of the data collected. Moreover, when the interaction is directed to amend inaccurate predictions of the algorithms involved in the survey, the impact of the interaction design on the quality of the ground truth collected from the respondents is even greater. A poor interaction between the user and the interface of SBTS could trigger a catastrophic loop in which the user validates wrong predictions instead of correcting them. Since the quality of the predictions derives from the quality of the ground truth, in this scenario SBTS would be a failure [78,79]. (ii) Device performance and battery life: any limitation of the phone performance caused by SBTS should be avoided throughout the efficient use of the sensors and data collection background process [10].

Smartphone capabilities and physical limitations for data validation
From the perspective of smartphone users, the most evident features of any SBTS emerge from the application installed on their device. The user interacts with the smartphone day after day while carrying it around. Thus, any personal perception is constrained by the experience coming from such an interaction.
According to [80] the main drivers determining the decision of a user to keep applications on his or her device are (i) the information conveyed through them, (ii) the ease of use, (iii) the perceived usefulness, (iv) the perceived risks and (v) the general satisfaction of the User Experience (UX 3 ). In (v) we include a broad and very relevant field of research which we won't discuss in this paper. However, we mention that there is consensus about the concerns deriving from the negative impact of smartphone battery consumption on the UX. We observe the same consensus about battery concerns in the literature focusing on the subject of SBTS. In the latter case the negative impact is on the quality of the data collectable with a SBTS. Although this conclusion might look trivial, there is a hierarchy of impacts on smartphones' multi-sided platforms, where independent developers are allowed to distribute native applications via App stores, from the smartphone platform owners, e.g. Apple and Google. For example, any battery optimisation strategy enforced by the Operating System (OS) providers has the priority over any strategy implemented by the application developers allowed on such multi-sided platforms. There is no exception for those who design and develop a SBTS. When it comes to SBTS, which require the protracted use of energy intensive sensors and computer intelligence models, there are situations where the need of high resolution data clashes with the need of battery efficiency enforced by smartphone platform providers 4 .
In Fig. 2 we present the abstraction of a smartphone-based travel survey platform. The main components are the client side, the App, (A) and the server side, or Back-end, (B) of the application. The client (see Fig. 2, A) is specialised in allowing the human interaction (see Fig. 2, A.1). It might include computer intelligence algorithms necessary for the automatic generation of travel diaries (see Fig. 2, A.2). The client side can also be specialised in handling the data generated by sensors (e.g. location), computer intelligence models or users involved in the travel diary validation, ensuring persistence, preventing loss of information and maximising privacy by locally processing the data (see Fig. 2, A.3). Last but not least, a battery efficiency layer is responsible for tuning and optimising, e.g. data sampling or network I/O operations between client, server or external data sources (e.g. GIS or Digital Maps). The sensory system of the platform is the smartphone device represented by: • the main relevant hardware components (see Fig. 2, OS.5), • the services exposed by the Operating System (see Fig. 2, OS, OS.1-OS.3), • and the operations beyond users and developers influence, such as those focusing on extending the battery life of the device (see Fig. 2, OS.4).
In the application field of SBTS the energy consumption derives from the following drivers [82,83,69].
1. GPU and screen: although it represents one of the most energy hungry components, fortunately it is used only when the user interacts explicitly with the app (e.g., when validating previous observations, when browsing his or her own data, when answering any required survey question).
2. CPU: it is an energy hungry component and it can receive high load of computations by one or more computer intelligence models, e.g. for mode classification. However, using it properly, we can improve the general energy efficiency of the application. For example, the computation necessary to detect if there are conditions to switch off unnecessary sensors might require an amount of energy below the amount saved by switching off the sensors. One can off-load tasks to the server (e.g. data analysis parts). Of course, off-loading implies transmitting data, which has its own energy cost.
3. AGPS: Assisted GPS is extensively used in smartphones, and any GPS-capable mobile phone. The difference with standard GPS devices is that while these devices depend exclusively on satellites, to detect the position of the device, AGPS uses also cell tower data. This feature is particularly convenient when GPS signal is weak or disturbed, but it introduces also challenges in the accuracy of the position. There is consensus about the high level of energy consumption of this fundamental sensor. The literature is rich in studies presenting effective strategies to provide the location of a smartphone while reducing the amount of time where the AGPS is active [84,85,86,87,88]. For example Apple iOS makes sure that developers can access the location information, but the control of the location updates is constrained by an API which involves the orchestration of multiple sensors and possibly a computer intelligence algorithm 5 . In this way, accurate location information may require less of GPS and more of WiFi, for example in indoors or high WiFi density areas.
Finding the best trade off between location accuracy, resolution and energy consumption is not trivial. Interestingly, we observe a convergence between approaches developed for the OS to improve the energetic efficiency of smartphones, and for data-mining to fill data gaps coming from missing or highly uncertain GPS observations. Both provide location coordinates, reducing GPS sensor need and leveraging on data from INS, GIS, and telecom-network. Nevertheless, current smartphone OS do not allow access to telecom-network data from independent applications as SBST 6 .
4. Network: it is a fundamental component by design, ensuring the communication between client and server, and enabling offloading strategies for computational intensive operations. Data generated by the application and validated by the user need also offloading on the back end. Fine tuning of the data transfer strategies between front-and back-end is not optional. For example, important energy savings depend on finding optimal thresholds for handling: network selection (Cellular of WiFi), data transfer frequency, status of battery, or size of the data-transfer.
5. Accelerometer, Gyroscope, Magnetometer: unlike the GPS, these sensors' raw data is accessible by the developers of the main OS providers. The literature presents multiple interesting strategies aiming at reducing the GPS up-time involving local or remote CPUs besides the raw data generated by these sensors. The energy efficiency improvements reported suggest that such strategies, apparently complex, are less energy hungry than the GPS sensor, even involving multiple hardware components [84,85,88,89,87,90].
6. Beacon and Bluetooth Low Energy (BLE) technology: BLE beacon devices arise from the convergence of Bluetooth and WiFi protocol in the Internet of Things (IoT) context. Unlike the classic Bluetooth protocol, in IoT applications BLE beacons communication is one to many, involves few bits of data to be broadcast frequently, and does not need any pairing operation. These properties are particularly suitable for proximity detection and interaction with Smartphone devices with the purpose of activity sensing [91]. [92] experimented BLE interaction in an SBTS. To the best of our knowledge, this is a pioneering work where the Ground Truth has been collected Device to Device (D2D) in this field (see Fig. 3). Among the goals, one was the automatic detection of bus trips exploiting beacons installed at the bus stop and inside the buses running the Silver Line in Boston, USA. The authors report the challenge of finding the right signal strength in order to allow the beacons detection by smartphones in conditions where the signal could be attenuated, e.g. by travellers bodies or relative devices' location, while at the same time reducing the risk of interference with other beacons in the range, e.g., when passing by a bus stop or bunching with other buses on the way.

Smartphone data mining
Due to the disparity of progress drivers, we see a trend of increasing fragmentation, inconsistencies, availability and volume of travel data. In response to this challenge, two main branches seem to arise as the flip sides of the same coin [93,94,95,11]. The first focuses towards data-fusion, intended to compose and then mine high dimensional data sets collected from multiple sources, including: (i) GIS, (ii) INS, (iii) GPS. The second targets the development of, e.g., very sophisticated Computer Intelligence Models, Feature Extraction Methodologies, and Optimal Hyper Parameters Selection, which are constantly improving and thus complementing traditional statistical methodologies, often substituting them for specific purposes [22].
The potential given by smartphones depends on both the high resolution of the data collectable and the large market penetration of the devices. These are two dimensions determining, as already mentioned, data sets quite complex to deal with. The consequence is a multi-step preparation process necessary before we can get any knowledge from the data. Typically, smartphone data are affected by several errors. For example, the study presented in [96], which compare GPS points coming from a dedicated GPS logger with the data generated by a Nokia N95, shows clearly the problem. Since then, however, the situation improved substantially. The resolution and the accuracy achievable by the sensors and the software operating in last generation smartphones is definitively higher. Nevertheless, the raw measurements vary between smartphones and within the same model of smartphone too [97]. Achieving consistency of machine learning methods across different smartphones requires a rigorous process of data preparation, cleansing and trajectory segmentation upfront.

Data preparation, trip segmentation and stop detection techniques
When approaching a data set generated by a smartphone, the first step is taking into account that any measurement is affected by noise. The noise is not necessarily random, since it may be correlated with weather conditions; building density, materials, and height; crowdedness; smartphone model; software "bugs". The data preparation techniques need to be adapted to each type of measurement. For example: the combination of longitude, latitude and time stamps allow the computation of features such as speed, acceleration, and higher-order time derivatives, e.g., jerk, crackle, pop; gyroscope and accelerometer enable the calculation of orientation and acceleration of the smartphone. On top of this laborious feature engineering process, we need to do data cleansing. We can assess Figure 4: Trip Segmentation using walking segments as separators [37] the quality of the positions and filter out the worst points, e.g., when speed or acceleration is inconsistent with the context. However, there are different degrees of sophistication between rule-based filters (e.g., the threshold on the max speed), statistical filters (e.g., median filter), and model-based filters (e.g., Kalman filter). Also, the resolution of the measurements is a crucial factor. For example, accelerometer and gyroscope readings from smartphones should be retrieved with a resolution compatible with the motion frequency of human bodies in the daily routines, which is above 20Hz [52]. Consequently, on the one hand, simple filters can hardly be effective for mining such features. On the other hand, more sophisticated tools as the Kalman filters can be applied effectively, but at the cost of heavy computations.
Once the trajectories are ready for the following classification process, after data cleansing, we can apply ML methods. We can organise ML methods from the standpoint of their classifying purpose, combined with the underlying features' nature. For each classifier, such as for mode detection and purpose imputation, the underlying features can be (i) location-agnostic versus location-specific; (ii) and user-agnostic versus user-specific. For example, methods relying on user-or location-agnostic features can be trained on any geographic area, and then either deployed on a different area to classify the activities of another population, or reused to solve similar problems. The first case depends on the generalisation power of the model, the second case is identified as transfer-learning, which is the discipline dedicated to bringing the knowledge gained in the solution of a specific domain's problem for solving a different problem within another domain. Most of the literature reviewed is working with location-and user-agnostic features (see fig. 5). In contrast, user-and location-specific data seem to enable more accurate classifications, at the cost of volume of information to be handled, poor transfer-ability, and poor generalisation power. This claim is hard to prove due to the heterogeneity of the results available in the relevant literature. Although these results are hardly comparable across related studies, within each relevant study, we find evidence about the positive contribution of user-and location-specific data on the performance of the classifiers [61,25,14].
Beyond very different human trajectory classifiers, as those specialised either on mode detection or on purpose imputation, the common ground is human behaviour. People travel with a purpose, often to reach a location where they perform some activity, and their strategy to reach the site depends on the context. Trivially, let us assume that one needs to get to the centre of the city: (i) Monday-Friday at 8.00 for work; (ii) Friday at 16.00 for sport activities; (iii) Saturday at 22.00 for social activities. It is likely that the strategy to reach the same location will vary according to the purpose and the context, for example: (i) Public transport or bicycle depending on weather, as the cost is important. (ii) Car, as it might allow room for sport equipment. (iii) Taxi, as it could give freedom of consuming a drink.
Therefore, the distinction of trajectories reduces to two fundamental classes: (i) motion, which is the act of reaching the site of interest, (ii) and stop, which is the act allowing people to perform the activity of interest in such a place. In Sec. 5.2.1 we present how the motion class branches out, and the methods specialising on their classification; in Sec. 5.2.2, how the stop class branches out, and suitable classification methods.
The approaches and methods about stop-detection and transition-detection described below have also been applied for segmenting GPS trajectories with the purpose of, e.g. mode-detection and/or activity inference. [14] focuses on Future Mobility Sensing (FMS). This work highlights the impact of stop-detection performance on the quality of the ground truth collected from users which validate their trips on the mobile device. The method presented consists in six steps: (i) trajectory cleansing based on the accuracy provided by the AGPS; (ii) rule based stop detection candidates, where stops are points within 50 meters range and 1 minute time window; (iii) check stop-candidates against user frequent stop location; (iv) rule based merging of the resulting stops applying various range/time thresholds; (v) detect still mode by applying a learned classifier based on acceleration measures; (vi) remove extra stops after mode detection algorithm.
[98] applies a rule based algorithm to detect activity points based on the on-off status of the GPS (because of the GPS units employed in the study), speed/time threshold and range/time threshold. Transition-points are identified by applying a threshold on computed acceleration and speed as well as on the time, based on the assumption that travellers walk to change mode.
[99] defines Stay-Points as a geographical area where travellers stay within a range for a certain time. Then, based on these two rules, they apply an "affinity propagation clustering method" [99]. Stay-Points belong to a different definition than the transition points used to identify where the travellers change mode in a complex travel mode chain and are identified on a different set of rules based on speed as well as on the assumption that the noisy data typically detected in correspondence of transition points is temporary, while the change in speed are permanent.
[100] applies two algorithms in sequence. The first algorithm performs the identification of the trips, which are identified by a rule based algorithm detecting stay points. They eliminate outliers first by performing the Kolmogorov-Smirnov test on a random sample of stay points, in order to verify if these are normally distributed. Then, they apply a three-sigma rule to find and remove outliers. After the cleansing process, they compute the central stay point. Even though GPS follows a bi-variate Raleigh distribution [96], the normal distribution is some times accepted as a suitable approximation.
[101] performs mode classification on the trajectories available in the Geolife data set [102]. The authors apply fixed-size segments of 200 points for both seen and unseen trajectories (where 200 is the median of GPS points in all trips composing the data set). Then they concatenate together consecutive segments with the same label. They discard segments with less than 10 GPS points. Finally, the trajectories are processed with a Savitzky-Golay filter for smoothing purpose.
[28] tests four segmentation methods: distance-, time-, bearing-and window-based. They highlight that while the last three are statistically equivalent, the first leads to varying sample size within each segment due to the different speeds in complex transport mode-chains. Stop detection is not mentioned explicitly. The work aims directly to transportation mode detection. Thus, transition points might be identified where there is a discontinuity in the mode-chain detected on these segments.
An efficient implementation, recently published on GitHub 7 , allows stop detection and labelling of stationary events from a GPS trajectory. By building a network that links stationary events identified, as nodes, within a critical space-time range, and clustering this network using two-level Infomap 8 , the algorithm provides a label for each stop event.
We find an extensive list of methods for trajectory segmentation which are presented from different perspectives [103,11,104].
For example, [103] highlights the difficulty of comparing the performances between point-and segment-based methods. Therefore they introduce a penalty system by looking at where these methods make mistakes, and a metric for improving the performance comparison between different segmentation techniques. In particular, with respect to the ground truth, if Precision and Recall identify "hits" and "misses" of the classifier, from such measurements we can't understand how the error depends on over-or under-segmentation of the trajectory. Therefore, since errors in trajectory segmentation propagates to the classification of the trajectories, and classification performance depends on how the segmentation inference aligns with the ground truth, the penalties are proportional to time and space of segments misaligned with the ground truth, in opposition to previous studies where a count of the editing operations was proposed [105]. Interestingly, with this metric, point based Trajectory Segmentation Techniques (TST) outperform segment-based TST.
[11] reviews a large amount of travel-surveys worldwide and when it comes about stop-detection and trip-segmentation they report that rule-based stop detection techniques relying on range, time, speed or acceleration thresholds are the most common. The authors also highlight the challenges about signal loss and signal noise in the detection of short stops.
[ 104] argues "that the presence of nearby points in Euclidean space may be indicative of an activity, while the absence of nearby points may be indicative of travel" [104]. Therefore, referring to [106], in order to acquire a local density of points they suggest to deploy a moving window that preserves the relationship with 30 preceding and 30 succeeding points within a 15 − m range. The range here seems too small compared to GPS error, which is approximately 45 − m [14].

Human Activity Recognition in Mobility
In this section we focus on some of the building blocks necessary for mining user behaviour from smartphone data, in order to support downstream behaviour modelling, as well as infer any general insights that can be useful for a range of applications, from individual services for the user to policy making and planning. On the basis are the fundamental questions in terms of determining the route, transportation mode, waiting times, and trip purpose.  It is important to note that in the modern smartphones AGPS is not directly accessible, nor the GPS. As mentioned in Sec.4, smartphones operating systems provide locations through specific APIs which do not allow direct access to the underlying data sources (e.g. AGPS). At this point, the combination of feature extraction techniques and computer intelligence algorithms allow capturing the correlation between the features and the user's strategic choices. It is the shared belief in our field, that as technology evolves, the inference of the user strategic choices in the form of a travel diary (see Def. 2.9) and the user validation by means of such a diary (see Fig. 6), allow us the generation of the continuous improvement of the acceptable truth (see Def. 2.8) asymptotically approaching the theoretical ground truth.
To achieve the goal, we rely on a smartphone-based sensing platform (See Fig. 2), and we need to take into account the limitations of the smartphone device following the user during her journeys.
Often, computer intelligence algorithms are perceived as black boxes providing an inference as output given some input. For example, the mode detection box should infer the transportation mode, the map-matching box should infer the route choice (See Fig. 6) and at first glance these seem to be the output of the process. However, these black boxes are tightly coupled with the data necessary to allow and refine the inferences. Given an initial validated data set, their performance can be measured only by comparing the inferences with the data, in other words with the ground truth. In Smartphone-based applications, the error propagating from trajectory segmentation, to trajectory classification [103], and then to the diary generation, could finally propagate to the ground truth. From this standpoint, the output of this process might lead to systematically biased prediction. In SBTS, ML is just a tool to capture the information represented by the data. The quality of the models has a strong correlation with the quality of the ground truth we can collect, through the inferences behind the automatic generation of travel diaries.
There is consensus in the field about the lack of standardisation for validating and comparing the performance of competing classifiers. The work of both [28] and [101], represents an evident example. Even though classifications are performed on the same data-set, their difference in amount and quality of classes predicted, and validation setup, are enough to make F1 scores comparison meaningless (see Tab. 1, data set, modes, cross validation). [101] computes F1 scores as the weighted average on a 5-fold cross validation. [28] picks a random sample of the users to compose training, validation, and test set, then computes F1 score on the test set only (leave one out method). To mitigate the problem, in Sec. 5.1 we find a the penalisation solution proposed in [103] to link F1 score with the distances represented by the classification errors.
Instead, [69] proposes both data-set and workflow for cross validation, which could provide a standardised baseline. The data-set includes 18 sensors' observations on 3 users, for a period of 2812 hours of labelled data. Labels include the position of the phone as Figure 6: Smartphone-based user activity monitoring.
Torso, Bag, Hand, and Hips. The workflow for cross validation includes 3 tasks: User-independent, phone position independent, and time-invariant. At the end of the 3 tasks, each one accomplished with manifold cross validation, as performance driver to measure the predictive power of the model, the paper suggests the standard deviation of F1 score, computed across users, phone positions and time periods. However, most of the data-sets available do not provide the same level of detail, thus do not allow the same validation workflow. For example, the widely used Geolife data-set provides GPS trajectories and transport mode labels only [102].
[69] explains that also the feature extraction process should be standardised, and they propose a standard workflow named Minimum Redundancy Maximum Relevance (see Sec. 5.2.1). For classifiers relying on Deep Learning (DL) though, the standard validation method described above is not effective, as the Neural Network extracts the features autonomously. Here, the new challenge is finding optimal hyper parameters for the network, such as architecture configuration, activation functions, batch size, regularisation factor and optimisation step. For example, [28], [101] and [107] select such hyper parameters manually, which is a process time consuming and ineffective. [70] proposes an effective approach to select this hyper parameters automatically, moving a step toward the standardisation of DL-based classifiers optimisation. However, the approach seems not used in this application field yet.

Mode detection
In Tab.1 we present the summary of the review regarding the transport mode detection methods (see Tab. 1). These methods aim at inferring the transportation mode chain as defined in Def. 2.6 and 2.7. We found focus on the following modes (see Tab. 1): • Walk, Bike, Electric Bike, Car, Bus, Rail (including both Train and Metro), Motorbike, Boat, Running, Plain, in-Vehicle, Stationary.
• In most of the cases we found that the data set Ground Truth come from the validation of the respondents.
Training of the methods follows the data cleansing and segmentation described in Sec.5. The research reviewed has been organised according to the person/location agnostic/specific features used to infer the mode of transportation, according to Fig.5.
Location and Person Agnostic Features are the following.
• Speed: speed, average speed, average speed over a time interval, median speed, n%-percentile speed, n%-percentile speed over a time interval, low speed rate (defined as the ratio of points with a speed of less than a threshold), velocity change rate, low velocity rate, high velocity rate, medium velocity rate, max angular velocity, average angular velocity, maximum speed, maximum speed over a time interval, skewness of speed distribution, average change in speed over a time interval, speed variance, speed skewness, speed kurtosis, standard deviation speed.
• Acceleration: average absolute acceleration, n%-percentile acceleration, acceleration change rate, acceleration spectral entropy, acceleration range, maximum acceleration, acceleration variance, average change in acceleration over a time interval, variance change in acceleration over a time interval, acceleration skewness, acceleration kurtosis, Jerk, adjusted acceleration computed by removing the gravity acceleration [108], applying Fast Fourier Transform to assess the frequency domain (DC), where the DC term is the 0 Hz term and it is equivalent to the average of all the samples in the window, summation of spectral coefficients, energy of the signals.
• Distance: travel distance, share of travel time with the speed within a threshold, ratio of direct distance to travelled distance between origin and destination.
• Heading: average heading change, heading change rate, head direction change, bearing rate.
Location Specific Features found are the following.
• Distance from motorway, from railway, from bicycle lane, from bus line, from bus stop, from ralways station, from car parking, from bicycle parking; altitude, longitude, latitude, origin, destination.
Person Specific Features found are the following.
• Departure/arrival time, route, transportation mode [109], Personal Trip History [27] We propose the above organisation from an application standpoint. From the General Data Protection Regulation 9 (GDPR) standpoint all the above information would be person specific, as any observation is collected and linked to the unique user identifier of the subject participating to the SBTS. [110] provides an extensive description of the main ML methods listed above.
As we already mentioned, every method listed in this section could benefit of a feature extraction, except for DNN, or NN, including AE.
Maximum dependency minimum redundancy features selection method (aka maximum relevance minimum redundancy), is an effective way of extracting features, but it is also computationally very expensive [69] compared to (e.g.) recursive feature elimination.
Rule-based algorithms are simple and effective classifiers, but their limitation is on generalisation power.
FL success depends on the possibility of applying rules for classification, without losing generalisability. The flaw here is the cost of design, implementation, and mostly any addition required during operations, e.g. the inclusion of a new transportation mode.
BNM, PBMD and BC are particularly powerful in considering the time dependency of GPS trajectories and time-series. Bayesian methods can be generalised to take into account several time steps, but bringing the risk of over-smoothing, thus of losing their classification power [111]. On the one hand, they could learn online, and avoid retraining when the system would collect new observations; on the other hand, they are computationally very intensive.
In contrast, SVM, RF, and BM are light and effective, but in the presence of new observations, these methods need to be retrained. About retraining, DNNs have the same drawback; moreover, training is quite complicated. However, here the potential in handling multiple thresholds within the same class and across classes is excellent. DNNs self learn multiple thresholds from the data while extracting relevant features. Consequently, DNNs generalisation power is probably the greatest. Again, the challenge is finding optimal hyper parameters.
Ironically, the ablest and most reliable method of the previous list, depends on at least two elements: the classification task complexity, and the scale of data.
The main sensors leveraged, already mentioned multiple times are the following.
• AGPS, accelerometer (Acc), gyroscope (Gyro), magnetometer (Mag). According to [69], Mag contributes in improving Train and Subway detection; Gyro contributes in improving detection of two wheels transportation modes, such as bicycles and motorbikes.
• Many studies also rely on Geographic Information Systems, except those based on Artificial Neural Networks (ANN), both Deep (DNN) and Recurrent (RNN).
AGPS represent the cornerstone for the fusion of any sensor with space. For example, Acc, Gyro, and Mag deliver time-series, which theoretically can be fused on the time dimension with AGPS. AGPS provides also a trajectory that can be fused with GIS on the space dimension.
In the assumption of having relevant information from GIS, compatible with the period of the observations of these sensors, the challenge in handling the resulting data set would derive from, e.g., the position where the user carries the smartphone, the environment that interferes with the AGPS, the battery consumption, the multidimensionality of the resulting data set.

Purpose Imputation
The analysis of purpose imputation methods available in Tab. 6, highlights the use of activity-centred features, cluster-specific features, Location specific features, person and location specific features. Unlike the mode detection methods, this area is heavily populated by person and location specific methods.
The focus is on the following activities: work, study, shop, social visit, recreation, home, other [5,7,61], service, business meeting [7,9], paid work, daily shopping, non-daily shopping, help parents/children, voluntary work [7], change mode/transfer, meal/eating break, personal errand/task, medical/dental, entertainment, sports/exercise [61], eating out, pick up, drop off [30] About the location-specific features, we found land-use [5,30], residential, administration and public services, commercial and business facilities, industrial, logistics and warehouse, street and transportation, municipal utilities, green space, water bodies, and others [30] Relevant time indicators taken into account by the models are time of week, time of day, activity duration [9,30], start time, end time.
Other features of interests are GPS points density, walk percentage [9] To complete the person specific features we list age, gender, education, working hours, income, mobility ownership [9,30].
The features in play seems well represented by the definition provided in Sec. 2. Some are more explicit, but none is unexpected.
In this review we found only very specialised methods, working with data set fusing GPS trajectory with GIS information.

Map matching
Following the high level classification of map-matching methods presented by [95], we focus on the subset of outdoor, multi-modal, offline methods for both low and high data sampling rate, which we present in Tab. 3. The large variety of approaches available has been organised in slightly different ways in the main reviews available on the subject. For example, [71] identifies four method categories: (i) Geometric analysis; (ii) Topological analysis; (iii) Probabilistic algorithms; (iv) Advanced Algorithms.
Each of the above reviews takes a methodological stand point, and seems strongly influenced by the latest trend emerging during the year of the review. We don not find any mention of Deep Learning and Artificial Neural Network applications, which are still representing a very exciting niche with applications of both Convolutional and Recurrent Neural Network [113,114].
For some map matching methods where are necessary short path generation, [115] presents an interesting review of heuristics.
For SBTS applications, the above classifications don't communicate some essential aspects though. From the application perspective, the method's categories could provide decision-makers with further meaningful information, such as the following. Is the technique uni-modal or multi-modal? This relates to the ability to perform in either simple or complex transport mode chains. Is it global or incremental? This has implications with operating with or without knowing the destination of the trip. Does the method need the generation of short-paths alternatives to choose the most likely, or can it classify each point directly? In the first case, the short path generation method is critical; in the second, the point labeling process is also critical. Is the technique rule-based or ML-based? Along this dimension, the distinction between the two ends is indefinite. However, moving from the first to the second category, we notice that heuristics gradually shift from values to distributions; parameters, from input to output of the models. We have difficulties also in interpreting the methods' performance in doing the job. The problem is manifold. Lack of standardisation translates into different performance' metrics, baselines, and ground truth collection methodologies. Data also plays a crucial role in the performance of the methods. Even in the assumption of adopting the same standard procedure, the performance of the methods still depends on the data set rank, intended as the number of independent features we can extract, and data set size, intended as the amount of independent observations. For example, synthetic trajectories (STR) are adopted in several studies. STR involve random selection of short-path generators applied to random origin-destination matrix within a road network, and perturbation with some noise distribution. If on the one hand, this methodology eases the standardisation of the subsequent experiments involving map-matching algorithms, on the other, the performance drop recorded when applying these algorithms to real-life data suggests that the rank is more important than the size of a data set. To support decision-makers in assessing different methods, we need to provide them at least with an idea about how different components of the experimental design contribute to the performance estimation. Figure 7: Cross disciplinary studies.

Conclusion and Future Directions
To orient Travel Surveys deployment on smartphones, this review provides an analysis of relevant platforms and methods, allowing data collection and data mining.
We reflect on constraints and challenges deriving from smartphones' limitations. We evaluate the main approaches available, how each method contributes to the generation of the ground truth, how the ground truth affects the methods' performance, and we discuss the interaction models between user and SBTS. A "person to device" interaction to validate the data might introduce further errors; what is their magnitude and their impact on the ML methods performance is not clear. The increasing market penetration of iBeacons technology and Internet of Things, together with the positive results reported within the first tests in transport applications, raise expectations on a "device to device" ground truth evolution (see Fig. 3). In practice, we could achieve full automation of both travel diary generation and validation, by introducing two new features: the exact location of an IoT device (e.g., an iBeacon), and the strength of the Bluetooth signal received by the smartphone. Meanwhile, where ML algorithms do not provide correct travel diaries to the user, "person to device" interaction could be enhanced by introducing the possibility for the user: (i) to trigger a new automatic evaluation of such segments; (ii) to flag whether he or she was unable to correct the mistakes. In the general field, we found consensus about ML performance measurement and experiment design lack of standardisation. Comparison based only on the accuracy level is superficial, as accuracy depends on the underlying data set first, and experiment design second. For example, analysing and testing the code 10 published by [28], it is evident that a specific split of the data set allows accuracy close to perfection. The authors accomplished the best performance over many other excellent classifiers, trained on the same data set [102]. Because of their astonishing accuracy score, they contribute to raising the question of how can we assess whether ML methods and data sets are realistic and applicable to real settings.
To provide a standardised way of comparing different methods, [41] introduces a penalisation system. [69] offers an open-source data set, a feature extraction method and a cross-validation algorithm. Even in the assumption that these methods would be enough for comparing algorithms specialised in mode detection, purpose imputation and map-matching would be out of the scope. Moreover, the number of stationary points present in these data sets, which is below 20% of the total, suggests that data cleansing focused on downsampling the stationary class quite heavily. In fact, in a realistic setting, sleeping and working represent at least two-third of a day, where one is unlikely to travel.
Since any standardisation requires intense and coordinated work across the research community and beyond, in this paper, we can only select and summarise all the information relevant, at least, for a qualitative comparison of the methods reviewed. To ease such a comparison, we organise the literature into tables, which include information about the classification objectives, the data sets employed in the experiments, and both experiment and data validation approach. The classification task is relatively more difficult with a larger amount of classes. The accuracy bias is relatively lower when performing cross validation, and when processing larger data sets. Besides, by listing sensors, features, and data set that each of the related works depends on, we identify the main methods underlying the process of ground truth generation, which in SBTS are trip segmentation, mode detection, purpose imputation, and map-matching.
SBTS depend on a sophisticated multisided platform, which is subject to often conflicting interests over the resources available, such as the smartphone battery.
In the current versions, the Operating System orchestrates the applications' use of sensors and battery, and precludes direct access to AGPS; therefore, developers have limited configuration possibilities. Furthermore, the data collected through these platforms is affected by severe errors and noise due to exogenous elements. For example, buildings' elevation, or number of satellites on the line of sight, may negatively affect data sets, thus any method classification performance, user validation and ground truth.
Nevertheless, we found excellent methods. Someone performs best on low-resolution trajectories, where positions are sampled within large time intervals. Other classifiers are tight (e.g.) to the location where trajectories are combined with data from GIS, or to personal information of the users' population. Among the best performers in terms of accuracy measurement, in general, we find the following: Support Vector Machines, Fuzzy Logic, Random Forests, and Probabilistic Models (e.g., Hidden Markov Model). Classic rule-based algorithms might not perform at the same accuracy level of the methods just mentioned. However, they are still competitive for applications where execution speed is a priority over the accuracy, and where the application scenario is stable.
We also found interesting studies that are trying either to combine multiple methods (see Fig. 7), or to leverage other methods output. For example, [51] uses transport mode to improve the map matching task, while [98] improves mode detection, by map matching GPS trajectories upfront. Similarly, [9] and [30] uses the transport mode to accomplish purpose imputation.
In contrast, Methods based on Deep Neural Network (DNN), both convolutional and recurrent, are still in the early stages. For Map-Matching and Purpose Imputation, we found applications with GPS combination with GIS, while for stop and mode detection, we found DNN applications with GPS only. Surprisingly, we did not find examples where the DNN flexibility has been exploited to perform active multi-task learning for classification of mode and purpose at the same time. [114] performs map-matching with a multi-task recurrent neural network; however, mode detection and purpose imputation are not included. It would be naive to think that using DNN for a multi-task classification could effectively target the three problems at the same time. The available data sets enable already an attempt in such a direction. Nevertheless, especially with map-matching, high quality ground truth collection is one of the biggest challenges for any supervised or semi-supervised method.