 Research
 Open Access
Scalable prediction by partial match (PPM) and its application to route prediction
 Vishnu Shankar Tiwari^{1}Email author,
 Arti Arya^{1} and
 Sudha Chaturvedi^{1}
 Received: 3 April 2018
 Accepted: 11 August 2018
 Published: 29 August 2018
Abstract
Route prediction plays a vital role in many important locationbased applications such as resource prediction in grid computing, traffic congestion estimation, vehicular ad hoc networks, and travel recommendation. The goal of this work is to design a scalable route prediction application based on prediction by partial match (PPM) modeling of user travel data. PPM is one of the widely used techniques for text compression as well as string sequence indexing and for prediction. PPM tree construction from the huge volume of data by sequential processing is time consuming in practical implementation. Existing techniques are designed for single machine and their implementation on the distributed environment is still a challenge. This work focuses on achieving a horizontal scalability of PPM and addresses various challenges in distributed construction, such as reducing I/O and parallel computation of sequences, and comes up with a final PPM tree in distributed environment without sacrificing accuracy. A huge corpus of GPS data set is map matched to the road network extracted from the OpenStreetMap and the PPM tree is built on the edges of the road network. A twostep construction of the PPM tree is proposed, which is extended to execute over the MapReduce framework. The MapReduce framework running over the Hadoop distributed file system is used for distributed processing. A horizontally scalable PPM model is built and evaluated for route prediction from a huge corpus of historical GPS traces. Data sets used are GPS traces and road networks. Both of these used in this work are taken from an openly available corpus. Distributed construction of PPM was proposed and evaluated on Hadoop cluster using MapReduce and the detailed results are presented.
Keywords
 PPM
 Big data
 Scalability
 MapReduce
 Route prediction
Introduction
Route prediction is a key requirement in many locationbased important applications such as vehicular ad hoc networks, traffic congestion estimation, resource prediction in grid computing, vehicular turn prediction, travel pattern similarity, and pattern mining. Route prediction is a problem which deals with, given a sequence of road network graph edges already traveled by the user, predicting the most probable edge of the network to be traveled. Our approach is to build a prediction by partial match (PPM) model from a huge corpus of sequential trajectories traveled by the user in the past. PPM is widely used in various applications in the area of data compression and machine learning (Begleiter et al. 2004). Timestamped GPS traces are collected over a long period of time. The chronological huge sequence of GPS traces is broken down into smaller units called trip (Froehlich and Krumm 2008; Tiwari et al. 2013). Trips are mapped to road network graph using map matching process which identifies the object’s location on the road network graph (Tiwari et al. 2014; Bernstein and Kornhauser 1996; Zhou and Golledge 2006). PPM treebased model is constructed from trips composed of an ordered sequence of road network edges. Given a trajectory traveled by the user, a lookup is done in the PPM treebased model and the most likely edge is found.
Cleary and Witten invented PPM back in (1984). Many versions of PPM evolved thereafter (Moffat 1990; Cleary et al. 1995; Teahan 1995; SchüRmann and Grassberger 1996). PPM models learn from historical occurrences of sequences to predict the probability of a specific data appearing after a given data sequence. For experiments in this work, a version PPMC is used. We explain the process of construction of PPMC, followed by distributed construction of the same. Real applications using PPM deals with processing of huge data sets, and processing such volume sequentially and coming up with a PPM model is a bottleneck. Attempts have been made to achieve scalability by adding processors and memory (Gilchrist 2004; Joel and Sirota 2012; Effros 2000). However, distributed construction of PPM is still a challenge. In the proposed work, scalability is achieved by decomposing GPS traces into trips and processing them in parallel and finally consolidating them to form the PPM model. A set of user trips is decomposed into smaller sets and ported to compute a module known as mappers. Mappers compute the variable order contexts as key–value pairs. In each case, the key is the context and value is the occurrence frequency in the training set. Key–value pairs from various mappers are emitted to the reducer node. Reducer consolidates the occurrences of various contexts and inserts in the PPM trie. The final tree produced by the reducer is the PPM model which is used for route prediction. The major contribution of this work is a technique of distributed computation of PPM and its application in route prediction. All experiments and implementations are done on real data sets available openly in the public domain.
PPM treerelated work and literature
Prediction by partial match (ppm) tree construction
Twophase PPM tree construction
All contexts computed by Algorithm 1
S. no.  d  Context (s)  Symbol (σ)  sσ  Frequency (f) 

1  2  e_{1}, e_{2}  e _{5}  e_{1}, e_{2}, e_{5}  2 
2  2  e_{2}, e_{5}  e _{1}  e_{2}, e_{5}, e_{1}  2 
3  2  e_{5}, e_{1}  e _{3}  e_{5}, e_{1}, e_{3}  1 
4  2  e_{1}, e_{3}  e _{1}  e_{1}, e_{3}, e_{1}  1 
5  2  e_{3}, e_{1}  e _{4}  e_{3}, e_{1}, e_{4}  1 
6  2  e_{1}, e_{4}  e _{1}  e_{1}, e_{4}, e_{1}  1 
7  2  e_{4}, e_{1}  e _{2}  e_{4}, e_{1}, e_{2}  1 
Distributed construction of the PPM tree
All contexts with frequency computed by m_{1}
S. no.  d  Context (s)  Symbol (σ)  sσ  Frequency (f)  〈K,V〉 

1  2  e_{1}, e_{2}  e _{5}  e_{1}, e_{2}, e_{5}  2  〈e_{1}, e_{2}, e_{5},2〉 
2  2  e_{2}, e_{5}  e _{1}  e_{2}, e_{5}, e_{1}  2  〈e_{2}, e_{5}, e_{1},2〉 
3  2  e_{5}, e_{1}  e _{3}  e_{5}, e_{1}, e_{3}  1  〈e_{5}, e_{1}, e_{3},1〉 
4  2  e_{1}, e_{3}  e _{1}  e_{1}, e_{3}, e_{1}  1  〈e_{1}, e_{3}, e_{1},1〉 
5  2  e_{3}, e_{1}  e _{4}  e_{3}, e_{1}, e_{4}  1  〈e_{3}, e_{1}, e_{4},1〉 
6  2  e_{1}, e_{4}  e _{1}  e_{1}, e_{4}, e_{1}  1  〈e_{1}, e_{4}, e_{1},1〉 
7  2  e_{4}, e_{1}  e _{2}  e_{4}, e_{1}, e_{2}  1  〈e_{4}, e_{1}, e_{2},1〉 
All contexts with frequency computed by m_{2}
S. no.  d  Context (s)  Symbol (σ)  sσ  Frequency (f)  〈K,V〉  

1  2  e_{5},e_{1}  e _{3}  \(e_{5} ,e_{1} , e_{3}\)  2  \(\langle e_{5} ,e_{1} , e_{3} ,2\rangle\)  
2  2  \(e_{1} , e_{3}\)  e _{1}  \(e_{1} , e_{3} , e_{1}\)  2  \(\langle e_{1} , e_{3} , e_{1} , 2\rangle\)  
3  2  \(e_{3} , e_{1}\)  e _{4}  \(e_{3} , e_{1} , e_{4}\)  1  \(\langle e_{3} , e_{1} , e_{4} , 1\rangle\)  
4  2  \(e_{1} , e_{4}\)  e _{1}  \(e_{1} , e_{4} , e_{1}\)  1  \(\langle e_{1} , e_{4} , e_{1} ., 1\rangle\)  
5  2  \(e_{4} , e_{1}\)  e _{2}  \(e_{4} , e_{1} , e_{2}\)  1  \(\langle e_{4} , e_{1} , e_{2} , 1\rangle\)  
6  2  \(e_{1} , e_{2}\)  e _{5}  \(e_{1} , e_{2} , e_{5}\)  1  \(\langle e_{1} , e_{2} , e_{5} , 1\rangle\)  
7  2  \(e_{2} , e_{5}\)  e _{1}  \(e_{2} , e_{5} , e_{1}\)  1  \(\langle e_{2} , e_{5} , e_{1} , 1\rangle\) 
Result of merging of intermediate key/value pairs by MapReduce framework
S. no.  d  Context (s)  Symbol (σ)  Key (k)  Frequencies  〈K,〈sum(occurence)〉〉 

1  2  e_{1}, e_{2}  e _{5}  e_{1}, e_{2}, e_{5}  2, 1  〈e_{1}, e_{2}, e_{5},〈3〉〉 
2  2  e_{2}, e_{5}  e _{1}  e_{2}, e_{5}, e_{1}  2, 1  〈e_{2}, e_{5}, e_{1},〈3〉〉 
3  2  e_{5}, e_{1}  e _{3}  e_{5}, e_{1}, e_{3}  1, 2  〈e_{5}, e_{1}, e_{3},〈3〉〉 
4  2  e_{1}, e_{3}  e _{1}  e_{1}, e_{3}, e_{1}  1, 2  〈 e_{1}, e_{3}, e_{1},〈2〉〉 
5  2  e_{3}, e_{1}  e _{4}  e_{3}, e_{1}, e_{4}  1, 1  〈e_{3}, e_{1}, e_{4},〈2〉〉 
6  2  e_{1}, e_{4}  e _{1}  e_{1}, e_{4}, e_{1}  1, 1  〈e_{1}, e_{4}, e_{1},〈2〉〉 
7  2  e_{4}, e_{1}  e _{2}  e_{4}, e_{1}, e_{2}  1, 1  〈e_{4}, e_{1}, e_{2},〈2〉〉 
Route prediction using the PPM tree
 Case I::

This is the case when the user is at root node which signifies the user has not started travel. We represent the user trajectory by \(S = \varepsilon\). From the PPM trie, it can be seen that the various possibilities for traversals are \(\left\{ {e_{1} ,e_{2} ,e_{3} ,e_{4} ,e_{5} } \right\}\). The probability for each case is as follows:
Hence, \(Route\_Predict\left( \varepsilon \right) \to e_{1} .\)$$p (e_{1}  \varepsilon ) = \frac{8}{18},\quad p (e_{2}  \varepsilon ) = \frac{3}{18},\quad p (e_{3}  \varepsilon ) = \frac{2}{18}, \quad p (e_{4}  \varepsilon ) = \frac{2}{18},\quad p (e_{5}  \varepsilon ) = \frac{3}{18}.$$
 Case II::

Another case we explore is when edge e_{2} has been traversed so far, \(S = e_{2}\). The length of the input trajectory is 1 unit only and consists of a single edge. The candidate edge after e_{2} already traversed is only one and is e_{5}. In this case, the probability of occurrence of e_{5} after e_{2} as context is \(p (e_{5}  e_{2} ) = 1\). Hence, Route_Predict (e_{2}) → e_{5}.
 Case III::

The next case is when the input trajectory is \(S = \left\{ {e_{1} } \right\}\) and only one edge e_{1} has been traversed so far. However, there are multiple candidates ({e_{2},e_{3}}) with high probability after edge e_{1} is already traversed. The probabilities of each candidate is as follows:
$$p (e_{2}  e_{1} ) = \frac{3}{8}, \quad p (e_{3}  e_{1} ) = \frac{3}{8}$$Hence, two edges are likely and will be resolved once more edges are traveled.
 Case IV::

Next, we consider a case when multiple edges are traveled and the input to Route_Predict function is \(\left\{ {e_{1} ,e_{2} } \right\}\). The possible candidate for travel next is edge e_{5}, having the said event of traveling over \(\left\{ {e_{1} ,e_{2} } \right\}\) already occurred. p (e_{5}e_{1},e_{2}) = 1 and hence Route_Predict(e_{1},e_{2}) → e_{5}.
 Case V::

Next, we consider a case when the user has traveled a path which has not yet been seen by the PPM model. For example, if the user has traveled path {e_{3},e_{4}} but in the trie no such path exists, this means something which has not occurred in the past. Hence, the prediction function result is \(Route\_Predict\left( {e_{3} ,e_{4} } \right) \to \varepsilon\). This can happen when the user has reached the destination and there is nothing to predict, and in another case it is a new route. In the latter case, new routes when found should be sent to the model for learning.
 Case VI::

All the above cases focused on predicting one hop next edge. The same model can be used to predict an end to the end path as well. The input trajectory is \(\varepsilon\). The next edge selected is e_{1}. From e_{2}, the next probable edge is e_{5} and so on
Implementation and evaluation
Map data: spatial road network data
User location traces data
Conclusion
In this work, the focus was on the construction of the PPM model in a distributed way from a huge corpus of GPS location traces. This model was then used for building a route prediction application. The application required road network data and GPS traces. Both data sets were sourced from openly available sources: road network data from OSM and GPS data from Geolife project. GPS location was decomposed into smaller units called user trips. User trips were map matched to road network to convert the data into a set of edges. This step is part of data preparation, which is a onetime activity. The map matching of GPS data to road network edges reduces the data size and makes the model construction faster than building a model from raw GPS data. For distributed construction, data were stored in HBase data store and MapReduce framework was used for computation. The design of processing was composed of two steps which are intuitive to implementation of MapReduce framework. The PPM model was constructed with the edges of the PPM tree annotated with the probability of their occurrence. The model was then used in the prediction of the route given a partial trajectory. We observed that the model construction phase is the most time consuming, but over distributed cluster processing the time decreases linearly with the addition of nodes in the cluster. Once the model is constructed, route prediction is not a timeconsuming process, but is all about traversing a branch of a multiway rooted tree and is linear in search time. All tools and data sets used in this work are openly available in the public domain. All the snapshots presented in this work were taken during implementation from real data sets.
Declarations
Authors’ contributions
VST, SC and AA discussed the idea of PPM with respect to route prediction and its implementation aspects. VST and SC implemented the idea and contributed toward the first draft of the paper under the guidance of AA. AA and SC thoroughly proofread the manuscript and made all vital corrections. All authors read and approved the final manuscript.
Authors’ information
Vishnu Shankar Tiwari is a postgraduate (master of technology—M Tech) in computer engineering from the Department of Computer Engineering, Indian Institute of Technology (IIT)Bombay, Mumbai, India. He also holds an M Tech (computer applications) degree from the YMCA University of Science and Technology, India, and master of computer application (MCA) from the Maharshi Dayanand University, India. He currently works as Vice President in Technology at J.P. Morgan Chase & Co. with a total experience of more than 10 years in the teaching and software industry.
Arti Arya is Head of Department (HOD) and Professor at the Department of Computer Application, PES Institute of Technology, Bangalore South Campus. She holds Ph.D. in computer science from the Faculty of Technology and Engineering, Maharshi Dayanand University, India. She has an M Tech degree in computer science from the Allahabad Agricultural Institute, and master of science (mathematics) and bachelor of science (mathematics) from the Delhi University. Her areas of interests are spatial data mining, knowledge based systems, machine learning, artificial intelligence and data analysis. She has approximately 17 years of teaching experience (with 10 years of research) at the undergraduate and post graduate level. She is Senior Member IEEE, Life Member CSI and Life Member IAENG.
Sudha Chaturvedi is Lead Software Engineer in Aricent Technologies with more than 10 years of experience in the software industry and teaching/research. She holds master of technology—M Tech degree in computer engineering. Additionally, she holds master of computer applications (MCA) and master of business administration (MBA) degrees. She has expertise in implementation of machine learning models.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
All data and material used is open source. Majorly, GPS data points are from GPS trajectory data set collected in (Microsoft Research Asia) Geolife project. Data set is made available for research from 2012 by Microsoft Research (https://geotime.com/general/geolifeproject/). Map data used is from Open Street Map (OSM) which is an open project (https://www.openstreetmap.org).
Consent for publication
Authors consent the right to publish this article by Springer Open.
Ethics approval and consent to participate
This is author’s own personal research work. Authors selfapproves ethical approval and provide consent for participation.
Funding
This work is purely author’s own work and authors own funding required for publishing of this research work.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Begleiter R, ElYaniv R, Yona G (2004) On prediction using variable order Markov models. J Artif Intell Res 22:385–421MathSciNetView ArticleMATHGoogle Scholar
 Bernstein D, Kornhauser A (1996) An introduction to Map Matching for personal navigation assistants, technical report, New Jersey TIDE Center Technical ReportGoogle Scholar
 Celikel E (2005) A cryptographic approach to language identification: PPM. In: Proceedings of the 7th Int’l conference on enterprise information systems (ICEIS)Google Scholar
 Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2):1–26. https://doi.org/10.1145/1365815.1365816 View ArticleGoogle Scholar
 Cleary J, Witten I (1984) Data compression using adaptive coding and partial string matching. IEEE Trans Commun 32(4):396–402. https://doi.org/10.1109/tcom.1984.1096090 View ArticleGoogle Scholar
 Cleary JG, Teahan WJ, Witten IH (1995) Unbounded length contexts for PPM. In: Storer JA, Cohn M (eds). Proceedings DCC ‘95, data compression conference: 28–0 Mar 1995. IEEE Computer Society Press, Snowbird, pp 52–61. https://doi.org/10.1109/dcc.1995.515495
 Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, December 0608, 2004, San FranciscoGoogle Scholar
 Effros M (2001) PPM performance with BWT complexity: a new method for lossless data compression. In: Proceedings of data compression conference, DCC 2000Google Scholar
 Froehlich J, Krumm J (2008) Route prediction from trip observations, society of automotive engineers (SAE) 2008 World Congress, April 2008, Paper 2008010201Google Scholar
 Gilchrist J (2004) Parallel data compression with bzip2. In: Proc. of IASTED Intl. Conf. on Par. and Distrib. Computing and Sys. pp 559–564Google Scholar
 Greenfeld JS (2002) Matching GPS observations to locations on a digital map. In: Proceedings of the 81st annual meeting of the transportation research board, WashingtonGoogle Scholar
 Hiroyuki A, Kazuhiro K, Takashi I, Shigeichi H (2005) A PPM* algorithm using context mixture. In the Journal of IEIC, pp 35–40Google Scholar
 Joel R, Sirota V (2012) FPGAbased data compressor based on Prediction by Partial Matching. In: IEEE 27th convention of electrical and electronics engineers, IsraelGoogle Scholar
 Lammel R (2008) Google’s MapReduce Programming Model—revisited. Sci Comput Program 70:1–30MathSciNetView ArticleMATHGoogle Scholar
 Moffat A (1990) Implementing the PPM data compression scheme. IEEE Trans Commun 38(11):1917–1921. https://doi.org/10.1109/26.61469 View ArticleGoogle Scholar
 Quddus MA (2006) High integrity mapmatching algorithms for advanced transport telematics applications, Ph.D. Thesis. Centre for Transport Studies, Imperial College LondonGoogle Scholar
 Quddus MA, Noland RB, Ochieng WY (2006) A high accuracy fuzzy logic based map matching algorithm for road transport. J Intell Trans Syst 10(3):103–115View ArticleMATHGoogle Scholar
 SchüRmann T, Grassberger P (1996) Entropy estimation of symbol sequences. Chaos 6(3):414–427. https://doi.org/10.1063/1.166191 MathSciNetView ArticleMATHGoogle Scholar
 Teahan WJ (1995) Probability estimation for PPM. In: Proceedings NZCSRSC’95Google Scholar
 Tiwari VS, Arya A, Chaturvedi SS (2013) Route Prediction using trip observations and map matching, advance computing conference (IACC). In: 2013 IEEE 3rd international, pp 583–587Google Scholar
 Tiwari VS, Arya A, Chaturvedi S (2014) Framework for horizontal scaling of map matching using MapReduce. In: IEEE, 13th international conference on information technology, ICIT 2014. http://www.icit2014.in/. Accessed 22–24 Dec 2014
 Zheng Y, Zhang L, Xie X, Ma W (2009) Mining interesting locations and travel sequences from GPS trajectories. In: Proceedings of international conference on World Wild Web (WWW 2009). ACM Press, Madrid Spain, pp 791–800.Google Scholar
 Zheng Y, Li Q, Chen Y, Xie X, Ma W (2008) Understanding mobility based on GPS data. In: Proceedings of ACM conference on Ubiquitous Computing (UbiComp 2008). ACM Press, Seoul, Korea, pp 312–321.Google Scholar
 Zheng Y, Xie X, Ma W (2010) GeoLife: a collaborative social networking service among user, location and trajectory. IEEE Data Eng Bull 33(2):32–40Google Scholar
 Zhou J, Golledge R (2006) A threestep map matching methods in GIS environment: travel/transport study perspective. Int J Geogr Inf Syst. X, XGoogle Scholar