Data Processing, KNN Trials and Errors

99P Labs
99P Labs
Published in
4 min readMay 3, 2021

--

Written by Adam Huth, Isabel Zavian, Ebru Odok and Charlie Duarte

Predicting how long a vehicle is going to remain parked (dwell time), and then where it’s going next, is no Sunday afternoon drive. In our last article, we discussed our journey of building a vehicle prediction model with 99P Labs. Millions of observations are collected for just one vehicle’s trip, and while having a large amount of data grants us the ability to create more robust models, it has also led to various difficulties setting up both models.

Our team successfully pulled in data utilizing the 99P Labs API and pagination techniques, conditioning on query counts and geofencing within Ohio state limits. After testing the pagination method with smaller amounts of data, we pulled in approximately 200 million rows of data, which took about 16 hours to complete and resulted in a 16GB csv file. Upon further inspection of this resulting csv file, we noticed that despite our efforts to test our pagination methods, a majority of the rows in the csv file were duplicates. Fortunately, 99P Labs was able to work with us on this issue and provided us with a new SDK package that solved the duplication errors.

After we resolved the looping issues, we were able to successfully pull in over 35 million rows of data, containing over 47,000 unique trips for around 970 unique cars. From here, our team preprocessed this dataset to get started on our prediction model for vehicle dwell-time predictions. At the beginning of the semester, we wrote code to calculate the dwell time, but at that time, our team only had access to a small subset of the data for only one car; to transform the data to be compatible with our older code, we grouped the new dataframe by unique vehicle identification numbers and unique sequence numbers to create a multi-index object. This allowed us to simply apply the function to each row of the multi-index to get the dwell times for each car and trip.

From here, we had a good starting point for the modelling process. Using the GeoPandas library, we mapped all of the trips on a map of Ohio and noticed that a majority of the trips were in Columbus, OH:

Visualization: Map of all routes in our dataset
Visualization: map of all trip ending locations
Visualization: map of all trip ending locations

Given this information, we decided to further narrow down our geofence to only Columbus; we used the latitude, longitude, and dwell time duration columns of the dataframe to build clustering algorithms using Density Based Spatial Clustering of Applications with Noise (DBScan). We then trained both a KNN and XGBoost model to predict what cluster a given location belongs to. These predicted clusters are used as an auxiliary for dwell-time predictions, where the model predicts the median and mean dwell time for the cluster. Instead of predicting a dwell time duration on a continuous range, we opted to predict an interval of time for 0–3 hours, 3–6 hours, and 6+ hours. To provide context to our model’s predictions, we provide insights on a cluster’s dwell time variance and confidence level. To summarise, we first create clusters in 3-dimensional space, we then train a model to predict what cluster a new data point belongs to, and then we use the statistics of that predicted cluster to make a prediction regarding dwell time.

Visualization: DBScan location clusters based on dwell time

Additionally, we trained an XGBoost model without any prior clustering, on the raw, geofenced, location data. This model achieves high accuracy (near 80% on test set) when trained on a binary classification problem — predicting if dwell time will be greater than 10 hours or not. We then use DBScan post-prediction to cluster decisions and provide informational context to predictions.

For this model, our next steps include fine-tuning the hyperparameters to achieve higher accuracy and confidence. Deepnote very kindly upgraded our cloud resources to their best and most capable GPU, allowing us to train and test tweaks to the prediction models faster than ever before. This will not only allow us to improve the accuracy of the predictions but also the overall run time of the model. At the moment, the dwell time duration predictions are being trained separately from the Markov model that is responsible for predicting the vehicle’s next location. Therefore, our next steps also include connecting the two together to include a location predictions with given probabilities and dwell times for each.

Our next article is going to dive deeper into the details about our Markov model, which took tremendous time and effort to implement from scratch. We have been able to make straightforward and relatively accurate predictions about a vehicle’s next destination and are currently in the process of improving it.

Hondezvous is a team project working with 99P Labs by Charlie Duarte, Nikhil Dutt, Adam Huth, Ebru Odok, Xuerui Song, and Isabel Zavian

--

--