Accidents and Traffic Violations Analysis — Bogota

12 min readJan 23, 2023

Versión en Español

A couple of weeks ago, I was on a casual trip across the city that was delayed by a traffic accident; fortunately (for me at least), I wasn’t involved in the accident, but it was severe. That got me thinking about traffic accidents in Bogota, the city ranks 4th worldwide on the tomtom index for urban congestion, losing 126 hours per year on transit [i]. And that’s how this article was born.

The data used in this analysis is available through an ElasticCloud instance, which contains all the data, dashboards, and maps. The platform details are as follows:
- URL: Not Available
- User and Password: Not Available
- Active Period: Not Available
- Time Frame: Not Available
- Workspaces: Not Available
-Raw Data & Scripts Repository: https://github.com/IkerSaint/Accidentes-Bogota
Once the active period ends, the platform and all the information it contains will be deleted. If you want to get access to the Elastic data or explore it in another Elasticsearch instance, feel free to leave a private note (How to write a private note on a Medium) with your email and I will provide an index snapshot, and the Kibana dashboard to be restored into the new cluster.

Give me Data!

Datos Abiertos [ii] is a platform operated by the Bogotá District Mobility Office, it provides information about traffic and mobility in Bogota, and even if some of the visualizations don’t work well, we are more interested in the data. There are multiple categories that we can access easily on the portal. Still, most datasets are not that interesting, complete, or with relevant data. We select only datasets associated with traffic violations and accidents for our specific use case. Selected datasets and his stats are the following:

The next logical step is an exploratory analysis. From the general view, we face a good amount of information. Even if it’s okay to start exploring the datasets using standard tools like excel, it’s not going to be enough for the analysis (Sorry Excel fans, it’s never enough). So we have to use python and pandas :).

import numpy as np
import pandas as pd

accidentes = pd.read_csv('./data/Siniestros/Siniestros_Viales_Bogota_Accidente.csv')
lesionados = pd.read_csv('./data/Siniestros/Siniestros_Viales_Bogota_Lesionados.csv')
muertos = pd.read_csv('./data/Siniestros/Siniestros_Viales_Bogota_Muertos.csv')
vehiculos = pd.read_csv('./data/Siniestros/AnA1lisis_Siniestralidad_vehiculos.csv')
causas = pd.read_csv('./data/Siniestros/AnA1lisis_Siniestralidad_causas.csv')

pd.set_option('display.max_colwidth', None)
accidentes.head()
lesionados.head()
muertos.head()
vehiculos.head()
causas.head()

On the other hand, I suspect those datasets come from a SQL Database and can be joined by different columns (FORMULARIO seems like the primary one) with the notable exception of the “Comparendos” datasets. Our first thought is to reverse the process and create a SQL Database with all the info; after all, Grafana/Databox/Tableau/PowerBi+TimeScale can handle this data volume easily and have excellent performance. However, for time series data, most of the time, I lean towards NoSQL alternatives, and no, not our good friend InfluxDB, simply because selecting which columns are measurements on top of the high cardinality can very quickly become a complete nightmare. In this case, we want to use ElasticSearch or OpenSearch (choose your poison), not only because it’s easier to create the template but also because the analysis and visualizations are intuitive and easy to build.

What About Traffic Cameras?
There is a discrepancy between the Datos Abiertos platform and National Road Safety Agency (ANSV) data, 92 and 129 records, respectively. Data from the ANSV appears to be more reliable and include more information about the cameras. In the end, the information is not that helpful, as the traffic violations include the detection device (Maybe if we derive and calculate a new column, “DistanceToCamera”).

Data Aggregation

By checking all the columns and information available, we can quickly realize we need to drop some columns and perform multiple operations, including aggregations over the remaining ones. This may seem weird from a “Normalization” [iii] perspective. Still, in our NoSQL case, we want to perform the opposite, “Denormalization” [iv], to increase performance while sacrificing the ability to make joins easily. The final templates[⁵] for ingesting data into Elastic are the following:

I don’t pretend to bore you with all the transformations over the datasets; if you like, you can check the jupyter notebook on the project_repository. Here is the graphic summary:

It’s worth mentioning several things at this point::

“Causas” data is not very reliable, as there are multiple “Otra/Otras” values as the traffic accident cause.
“Comparendos” data has some problems with latitude and longitude coordinates, hour, violation code, and, to be honest, I don’t trust the accuracy of several columns; there are multiple records with the same Lat/Lon in different localities (“Localidades”), wrong hour, and a lot of different problems.
Groupby operations also apply a lambda function to convert groups into objects, therefore we can assign an object list to a column. This is not even close to being compatible with the usual practices on using Pandas. Be aware of this if you plan to use the resulting dataset for other purposes, like for example Machine Learning, statistical aggregations, etc.
Derivation of already present columns (month, day, day of the week) may seem unnecessary, in this case, it is just to make sure that we have the correct information and format in the column.
We derive a couple of new fields, one is the GEO field, which is a point representation of latitude and longitude for Elastic, and the second is GEOHASH [vi], which is a special column that enables us to group data very easily on a specific precision level (±610 meters for our case), this is very handy for multiple porpuses as we are no longer limited to specific lat and log coordinates.

Finally, we can ingest the data into our cluster and start the analysis process…

Data Visualization and Analysis

There are two different indices, one for accidents, and one for traffic violations. Using those, we create three dashboards, one for accidents, one for traffic violations, and one that includes the general map, and the traffic accidents and violations ratio along with other columns (Localities, Month, Day of the Week, Geohash, etc.).

By playing a bit with the visualizations, we can extract many different insights, some obvious and others came as a surprise (at least to me).

Traffic Accidents

Cars are the number one vehicle involved in accidents; however, motorcycles along with bicycles are way more likely to cause injuries and death in accidents. On the other end, SUVs/Pickups are the safest, even safer than public transport vehicles.

The number one cause of accidents across all vehicles is not keeping a safe distance from other vehicles. The second one varies across different types of vehicles, but, generally speaking, is either aggressive driving or failure to observe traffic signs or signals; the exact three causes apply to accidents that involve injuries but not to those that include deaths; in that case, the top three causes are failure to observe traffic signs or signals, speeding, and, driving between vehicles, for every single one of those causes the top one vehicles involved are motorcycles.
One very weird, and I have to say, an almost ridiculous accident that often involves injuries and deaths is the fall of the passenger, with a staggering 329 total injuries and 18 deaths because of driving with doors open.

Traffic Accidents — Accident Types and Top 5 Causes

Top three days of the week are the expected, Friday and Saturday, with a rare addition, Tuesday. Obviously, with a high amount of accidents, there is a high amount of injuries and deaths that take place on those days; even more, if you are a woman, there is a greater chance of dying on Tuesdays or Mondays than Saturdays.

Traffic Accidents, Injuries and Deaths by Day of the Week and Hour

The top ten locations can be easily identified using the geohash field; those are fairly distributed, but some of the locations seem a little bit off and not very intuitive; probably worth a close-up inspection.

Finally, not all deaths by traffic accidents are immediate, almost half of the deaths happen after the accident, and the 95% are between one day and six months after the accident.

Trafic Accidents — Time distribution for Deaths after the accident distribution

Even if some of the data can be a little weird, there are no huge surprises in the general view; that situation is not the same in traffic violations, as we will see.

Traffic Violations

With this particular dataset, there are several concerns, and overall, I’m not so sure about data reliability; maybe because there are a lot of discrepancies about multiple columns, but overall is because it just doesn’t make a lot of sense at a high level (Also, there are almost two whole months of data missing 2021–05 and 2021–06).

Traffic Violations timeline shows a shift in the detection devices being used; by the end of 2018 and the beginning of 2019, the DEAP (Electronic Devices for Police Assistance) primary detection device becomes the predominant one; however, with the pandemic, and up to the end of 2021, the top detection devices shifted towards traffic cameras and on the road operations, almost exclusively. This can be because there is some sub-reporting by the Traffic Office; after all, we are already in 2023, and there is zero data for 2022.

Traffic violations — Timeline by month breakdown by Detection Device

As with the previous, this is a weird one; the most prevalent type of infraction has shifted from C02 (Parking in prohibited sites) to C29 (Speeding); we can blame traffic cameras, but the proportion at the end of 2021 between C29 and C02 is 1 to 4 respectively.

Traffic Violations — Timeline by month breakdown by Violation Code

Traffic Violations mostly are the same for different types of cars, being the top two, C02 (Parking in prohibited sites) and C14 (Violation of the infamous “Pico y Placa” restriction). For Motorcycles, the top one is different C24 (Failure to observe traffic signs or signals while driving a Motorcycle), and the second one is C02, seems like Motorcycles are lousy drivers.

Traffic Violations — Aggregation by Vehicle Type and Violation Code

Opposite to what we may think, most of the traffic violations occur on Wednesdays and Thursdays, with some exceptions in specific months, the days with less infractions are Saturdays and Sundays. This doesn’t correlate with the data from traffic accidents and suggests a different prioritization from the Mobility Office, or maybe, just maybe, I’m too biased…

Traffic Violation — Aggregation by Month and Day of the Week

Maps and Correlation or trying to…

The primary way I have found to correlate Traffic Violations and Traffic Accidents was by the Ratio; it can be on location, type of vehicle, or even time feature (year, month, day, etc.).

The top ten locations with the most accidents don’t correlate with the top 10 locations with the most traffic violations. We may think that this is due to a reactive action from the Mobility Office, they assess which locations have the higher amount of accidents and act accordingly, that is partially true, such a trend exists for some minority of the top ten locations, at least in the same year or with one year difference, but it’s not completely clear how the decision process is made or if the locations are getting any special attention (Also, a pandemic year makes analysis more difficult)

Top Ten — Accidents and Traffic Violations by Locations

The traffic violations top one location (By Ratio) is the airport, with more than 43k traffic violations and a 325.11 ratio (By each accident there are 325.11 traffic violations); the second one is some god forbidden point way over the south of the city (A few hundred meters away from Dona Juana Landfill) with 6K traffic violations and a 322.22 ratio. For comparison, positions third and fourth have a Ratio of less than 150. Further investigation may include which traffic violations are associated with the top two locations.

The Ratio by locality doesn’t align with the traffic accidents, and, from the map visualization, it may seem that there is a large cluster right in the city center. However, as I mentioned before, the traffic violation location is unreliable, and it may be difficult to draw any conclusions from the data.

The Ratio is almost twice the median on the second week of January, maybe because there is a high amount of traffic on those days and special operatives.

Final Comments and Closing

Overall, Bogotá’s stats for traffic accidents, injuries, and deaths; are not so bad compared to the world average [vii], probably because you don’t have a lot of chances to transit at high speed, therefore, reducing the risk to most severe accidents. However, we can always try to improve; special attention to accidents’ top locations can improve speed and prevent many unnecessary deaths, as well as focus on preventive actions by using Machine Learning Algorithms or getting real-time insights from the data.

In the end, this article does not try to be the main reference for Traffic violations and accidents in Bogota (I’m not a Mobility expert), but a starting point. The platform and data are open to anyone for further analysis, drawing new conclusions, or even detecting some patterns. I invite you to play with the platform, dig deeper into the general findings or make your own and share them with the world.

Closing, If you access the platform, you probably already have identified two different workspaces:

Base (Read Only): This space contains all the visualizations and dashboards used as a reference for this article. You can explore and filter all the data but cannot modify anything there.
Playground: This is a space for anyone to create his visualizations and explore the data in detail; however, be aware that all users share the same account, which means that your visualizations, maps, work, or any other information are available for anyone to read, modify or even delete. (I trust you will behave… No I don’t)

FAQ

When I try to access the platform, I get the message, “This site can’t be reached.”
You are probably trying to access the platform outside of the time frame (23:00 and 05:00 UTC) or active period (18/01/2023–18/03/2023).
I try to access the platform but get “user or password is incorrect.”
Check the credentials in this article and try again.
I have tried to create a new visualization, but I don’t know how
Make sure that you are in the “Playground” space. You also may want to check the following guide https://www.elastic.co/guide/en/kibana/current/get-started.html#explore-the-data.
How can I get access to the raw data
Raw data and scripts are available in the GitHub repository.
I’m trying to create a new dashboard/visualization, but, It doesn’t seem to be working
Leave a comment in this article, and hopefully, I (Or someone) will assist you.
Platform feels slow and not responsive enough
This is a tough one; it depends on the resources available and currently active users on the platform; please try on different timeframes.
It’s possible to predict bogotá accidentality using some parts or the entire datasets?
Sure, it’s not that hard, as time series prediction algorithms are well known. You can find multiple examples (Time-related feature engineering — scikit-learn 1.2.0 documentation — https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html#sphx-glr-auto-examples-applications-plot-cyclical-feature-engineering-py) that work very well on this dataset (Tip, use the geohash field and change the precision at your preference), or you can use an AutoML platform. I have successfully tried a few, but this article is long enough, so I don’t want to include additional sections dedicated to ML.

My cat after a terrible accident — Tickling by a Human

[i] Bogota traffic report | TomTom Traffic Index — https://www.tomtom.com/traffic-index/bogota-traffic
[ii] Datos Abiertos Secretaría Distrital de Movilidad — https://datos.movilidadbogota.gov.co
[iii] Database Normalization — https://www.w3schools.in/dbms/database-normalization
[iv] Denormalization in Databases — GeeksforGeeks — https://www.geeksforgeeks.org/denormalization-in-databases/
[v] Index Templates | Elasticsearch — https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html
[vi] Geohash encoding/decoding — https://www.movable-type.co.uk/scripts/geohash.html
[vii] Global status report on road safety 2018 — https://www.who.int/publications/i/item/9789241565684