How to Audit a Presidential Election — Colombia Edition — Proof of Concept

Iker
9 min readJun 9, 2022
Final Results

This article is entirely apolitical; it has no intention or inclination toward any presidential candidate. The system is mainly a proof of concept that allows counting votes and identifying possible anomalies, enabling auditing of Colombia’s mechanism to vote for a president; this corresponds to the elections on May 29, 2022. The total time spent on developing the system was approximately 60 hours.

On May 29, 2022, Colombia’s presidential elections took place. Beyond the result, in most social networks, followers of multiple candidates denounce fraud in the election process, arguing that the formats used, also called E14, contain corrections that add votes to one or another of the candidates.

Presidential Election Results — First Round — Source: https://cnnespanol.cnn.com/2022/05/30/elecciones-presidenciales-colombia-2022-resultados-reacciones-primera-vuelta-orix/

E-14…

The curious thing about the aforementioned “complaints” is that, although there are representatives of all political fronts, most of these complaints, at least from my point of view, come from the followers of the candidate who came in the first position with more than 40% of the total votes.
The problem with these complaints is that carrying out a general audit from an external perspective is an incredibly complicated and manual process.

I do not intend to detail the Colombian vote counting process, which can be reviewed at this link. The point to remember is that one of the most critical elements is the E-14 form. This form contains all the information about the candidates, their political parties, and the votes obtained by each one. One of the three copies of this form is published by the National Registry on his webpage. The National Registry is the entity in charge of aggregating the results and publishing the official results for the elections.

Presidential Elections 2022 E-14 Form — Source: https://www.infobae.com/america/colombia/2022/04/18/asi-sera-el-nuevo-formulario-e-14-para-las-elecciones-presidenciales-en-colombia/

¿Where are my Votes?

If we go to the page of the National Registry, it is possible to search for any of the 103,000 E-14 forms corresponding to the voting tables installed in the country (https://divulgacione14presidencia.registraduria.gov.co). However, as it is a Single Page Application (SPA) style portal and uses reCAPTCHA, it is not possible to use traditional tools to automate the download process of E14 forms.

E-14 form publication portal, National Registry

At a logical level, the steps that are required to count the votes in the E-14 forms and identify possible anomalies in them would include the following steps:

  1. Download the E-14 (103.364) PDF forms through the portal navigation
  2. Convert the PDF file to an image (PNG if possible)
  3. Extract the fields that contain the votes of each of the candidates
  4. Identify the number, asterisk, or the absence of a character in the corresponding fields
  5. Identify if there is any inconsistency within the formats/votes
  6. Store the information, correlate it and present it for audit (if necessary)
E-14 form, the votes obtained are highlighted in red

Ultron… There are no Strings on Me…

If we evaluate the multiple restrictions that may exist for each of the steps, it is possible with specific technologies to achieve a result close to our goals of counting the votes and at the same time auditing the E-14 forms. To this end, Ultron was created as a proof of concept, a system with an architecture capable of overcoming the identified restrictions. The high-level architecture of the system is as follows:

Recommended soundtrack to analyze Architecture: https://www.youtube.com/watch?v=I1968HY4DKc

Sources: National Registry Official SPA-style page, protected by reCAPTCHA, where the E-14 forms are published; It is important to keep in mind that this portal is relatively slow (I assume because of the high demand); the loading of elements and information takes several seconds, which is the equivalent of years for a computer.

Extraction: Although there are solutions to bypass systems such as reCAPTCHA and consult SPA pages, we will opt for “Puppeteer”, a library that allows us to manipulate Chrome/Chromium. To overcome the problems associated with reCAPTCHA, we used an instance of Google Chrome on a user’s computer (My Computer). Puppeteer detects when reCAPTCHA is activated and waits a few seconds for the user to solve the captcha; since it is a browser with a regular user session, solving the captcha is not very frequent (More or less two or three times for every 500 downloaded forms). 1,016 E-14 forms were downloaded (approximately 1% of the total tables in the country) in about 30 minutes, all from the department of Antioquia (It’s just easier) and from various Municipalities and Zones.

Ultron in Action!

Transformation: In this component, we develop three processes; extract the fields that contain the votes for each one of the candidates, identify the number, asterisk, or the absence of a character in the fields and identify if there is any inconsistency; this was initially developed with three microservices and a database:

  • p2i (PDF to Image): Microservice’s sole purpose is to convert files from pdf to png format; we also decrease the resolution of the resulting image to make the process more efficient.
  • icutter (Image Cutter): Microservice that takes an input png file to make one or multiple image cuts, returning these new images in png format. Only the votes of the candidates Gustavo Petro, Rodolfo Hernández and Federico Gutiérrez were cut.
  • mldigits (Machine Learning Digits): Microservice that exposes a Machine Learning Model trained using the MNIST-784 data set for number recognition; through expansion(move the image a few pixels to the left, right, up, and down) and augmentation (rotate the image a few pixels to the left or right) techniques, and, also, using the microservices to extract 6000 samples of asterisks and free squares; a Scikit’s KNN-Classifier model was trained on a total of 565 thousand samples of numbers and asterisks. The output of this microservice is a vector of numbers, which represents the probability that the input image corresponds to one or more of the options.
mldigits example output
  • Database: The database is ElasticSearch since it includes the analytics tools for the process under the same interface. The information is stored in JSON format, with the following fields:
  1. Candidate: Candidate Name.
  2. First, Second and Third Box Numbers: Numbers identified by mldigits in each of the respective boxes of the E-14 form.Probabilities Numbers: Probabilities generated by mldigits for each of the analyzed images. The number 10 was used as a label to refer to the asterisk/cross-out/empty cell on the E-14 forms.
  3. Votes: Number of votes according to the numbers identified in the boxes.
  4. Images: Images names obtained from each one of the boxes corresponding to the votes. The system sends these images to an S3 bucket so they can be visualized within the analytics interface.
  5. Votes — Confidence: The ML model’s confidence level in each of the three identified numbers.
  6. General — Confidence: Average confidence level from the ML model on the three identified numbers.
Database entry for candidate Rodolfo Hernandez

Analytics: ElasticSearch (Kibana) generates dashboards and aggregates results and analytics on the information collected.

These systems were implemented on platforms that run on a cluster of 4 Kubernetes nodes and a server with Docker. The orchestration is performed by Node-Red, which makes the calls to the microservices and sends and stores the information in ElasticSearch.

Node-Red flow used to orchestrate the process

Due to the practicality and economy of the article, and that it is not a goal, I do not intend to go into technical details regarding each component. Probably in the future, I will write articles on IA / ML, Microservices, Kubernetes, etc.

Results

The 1016 E-14 forms download and processing took approximately 50 minutes, although parallelizing it a bit more could significantly reduce this time (I estimate 25–30 minutes).

Final Results

By aggregating the data and comparing it against a sample of 100 polling tables(10% of the 1% of the total polling tables in the country), the performance of the ML model was within 95% accuracy for number recognition. Since errors were more common in the 2nd and 3rd digits, the results of the total votes of the tables differed by 2% from the actual value.

Count of Votes by Municipality (“Antioquia” is misplaced due to an error in the coordinates in the database)

One of the significant advantages of the process is that we were using an ML model to recognize the number of votes; the model outputs not just the number but also a vector with the probabilities. These probabilities are the model’s level of “confidence” that the number is 1, 2, asterisk, etc., which is a measure that tells more about the ML model than the data…
However, a difference of more than a couple of percentage points in the confidence level between candidates tells us that although the same person supposedly writes both results with the same handwriting, some are more easily recognizable than others, giving us clues to anomalies. Similarly, a low or very low confidence level (we could take 75%, for example) also indicates possible problems; mainly of three types:

  • Data format: Errors that are mainly associated with misalignment of the format image, blurred numbers, ink stains, etc.
  • ML algorithm errors: Due to factors such as the dataset used and the handwriting of the jury that fills out the forms; some numbers can be ambiguous, either because they are too slanted, the writing makes it difficult to discern the numbers, for example, a 4 from a 9 and other factors.
  • Errors in the form: Problems related to the information within the format; these can be strikeouts, amendments, poorly written numbers, ambiguous, supplemented, or superimposed.
Examples of Low Confidence Results (Only one of these had explanatory notes on the E-14 form)

One of the most useful features, and completely independent of the accuracy of the ML algorithm, is the ability to audit results quickly and without having to go through each form by generating a unified view that only shows the candidate and the images. In case of any doubt, it is possible to check which specific form, department, municipality, zone, position, and table present the anomalies. For example, it is possible to audit more than 100 forms in just under a couple of minutes.

Anomaly Identification from the Audit View

Conclusions

  • From my point of view, and although it is part of the political process, and I would dare to say that even our Colombian culture; complaining very, very loudly about things that “do not benefit me”, is precisely the only foundation of the majority of complaints that appear on social networks.
  • The reality is that by developing this E-14 audit proof of concept, I couldn’t find a significant number of crossings or amendments that add (or remove) votes to the candidates, at least not enough to manipulate the result of the electoral process in the analyzed municipalities. 21 tables (2.06%) anomalies were found, all with the signature of at least 4 juries and most with comments clarifying the error. Did these errors benefit any candidate in particular? Yes; however, I will not mention which candidate because the errors were insignificant and to maintain the article’s apolitical tone.
  • Many improvements can be implemented within the project, particularly in extracting data from the E-14 format and identifying where the numbers, notes, and total votes are found. Also, training the ML model and the algorithm (XGBoost, RandomForest, etc.), including changing it to a deep learning model, would reduce its size and improve accuracy.
  • “Ultron” is a Proof of Concept, the sample is small, and it is not possible to extrapolate what is identified here to the total tables in the country; that would be like taking a glass of water from a lake and not finding a single fish, concluding that there are no fish in the entire lake. This article aims NOT to support any candidate, not even to defend the National Registry; it seeks to show strategies for auditing the process and how it can be done more efficiently and with the least amount of bias.
My cat bored by unsubstantiated posts on social media saying “The Election was stolen from the people!”

--

--

Iker

CyberSecurity, Information Security, Tech and Data Enthusiast, Amateur Developer