Developer Guide

Setup Development Environment

  1. Install pip and tox:

    sudo apt-get install python-pip
    sudo pip install tox
    
  2. Configure git pre-commit hook:

    sudo pip install flake8 pep8-naming
    flake8 --install-hook
    git config flake8.strict true
    

Launching executable from repository

To execute tse2sql from the repository you can:

  1. Install dependencies system-wide:

    sudo pip install -r requirements.txt
    PYTHONPATH=lib/ bin/tse2sql
    
  2. Load the py34 virtual environment:

    tox -e py34
    source .tox/py34/bin/activate
    PYTHONPATH=lib/ bin/tse2sql
    

Building Documentation

tox -e doc

Output will be available at .tox/doc/tmp/html. It is recommended to install the webdev package:

sudo pip install webdev

So a development web server can serve any location like this:

$ webdev .tox/doc/tmp/html

Running Test Suite

tox -e py27,py34

tse2sql.args reference

Argument management module.

Functions

parse_args([argv]) Argument parsing routine.
parse_args_scrapper([argv]) Scrapper argument parsing routine.
tse2sql.args.parse_args(argv=None)

Argument parsing routine.

Parameters:argv – A list of argument strings.
Rtype argv:list
Returns:A parsed and verified arguments namespace.
Return type:argparse.Namespace
tse2sql.args.parse_args_scrapper(argv=None)

Scrapper argument parsing routine.

Parameters:argv – A list of argument strings.
Rtype argv:list
Returns:A parsed and verified arguments namespace.
Return type:argparse.Namespace

tse2sql.main reference

Application entry point module.

Functions

main(args) Application main function.
main_scrapper(args) Scrapper main function.
tse2sql.main.main(args)

Application main function.

Parameters:args (argparse.Namespace) – An arguments namespace.
Returns:Exit code.
Return type:int
tse2sql.main.main_scrapper(args)

Scrapper main function.

Parameters:args (argparse.Namespace) – An arguments namespace.
Returns:Exit code.
Return type:int

tse2sql.readers reference

TSE files parsing / reading module.

Classes

DistrictsReader Read and parse the Distelec.txt file.
VotersReader Read and parse the PADRON_COMPLETO.txt file.
class tse2sql.readers.DistrictsReader(search_dir)

Read and parse the Distelec.txt file.

The Distelec.txt file is a CSV file in the form:

101001,SAN JOSE,CENTRAL,HOSPITAL
101002,SAN JOSE,CENTRAL,ZAPOTE
101003,SAN JOSE,CENTRAL,SAN FRANCISCO DE DOS RIOS
101004,SAN JOSE,CENTRAL,URUCA
101005,SAN JOSE,CENTRAL,MATA REDONDA
101006,SAN JOSE,CENTRAL,PAVAS
  • It list the provinces, cantons and districts of Costa Rica.

  • It is encoded in ISO-8859-15 and uses Windows CRLF line terminators.

  • It is quite stable. It will only change when Costa Rica districts change (quite uncommon, but happens from time to time).

  • It is relatively small. Costa Rica has 81 cantons, and ~6 or so districts per canton. As of 2016, Costa Rica has 478 districts. As this writting, the CSV file is 172KB in size.

  • The semantics of the code is as following:

    <province(1 digit)><canton(2 digits)><district(3 digits)>
    

    Please note that only the province code is unique. Both canton and districts codes are reused and thus depend on the value of the previous code.

This class will lookup for the file and will process it completely in main memory in order to build provinces, cantons and districts tables at the same time. Also, the file will be processed even if some lines are malformed. Any error will be logged as such.

Inheritance

Inheritance diagram of DistrictsReader

analyse()

Return a small report with some basic analysis of the tables.

Returns:A dictionary with the analysis of data provided by the parsed file. In particular, the amount of provinces, cantons and districts, the largest name of those, and the bad lines found.
Return type:A dict of the form:
analysis = {
    'provinces': ...,
    'provinces_extended': ...,
    'province_largest': ...,
    'cantons': ...,
    'cantons_extended': ...,
    'cantons_largest': ...,
    'districts': ...,
    'districts_extended': ...,
    'districts_largest': ...,
    'bad_data': ...
}
parse()

Open and parse the Distelec.txt file.

After parsing the following attributes will be available:

Variables:
  • provinces – Dictionary with province id as key and name as value.
  • cantons – Dictionary with a tuple (province id, canton id) as key and name as value.
  • districts – Dictionary with a tuple (province id, canton id, district id) as key and name as value.
class tse2sql.readers.VotersReader(search_dir, distelec)

Read and parse the PADRON_COMPLETO.txt file.

The PADRON_COMPLETO.txt file is a CSV file in the form:

100339724,109007,1,20231119,01031,JOSE                          ,DELGADO                   ,CORRALES
100429200,109006,2,20221026,01025,PAULA                         ,QUIROS                    ,QUIROS
100697455,101023,2,20150620,00073,CARMEN                        ,FALLAS                    ,GUEVARA
100697622,101020,2,20230219,00050,ANTONIA                       ,RAMIREZ                   ,CARDENAS
100720641,108002,2,20241119,00884,SOLEDAD                       ,SEQUEIRA                  ,MORA
100752764,403004,1,20151208,03731,EZEQUIEL                      ,LEON                      ,CALVO
100753244,210012,2,20161009,02599,CONSTANCIA                    ,ARIAS                     ,RIVERA
100753335,115001,2,20180211,01362,MARGARITA                     ,ALVARADO                  ,LAHMAN
100753618,111005,2,20220109,01168,ETELVINA                      ,PARRA                     ,SALAZAR
100763791,108007,1,20190831,00971,REINALDO                      ,MENDEZ                    ,BARBOZA
  • It lists all the voters in Costa Rica: their id, voting district, officialy sex (as if anyone should care), id expiration, voting site, name, first family name and second family name.

  • It is encoded in ISO-8859-15 and uses Windows CRLF line terminators.

  • It is quite unstable. Deaths and people passing 18 years are removed - added.

  • It is very large. As this writting, the CSV file is 364MB in size, with 3 178 364 lines (and thus, registered voters).

  • The semantics of the sex code is as following: 1: men, 2: women.

  • The format of the id expiration date is %Y%m%d as following:

    <year(4 digit)><month(2 digits)><day(2 digits)>
    

This class will interpret the file on the fly without loading it entirely on main memory. Also, the file will be processed even if some lines are malformed. Any error will be logged as such.

Inheritance

Inheritance diagram of VotersReader

open()

Open voters file for on-the-fly parsing.

tse2sql.render reference

SQL rendering module.

Functions

list_renderers() List availables templates.
render(payload, renderer, sqlfile) Render given payload using given renderer.
render_scrapped(data, renderer, sqlfile) Render given payload using given renderer.
tse2sql.render.list_renderers()

List availables templates.

Returns:The list of available templates.
Return type:list
tse2sql.render.render(payload, renderer, sqlfile)

Render given payload using given renderer.

Parameters:
  • payload (dict) – The payload to render.
  • renderer (str) – The name of the renderer to use.
  • sqlfile (file) – Output file descriptor to write to.
tse2sql.render.render_scrapped(data, renderer, sqlfile)

Render given payload using given renderer.

Parameters:
  • data (dict) – The scrapped data to render.
  • renderer (str) – The name of the renderer to use.
  • sqlfile (file) – Output file descriptor to write to.

tse2sql.scrapper reference

TSE voting center data scrapper module.

Functions

scrappe_data(samples) Scrapper main function.
tse2sql.scrapper.scrappe_data(samples)

Scrapper main function.

Parameters:samples (dict) – A dictionary with the ids samples.
Returns:A dictionary with the scrapped data of the form.
Return type:dict

tse2sql.utils reference

Utilities module.

Functions

is_url(url) Deterime if given string is an url.
ensure_dir(path) Ensure that a path exists.
download(url[, subdir]) Download given file in system temporal files folder.
sha256(filename) Calculate SHA256 of given filename.
unzip(filename) Unzip given filename.
get_file(search_dir, filename) Case-insensitive get file from a directory.
count_lines(filename) Count the number of lines in filename.
tse2sql.utils.is_url(url)

Deterime if given string is an url.

Parameters:url (str) – String to check if its a URL.
Returns:True if its an URL, False otherwise.
Return type:bool
tse2sql.utils.ensure_dir(path)

Ensure that a path exists.

Parameters:path (str) – Directory path to create.
tse2sql.utils.download(url, subdir=None)

Download given file in system temporal files folder.

Parameters:
  • url (str) – URL of the file to download.
  • subdir (str) – Subfolder name to store the downloaded file in the system temporal files folder.
Returns:

Local path where the file was stored.

Return type:

str

tse2sql.utils.sha256(filename)

Calculate SHA256 of given filename.

Parameters:filename (str) – Filename to calculate SHA256 from.
Returns:SHA256 hexidecimal digest.
Return type:str
tse2sql.utils.unzip(filename)

Unzip given filename.

The extraction folder will be determined by the archive filename removing the extension, including it’s parent folder.

Parameters:filename (str) – Path to the zip to extract.
Returns:The path where the archive was extracted.
Return type:str
tse2sql.utils.get_file(search_dir, filename)

Case-insensitive get file from a directory.

Parameters:
  • search_dir (str) – Directory to look for filename.
  • filename (str) – Case-insensitive filename to look for.
Returns:

The absolute path to the file.

Return type:

str

Raises:

Exception if file not found.

tse2sql.utils.count_lines(filename)

Count the number of lines in filename.

Parameters:filename (str) – Path to the filename.
Returns:The number of lines in the file.
Return type:int

tse2sql reference

tse2sql module entry point.

Modules

tse2sql.args Argument management module.
tse2sql.main Application entry point module.
tse2sql.readers TSE files parsing / reading module.
tse2sql.render SQL rendering module.
tse2sql.scrapper TSE voting center data scrapper module.
tse2sql.utils Utilities module.

Convertidor del Padron Electoral a SQL

_images/logo.png

SQL converter of the electoral registry published by the Costa Rican Supreme Electoral Tribunal.

Install

You require the Python packages installer for Python 3 pip3, see:

Then, install the tse2sql package:

sudo pip3 install tse2sql

Converter Usage

$ pip3 install tse2sql
$ tse2sql --help
usage: tse2sql [-h] [-v] [--version] [--renderer {mysql}] [archive]

SQL converter of the electoral registry published by the Costa Rican
Supreme Electoral Tribunal.

positional arguments:
  archive               URL or path to the voters database

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Increase verbosity level
  --version             show program's version number and exit
  --renderer {mysql}  SQL renderer to use

Where archive can be left empty and the known URL where the voters database is published will be used, currently:

http://www.tse.go.cr/zip/padron/padron_completo.zip

A local path to a .zip file can also be used to avoid download of the database if it is already in the local system.

When run, tse2sql will create a few files in the current working directory:

<digest>.<renderer>.sql:
 The SQL version of the database.
<digest>.data.json:
 Analysis of the Distelec.txt data. This JSON file provides a dictionary with the amount of provinces, cantons and districts, the largest name of those, and the bad lines found.
<digest>.samples.json:
 A fil with samples of voters ids for each district. This is the input file for the scrapper.

The whole process, downloading, extracting, parsing and writing the output will take several minutes to finish. tse2sql was optimized for memory usage, so expect high CPU usage while writing the outputs.

Scrapper Usage

tse2sql-scrapper --help
usage: tse2sql-scrapper [-h] [-v] [--version] [--renderer {mysql}] samples

TSE Voting Sites Scrapper

positional arguments:
  samples             Samples file with one id number per site id

optional arguments:
  -h, --help          show this help message and exit
  -v, --verbose       Increase verbosity level
  --version           show program's version number and exit
  --renderer {mysql}  SQL renderer to use

Some data isn’t published in CSV by the Supreme Electoral Tribunal:

  1. Voting center names. There is one voting center per district AFAWK.
  2. Voting centers addresses.
  3. Geographical coordinates of the voting center.

Currently there is a web service available at:

http://www.tse.go.cr/DondeVotarM/prRemoto.aspx/ObtenerDondeVotar

That allows to grab this information as following:

curl -i -X POST -H "Content-Type: application/json" -d '{"numeroCedula":"100763791"}' $WEB_SERVICE_ENDPOINT

This request will return the first two data as is and an url to Google Maps. The tse2sql-scrapper will parse that URL for the coordinates.

Databases

MySQL

The schema of the database is as follows:

_images/schema_mysql.png

You will need a functional MySQL server install, see:

This database uses FULLTEXT INDEX on a InnoDB engine, and thus, requires at least MySQL v5.6. On Ubuntu Linux:

sudo apt-get install mysql-server-5.6

Load the database and create a user for it:

$ mysql -u root -p
mysql> SET @start := NOW(); source <DIGEST>.mysql.sql; SET @end := NOW(); SELECT TIMEDIFF(@end, @start);
mysql> SET @start := NOW(); source <DIGEST>.scrapped.mysql.sql; SET @end := NOW(); SELECT TIMEDIFF(@end, @start);
mysql> GRANT ALL PRIVILEGES ON tsesql.* TO 'tse2sql'@'localhost' IDENTIFIED BY '<YOUR_PASSWORD>';

Sourcing the database will take several minutes. Once done you will most likely use the following query:

SELECT id_voter, name, family_name_1, family_name_2, sex, id_expiration,
    name_province, name_canton, name_district, site,
    voting_center_name, voting_center_address,
    voting_center_latitude, voting_center_longitude
FROM voter
    JOIN district ON voter.district_id_district = district.id_district
    JOIN canton ON district.canton_id_canton = canton.id_canton
    JOIN province ON canton.province_id_province = province.id_province
WHERE voter.id_voter = <id_voter>;

If you don’t have the voter id but you have the name you could use the following query (adding the + sign at the beginning of each word is important to get matching res):

SELECT id_voter, name, family_name_1, family_name_2, sex, id_expiration,
    name_province, name_canton, name_district, site,
    voting_center_name, voting_center_address,
    voting_center_latitude, voting_center_longitude
FROM voter
    JOIN district ON voter.district_id_district = district.id_district
    JOIN canton ON district.canton_id_canton = canton.id_canton
    JOIN province ON canton.province_id_province = province.id_province
WHERE MATCH(name, family_name_1, family_name_2)
    AGAINST ('+<name> +<family_name_1> +<family_name_2>' IN BOOLEAN MODE);

To implement the full search, MySQL uses Boolean logic, in which

+ stands for AND
- stands for NOT
[no operator] implies OR

The minimum default token size in InnoDB is 3 characters and the indexing engine ignores words shorter than this minimum size, then when the length of the token is minor than 3 no operator should be added to get more accurate results.

License

Copyright (C) 2016 Carolina Aguilar
Copyright (C) 2016 Carlos Jenkins

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.