pgreaper 1.0.0 Documentation¶
PostgreSQL Default Settings¶
Before uploading to Postgres, you may want to configure the default connection settings. If default settings are provided, PGReaper can use these to create new databases so you won’t have to create them manually.
Uploading TXTs and CSVs to Postgres¶
Loading TXTs and CSVs to PostgreSQL¶
Details¶
Loading Large Files¶
To conserve memory, all CSVs are read in chunks.
Schema Inference¶
PGReaper uses a custom CSV parser (written in C++) which simultaneously determines sanitizes and analyzes CSV files. It is capable of differentiating between strings, integers, and floats. The data type of an entire column is determined by this rule:
- If all values are integers, then the column type is bigint
- If all values are either floats or integers, then the column type is double precision
- Otherwise, the column type is text
- Minor Caveats:
- Trailing and leading whitespace is ignored when determining data types, so * ” 3.14 ” is considered a floating point number
- Quoted numeric fields are automatically unquoted (Postgres does not tolerate quoted numeric fields)
Corrections PGReaper Can Make¶
- Not all CSV files are perfect, but PGReaper is capable of making some corrections:
- Sanitizing column names
- Dropping rows that are too short or too long
- Unquoting numeric fields before copying
API¶
Reading JSON¶
Loading JSON¶
From Files¶
PGReaper can load arbitrarily large JSON files assuming they are structured as collections of JSON objects, e.g.
[
{
"Name": "Julia",
"Age": 29,
"Occupation": "Database Administrator"
}, {
"Name": "Mark",
"Age": 30,
"Occupation": "Barista",
"Phone": 999-999-9999
}
]
Because it uses a custom JSON reader (which cuts up potentially large files), it is also capable of reading other variants of JSON that Python’s json module can’t handle, such as newline delimited JSON.
Flattening¶
Currently, PGReaper can optionally flatten out JSON data by its outermost keys. In the example above, the resultant table would have two rows with the columns “name”, “age”, “occupation”, and “phone”. The majority of this work is done via Postgres’ JSON functions, which is much faster than anything done in pure Python.
API¶
Reading From Compressed (ZIP) Files¶
PGReaper is capable of copying specific files located in ZIP archives without decompressing them. (If you are interested in reading in GZIP, BZIP, or LZMA compressed files, use the compression parameter on the copy_csv() function.)
Reading ZIP Archives¶
PGReaper provides an intutive way to access files stored in ZIP archives. You can also pass the references to these files to PGReaper’s normal reader functions like copy_csv().
Step 1: Read the ZIP Archive¶
Step 2: Get the Specific File¶
Notes: File Opening Safety¶
Opening a file within a ZIP archive using the methods above creates a ZipReader object. These objects are like any other file-like objects in Python– supporting read() and readline() methods, but can only be used as context managers.
The Table Data Structure¶
PGReaper contains a two-dimensional data structure creatively named Table. These are lightweight structures which are built on Python’s list containers but contain a lot of features for easily mapping them into SQL tables.
Creating and Loading Table Objects¶
Motivation¶
The Table is similar in concept to the DataFrame (either in R or in pandas) but is optimized for fast appends, iteration, and copying into databases. Because they are structured as lists of lists, all of the Python methods for operating with lists apply to Tables. Furthermore, Table objects provide their own set of specialized methods, with an API inspired by R’s dplyr package.
The key difference between an R or pandas DataFrame and a pgreaper Table is that the Table is designed to be copied into a SQL database, rather than attempt to replace its functionality. As such, most of the methods are geared towards collecting, cleaning and restructuring data, rather than analyzing it.
Structure¶
Each list in a Table represents a row, while an item in each row represents a cell. If you had a table stored as–say–world_gdp_data, then world_gdp_data[0] would return a list representing the first row and world_gdp_data[0][1] would be the second column of the first row.
Type-Inference¶
Table objects keep track of the data types inserted into every column, and uses this information to determine the final column type. By default, PGReaper will attempt to treat unrecognized data types as text columns. Currently, Table is able to recognize text, integer, float, jsonb (list or dict), and datetime types as well as most numpy types.
Brief Example: Creating and Loading Table Objects¶
from pgreaper import Table, table_to_pg
planes = Table(
name='planes',
col_names=['weight', 'length', 'wingspan', 'cost', 'manufacturer']
)
planes.append(...)
table_to_pg(planes, dbname='vehicles')
Full Reference: Creating and Modifying Tables¶
Slicing¶
Exotic Methods¶
These are methods which be handy for some edge cases
Operations Which Create New Tables¶
Groupby¶
Adding Rows to a Table¶
Since a Table is really just a nested list, you can use the append() method. However, Table objects will refuse to add rows which are shorter or longer than the existing rows. This is motivated by the fact that a lot of input sources, especially HTML tables, are not very clean and do not guarantee consistent record lengths even if they should.
If you want to intentionally extend a table, you can use the add_col() method to create a new column, or add_dict() method which implicitly creates new columns.
Full Reference: Loading Tables to Postgres¶
Dumping to Files¶
It is also possible to dump the contents of Table objects into files instead of SQL databases.
Jupyter Notebooks¶
Table objects take advantage of Jupyter’s pretty HTML display.
pandas Integration¶
Loading pandas DataFrames¶
By supporting both INSERTs and UPSERTs for DataFrames, when combined with pandas’ read_sql() functionality, PGReaper turns any Postgres database into a robust store for your pandas-based projects. Compared with other methods for loading DataFrames such as to_sql() or a combination of to_csv() and copy_from(), PGReaper provides:
- Automatic type inference for basic Python types and most numpy types
- Properly encoding jsonb (dict or list) and timestamp (datetime) objects
- Automatic correction of problematic column names, e.g. those that are Postgres keywords
- Support for composite primary keys
- Support for both INSERT OR REPLACE and UPSERT operations
- Faster UPSERT performance using SELECT unnest() rather than slower batched INSERTs
Benchmark/Example: pandas DataFrame COPY and UPSERT¶
COPY¶
Here, we’re going to use the excellent mimesis package to generate 500,000 rows of fake data to populate a pandas DataFrame.
Note
For robustness, PGReaper uses both fake and real data in its test suite.
Structure¶
The resulting SQL table should have 4 text, 1 bigint, and 1 jsonb column.
id | Full Name | Age | Occupation | Contact | Nationality | |
---|---|---|---|---|---|---|
0 | 0 | Jimmy Vaughn | 25 | Leather Worker | {'email': 'inell_8213@yandex.com', 'phone': '(... | Uruguayan |
1 | 1 | Douglas Noel | 50 | Bricklayer | {'email': 'carter-3572@outlook.com', 'phone': ... | Uruguayan |
2 | 2 | Aaron Figueroa | 38 | Bank Messenger | {'email': 'leopoldo9374@live.com', 'phone': '0... | Russian |
3 | 3 | Lonnie Rose | 57 | Merchant Seaman | {'email': 'dudley-6367@live.com', 'phone': '1-... | Romanian |
4 | 4 | Mafalda Gross | 43 | Song Writer | {'email': 'magdalen_3735@outlook.com', 'phone'... | Swiss |
5 | 5 | Cammie Kirkland | 55 | Flour Miller | {'email': 'lorrine_2944@yahoo.com', 'phone': '... | Puerto Rican |
6 | 6 | Alfonzo Harper | 31 | Employee | {'email': 'len_3927@gmail.com', 'phone': '(068... | Australian |
7 | 7 | Vaughn Herman | 21 | Stage Hand | {'email': 'federico_3662@live.com', 'phone': '... | Salvadorian |
8 | 8 | My Hendricks | 58 | Mining Engineer | {'email': 'my_9446@gmail.com', 'phone': '068-8... | Finnish |
9 | 9 | Moses Moran | 58 | Sheriff Clerk | {'email': 'delmer-761@live.com', 'phone': '+1-... | British |
10 | 10 | Hyman Leach | 20 | Painter | {'email': 'reinaldo-2219@live.com', 'phone': '... | Australian |
11 | 11 | Elsy Ball | 50 | Records Supervisor | {'email': 'tu_847@live.com', 'phone': '(068) 8... | Bolivian |
12 | 12 | Crista Washington | 45 | Homeopath | {'email': 'lady9072@outlook.com', 'phone': '06... | Greek |
13 | 13 | Matthew Shaw | 63 | Machine Fitters | {'email': 'sol_3090@yahoo.com', 'phone': '1-06... | Jordanian |
14 | 14 | Versie Stephens | 56 | Underwriter | {'email': 'maragret-9589@live.com', 'phone': '... | Cambodian |
15 | 15 | Herb Gonzales | 63 | Sign Maker | {'email': 'elisha-258@outlook.com', 'phone': '... | Egyptian |
16 | 16 | Jerrod Peterson | 38 | Vehicle Engineer | {'email': 'len-8721@gmail.com', 'phone': '1-06... | Australian |
17 | 17 | Randal Wyatt | 50 | Purchase Clerk | {'email': 'mack-9125@yandex.com', 'phone': '06... | Puerto Rican |
18 | 18 | Sabine Powell | 42 | Buyer | {'email': 'lizzette-8591@live.com', 'phone': '... | Ukrainian |
19 | 19 | Ulrike Wyatt | 29 | Stonemason | {'email': 'pandora9731@outlook.com', 'phone': ... | Swiss |
20 | 20 | Emmie Hartman | 36 | Market Trader | {'email': 'eusebia9143@yandex.com', 'phone': '... | Egyptian |
21 | 21 | Liberty Willis | 38 | Technical Author | {'email': 'pearlene8013@yandex.com', 'phone': ... | Puerto Rican |
22 | 22 | Teddy Weaver | 35 | Materials Controller | {'email': 'frederic-2350@yandex.com', 'phone':... | Dominican |
23 | 23 | Maddie Malone | 61 | Stone Sawyer | {'email': 'miki2798@gmail.com', 'phone': '1-06... | Finnish |
24 | 24 | Olevia Mcdaniel | 22 | Playgroup Leader | {'email': 'retta-1501@yahoo.com', 'phone': '06... | Australian |
25 | 25 | Tinisha Christian | 64 | Assessor | {'email': 'amina_814@outlook.com', 'phone': '0... | Argentinian |
26 | 26 | Tova Sanchez | 51 | Assistant Nurse | {'email': 'pei-4002@live.com', 'phone': '068.8... | Mexican |
27 | 27 | Theo Williamson | 30 | Research Technician | {'email': 'lorenzo_6203@yandex.com', 'phone': ... | Japanese |
28 | 28 | Maurice Payne | 50 | Health Nurse | {'email': 'alease6289@gmail.com', 'phone': '06... | Taiwanese |
29 | 29 | Paulita Hughes | 38 | Investment Strategist | {'email': 'oneida9325@outlook.com', 'phone': '... | Ethiopian |
... | ... | ... | ... | ... | ... | ... |
499970 | 499970 | Wilford Rowe | 35 | Trade Union Official | {'email': 'burl9662@live.com', 'phone': '628-8... | Estonian |
499971 | 499971 | Arie Summers | 25 | Special Needs | {'email': 'carlee7311@gmail.com', 'phone': '62... | Irish |
499972 | 499972 | Foster Briggs | 20 | Taxi Controller | {'email': 'homer2132@gmail.com', 'phone': '+1-... | Cuban |
499973 | 499973 | Korey Pugh | 66 | Decorator | {'email': 'mathew1418@gmail.com', 'phone': '62... | Dominican |
499974 | 499974 | Berry Solis | 27 | Waiter | {'email': 'lory6921@gmail.com', 'phone': '1-62... | Puerto Rican |
499975 | 499975 | Tequila William | 38 | Sand Blaster | {'email': 'karly-9354@live.com', 'phone': '(62... | Chilean |
499976 | 499976 | Meridith Wright | 59 | Genealogist | {'email': 'cyndy-8184@outlook.com', 'phone': '... | Canadian |
499977 | 499977 | Jacqui Serrano | 51 | Tyre Builder | {'email': 'mercedez_8467@yahoo.com', 'phone': ... | Cameroonian |
499978 | 499978 | Clemente Powell | 22 | Music Teacher | {'email': 'carmen_5259@gmail.com', 'phone': '3... | Australian |
499979 | 499979 | Will Hale | 22 | Radio Presenter | {'email': 'jamison_5100@live.com', 'phone': '1... | Cuban |
499980 | 499980 | Lester Butler | 63 | Sportswoman | {'email': 'william7285@live.com', 'phone': '1-... | Colombian |
499981 | 499981 | Lianne Irwin | 47 | Applications Engineer | {'email': 'madison-7241@yahoo.com', 'phone': '... | Puerto Rican |
499982 | 499982 | Mistie Medina | 47 | Health Advisor | {'email': 'valene8224@outlook.com', 'phone': '... | Chinese |
499983 | 499983 | Keven Beck | 41 | Tyre Fitter | {'email': 'noel_2584@live.com', 'phone': '(981... | Argentinian |
499984 | 499984 | Tyler Beasley | 36 | School Inspector | {'email': 'antony2949@yandex.com', 'phone': '1... | Bolivian |
499985 | 499985 | Myung Sanford | 46 | Production Planner | {'email': 'naida_7091@yahoo.com', 'phone': '98... | Chilean |
499986 | 499986 | Esteban Lowe | 20 | Radiologist | {'email': 'williams_7486@yahoo.com', 'phone': ... | Mexican |
499987 | 499987 | Leia Cunningham | 49 | Orthoptist | {'email': 'marcell9206@yahoo.com', 'phone': '1... | Costa Rican |
499988 | 499988 | Elbert Rodriquez | 17 | Catering Staff | {'email': 'gregg-7474@gmail.com', 'phone': '96... | Ethiopian |
499989 | 499989 | Meri Mathews | 59 | Sales Executive | {'email': 'thomasena5180@live.com', 'phone': '... | Greek |
499990 | 499990 | Ron Velasquez | 45 | Security Officer | {'email': 'dan1402@yahoo.com', 'phone': '(963)... | Belgian |
499991 | 499991 | Adolfo Hickman | 35 | Professional Wrestler | {'email': 'blake-8646@live.com', 'phone': '(96... | Australian |
499992 | 499992 | Pearle Dotson | 20 | Seamstress | {'email': 'theo_1294@yahoo.com', 'phone': '1-9... | Salvadorian |
499993 | 499993 | Stefania Mays | 28 | Party Planner | {'email': 'evan_3309@yandex.com', 'phone': '28... | Afghan |
499994 | 499994 | Luis Bond | 29 | Area Manager | {'email': 'eduardo-2172@outlook.com', 'phone':... | French |
499995 | 499995 | Irina Gibbs | 64 | History Teacher | {'email': 'delmy_8959@outlook.com', 'phone': '... | Italian |
499996 | 499996 | Emery Anderson | 59 | Chambermaid | {'email': 'percy6103@live.com', 'phone': '288.... | Spanish |
499997 | 499997 | Camie Frazier | 38 | Technical Liaison | {'email': 'emelda-127@outlook.com', 'phone': '... | British |
499998 | 499998 | Jospeh Reid | 26 | Historian | {'email': 'ferdinand_5862@yandex.com', 'phone'... | Guatemalan |
499999 | 499999 | Argelia Payne | 35 | Station Manager | {'email': 'domonique5565@gmail.com', 'phone': ... | Australian |
500000 rows × 6 columns
UPSERT¶
Suppose now that we live in such an amazing economy that everybody past 50 has enough money to retire. This means we’ll need to update our data to reflect this. As you can see for yourself, this operation will affect about 160,000 rows.
id | full_name | age | occupation | contact | nationality | |
---|---|---|---|---|---|---|
0 | 1 | Douglas Noel | 50 | Retired | {'email': 'carter-3572@outlook.com', 'phone': ... | Uruguayan |
1 | 3 | Lonnie Rose | 57 | Retired | {'email': 'dudley-6367@live.com', 'phone': '1-... | Romanian |
2 | 5 | Cammie Kirkland | 55 | Retired | {'email': 'lorrine_2944@yahoo.com', 'phone': '... | Puerto Rican |
3 | 8 | My Hendricks | 58 | Retired | {'email': 'my_9446@gmail.com', 'phone': '068-8... | Finnish |
4 | 9 | Moses Moran | 58 | Retired | {'email': 'delmer-761@live.com', 'phone': '+1-... | British |
5 | 11 | Elsy Ball | 50 | Retired | {'email': 'tu_847@live.com', 'phone': '(068) 8... | Bolivian |
6 | 13 | Matthew Shaw | 63 | Retired | {'email': 'sol_3090@yahoo.com', 'phone': '1-06... | Jordanian |
7 | 14 | Versie Stephens | 56 | Retired | {'email': 'maragret-9589@live.com', 'phone': '... | Cambodian |
8 | 15 | Herb Gonzales | 63 | Retired | {'email': 'elisha-258@outlook.com', 'phone': '... | Egyptian |
9 | 17 | Randal Wyatt | 50 | Retired | {'email': 'mack-9125@yandex.com', 'phone': '06... | Puerto Rican |
10 | 23 | Maddie Malone | 61 | Retired | {'email': 'miki2798@gmail.com', 'phone': '1-06... | Finnish |
11 | 25 | Tinisha Christian | 64 | Retired | {'email': 'amina_814@outlook.com', 'phone': '0... | Argentinian |
12 | 26 | Tova Sanchez | 51 | Retired | {'email': 'pei-4002@live.com', 'phone': '068.8... | Mexican |
13 | 28 | Maurice Payne | 50 | Retired | {'email': 'alease6289@gmail.com', 'phone': '06... | Taiwanese |
14 | 32 | Delmer Saunders | 52 | Retired | {'email': 'lorenzo-4738@yandex.com', 'phone': ... | German |
15 | 33 | Kit Holcomb | 50 | Retired | {'email': 'ladawn8977@outlook.com', 'phone': '... | Latvian |
16 | 36 | Dylan Burgess | 56 | Retired | {'email': 'cliff_228@live.com', 'phone': '068.... | Argentinian |
17 | 38 | Francisco Wiley | 55 | Retired | {'email': 'georgine_4373@yandex.com', 'phone':... | Cambodian |
18 | 39 | Stuart Hendricks | 66 | Retired | {'email': 'andre_9255@live.com', 'phone': '068... | Romanian |
19 | 41 | Gerry Holt | 62 | Retired | {'email': 'kyle6356@outlook.com', 'phone': '06... | Chilean |
20 | 46 | Rosio Henson | 58 | Retired | {'email': 'marielle9323@yahoo.com', 'phone': '... | Afghan |
21 | 49 | Archie Vega | 62 | Retired | {'email': 'timothy-7344@outlook.com', 'phone':... | Brazilian |
22 | 51 | Dudley Richmond | 55 | Retired | {'email': 'bob_7237@yandex.com', 'phone': '(06... | Cuban |
23 | 52 | Harley Matthews | 58 | Retired | {'email': 'albert_9262@live.com', 'phone': '06... | Uruguayan |
24 | 53 | Blair Blake | 61 | Retired | {'email': 'kenton-562@outlook.com', 'phone': '... | Dominican |
25 | 56 | Kelly Logan | 66 | Retired | {'email': 'hubert-9681@live.com', 'phone': '+1... | Polish |
26 | 57 | Raymon Flowers | 62 | Retired | {'email': 'jonathon-8669@outlook.com', 'phone'... | Swiss |
27 | 58 | Vertie Cochran | 64 | Retired | {'email': 'vincenza-649@outlook.com', 'phone':... | Portuguese |
28 | 59 | Stacy Reed | 62 | Retired | {'email': 'keith6772@gmail.com', 'phone': '(06... | Chilean |
29 | 64 | Delbert Emerson | 50 | Retired | {'email': 'kraig-4725@outlook.com', 'phone': '... | French |
... | ... | ... | ... | ... | ... | ... |
166595 | 499896 | Argelia Robles | 53 | Retired | {'email': 'hildred_2878@yahoo.com', 'phone': '... | Egyptian |
166596 | 499904 | Janean Marshall | 61 | Retired | {'email': 'shawnee994@yahoo.com', 'phone': '93... | Danish |
166597 | 499906 | Sunday Morgan | 61 | Retired | {'email': 'wesley_648@live.com', 'phone': '533... | Cuban |
166598 | 499909 | Marty Cross | 61 | Retired | {'email': 'brooks_9153@yahoo.com', 'phone': '5... | Israeli |
166599 | 499912 | Kiana Abbott | 57 | Retired | {'email': 'charita-9682@live.com', 'phone': '7... | Bolivian |
166600 | 499913 | Thomasena Fowler | 52 | Retired | {'email': 'mitzie_7093@outlook.com', 'phone': ... | Chilean |
166601 | 499914 | Hobert Alford | 61 | Retired | {'email': 'chuck2797@live.com', 'phone': '1-70... | Swiss |
166602 | 499919 | Brice Arnold | 63 | Retired | {'email': 'malik_529@yandex.com', 'phone': '48... | Russian |
166603 | 499921 | Marcus Pearson | 56 | Retired | {'email': 'rickie-7134@outlook.com', 'phone': ... | Dutch |
166604 | 499923 | Gricelda Dillon | 65 | Retired | {'email': 'galina_1993@gmail.com', 'phone': '5... | Venezuelan |
166605 | 499925 | Idell Hopper | 51 | Retired | {'email': 'berenice9972@gmail.com', 'phone': '... | English |
166606 | 499930 | Francesco Anthony | 58 | Retired | {'email': 'rex_2601@outlook.com', 'phone': '61... | Chilean |
166607 | 499933 | Elroy Morton | 58 | Retired | {'email': 'brad4797@live.com', 'phone': '1-830... | Jordanian |
166608 | 499935 | Rosette Giles | 66 | Retired | {'email': 'lahoma-857@outlook.com', 'phone': '... | Canadian |
166609 | 499938 | Art Charles | 52 | Retired | {'email': 'dustin1542@live.com', 'phone': '074... | Brazilian |
166610 | 499941 | Dong Reyes | 65 | Retired | {'email': 'krystin-4097@yahoo.com', 'phone': '... | Swiss |
166611 | 499942 | Gino Dalton | 61 | Retired | {'email': 'henry_5319@live.com', 'phone': '074... | Guatemalan |
166612 | 499949 | Marshall White | 58 | Retired | {'email': 'aliza-7392@yandex.com', 'phone': '1... | Swedish |
166613 | 499954 | Rudy Gill | 50 | Retired | {'email': 'rudolph8986@yahoo.com', 'phone': '(... | Brazilian |
166614 | 499955 | Rhona Hubbard | 59 | Retired | {'email': 'chau_5207@yandex.com', 'phone': '49... | Puerto Rican |
166615 | 499965 | Kenny Best | 59 | Retired | {'email': 'chauncey-5684@gmail.com', 'phone': ... | Polish |
166616 | 499968 | Madalene Yates | 58 | Retired | {'email': 'kenisha604@gmail.com', 'phone': '06... | British |
166617 | 499969 | Beata Pugh | 52 | Retired | {'email': 'kit-2831@outlook.com', 'phone': '06... | German |
166618 | 499973 | Korey Pugh | 66 | Retired | {'email': 'mathew1418@gmail.com', 'phone': '62... | Dominican |
166619 | 499976 | Meridith Wright | 59 | Retired | {'email': 'cyndy-8184@outlook.com', 'phone': '... | Canadian |
166620 | 499977 | Jacqui Serrano | 51 | Retired | {'email': 'mercedez_8467@yahoo.com', 'phone': ... | Cameroonian |
166621 | 499980 | Lester Butler | 63 | Retired | {'email': 'william7285@live.com', 'phone': '1-... | Colombian |
166622 | 499989 | Meri Mathews | 59 | Retired | {'email': 'thomasena5180@live.com', 'phone': '... | Greek |
166623 | 499995 | Irina Gibbs | 64 | Retired | {'email': 'delmy_8959@outlook.com', 'phone': '... | Italian |
166624 | 499996 | Emery Anderson | 59 | Retired | {'email': 'percy6103@live.com', 'phone': '288.... | Spanish |
166625 rows × 6 columns
Results¶
1 loop, best of 1: 7.8 s per loop
Checking Our Work¶
50 rows affected.
id | full_name | age | occupation | contact | nationality |
---|---|---|---|---|---|
915 | Dorsey Shaffer | 53 | Retired | {'email': 'johnny-4695@gmail.com', 'phone': '163-554-3792'} | Swedish |
1012 | Russel Mccall | 66 | Retired | {'email': 'kenny139@yandex.com', 'phone': '1-910-688-4436'} | Finnish |
1219 | Guillermo Cooley | 53 | Retired | {'email': 'peter_7368@gmail.com', 'phone': '(429) 133-8709'} | Cambodian |
1486 | Scott Parker | 63 | Retired | {'email': 'valda-9966@gmail.com', 'phone': '1-139-964-0250'} | Latvian |
1695 | Merlene King | 61 | Retired | {'email': 'timika-8127@gmail.com', 'phone': '901.106.6537'} | Estonian |
4239 | Corrinne Frost | 59 | Retired | {'email': 'natisha5688@yahoo.com', 'phone': '1-804-764-7697'} | Saudi |
4717 | Donette Savage | 55 | Retired | {'email': 'page_1478@outlook.com', 'phone': '189-685-2682'} | Irish |
4769 | Omer Powell | 64 | Retired | {'email': 'tyler_4230@yandex.com', 'phone': '189.685.2682'} | Uruguayan |
5194 | Regan Joseph | 50 | Retired | {'email': 'azucena_6202@yahoo.com', 'phone': '1-130-663-8499'} | Salvadorian |
5838 | Anton Cannon | 60 | Retired | {'email': 'warren6487@yandex.com', 'phone': '429-086-9244'} | Irish |
6056 | Gregory Wiley | 50 | Retired | {'email': 'simon1733@live.com', 'phone': '1-314-624-3685'} | Danish |
6520 | Mason Dodson | 54 | Retired | {'email': 'wiley-8663@live.com', 'phone': '1-852-906-6575'} | Israeli |
8743 | Emmie Hamilton | 61 | Retired | {'email': 'ali7464@gmail.com', 'phone': '(397) 625-4962'} | Romanian |
8956 | Leonard Sosa | 62 | Retired | {'email': 'al-4742@outlook.com', 'phone': '(179) 891-5062'} | Chinese |
13782 | Brent Norris | 54 | Retired | {'email': 'garrett390@yahoo.com', 'phone': '232.837.7400'} | Afghan |
14126 | Son Wilson | 65 | Retired | {'email': 'raye8017@live.com', 'phone': '652-370-3678'} | Dominican |
14973 | Arcelia Haley | 56 | Retired | {'email': 'marni-2282@yahoo.com', 'phone': '1-946-957-7639'} | Saudi |
15369 | Euna Hahn | 50 | Retired | {'email': 'melita_5569@outlook.com', 'phone': '483.145.4282'} | English |
15757 | Lory Warner | 66 | Retired | {'email': 'rikki-510@yahoo.com', 'phone': '(026) 778-3770'} | Greek |
16060 | Jessia Blevins | 54 | Retired | {'email': 'phung-7621@yahoo.com', 'phone': '439.640.8052'} | Russian |
17177 | Von Mullen | 63 | Retired | {'email': 'ty-7154@yahoo.com', 'phone': '910-148-9413'} | Ukrainian |
17336 | Serita Gregory | 62 | Retired | {'email': 'ilda7522@yahoo.com', 'phone': '039-630-0154'} | Egyptian |
19998 | Arnold Mcmillan | 60 | Retired | {'email': 'dong6181@live.com', 'phone': '149.377.4981'} | Danish |
20482 | Jeneva Crosby | 53 | Retired | {'email': 'lizeth-3856@live.com', 'phone': '545.029.6990'} | Dutch |
22282 | Klara Sutton | 54 | Retired | {'email': 'nicholle9797@gmail.com', 'phone': '884-671-8260'} | Dutch |
22334 | Noe Fox | 51 | Retired | {'email': 'angelo-5064@gmail.com', 'phone': '271.013.9775'} | Irish |
22431 | Maye Newman | 52 | Retired | {'email': 'jeremy-5043@yahoo.com', 'phone': '1-579-666-7433'} | Venezuelan |
22694 | Chante Adams | 51 | Retired | {'email': 'kizzie_7226@live.com', 'phone': '(887) 174-3442'} | French |
22849 | Ron Nieves | 51 | Retired | {'email': 'teodoro7237@yahoo.com', 'phone': '823-299-5268'} | Argentinian |
23326 | Louis Watkins | 57 | Retired | {'email': 'marlin_6498@yahoo.com', 'phone': '1-582-372-1588'} | Chilean |
23752 | Lowell Burton | 57 | Retired | {'email': 'john4744@yahoo.com', 'phone': '969.176.7855'} | Danish |
24021 | Dotty English | 56 | Retired | {'email': 'waneta714@yahoo.com', 'phone': '519.705.0652'} | Saudi |
25127 | Rocky Scott | 60 | Retired | {'email': 'wendell5688@outlook.com', 'phone': '681.405.1298'} | Swedish |
25817 | Juan Chapman | 60 | Retired | {'email': 'basil3123@live.com', 'phone': '1-601-774-1365'} | Japanese |
26060 | Earleen Monroe | 51 | Retired | {'email': 'raelene_2099@gmail.com', 'phone': '390.346.9362'} | Irish |
26131 | Kurtis Bates | 55 | Retired | {'email': 'leigh-5232@gmail.com', 'phone': '(847) 681-7496'} | Dutch |
26411 | Stephane Witt | 53 | Retired | {'email': 'beata4873@outlook.com', 'phone': '(713) 298-5607'} | Irish |
26769 | Ja Wilkinson | 60 | Retired | {'email': 'sammy6558@yahoo.com', 'phone': '467.269.2270'} | Finnish |
27041 | Cordia Eaton | 60 | Retired | {'email': 'clarine2380@outlook.com', 'phone': '1-550-095-1353'} | Ukrainian |
27678 | Ricky Jacobs | 63 | Retired | {'email': 'robin-5899@live.com', 'phone': '560-128-1809'} | Mexican |
27830 | Phung Melendez | 59 | Retired | {'email': 'ka-105@live.com', 'phone': '948-647-5247'} | Salvadorian |
28624 | Rebecka Witt | 64 | Retired | {'email': 'hassie_5350@yahoo.com', 'phone': '539.580.3269'} | Finnish |
29365 | Tod Williamson | 55 | Retired | {'email': 'dong4168@live.com', 'phone': '1-497-083-3206'} | German |
33514 | Jess Rowland | 64 | Retired | {'email': 'nigel-433@outlook.com', 'phone': '588.713.9422'} | Polish |
34040 | Leigh Schneider | 61 | Retired | {'email': 'carey2562@gmail.com', 'phone': '1-888-226-2833'} | Latvian |
34190 | So Morales | 66 | Retired | {'email': 'lahoma-6033@yahoo.com', 'phone': '(636) 975-3145'} | Italian |
35249 | Alvaro Franks | 63 | Retired | {'email': 'cecil-5400@outlook.com', 'phone': '567.210.8698'} | German |
36898 | Minna Glass | 66 | Retired | {'email': 'caterina-9301@gmail.com', 'phone': '+1-(992)-032-8425'} | Estonian |
37215 | Jacelyn Knight | 59 | Retired | {'email': 'berneice2272@yahoo.com', 'phone': '(886) 773-4378'} | English |
37302 | Stanton Leblanc | 52 | Retired | {'email': 'elwood-4643@live.com', 'phone': '1-886-773-4378'} | Irish |
Where’s the Bottleneck¶
Apparently it only takes Python about 2.5 seconds to create the 160,000 row UPSERT statement (which includes properly encoding dicts, escaping quotes, and so on). Since psycopg2 (which PGReaper sends the UPSERT statement to) is basically a C library with Python bindings, and we’re only sending one statement, the 5 remaining seconds is most likely taken up primarily by Postgres itself.
1 loop, best of 3: 2.42 s per loop
HTML Parsing¶
pgreaper contains a rich HTML parsing module featuring automated <table> parsing and Jupyter notebook integration. Because it is a large module on its own, it has its own documentation page.
HTML Table Parsing¶
Introduction¶
- A lot of useful data is stored in HTML tables, but parsing HTML is an arduous task. Using Python’s standard library html.parser module, SQLify tries to simplify the task of parsing HTML tables by:
- Creating a list of separate HTML tables
- Trying to automatically find column headers
- Handling different table designs, i.e. handling
- <tbody> and <thead> tags
- rowspan and colspan attributes
Note
Using SQLify’s HTML parser in conjunction with Jupyter notebooks is recommended.
Step 1: Reading in HTML¶
There are two avenues for reading in HTML.
a) Locally Saved HTML Files¶
b) From the Web¶
Step 2: Reviewing the Output¶
The functions above return TableBrowser objects, which are basically lists of HTML tables that were found. If viewing in Jupyter Notebook, the code above will display every table with an index next to the name of the table, e.g. [5] Players of the week.
Step 3: Cleaning the Tables¶
If the tables you wanted were parsed 100% correctly and don’t require any further processing steps, proceed to step 4. Otherwise, read on.
As seen above, when using the indexing operator on a TableBrowser object, you will get a Table object back. Table objects support a small set of data cleaning methods and contain attributes you may want to use or modify.
Step 4: Saving the Results¶
- After you’ve cleaned the Table to your satisfaction, you can save the results as either a:
- CSV file
- JSON file
- PostgreSQL Table
PostgreSQL¶
When saving to a new PostgreSQL database, you can either manually create it, or tell SQLify your preferred default database which should be used to create new databases.
Internals¶
Information for maintainers and forkers of pgreaper.
PGReaper Internals: Schema Inference¶
This page documents the interals of PGReaper. Unless you are developing, maintaining, forking, or just really curious in PGReaper, this is not really going to be of interest to you.
ColumnList¶
Originally, the column names and types for a Table were stored as the col_names and col_types attributes respectively. Simply being lists of strings, this approach–while simple–had many limitations. As PGReaper expanded its capabilities, more and more were being demanded of these lists, such as:
- Making sure col_names and col_types were of the same length
- Returning SQL safe column names while somehow being to keep track of the original column names
- Validating primary keys, i.e. making sure they referred to columns that actually existed
- Comparing column lists to other column lists to see if one was a subset of another or not
- This comes up when performing UPSERTs against existing SQL tables
- Comparing column lists to other column lists to see if they were really the same columns in different orders (a common problem with JSON parsing)
- Having a method to return the integer index corresponding to a column name
- Taking a list of column names and mapping to to their respective integer indices (again, comes up in JSON parsing and also used to implement the add_dict() method
Adding all this code to the Table structure made it messy, confusing, and harder to test. Therefore, a standalone class that managed column information was created.
SQLType¶
The SQLType object is used to map Python types to SQL types.