tidyextractors¶
Overview: Data Extraction Made Simple¶
tidyextractors
makes extracting data from supported sources as painless as possible, delivering a populated Pandas DataFrame in three lines of code. tidyextractors
was inspired by Hadley Wickham’s (2014) paper which introduces “tidy data” as a conceptual framework for data preparation.
Features¶
- Extracts data with minimal effort.
- Creates readable code that requires minimal explanation.
- Exports Pandas Dataframes to maximize compatibility with the Python data science ecosystem.
Data Sources Implemented¶
tidyextractors
currently has submodules for extracting data from the following sources:
- Local Git Repositories
- Twitter User Data (including Tweets) using the Twitter API
- Emails stored in the
mbox
file format.
What is Tidy Data?¶
Hadley Wickham (2014) introduced “tidy data” to describe data that has been cleaned and reshaped in a way that is ready for analysis. The concept of tidy data inspired tidyextractors
, which provides a convenient interface for extracting data in a tidy format. However, what is a tidy format?
Our Definition of Tidy Data¶
We consider data to be “tidy” if it satisfies the following constraints:
- Data values are atomic. No cell contains a collection of items (e.g. a list, set, or dictionary).
- Each row is a single observation. This is to say that each row represents a single “entity” (e.g. such as a commit, a change to a file, or a tweet) which can be uniquely identified by a primary key (e.g. MessageID for an email, or MessageID and recipient for an email “send”).
- Each column is a single variable.
This definition intentionally allows for a certain degree of data redundancy, which would be eliminated in traditional database normalizations, such as BCNF.
Choosing Output Formats¶
We have a few guiding principals for deciding which output formats to implement for a given extractor:
- If an entity would have its own table in a normalized database, it should be available as an output format.
- Dataset should include all variables that have meaningful information about the table’s defining unit of observation, even if this data may be redundant between rows.
- In general, more data is preferred to less data, so long as it is meaningful. It is easier to drop data than to integrate data.
Not Your Grandfather’s Tidy Data¶
If you are familiar with Hadley’s paper, you may notice that our definition is different from his. We did this because we find the original definition to be self-contradictory. Hadley claims that tidy data is “Codd’s 3rd normal form, but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases.” However, Codd’s 3rd normal form requires multiple tables.
We agree with Hadley’s claim that single table datasets are optimal for data analysis. We also agree with his practice of preferring single datasets at the cost of some data redundancy. We feel that our definition of tidy data is in the spirit of Hadley’s original paper.
Installation¶
With pip
¶
Run pip install tidyextractors
in your terminal. This will install the most recently released stable version of tidyextractors
.
The Hard Way¶
Clone the project’s GitHub repository and run pip install .
in the cloned directory.
Note this will install the current state of the project as it is on GitHub. This may not correspond to an official release and is not guaranteed to be stable.
Git Repository Data Extraction¶
The tidyextractors.tidygit
submodule lets you extract Git log data from a local Git repository. This page will guide you through the process.
A Minimal Code Example¶
from tidyextractors.tidygit import GitExtractor
# Extract data from a local Git repo
gx = GitExtractor('./your/repo/dir/')
# Commit data in a Pandas DataFrame.
commits_df = gx.commits(drop_collections=True)
# Commit/file keyed change data in a Pandas DataFrame
changes_df = gx.changes()
Step 1: Prepare Your Git Repo¶
All you need to get started is the path to a local Git repository. If you want to extract data from a repository hosted on GitHub, download or clone the repository to your computer.
Step 2: Extract Data¶
You can extract data from any local Git repository using the GitExtractor
:
from tidyextractors.tidygit import GitExtractor
gx = GitExtractor('./your/repo/dir/')
You may need to wait while the data is being extracted, but all the data is now stored inside the extractor object. You just need a bit more code to get it in your preferred format.
Step 3: Get Pandas Data¶
Now, you can call a GitExtractor
method to return data in a Pandas DataFrame.
# Commit data in a Pandas DataFrame.
commits_df = gx.commits(drop_collections=True)
# Commit/file keyed change data in a Pandas DataFrame
changes_df = gx.changes()
Note
GitExtractor.commits()
drops columns with collections of data in cells (i.e. list
, set
, and dicts
) because “tidy data” requires only atomic values in cells.
If you don’t want data dropped, change the optional drop_collections
argument to false.
Mbox Data Extraction¶
Mbox is a file format used to store mailbox data on Unix operating systems. The tidyextractors.tidymbox
submodule lets you extract user data from Mbox files with minimal effort. This page will guide you through the process.
A Minimal Code Example¶
from tidyextractors.tidymbox as MboxExtractor
# Extracts all mbox files in this directory.
mx = MboxExtractor('./your/mbox/dir/')
# Email messages in a Pandas DataFrame.
email_df = mx.emails(drop_collections=True)
# MessageID/receiver keyed Pandas DataFrame.
sends_df = mx.sends()
Step 1: Prepare Your Mbox Files¶
You can extract data from a single Mbox file, or multiple Mbox files. However, all these files must be in a single directory:
ls -1 ./your/mbox/dir/
file1.mbox
file2.mbox
file3.mbox
Step 2: Extract Data¶
Once you have consolidated your Mbox files, you can extract data from them using the MboxExtractor
:
from tidyextractors.tidymbox as MboxExtractor
# All mbox files in the directory
mx = MboxExtractor('./your/mbox/dir/')
# Only one mbox file
mx = MboxExtractor('./your/mbox/dir/file1.mbox')
You may need to wait while the data is being extracted, but all the data is now stored inside the extractor object. You just need a bit more code to get it in your preferred format.
Step 3: Get Pandas Data¶
Now, you can call an MboxExtractor
method to return data in a Pandas DataFrame.
# Email messages in a Pandas DataFrame.
email_df = mx.emails(drop_collections=True)
# MessageID/receiver keyed Pandas DataFrame.
sends_df = mx.sends()
Note
MboxExtractor.emails()
drops columns with collections of data in cells (i.e. list
, set
, and dicts
) because “tidy data” requires only atomic values in cells.
If you don’t want data dropped, change the optional drop_collections
argument to false.
Note
This submodule’s internals were adapted from Phil Deutsch’s mbox-to-pandas script with his permission.
Twitter Data Extraction¶
The tidyextractors.tidytwitter
submodule lets you extract user data from Twitter with minimal effort. This page will guide you through the process.
A Minimal Code Example¶
from tidyextractors.tidytwitter import TwitterExtractor
# Your Twitter API credentails. See below for how to get them!
credentials = {
'access_token': '',
'access_secret': '',
'consumer_key': '',
'consumer_secret': ''
}
# A list of users for data extraction.
users = ['user1','user2','user3']
# Extract Twitter data.
tx = TwitterExtractor(users, extract_tweets=True, **credentials)
# Twitter user profile data in a Pandas DataFrame
user_df = tx.users(drop_collections=True)
# User/tweet keyed Pandas DataFrame
tweet_df = tx.tweets()
Step 1: Get API Credentials¶
To extract data using the Twitter API, you will first need to obtain API credentials. Your API credentials contain four pieces of information:
access_token
access_secret
consumer_key
consumer_secret
To get these credentials, check out the Twitter developer documentation: https://dev.twitter.com/oauth/overview/application-owner-access-tokens
Step 2: Extract Data¶
Once you have your API credentials, you can extract user data with the TwitterExtractor
:
Warning
The Twitter API enforces rate limits, so be careful when downloading large amounts of data.
For a raw report on your remaining limit, call tx._api.rate_limit_status()
after extraction.
Note
As per the limit imposed by the Twitter API, only the 3,200 most recent tweets will be downloaded for each user.
from tidyextractors.tidytwitter import TwitterExtractor
credentials = {
# Randomly generated example credentials for demonstration only
'access_token': '985689236-R0EjHQJZLya6gb82R5g8Odb4UMwkhQy4Q2AxzBnB',
'access_secret': 'CVuVV0LSf74PQt2HH6zt08aeumGdMvlZtKF7BbHvRmX4r',
'consumer_key': 'F47AzSRag0KvVFG4eJYexuDqB',
'consumer_secret': 'lovnyqIA1oKs0jI4A27VXLLSUWrKc0hnNzyTu39NWIjSiq1xxj'
}
# User names may have leading "@" but this is not required.
users = ['user1','user2','user3']
# Users' tweets are extracted by default, but this may be disabled.
tx = TwitterExtractor(users, extract_tweets=True, **credentials)
You may need to wait while the data is being extracted, but all the data is now stored inside the extractor object. You just need a bit more code to get it in your preferred format.
Step 3: Get Pandas Data¶
Now, you can call a TwitterExtractor
method to return data in a Pandas DataFrame.
# Twitter user profile data in a Pandas DataFrame
user_df = tx.users(drop_collections=True)
# User/tweet keyed Pandas DataFrame
tweet_df = tx.tweets()
Note
TwitterExtractor.users()
drops columns with collections of data in cells (i.e. list
, set
, and dicts
) because “tidy data” requires only atomic values in cells.
If you don’t want data dropped, change the optional drop_collections
argument to false.
API Documentation¶
An in-depth look at the tidyextractors
interface.
BaseExtractor
¶
The BaseExtractor
is the foundation for all data extractors in this module.
Warning
The BaseExtractor
should never be created on its own. This page is intended for people who want to contribute to the tidyextractors
project or just want to understand some of what is going on under the hood.
For developers, note that the internal interface is not documented on this site. See the source code for the full documentation of the internal interface.
-
class
tidyextractors.
BaseExtractor
(source, auto_extract=True, *args, **kwargs)¶ BaseExtractor defines a basic interface, initialization routine, and data manipulation tools for extractor subclasses.
-
expand_on
(col1, col2, rename1=None, rename2=None, drop=[], drop_collections=False)¶ Returns a reshaped version of extractor’s data, where unique combinations of values from col1 and col2 are given individual rows.
Example function call from
tidymbox
:self.expand_on('From', 'To', ['MessageID', 'Recipient'], rename1='From', rename2='Recipient')
Columns to be expanded upon should be either atomic values or dictionaries of dictionaries. For example:
Input Data:
col1 (Atomic) col2 (Dict of Dict) value1 {valueA : {attr1: X1, attr2: Y1}, valueB: {attr1: X2, attr2: Y2} value2 {valueC : {attr1: X3, attr2: Y3}, valueD: {attr1: X4, attr2: Y4} Output Data:
col1_extended col2_extended attr1 attr2 value1 valueA X1 Y1 value1 valueB X2 Y2 value2 valueA X3 Y3 value2 valueB X4 Y4 Parameters: - col1 (str) – The first column to expand on. May be an atomic value, or a dict of dict.
- col2 (str) – The second column to expand on. May be an atomic value, or a dict of dict.
- rename1 (str) – The name for col1 after expansion. Defaults to col1_extended.
- rename2 (str) – The name for col2 after expansion. Defaults to col2_extended.
- drop (list) – Column names to be dropped from output.
- drop_collections (bool) – Should columns with compound values be dropped?
Returns: pandas.DataFrame
-
raw
(drop_collections=False)¶ Produces the extractor object’s data as it is stored internally.
Parameters: drop_collections (bool) – Defaults to False. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
-
GitExtractor
API Documentation¶
-
class
tidyextractors.tidygit.
GitExtractor
(source, auto_extract=True, *args, **kwargs)¶ The
GitExtractor
class is for extracting data from local git repositories. This class has methods for outputting data into thechanges
andcommits
tidy formats, and a raw untidy format.Parameters: - source (str) – The path to a local git repository
- auto_extract (bool) – Defaults to True. If True, data is extracted automatically. Otherwise, extraction must be initiated through the internal interface.
-
changes
()¶ Returns a table of git log data, with “changes” as rows/observations.
Note
drop_collections is not available for this method, since there are no meaningful collections to keep.
Returns: pandas.DataFrame
-
commits
(drop_collections=True)¶ Returns a table of git log data, with “commits” as rows/observations.
Parameters: drop_collections (bool) – Defaults to True. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
-
raw
(drop_collections=False)¶ Produces the extractor object’s data as it is stored internally.
Parameters: drop_collections (bool) – Defaults to False. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
MboxExtractor
API Documentation¶
-
class
tidyextractors.tidymbox.
MboxExtractor
(source, auto_extract=True, *args, **kwargs)¶ The
MboxExtractor
class is for extracting data from local Mbox files. This class has methods for outputting data into theemails
andsends
tidy formats, and a raw untidy format.Parameters: - source (str) – The path to either a single mbox file or a directory containing multiple mbox files.
- auto_extract (bool) – Defaults to True. If True, data is extracted automatically. Otherwise, extraction must be initiated through the internal interface.
-
emails
(drop_collections=True)¶ Returns a table of mbox message data, with “messages” as rows/observations.
Parameters: drop_collections (bool) – Defaults to True. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
-
raw
(drop_collections=False)¶ Produces the extractor object’s data as it is stored internally.
Parameters: drop_collections (bool) – Defaults to False. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
-
sends
()¶ Returns a table of mbox message data, with “sender/recipient” pairs as rows/observations.
Note
Rows may have a recipient from either “TO” or “CC”. SendType column specifies this for each row.
Note
drop_collections is not available for this method, since there are no meaningful collections to keep.
Returns: pandas.DataFrame
Note
This submodule’s internals were adapted from Phil Deutsch’s mbox-to-pandas script with his permission.
TwitterExtractor
API Documentation¶
-
class
tidyextractors.tidytwitter.
TwitterExtractor
(source, auto_extract=True, *args, **kwargs)¶ The
TwitterExtractor
class is for extracting user data from Twitter. This class has methods for outputting data into theusers
andtweets
tidy formats, and a raw untidy format.Parameters: - source (list) – A list of user screen name strings.
- auto_extract (bool) – Defaults to True. If True, data is extracted automatically. Otherwise, extraction must be initiated through the internal interface.
- access_token (str) – One of four required keyword arguments that make up a complete set of Twitter API credentials.
- access_secret (str) – One of four required keyword arguments that make up a complete set of Twitter API credentials.
- consumer_key (str) – One of four required keyword arguments that make up a complete set of Twitter API credentials.
- consumer_secret (str) – One of four required keyword arguments that make up a complete set of Twitter API credentials.
-
raw
(drop_collections=False)¶ Produces the extractor object’s data as it is stored internally.
Parameters: drop_collections (bool) – Defaults to False. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
-
tweets
()¶ Returns a table of Twitter user data, with “tweets” as rows/observations.
Note
drop_collections is not available for this method, since there are no meaningful collections to keep.
Returns: pandas.DataFrame
-
users
(drop_collections=True)¶ Returns a table of Twitter user data, with “users” as rows/observations.
Parameters: drop_collections (bool) – Defaults to True. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
Contributing¶
tidyextractors
is a very new project, but it will grow quickly in the coming month. If there is a particular kind of data you are interested in extracting, or which to contribute to the package, please contact Joel Becker (mail@joelbecker.ca) or Jillian Anderson (jillianderson8@gmail.com) and we will respond ASAP.
Creating an Extractor¶
Contributing a new extractor is relatively simple. Broadly speaking, you need to create a submodule with an extractor class inheriting from BaseExtractor
. To create this class (e.g. NewExtractor
) you need to do the following:
- Define a
NewExtractor._extract
method, which should extract data and assign it toNewExtractor._data
. This method will be called byBaseExtractor.__init__
during initialization. - Create a method to return each data format (e.g.
commits
,changes
).