_images/logo.png

dbling: The Chrome OS Forensic Tool

Documentation Status Python versions supported License: MIT

dbling is a tool for performing forensics in Chrome OS.

Please view the latest version of the documentation on Read the Docs and the latest version of the code on GitHub.

Publication

This work is based on the following publication:

Installation

Coming soon!

dbling Components

dbling is divided into the following main components:

Crawler

The Crawler finds and downloads the list of the currently-available extensions on the Chrome Web Store, determines which extensions are at a version that has already been downloaded, downloads those that have not yet been downloaded, and adds information on the newly downloaded extensions to the database.

The code for the Crawler is under crawl: The Chrome Web Store Crawler.

Template Generator

The Template Generator runs concurrently with the Crawler. For each new extension downloaded by the Crawler, the Template Generator calculates the centroid of the extension and stores it in the database. The Template Generator does not run inside Chrome or Chrome OS, and so it does not use the same mechanisms for unpacking and installing that Chrome does natively. Instead, the primary function of the Template Generator is to mimic as closely as possible the Chrome’s functions as they pertain to unpacking and installing extensions.

The code for the Template Generator is implemented alongside the Crawler, but the main function that creates templates is calc_centroid().

Profiler

Coming soon!

MERL Exporter

Coming soon!

gripper

Coming soon!

License

dbling is licensed under the MIT License.

dbling API

crawl: The Chrome Web Store Crawler

tasks

Tasks for Celery workers.

Beat Tasks

Beat tasks are those that are run on a periodic basis, depending on the configuration in celeryconfig.py or any cron jobs setup in the Ansible playbooks. Beat tasks only initiate the workflow by creating the jobs, they don’t actually do the work for each task.

Entry Points

Entry points are where an actual worker begins its work. A single task corresponds to a specific CRX file. The task function dictates what operations are performed on the CRX. Each operation is represented by a specific worker function (as described below).

Worker Functions

Worker functions each represent a discrete action to be taken on a CRX file.

Helper Tasks and Functions

These functions provide additional functionality that don’t fit in any of the above categories.

db_iface
webstore_iface

Chrome Web Store interface for dbling.

exception crawl.webstore_iface.ListDownloadFailedError(*args, **kwargs)[source]

Raised when the list download fails.

Initialize RequestException with request and response objects.

exception crawl.webstore_iface.ExtensionUnavailable[source]

Raised when an extension isn’t downloadable.

exception crawl.webstore_iface.BadDownloadURL[source]

Raised when the ID is valid but we can’t download the extension.

exception crawl.webstore_iface.VersionExtractError[source]

Raised when extracting the version number from the URL fails.

class crawl.webstore_iface.DownloadCRXList(ext_url, *, return_count=False, session=None)[source]

Generate list of extension IDs downloaded from Google.

As a generator, this is designed to be used in a for loop. For example:

>>> crx_list = DownloadCRXList(download_url)
>>> for crx_id in crx_list:
...     print(crx_id)

The list of CRXs will be downloaded just prior to when the first item is generated. In other words, instantiating this class doesn’t start the download, iterating over the instance starts the download. This is significant given that downloading the list is quite time consuming.

Parameters:
  • ext_url (str) – Specially crafted URL that will let us download the list of extensions.
  • return_count (bool) – When True, will return a tuple of the form: (crx_id, job_number), where job_number is the index of the ID plus 1. This way, the job number of the last ID returned will be the same as len(DownloadCRXList).
  • session (requests.Session) – Session object to use when downloading the list. If None, a new requests.Session object is created.
download_ids()[source]

Starting point for downloading all CRX IDs.

This function actually creates an event loop and starts the downloads asynchronously.

Return type:None
_async_download_lists()[source]

Download, loop through the list of lists, combine IDs from each.

Return type:None
_dl_parse_id_list(list_url)[source]

Download the extension list at the given URL, return set of IDs.

Parameters:list_url (str) – URL of an individual extension list.
Returns:Set of CRX IDs.
Return type:set
crawl.webstore_iface.save_crx(crx_obj, download_url, save_path=None, session=None)[source]

Download the CRX, save in the save_path directory.

The saved file will have the format: <extension ID>_<version>.crx

If save_path isn’t given, this will default to a directory called “downloads” in the CWD.

Adds the following keys to crx_obj:

  • version: Version number of the extension, as obtained from the final URL of the download. This may differ from the version listed in the extension’s manifest.
  • filename: The basename of the CRX file (not the full path)
  • full_path: The location (full path) of the downloaded CRX file
Parameters:
  • crx_obj (munch.Munch) – Previously collected information about the extension.
  • download_url (str) – The URL template that already contains the correct Chrome version information and {} where the ID goes.
  • save_path (str or None) – Directory where the CRX should be saved.
  • session (requests.Session or None) – Optional Session object to use for HTTP requests.
Returns:

Updated version of crx_obj with version, filename, and full_path information added. If the download wasn’t successful, not all of these may have been added, depending on when it failed.

Return type:

munch.Munch

merl: Matching Extension Ranking List Files

Google API Acquisition Tool

Access many APIs and acquire as much user identifying information as possible.

Getting Started

Gripper depends on an active G Suite account, which requires a domain name for your organization (you can create a new domain while creating your G Suite account if you don’t already have one).

Only has been tested in a Linux environment.

Install python 3 / pip3

Install google api python client

pip3 install --upgrade google-api-python-client
Prerequisites
Installing
Running

Command line:

Usage: gripper.py drive [options] (created | revised | comment) ...
       gripper.py reports [options]

Options:
 -c --cached    Use a cached version of the data, if available.
 -e EMAIL --email=EMAIL
                The email address of a user to impersonate. This requires
                domain-wide delegation to be activated. See
                https://developers.google.com/admin-sdk/reports/v1/guides/delegation
                for instructions.
 --level=LEVEL  The granularity level of the resulting heat map [default: hr]
 --start=START -s START
                The earliest data to collect. Can be any kind of date string,
                as long as it is unambiguous (e.g. "2017"). It can even be
                slang, such as "a year ago". Be aware, however, that only the
                *day* of the date will be used, meaning time information will
                be discarded.
 --end=END -e END
                The latest data to collect. Same format rules apply for this
                as for --start.
 --tz=TIMEZONE  The timezone to convert all timestamps to before compiling.
                This should be a standard timezone name. For reference, the
                list that the timezone will be compared against is available
                at https://github.com/newvem/pytz/blob/master/pytz/__init__.py.
                If omitted, the local timezone of the computer will be used.

Note: If you start this script using ipython (recommended), you'll need to
invoke it like this:

    $ ipython3 gripper.py -- [typical arguments]

The reason for this is that ipython interprets any options *before* the ``--``
as being meant for it.
Bugs or Issues

If you receive this error:

Failed to start a local webserver listening on either port 8080
or port 8090. Please check your firewall settings and locally
running programs that may be blocking or using those ports.

use: lsof -w -n -i tcp:8080 or lsof -w -n -i tcp:8090 respectively. then: kill -9 PID

Or you can click the click provided in the terminal and then copy and paste the key from the webpage that is launched.

API Documentation

Please refer to the following documents for information on the API:

apis
admin
class google.apis.admin.ReportsAPI(http=None, impersonated_user_email=None, start=None, end=None, timezone=None)[source]

Class to interact with G Suite Admin Reports APIs.

Documentation for the Python API: - https://developers.google.com/resources/api-libraries/documentation/admin/reports_v1/python/latest/

See also: https://developers.google.com/admin-sdk/reports/v1/quickstart/python

Parameters:
  • http (httplib2.Http) – An Http object for sending the requests. In general, this should be left as None, which will allow for auto-adjustment of the kind of Http object to create based on whether a user’s email address is to be impersonated.
  • impersonated_user_email (str) – The email address of a user to impersonate. This requires domain-wide delegation to be activated. See https://developers.google.com/admin-sdk/reports/v1/guides/delegation for instructions.
  • start (str) – The earliest data to collect. Can be any kind of date string, as long as it is unambiguous (e.g. “2017”). It can even be slang, such as “a year ago”. Be aware, however, that only the day of the date will be used, meaning time information will be discarded.
  • end (str) – The latest data to collect. Same format rules apply for this as for the start parameter.
  • timezone (str) – The timezone to convert all timestamps to before compiling. This should be a standard timezone name. For reference, the list that the timezone will be compared against is available at https://github.com/newvem/pytz/blob/master/pytz/__init__.py. If omitted, the local timezone of the computer will be used.
activity(user_key='all', app_name=None, **kwargs)[source]

Return the last 180 days of activities.

https://developers.google.com/admin-sdk/reports/v1/reference/activities/list

The application_name parameter specifies which events are to be retrieved. The possible values include:

  • admin – The Admin console application’s activity reports return account information about different types of administrator activity events.
  • calendar – The G Suite Calendar application’s activity reports return information about various Calendar activity events.
  • drive – The Google Drive application’s activity reports return information about various Google Drive activity events. The Drive activity report is only available for G Suite Business customers.
  • groups – The Google Groups application’s activity reports return information about various Groups activity events.
  • gplus – The Google+ application’s activity reports return information about various Google+ activity events.
  • login – The G Suite Login application’s activity reports return account information about different types of Login activity events.
  • mobile – The G Suite Mobile Audit activity report return information about different types of Mobile Audit activity events.
  • rules – The G Suite Rules activity report return information about different types of Rules activity events,.
  • token – The G Suite Token application’s activity reports return account information about different types of Token activity events.
Parameters:
  • user_key (str) – The value can be 'all', which returns all administrator information, or a userKey, which represents a user’s unique G Suite profile ID or the primary email address of a person or entity.
  • app_name (str) – Name of application from the list above. If set to None, data will be retrieved from all the applications listed above.
Returns:

JSON

get_customer_usage_reports(date, customer_id=False)[source]

Get customer usage reports.

https://developers.google.com/admin-sdk/reports/v1/reference/customerUsageReports/get

Parameters:
  • date
  • customer_id
Returns:

JSON

get_user_usage_report(date, user_key='all')[source]

Get user usage report.

https://developers.google.com/admin-sdk/reports/v1/reference/userUsageReport/get

Parameters:
  • date
  • user_key
Returns:

JSON

class google.apis.admin.DirectoryAPI(**kwargs)[source]

Class to interact with G Suite Admin Directory APIs.

Documentation for the Python API: - https://developers.google.com/resources/api-libraries/documentation/admin/directory_v1/python/latest/

See also: https://developers.google.com/admin-sdk/directory/v1/quickstart/python

list_chromeos_devices(fields='*')[source]

List up to 100 Chrome OS devices in the organization.

API: https://developers.google.com/resources/api-libraries/documentation/admin/directory_v1/python/latest/admin_directory_v1.chromeosdevices.html

Reference: https://developers.google.com/admin-sdk/directory/v1/reference/chromeosdevices/list

Parameters:fields (str) – Comma-separated list of metadata fields to request.
Returns:The list of Chrome OS devices. See one of the documentation links above for the format of the return value.
Return type:list
get_user(user_email)[source]

Get information for a single user specified by their email.

https://developers.google.com/admin-sdk/directory/v1/reference/users/get

Parameters:user_email (str) – user’s email
Returns:JSON
get_all_users(domain_name)[source]

Return all users in the domain.

https://developers.google.com/admin-sdk/directory/v1/reference/users/list

Parameters:domain_name (str) – the name of the domain
Returns:JSON
get_chrome_os_devices_properties(device_id)[source]

Get data pertaining to a single ChromeOS device.

https://developers.google.com/admin-sdk/directory/v1/reference/chromeosdevices/get

Parameters:device_id (str) – unique ID for the device.
Returns:JSON
list_customers_mobile_devices_properties()[source]

Get a list of mobile devices.

https://developers.google.com/admin-sdk/directory/v1/reference/mobiledevices/list

Returns:JSON
get_mobile_devices_properties(resource_id)[source]

Get data pertaining to a single mobile device.

https://developers.google.com/admin-sdk/directory/v1/reference/mobiledevices/get

Parameters:resource_id (str) – The unique ID the API service uses to identify the mobile device.
Returns:JSON
suspend_user_account(user_email)[source]

Suspend an user’s account.

https://developers.google.com/admin-sdk/directory/v1/reference/users/update https://developers.google.com/admin-sdk/directory/v1/guides/manage-users

Parameters:user_email (str) – Email for the user to be suspended.
Returns:JSON
unsuspend_user_account(user_email)[source]

Un-suspend a user’s account.

https://developers.google.com/admin-sdk/directory/v1/reference/users/update https://developers.google.com/admin-sdk/directory/v1/guides/manage-users

Parameters:user_email (str) – Email for the user to be un-suspended.
Returns:JSON
drive
google.apis.drive.DRIVE_BACKUP_FILE = '/home/docs/checkouts/readthedocs.org/user_builds/dbling/checkouts/latest/google/apis/../drive_data_backup.pkl'

Location of pickled data when cached.

google.apis.drive.SEGMENT_SIZE = 4

Number of hours in a segment. Must be equally divisible by 24 to avoid issues.

class google.apis.drive.DriveAPI(http=None, impersonated_user_email=None, start=None, end=None, timezone=None)[source]

Class to interact with Google Drive APIs.

Documentation for the Python API:

Quick start guide:

Parameters:
  • http (httplib2.Http) – An Http object for sending the requests. In general, this should be left as None, which will allow for auto-adjustment of the kind of Http object to create based on whether a user’s email address is to be impersonated.
  • impersonated_user_email (str) – The email address of a user to impersonate. This requires domain-wide delegation to be activated. See https://developers.google.com/admin-sdk/reports/v1/guides/delegation for instructions.
  • start (str) – The earliest data to collect. Can be any kind of date string, as long as it is unambiguous (e.g. “2017”). It can even be slang, such as “a year ago”. Be aware, however, that only the day of the date will be used, meaning time information will be discarded.
  • end (str) – The latest data to collect. Same format rules apply for this as for the start parameter.
  • timezone (str) – The timezone to convert all timestamps to before compiling. This should be a standard timezone name. For reference, the list that the timezone will be compared against is available at https://github.com/newvem/pytz/blob/master/pytz/__init__.py. If omitted, the local timezone of the computer will be used.
activity(level, what=('files', 'revisions'), use_cached=False, **kwargs)[source]

Compile the user’s activity.

Note about revision history: One of the metadata fields for file revisions is called “keepForever”. This indicates whether to keep the revision forever, even if it is no longer the head revision. If not set, the revision will be automatically purged 30 days after newer content is uploaded. This can be set on a maximum of 200 revisions for a file.

Parameters:
  • level (str) –

    Level of detail on the activity. Accepted values:

    • 'dy': Activity is summarized by day
    • 'hr': Activity is summarized by hour, X:00:00 to X:59:59
    • 'sg': Activity throughout the day is divided into a number of segments (defined to be SEGMENT_SIZE divided by 24).
  • what (tuple or list) –

    Indicates what kind of content to scan for activity. Accepted values:

    • 'created'
    • 'revisions'
    • 'comments'
  • use_cached (bool) – Whether or not to use cached data. When set, this avoids downloading all the file metadata from Google if a cached version of the data is available on disk.
Returns:

A dictionary containing three keys: x, y, and z. Each of these stores a list suitable for passing as the data set for a plot.

Return type:

dict(str, list)

Raises:

ValueError – When the level or what parameters have an unsupported format or value.

get_about(fields='*')[source]

Retrieves information about the user’s Drive. and system capabilities.

https://developers.google.com/drive/v3/reference/about

Parameters:fields (string) – fields to be returned
Returns:JSON
team_drives

A list of team drives associated with the user.

Return type:list(str)
get_changes(spaces='drive', include_team_drives=True, restrict_to_my_drive=False, include_corpus_removals=None, include_removed=None)[source]

Return the changes for a Google Drive account.

The set of changes as returned by this method are more suited for a file syncing application.

In the returned dict, the key for changes in the user’s regular Drive is an empty string (''). The data for each Team Drive (assuming include_team_drives is True) is stored using a key in the format 'team_drive_X', where X is the ID of the Team Drive. For the form of the JSON data, go to https://developers.google.com/resources/api-libraries/documentation/drive/v3/python/latest/drive_v3.teamdrives.html#list

https://developers.google.com/drive/v3/reference/changes

Parameters:
  • spaces (str) – A comma-separated list of spaces to query within the user corpus. Supported values are ‘drive’, ‘appDataFolder’ and ‘photos’.
  • include_team_drives (bool) – Whether or not to include data from Team Drives as well as the user’s Drive.
  • restrict_to_my_drive (bool) – Whether to restrict the results to changes inside the My Drive hierarchy. This omits changes to files such as those in the Application Data folder or shared files which have not been added to My Drive.
  • include_corpus_removals (bool) – Whether changes should include the file resource if the file is still accessible by the user at the time of the request, even when a file was removed from the list of changes and there will be no further change entries for this file.
  • include_removed (bool) – Whether to include changes indicating that items have been removed from the list of changes, for example by deletion or loss of access.
Returns:

All data on changes by the user in JSON format and stored in a dict.

Return type:

dict(str, dict)

gen_file_data(fields='*', spaces='drive', include_team_drives=True, corpora=None)[source]

Generate the metadata for the user’s Drive files.

This function is a generator, so it yields the metadata for one file at a time. For the format of the dict generated, see https://developers.google.com/resources/api-libraries/documentation/drive/v3/python/latest/drive_v3.files.html#list

Parameters:
  • fields (str) – The metadata fields to retrieve.
  • spaces (str) – A comma-separated list of spaces to query within the user corpus. Supported values are ‘drive’, ‘appDataFolder’ and ‘photos’.
  • include_team_drives (bool) – Whether or not to include data from Team Drives as well as the user’s Drive.
  • corpora (str) – Comma-separated list of bodies of items (files/documents) to which the query applies. Supported bodies are ‘user’, ‘domain’, ‘teamDrive’ and ‘allTeamDrives’. ‘allTeamDrives’ must be combined with ‘user’; all other values must be used in isolation. Prefer ‘user’ or ‘teamDrive’ to ‘allTeamDrives’ for efficiency.
Returns:

The file metadata.

Return type:

dict

export_drive_file(file_data, download_path)[source]

Exports and converts .g* files to real files and then downloads them

https://developers.google.com/drive/v3/reference/files/export

Parameters:
  • file_data (JSON) – List of file(s) to be downloaded
  • download_path – Path where the file will be downloaded
Returns:

boolean True if downloads succeeded, False if Downloads failed.

export_real_file(file_data, download_path)[source]

Downloads real files. AKA not .g*

https://developers.google.com/drive/v3/reference/files/export

Parameters:
  • file_data (JSON) – List of file(s) to be downloaded
  • download_path – Path where the file will be downloaded
Returns:

Nothing

download_files(file_list_array=False)[source]

Downloads files from the user’s drive

https://developers.google.com/drive/v3/web/manage-downloads

Parameters:file_list_array (array) – list of file(s) to be downloaded
Returns:Nothing
get_app_folder(fields='nextPageToken, files(id, name)')[source]

Returns the data in the users app data folder

https://developers.google.com/drive/v3/reference/files/list

Parameters:fields (string) – fields to be returned
Returns:JSON
get_photo_data(fields='nextPageToken, files(id, name)')[source]

Returns the data about the user’s photos

https://developers.google.com/drive/v3/reference/files/list

Parameters:fields (string) – fields to be returned
Returns:JSON
google.apis.drive.crunch(level, **kwargs)[source]

Consolidate the data to the specified level.

Parameters:
  • data (CalDict) – The data from parsing the Drive metadata.
  • level (str) – Must be one of dy, sg, or hr. For an explanation of these options, see the docstring for DriveAPI.activity().
  • start (datetime.date) – The earliest data to collect.
  • end (datetime.date) – The latest data to collect.
Returns:

Tuple with two elements. The first is a DateRange object which stores the first and last days with activity (the range of dates that the data corresponds to) in its start and end attributes, respectively. Both of these attributes are date objects.

The second element in the returned tuple is a list containing the data for each day. The contents of this list vary based on the value of level:

  • dy: A single list of int s, one for each day.
  • sg: list s of int s. Each list corresponds to a segment, each int corresponds to a day. These lists are in reverse order, meaning the first list represents the last segment of a day.
  • hr: list s of int s. Each list corresponds to an hour, each int corresponds to a day. These lists are in reverse order, meaning the first list represents the last hour of a day.

Return type:

tuple(DateRange, list(list(int)))

gmail
class google.apis.gmail.GmailAPI(http=None, impersonated_user_email=None, start=None, end=None, timezone=None)[source]

Class to interact with Google Gmail APIs.

Parameters:
  • http (httplib2.Http) – An Http object for sending the requests. In general, this should be left as None, which will allow for auto-adjustment of the kind of Http object to create based on whether a user’s email address is to be impersonated.
  • impersonated_user_email (str) – The email address of a user to impersonate. This requires domain-wide delegation to be activated. See https://developers.google.com/admin-sdk/reports/v1/guides/delegation for instructions.
  • start (str) – The earliest data to collect. Can be any kind of date string, as long as it is unambiguous (e.g. “2017”). It can even be slang, such as “a year ago”. Be aware, however, that only the day of the date will be used, meaning time information will be discarded.
  • end (str) – The latest data to collect. Same format rules apply for this as for the start parameter.
  • timezone (str) – The timezone to convert all timestamps to before compiling. This should be a standard timezone name. For reference, the list that the timezone will be compared against is available at https://github.com/newvem/pytz/blob/master/pytz/__init__.py. If omitted, the local timezone of the computer will be used.
get_labels()[source]

returns a list of mailbox labels

Returns:JSON
get_all()[source]

method used for testing

Returns:nothing
google
class google.apis.google.GoogleAPI(http=None, impersonated_user_email=None, start=None, end=None, timezone=None)[source]

Interface to the Google API.

See the documentation for subclasses for more detailed information.

Parameters:
  • http (httplib2.Http) – An Http object for sending the requests. In general, this should be left as None, which will allow for auto-adjustment of the kind of Http object to create based on whether a user’s email address is to be impersonated.
  • impersonated_user_email (str) – The email address of a user to impersonate. This requires domain-wide delegation to be activated. See https://developers.google.com/admin-sdk/reports/v1/guides/delegation for instructions.
  • start (str) – The earliest data to collect. Can be any kind of date string, as long as it is unambiguous (e.g. “2017”). It can even be slang, such as “a year ago”. Be aware, however, that only the day of the date will be used, meaning time information will be discarded.
  • end (str) – The latest data to collect. Same format rules apply for this as for the start parameter.
  • timezone (str) – The timezone to convert all timestamps to before compiling. This should be a standard timezone name. For reference, the list that the timezone will be compared against is available at https://github.com/newvem/pytz/blob/master/pytz/__init__.py. If omitted, the local timezone of the computer will be used.
people
class google.apis.people.PeopleAPI(http=None, impersonated_user_email=None, start=None, end=None, timezone=None)[source]

Class to interact with Google People APIs.

Parameters:
  • http (httplib2.Http) – An Http object for sending the requests. In general, this should be left as None, which will allow for auto-adjustment of the kind of Http object to create based on whether a user’s email address is to be impersonated.
  • impersonated_user_email (str) – The email address of a user to impersonate. This requires domain-wide delegation to be activated. See https://developers.google.com/admin-sdk/reports/v1/guides/delegation for instructions.
  • start (str) – The earliest data to collect. Can be any kind of date string, as long as it is unambiguous (e.g. “2017”). It can even be slang, such as “a year ago”. Be aware, however, that only the day of the date will be used, meaning time information will be discarded.
  • end (str) – The latest data to collect. Same format rules apply for this as for the start parameter.
  • timezone (str) – The timezone to convert all timestamps to before compiling. This should be a standard timezone name. For reference, the list that the timezone will be compared against is available at https://github.com/newvem/pytz/blob/master/pytz/__init__.py. If omitted, the local timezone of the computer will be used.
get_contacts()[source]

returns list of contacts for the authenticated user

Returns:JSON
get_all()[source]

method used for testing

Returns:nothing
plus
class google.apis.plus.PlusAPI(http=None, impersonated_user_email=None, start=None, end=None, timezone=None)[source]

Class to interact with Google Plus APIs.

Parameters:
  • http (httplib2.Http) – An Http object for sending the requests. In general, this should be left as None, which will allow for auto-adjustment of the kind of Http object to create based on whether a user’s email address is to be impersonated.
  • impersonated_user_email (str) – The email address of a user to impersonate. This requires domain-wide delegation to be activated. See https://developers.google.com/admin-sdk/reports/v1/guides/delegation for instructions.
  • start (str) – The earliest data to collect. Can be any kind of date string, as long as it is unambiguous (e.g. “2017”). It can even be slang, such as “a year ago”. Be aware, however, that only the day of the date will be used, meaning time information will be discarded.
  • end (str) – The latest data to collect. Same format rules apply for this as for the start parameter.
  • timezone (str) – The timezone to convert all timestamps to before compiling. This should be a standard timezone name. For reference, the list that the timezone will be compared against is available at https://github.com/newvem/pytz/blob/master/pytz/__init__.py. If omitted, the local timezone of the computer will be used.
get_me()[source]

returns Google+ information for the current user

Returns:
get_all()[source]

method used for testing

Returns:nothing
google.apis.get_api(api, **kwargs)[source]

Shortcut for creating an API object.

Parameters:
  • api (str) –

    Name of the API to instantiate. Acceptable values are:

    • 'drive'
    • 'plus'
    • 'people'
    • 'dir'
    • 'gmail'
    • 'reports'
  • kwargs (dict) – Set of keyword arguments to pass to the object’s constructor.
Returns:

An instance of the created object.

Return type:

DriveAPI or PlusAPI or PeopleAPI or DirectoryAPI or GmailAPI or ReportsAPI

Other Functions and Classes

The following functions and classes are helpers to the code documented elsewhere.

util

Google API Client Library Page https://developers.google.com/api-client-library/python/reference/pydoc Python Quick Start Page https://developers.google.com/drive/v3/web/quickstart/python

exception google.util.InvalidCredsError[source]

Raised when HTTP credentials don’t work.

class google.util.CalDict[source]

A dict-like class for storing hourly data for a year.

This is intended to have a set of keys that correspond to years. Since Python’s syntax dictates that objects cannot have attributes with names consisting only of numbers (e.g. cal.2017), one solution would be to name the year keys cal.y2017, cal.y2016, etc. This is the intended convention for CalDict objects and aligns with how month and day data is named.

Once you have created an instance of CalDict, you can easily create the structures necessary to store a year’s worth of data like so:

>>> cal = CalDict()
>>> cal[2017]

Just accessing the 2017 key (which is an int) assigns its value to be a dict with 12 keys, one for each month, numbered 1 through 12. Each of those keys points to a dict object with 31 keys, numbered 1 through 31. The day keys point to a list of 24 integers, initialized to 0. This allows you to increment the value for a particular hour immediately after instantiation, like the following, which increments the counter for the 2 PM hour block on August 31, 2016:

>>> cal2 = CalDict()
>>> y, m, d = 2016, 8, 31
>>> cal2[y][m][d][14] += 1

Since all months in a CalDict instance have 31 days, I recommend you use an external method of validating a particular date before storing or retrieving data.

google.util.get_credentials(scope=None, application_name=None, secret=None, credential_file=None)[source]

Create the credential file for accessing the Google APIs.

https://developers.google.com/drive/v3/web/quickstart/python

Parameters:
  • scope (str) – String of Scopes separated by spaces to give access to different Google APIs. Defaults to SCOPES.
  • application_name (str) – Name of this Application. Defaults to APPLICATION_NAME.
  • secret (str) – The secret file given from Google. Should be named client_secret.json. Defaults to CLIENT_SECRET_FILE.
  • credential_file (str) – Name of the credential file to be created. Defaults to CREDENTIAL_FILE.
Returns:

Credential object.

Raises:

InvalidCredsError – if the credential file is missing or invalid.

google.util.set_http(impersonated_user_email=None)[source]

Create and return the Http object used to communicate with Google.

https://developers.google.com/drive/v3/web/quickstart/python

Parameters:impersonated_user_email (str) – Email address of the User to be impersonated. This uses domain wide delegation to do the impersonation.
Returns:The Http object.
Return type:httplib2.Http
Raises:InvalidCredsError – if the credential file is missing or invalid.
google.util.print_json(obj, sort=False, indent=2)[source]

Print the JSON object in a human readable format.

Parameters:
  • obj (dict or list) – JSON-serializable object.
  • sort (bool) – Whether to sort the keys before printing.
  • indent (int) – Number of spaces to indent.
Return type:

None

google.util.convert_mime_type_and_extension(google_mime_type)[source]

Return the conversion type and extension for the given Google MIME type.

Converts mimeType given from google to one of our choosing for export conversion This is necessary to download .g* files.

Information on MIME types:

Parameters:google_mime_type (str) – mimeType given from Google API
Returns:Tuple in the form (conversion type, extension). If no supported conversion is supported for the given MIME type, the tuple will be (False, False).
Return type:tuple
const
google.const.CLIENT_SECRET_FILE = 'client_secret.json'

This file is obtained from Google through the API pages. Guide posted below

Look for the “Create authorization credentials” subsection

google.const.CREDENTIAL_FILE = 'test_creds.json'

Name of the file that is made by get_credentials

google.const.APPLICATION_NAME = 'dbling'

Name of the application

google.const.DOWNLOAD_DIRECTORY = None

Optional, if set to a path the user’s drive files will be downloaded to that location

google.const.PAGE_SIZE = 1000

Page size for requests. Specifies the number of records to be returned in a single reply. The accepted range for most requests is [1, 1000].

google.const.SCOPES = 'https://www.googleapis.com/auth/drive.readonly https://www.googleapis.com/auth/drive.appfolder https://www.googleapis.com/auth/plus.login https://www.googleapis.com/auth/gmail.readonly https://www.googleapis.com/auth/contacts.readonly https://www.googleapis.com/auth/admin.directory.device.chromeos https://www.googleapis.com/auth/admin.directory.user https://www.googleapis.com/auth/admin.directory.device.mobile.readonly https://www.googleapis.com/auth/admin.directory.customer.readonly https://www.googleapis.com/auth/admin.reports.audit.readonly https://www.googleapis.com/auth/admin.reports.usage.readonly'

Scope: https://developers.google.com/drive/v3/web/about-auth

plot
google.plot.HEATMAP_COLORS = ('#e7f0fa', '#c9e2f6', '#95cbee', '#0099dc', '#4ab04a', '#ffd73e', '#eec73a', '#e29421', '#e29421', '#f05336', '#ce472e')

Color scale used by the heat map

google.plot.STOP_FACTOR = 80

Changes how quickly the colors go to the maximum

google.plot.stop(i)[source]

Return the i th color stop.

In color gradients, the point where a defined color is (as opposed to in between the defined colors, where the colors are “graded”) is called a “stop”. This Python function defines an exponential math function that returns floating point values, in the range of 0 to 1, that define where the gradient stops should occur. In the heatmap() function, these values will be used to determine the color of each cell based on the normalized values of the z parameters.

The number of stops is determined by the number of colors defined in HEATMAP_COLORS. The math function used is below. In it, m = STOP_FACTOR and n = len(HEATMAP_COLORS).

\[\frac{m^{(i/(n - 1))} - 1}{m - 1}\]
Parameters:i (int) – The current stop number. Must be a value between 0 and len(HEATMAP_COLORS) - 1, i.e. [0, n).
Returns:Where the i th color stop should occur. Will always be a value between 0 and 1.
Return type:float
Raises:ValueError – When i isn’t between 0 and len(HEATMAP_COLORS) -1.
google.plot.heatmap(x, y, z, title='')[source]

Create and return a heat map figure object for the given data.

Parameters:
  • x (list or tuple) – Data for the x-axis.
  • y (list or tuple) – Data for the y-axis.
  • z (list or tuple) – Data for the z-axis.
  • title (str) – A title for the figure.
Returns:

The object with the data’s graph. With this object you can then call its iplot() method to show the graph.

Return type:

plotly.graph_objs.Figure

MIME Type Info

As specified in the Google Drive API documentation, G Suite and Google Drive use MIME types specific to those services, as follows:

MIME Type Description
application/vnd.google-apps.audio  
application/vnd.google-apps.document Google Docs
application/vnd.google-apps.drawing Google Drawing
application/vnd.google-apps.file Google Drive file
application/vnd.google-apps.folder Google Drive folder
application/vnd.google-apps.form Google Forms
application/vnd.google-apps.fusiontable Google Fusion Tables
application/vnd.google-apps.map Google My Maps
application/vnd.google-apps.photo  
application/vnd.google-apps.presentation Google Slides
application/vnd.google-apps.script Google Apps Scripts
application/vnd.google-apps.sites Google Sites
application/vnd.google-apps.spreadsheet Google Sheets
application/vnd.google-apps.unknown  
application/vnd.google-apps.video  
application/vnd.google-apps.drive-sdk 3rd party shortcut

In addition to the above MIME types, Google Doc formats can be exported as the following MIME types, as described in the Drive documentation:

Google Doc Format Conversion Format Corresponding MIME type
Documents HTML text/html
  HTML (zipped) application/zip
  Plain text text/plain
  Rich text application/rtf
  Open Office doc application/vnd.oasis.opendocument.text
  PDF application/pdf
  MS Word document application/vnd.openxmlformats-officedocument .wordprocessingml.document
  EPUB application/epub+zip
Spreadsheets MS Excel application/vnd.openxmlformats-officedocument .spreadsheetml.sheet
  Open Office sheet application/x-vnd.oasis.opendocument.spreadsh eet
  PDF application/pdf
  CSV (1st sheet only) text/csv
  TSV (1st sheet only) text/tab-separated-values
  HTML (zipped) application/zip
Drawings JPEG image/jpeg
  PNG image/png
  SVG image/svg+xml
  PDF application/pdf
Presentations MS PowerPoint application/vnd.openxmlformats-officedocument .presentationml.presentation
  Open Office presentation application/vnd.oasis.opendocument.presentati on
  PDF application/pdf
  Plain text text/plain
Apps Scripts JSON application/vnd.google-apps.script+json
Authors

common: Modules Used Throughout dbling

centroid: Representation of a Centroid
clr: Color Text Easily

Color text.

Typical usage:

>>> red('red text', False)

Returns the string “red text” where the text will be red and the background will be the default.

>>> red('red background')

Returns the string “red background” where the text will be the default color and the background will be red.

common.clr.add_color_log_levels(center=False)[source]

Alter log level names to be colored.

Levels are colored to have black text and a background colored as follows:

  • Level 50 (Critical): red
  • Level 40 (Error): magenta
  • Level 30 (Warning): yellow
  • Level 20 (Info): blue
  • Level 10 (Debug): green
  • Level 0 (Not Set): white
Parameters:center (bool) – If log text should be centered. When set to True, the text will be centered to the width of "CRITICAL", which is 8 characters. This makes it so the level in the log output always takes up the same number of characters.
Return type:None
common.clr.black(text, background=True)[source]

Set text (or its background) to be black.

common.clr.red(text, background=True)[source]

Set text (or its background) to be red.

common.clr.green(text, background=True)[source]

Set text (or its background) to be green.

common.clr.yellow(text, background=True)[source]

Set text (or its background) to be yellow.

common.clr.blue(text, background=True)[source]

Set text (or its background) to be blue.

common.clr.magenta(text, background=True)[source]

Set text (or its background) to be magenta.

common.clr.cyan(text, background=True)[source]

Set text (or its background) to be cyan.

common.clr.white(text, background=True)[source]

Set text (or its background) to be white.

const: Constant Values

Constant values used by dbling.

common.const.IN_PAT_VAULT = re.compile('^/?home/\\.shadow/[0-9a-z]*?/vault/user/')

Regular expression pattern for including only the user’s files

common.const.ENC_PAT = re.compile('/ECRYPTFS_FNEK_ENCRYPTED\\.([^/]*)$')

Regular expression pattern for identifying encrypted files

common.const.SLICE_PAT = re.compile('.*(/home.*)')
common.const.CRX_URL = 'https://chrome.google.com/webstore/detail/%s'

URL used for downloading CRXs

common.const.ISO_TIME = '%Y-%m-%dT%H:%M:%SZ'

ISO format for date time values

common.const.DENTRY_FIELD_BYTES = 8

Number of bytes used by the dir entry fields (preceding the filename)

class common.const.FType[source]

File types as stored in directory entries in ext2, ext3, and ext4.

common.const.MODE_UNIX = {32768: 1, 16384: 2, 24576: 4, 40960: 7, 4096: 5, 8192: 3, 49152: 6}

Maps the octal values that stat returns from stat.S_IFMT to one of the regular Unix file types

common.const.TYPE_TO_NAME = {0: '-', 1: 'r', 2: 'd', 3: 'c', 4: 'b', 5: 'p', 6: 's', 7: 'l'}

Maps Unix file type numbers to the character used in DFXML to represent that file type

See: https://github.com/dfxml-working-group/dfxml_schema/blob/4c8aab566ea44d64313a5e559b1ecdce5348cecf/dfxml.xsd#L412

Other file types defined in DFXML schema

  • h - Shadow inode (Solaris)
  • w - Whiteout (OpenBSD)
  • v - Special (Used in The SleuthKit for added “Virtual” files, e.g. $FAT1)
class common.const.ModeTypeDT[source]

File types as stored in the file’s mode.

In Linux, fs.h defines these values and stores them in bits 12-15 of stat.st_mode, e.g. (i_mode >> 12) & 15. In fs.h, the names are prefixed with DT_, hence the name of this enum class. Here are the original definitions:

#define DT_UNKNOWN      0
#define DT_FIFO         1
#define DT_CHR          2
#define DT_DIR          4
#define DT_BLK          6
#define DT_REG          8
#define DT_LNK          10
#define DT_SOCK         12
#define DT_WHT          14
common.const.mode_to_unix(x)[source]

Return the UNIX version of the mode returned by stat.

common.const.ECRYPTFS_SIZE_THRESHOLDS = (84, 104, 124, 148, 168, 188, 212, 232, 252, -inf)

The index of these correspond with i such that 16*i is the lower bound and (16*(i+1))-1 is the upper bound for file name lengths that correspond to this value. Anything 16*9=144 or longer is invalid.

common.const.ECRYPTFS_FILE_HEADER_BYTES = 8192

Number of bytes used by eCryptfs for its header

common.const.USED_FIELDS = ('_c_num_child_dirs', '_c_num_child_files', '_c_mode', '_c_depth', '_c_type')

Fields used to calculate centroids

common.const.USED_TO_DB = {'_c_num_child_files': 'num_files', '_c_depth': 'depth', '_c_size': 'size', '_c_ctime': 'ctime', '_c_type': 'type', '_c_mode': 'perms', '_c_num_child_dirs': 'num_dirs'}

Mapping of USED_FIELDS to database colulmn names. USED_TO_DB doesn’t have the ttl_files field because it’s not explicitly stored in the graph object.

graph: Customized Digraph Object
sync: Easy Mutex Creation

Context manager for easily using a pymemcache mutex.

The acquire_lock context manager makes it easy to use pymemcache (which uses memcached) to create a mutex for a certain portion of code. Of course, this requires the pymemcache library to be installed, which in turn requires memcached to be installed.

exception common.sync.LockUnavailable[source]

Raised when a cached lock is already in use.

common.sync.acquire_lock(lock_id, wait=0, max_retries=0)[source]

Acquire a lock on the given lock ID, or raise an exception.

This context manager can be used as a mutex by doing something like the following:

>>> from time import sleep
>>> job_done = False
>>> while not job_done:
...     try:
...         with acquire_lock('some id'):
...             sensitive_function()
...             job_done = True
...     except LockUnavailable:
...         # Sleep for a couple seconds while the other code runs and
...         # hopefully completes
...         sleep(2)

In the above example, sensitive_function() should only be run if no other code is also running it. A more concise way of writing the above example would be to use the other parameters, like this:

>>> with acquire_lock('some id', wait=2):
...     sensitive_function()
Parameters:
  • lock_id (str or bytes) – The ID for this lock. See pymemcache‘s documentation on key constraints for more info.
  • wait (int) – Indicates how many seconds after failing to acquire the lock to wait (sleep) before retrying. When set to 0 (default), will immediately raise a LockUnavailable exception.
  • max_retries (int) – Maximum number of times to retry to acquire the lock before raising a LockUnavailable exception. When set to 0 (default), will always retry. Has essentially no effect if wait is 0.
Raises:

LockUnavailable – when a lock with the same ID already exists and wait is set to 0.

util: Various Utilities for dbling
common.util.validate_crx_id(crx_id)[source]

Validate the given CRX ID.

Check that the Chrome extension ID has three important properties:

  1. It must be a string
  2. It must have alpha characters only (strictly speaking, these should be lowercase and only from a-p, but checking for this is a little overboard)
  3. It must be 32 characters long
Parameters:crx_id (str) – The ID to validate.
Raises:MalformedExtID – When the ID doesn’t meet the criteria listed above.
exception common.util.MalformedExtId[source]

Raised when an ID doesn’t have the correct form.

common.util.get_crx_version(crx_path)[source]

Extract and return the version number from the CRX’s path.

The return value from the download() function is in the form: <extension ID>_<version>.crx.

The <version> part of that format is “x_y_z” for version “x.y.z”. To convert to the latter, we need to 1) get the basename of the path, 2) take off the trailing ”.crx”, 3) remove the extension ID and ‘_’ after it, and 4) replace all occurrences of ‘_’ with ‘.’.

Parameters:crx_path (str) – The full path to the downloaded CRX, as returned by the download() function.
Returns:The version number in the form “x.y.z”.
Return type:str
common.util.get_id_version(crx_path)[source]

From the path to a CRX, extract and return the ID and version as strings.

Parameters:crx_path (str) – The full path to the downloaded CRX.
Returns:The ID and version number as a tuple: (id, num)
Return type:tuple(str, str)
common.util.separate_mode_type(mode)[source]

Separate out the values for the mode (permissions) and the file type from the given mode.

Both returned values are integers. The mode is just the permissions (usually displayed in the octal format), and the type corresponds to the standard VFS types:

  • 0: Unknown file
  • 1: Regular file
  • 2: Directory
  • 3: Character device
  • 4: Block device
  • 5: Named pipe (identified by the Python stat library as a FIFO)
  • 6: Socket
  • 7: Symbolic link
Parameters:mode (int) – The mode value to be separated.
Returns:Tuple of ints in the form: (mode, type)
Return type:tuple(int, int)
common.util.calc_chrome_version(last_version, release_date, release_period=10)[source]

Calculate the most likely version number of Chrome.

The calculation is based on the last known version number and its release date, based on the number of weeks (release_period) it usually takes to release the next major version. A list of releases and their dates is available on Wikipedia.

Parameters:
  • last_version (str) – Last known version number, e.g. “43.0”. Should only have the major and minor version numbers and exclude the build and patch numbers.
  • release_date (list) – Release date of the last known version number. Must be a list of three integers: [YYYY, MM, DD].
  • release_period (int) – Typical number of weeks between releases.
Returns:

The most likely current version number of Chrome in the same format required of the last_version parameter.

Return type:

str

common.util.make_download_headers()[source]

Return a dict of headers to use when downloading a CRX.

Returns:Set of HTTP headers as a dict, where the key is the header type and the value is the header content.
Return type:dict[str, str]
common.util.dt_dict_now()[source]

Return a dict of the current time.

Returns:A dict with the following keys:
  • year
  • month
  • day
  • hour
  • minute
  • second
  • microsecond
Return type:dict[str, int]
common.util.dict_to_dt(dt_dict)[source]

Reverse of dt_dict_now().

Parameters:dt_dict (dict) – A dict (such as dt_dict_now() returns) that correspond with the keyword parameters of the datetime constructor.
Returns:A datetime object.
Return type:datetime.datetime
class common.util.MunchyMunch(f)[source]

Wrapper class to munchify crx_obj parameters.

This wrapper converts either the kwarg crx_obj or the first positional argument (tests in that order) to a Munch object, which allows us to refer to keys in the Munch dictionary as if they were attributes. See the docs on the munch library for more information.

Example usage:

>>> @MunchyMunch
... def test_func(crx_obj)
...     # crx_obj will be converted to a Munch
...     print(crx_obj.id)
Parameters:f – The function to wrap.
common.util.byte_len(s)[source]

Return the length of s in number of bytes.

Parameters:s (str or bytes) – The string or bytes object to test.
Returns:The length of s in bytes.
Return type:int
Raises:TypeError – If s is not a str or bytes.
common.util.ttl_files_in_dir(dir_path, pat='.')[source]

Count the files in the given directory.

Will count all files except . and .., including any files whose names begin with . (using the -A option of ls).

Parameters:
  • dir_path (str) – Path to the directory.
  • pat (str) – Pattern the files should match when searching. This is passed to grep, so when the default remains (.), it will match all files and thus not filter out anything.
Returns:

The number of files in the directory.

Return type:

int

Raises:

NotADirectoryError – When dir_path is not a directory.

common.util.chunkify(iterable, chunk_size)[source]

Split an iterable into smaller iterables of a certain size (chunk size).

For example, say you have a list that, for whatever reason, you don’t want to process all at once. You can use chunkify() to easily split up the list to whatever size of chunk you want. Here’s an example of what this might look like:

>>> my_list = range(1, 6)
>>> for sublist in chunkify(my_list, 2):
...     for i in sublist:
...         print(i, end=', ')
...     print()

The output of the above code would be:

1, 2,
3, 4,
5,

Idea borrowed from http://code.activestate.com/recipes/303279-getting-items-in-batches/.

Parameters:
  • iterable – The iterable to be split into chunks.
  • chunk_size (int) – Size of each chunk. See above for an example.

Secret Files

The secret directory is used to store sensitive information specific to an installation of dbling. The files in this directory have excluded from the repository for obvious reasons, but should include a creds.py file. It has should have a form such as displayed below.

import yaml
from os import uname
from os.path import join, dirname

with open(join(dirname(__file__), 'passes.yml')) as passes:
    passwd = yaml.load(passes)

crx_save_path = ''  # Path where the CRXs should be saved when downloaded
db_info = {  # Database access information
    'uri': '',  # Full URI for accessing the DB. See SQLAlchemy docs for more info.
    'user': '',
    'pass': '',
    'nodes': ['host1', ],  # Host names of machines that should use 127.0.0.1 instead of the value for full_url below
    'full_url': '1.2.3.4',  # IP address of host with the database (usually dbling master)
}
# Login info for workers to access the celery server on the dbling master
celery_login = {'user': 'sample_username', 'pass': 'secure_password', 'port': 5672}
admin_emails = (  # Names and email addresses of admins that should receive emails from Celery
    ('Admin Name', 'admin_email@example.com'),
)
sender_email_addr = 'ubuntu@{}'.format(uname().nodename)  # Email address Celery should use when sending admin emails

The template above references another file that should be in the secret directory, passes.yml. This should have a form as shown below. Without this file, the Ansible playbooks will not function properly.

---

mysql_rt_pass: ''  # MySQL root user password
mysql_dbling_user: 'dbling_dbusr'  # MySQL regular user name
mysql_dbling_pass: ''  # MySQL regular user password

rabbit_user: 'dbling_crawler'  # RabbitMQ user name
rabbit_pass: ''  # RabbitMQ user password