Dye Score

Documentation Status https://travis-ci.org/mozilla/dye-score.svg?branch=master

Utilities to build the dye-score metric from OpenWPM javascript call data.

Quickstart

Install dye score: conda install dye-score -c conda-forge or pip install dye-score

Review usage notebook: https://github.com/mozilla/dye-score/blob/master/docs/usage.ipynb

Documentation Contents

Installation

Prerequisites

You will need Apache Spark including PySpark available on your system.

Stable release

To install Dye Score, run this command in your terminal:

$ pip install dye_score

From sources

The sources for Dye Score can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/mozilla/dye_score

Or download the tarball:

$ curl  -OL https://github.com/mozilla/dye_score/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Usage

This notebook runs through using the dye score library and methodology to score scripts.

The input data is generated by OpenWPM. A dataset that has been used with the dye score is available at github.com/mozilla/overscripted

This notebook was run on a small sample.

Dye Score expects a spark context to be available for thie initial data processing steps.

Additionally, set-up a Dask Client however you choose to. The below cell was generated by Dask’s JupyterLab extension.

Note the warning is known by the dask team (https://github.com/dask/distributed/issues/2564).

[1]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:32829")
client
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/dask/config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  data = yaml.load(f.read()) or {}
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/distributed/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)
[1]:

Client

Cluster

  • Workers: 4
  • Cores: 12
  • Memory: 33.35 GB
[2]:
import dask.dataframe as dd
import numpy as np

from dye_score import DyeScore
[3]:
ds = DyeScore('config.yaml', print_config=False)
[4]:
ds.validate_input_data()
[4]:
True
[5]:
df = ds.get_input_df()
df.head()
[5]:
top_level_url script_url func_name symbol
0 https://7ero.org/ https://forsiteid6346.tech/convert/scripts/cre... b.exec CanvasRenderingContext2D.fillStyle
1 https://www.stevinsonhyundai.com/ https://tag.contactatonce.com/le_secure_storag... r window.Storage.setItem
2 https://www.thecircle.com/us/ https://www.thecircle.com/k3/ruxitagentjs_ICA2... fc window.document.cookie
3 https://www.jcpportraits.com/ https://cdn.optimizely.com/js/8447592883.js be/< window.Storage.length
4 https://www.technik-profis.de/ https://cdn.optimizely.com/js/8323142798.js t.getUserAgent window.navigator.userAgent
[6]:
print(f'This sample is {len(df):,} rows')
This sample is 2,312,697 rows

Data Preparation

[7]:
%time ds.build_raw_snippet_df()
                       top_level_url  \
0                  https://7ero.org/
1  https://www.stevinsonhyundai.com/
2      https://www.thecircle.com/us/
3      https://www.jcpportraits.com/
4     https://www.technik-profis.de/

                                          script_url       func_name  \
0  https://forsiteid6346.tech/convert/scripts/cre...          b.exec
1  https://tag.contactatonce.com/le_secure_storag...               r
2  https://www.thecircle.com/k3/ruxitagentjs_ICA2...              fc
3        https://cdn.optimizely.com/js/8447592883.js            be/<
4        https://cdn.optimizely.com/js/8323142798.js  t.getUserAgent

                               symbol  \
0  CanvasRenderingContext2D.fillStyle
1              window.Storage.setItem
2              window.document.cookie
3               window.Storage.length
4          window.navigator.userAgent

                                         raw_snippet  called
0  forsiteid6346.tech||createjs-2015.11.26.min.js...       1
1  tag.contactatonce.com||storage.secure.min.html||r       1
2  www.thecircle.com||ruxitagentjs_ICA27SVfhjoqrx...       1
3            cdn.optimizely.com||8447592883.js||be/<       1
4  cdn.optimizely.com||8323142798.js||t.getUserAgent       1
CPU times: user 121 ms, sys: 43 ms, total: 164 ms
Wall time: 39.7 s
[7]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/raw_snippet_call_df.parquet'
[8]:
%time ds.build_snippet_map()
                                         raw_snippet              snippet
0  forsiteid6346.tech||createjs-2015.11.26.min.js...   792826184637634903
1  tag.contactatonce.com||storage.secure.min.html||r -3182365903651065472
2  www.thecircle.com||ruxitagentjs_ICA27SVfhjoqrx... -9027005229756292155
3            cdn.optimizely.com||8447592883.js||be/<  2248811367515630966
4  cdn.optimizely.com||8323142798.js||t.getUserAgent -6265856453346281252
CPU times: user 81 ms, sys: 10.6 ms, total: 91.6 ms
Wall time: 1.71 s
[8]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/snippet_lookup.parquet'

The next two methods require your spark context to be available to pass to the methods.

[9]:
%time ds.build_snippets(spark)
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
Dataset has 216 unique symbols
<xarray.DataArray (snippet: 231057, symbol: 216)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Coordinates:
  * snippet  (snippet) object '-1000043381057326421' ... '999736522860943363'
  * symbol   (symbol) object 'AnalyserNode.channelCount' ... 'window.sessionStorage'
CPU times: user 13.6 s, sys: 859 ms, total: 14.5 s
Wall time: 4min
[9]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/snippets.zarr'
[12]:
%time ds.build_snippet_snippet_dyeing_map(spark)
CPU times: user 39.1 ms, sys: 51.7 ms, total: 90.8 ms
Wall time: 9.8 s
[12]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/snippet_dyeing_map.parquet'

There are no more functions that depend on spark

Dyeing

Building list of dye snippets is up to user. Here we show an example using a keyword search for fingerprint.

[4]:
snippet_dyeing_map_file = ds.dye_score_data_file('snippet_dyeing_map')
snippet_data = dd.read_parquet(snippet_dyeing_map_file, engine='pyarrow')
snippet_data.head()
[4]:
top_level_url script_url func_name snippet clean_script
0 http://narvalife.ucoz.net/ https://usocial.pro/usocial/fingerprint2.min.js e.prototype.getNavigatorPlatform 4996125033026346492 usocial.pro/usocial/fingerprint2.min.js
1 https://sletaem.by/ https://sletaem.by/ updateTimer -6846198680163094774 sletaem.by/
2 http://realcoco.com/ http://fs.bizspring.net/fsn/bstrk.1.js _trkdp_getCookie 2578583411096044764 fs.bizspring.net/fsn/bstrk.1.js
3 https://www.trendydiscount.shop/ https://www.google-analytics.com/analytics.js zc 1695113790766404014 www.google-analytics.com/analytics.js
4 https://www.liveaquaria.com/ https://www.youtube.com/yts/jsbin/player-vflYg... hE 2066756695033721030 www.youtube.com/yts/jsbin/player-vflYgf3QU/en_...
[4]:
key = 'fingerprint'
filename_suffix = f'{key}_keyword'
thresholds = [0.15, 0.2, 0.23, 0.24, 0.25, 0.26, 0.3, 0.35]
[6]:
script_snippets = snippet_data[snippet_data.clean_script.str.contains(key, case=False)].snippet.unique().astype(str)
funcname_snippets = snippet_data[snippet_data.func_name.str.contains(key, case=False)].snippet.unique().astype(str)
dye_snippets = np.unique(np.append(script_snippets, funcname_snippets))

With the dye snippets in hand we can now use the DyeScore library to compute the dye scores for a range of thresholds.

[8]:
%time ds.compute_distances_for_dye_snippets(dye_snippets=dye_snippets, filename_suffix=filename_suffix)
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/dask/array/blockwise.py:204: UserWarning: The da.atop function has moved to da.blockwise
  warnings.warn("The da.atop function has moved to da.blockwise")
<xarray.DataArray 'data' (snippet: 231057, dye_snippet: 553)>
dask.array<shape=(231057, 553), dtype=float64, chunksize=(10000, 100)>
Coordinates:
  * snippet      (snippet) object '-1000043381057326421' ... '999736522860943363'
  * dye_snippet  (dye_snippet) object '-1006661115172174629' ... '917589267078160730'
CPU times: user 453 ms, sys: 120 ms, total: 573 ms
Wall time: 1min 6s
[8]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_dye_distances_from_fingerprint_keyword'
[9]:
%time ds.compute_snippets_scores_for_thresholds(thresholds, filename_suffix=filename_suffix)
CPU times: user 1.55 s, sys: 253 ms, total: 1.8 s
Wall time: 13.1 s
[9]:
['/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.15',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.2',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.23',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.24',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.25',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.26',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.3',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.35']
[5]:
%time ds.compute_dye_scores_for_thresholds(thresholds, filename_suffix=filename_suffix)
CPU times: user 10.3 s, sys: 415 ms, total: 10.7 s
Wall time: 57 s
[5]:
['/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.15.csv.gz',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.2.csv.gz',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.23.csv.gz',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.24.csv.gz',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.25.csv.gz',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.26.csv.gz',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.3.csv.gz',
 '/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.35.csv.gz']

Evaluate scores

We now manually review the dye scores compared to the input dye list in order to select the best distance threshold.

The review process needs a list of clean script scripts to compare to the dye score list to produce the following plot. The production of this list will be dependent on how the dye snippets list was prepared.

[10]:
import pandas as pd

from bokeh.io import export_png, show
from bokeh.layouts import gridplot
from dye_score.plotting import get_pr_plot
from IPython.display import Image
[8]:
snippet_data.head()
[8]:
top_level_url script_url func_name snippet clean_script
0 http://narvalife.ucoz.net/ https://usocial.pro/usocial/fingerprint2.min.js e.prototype.getNavigatorPlatform 4996125033026346492 usocial.pro/usocial/fingerprint2.min.js
1 https://sletaem.by/ https://sletaem.by/ updateTimer -6846198680163094774 sletaem.by/
2 http://realcoco.com/ http://fs.bizspring.net/fsn/bstrk.1.js _trkdp_getCookie 2578583411096044764 fs.bizspring.net/fsn/bstrk.1.js
3 https://www.trendydiscount.shop/ https://www.google-analytics.com/analytics.js zc 1695113790766404014 www.google-analytics.com/analytics.js
4 https://www.liveaquaria.com/ https://www.youtube.com/yts/jsbin/player-vflYg... hE 2066756695033721030 www.youtube.com/yts/jsbin/player-vflYgf3QU/en_...
[9]:
compare_list = snippet_data[snippet_data.snippet.isin(dye_snippets)].clean_script.unique().compute()
compare_list.head()
[9]:
0              usocial.pro/usocial/fingerprint2.min.js
1    script.hotjar.com/modules-ab5ba0ccf53ded68dfc9...
2                   www.convertthepdf.co/js/landing.js
3     track.adabra.com/sbn_fingerprint.v1.16.47.min.js
4    www.bestwestern.com.br/modules/mod_rewards_but...
Name: clean_script, dtype: object
[10]:
%time plot_df_paths = ds.build_plot_data_for_thresholds(compare_list, thresholds, filename_suffix=filename_suffix)
CPU times: user 39.7 s, sys: 148 ms, total: 39.9 s
Wall time: 39.9 s
[7]:
plot_df_paths[0]
[7]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_plot_data_from_fingerprint_keyword_0.15.csv.gz'
[11]:
plots = []
plot_opts = dict(tools='', toolbar_location=None, width=300, height=200)
for threshold, pr_df_path in zip(thresholds, plot_df_paths):
    pr_df = pd.read_csv(pr_df_path)
    plots.append(get_pr_plot(pr_df,  title=f'{threshold}', plot_opts=plot_opts))
Image(export_png(gridplot(plots, ncols=3, toolbar_location=None)))
[11]:
_images/usage_29_0.png

Remaining analysis is up to user based on their preferred distance threshold.

[ ]:

API

DyeScore

class dye_score.dye_score.DyeScore(config_file_path, validate_config=True, print_config=False, sc=None)[source]
Parameters:
  • config_file_path (str) –

    The path of your config file that is used for dye score to interact with your environment. Holds references to file paths and private data such as AWS API keys. Expects a YAML file with the following keys:

    • INPUT_PARQUET_LOCATION - the location of the raw or sampled OpenWPM input parquet folder
    • DYESCORE_DATA_DIR - location where you would like dye score to store data assets
    • DYESCORE_RESULTS_DIR - location where you would like dye score to store results assets
    • USE_AWS - default False - set true if data store is AWS
    • AWS_ACCESS_KEY_ID - optional - for storing and retrieving data on AWS
    • AWS_SECRET_ACCESS_KEY - optional - for storing and retrieving data on AWS
    • SPARK_S3_PROTOCOL - default ‘s3’ - only s3 or s3a are used
    • PARQUET_ENGINE - default ‘pyarrow’ - pyarrow or fastparquet

    Locations can be a local file path or a bucket.

  • validate_config (bool, optional) – Run DyeScore.validate_config method. Defaults to True.
  • print_config (bool, optional) – Print out config once saved. Defaults to False.
  • sc (SparkContext, optional) – If accessing s3 via s3a, pass spark context to set aws credentials
build_plot_data_for_thresholds(compare_list, thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]

Builds a dataframe for evaluation

Contains the recall compared to the compare_list for scripts under the threshold.

Parameters:
  • compare_list (list) – List of dye scripts to compare for recall.
  • thresholds (list) – List of distances to compute snippet scores for e.g. [0.23, 0.24, 0.25]
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

build_raw_snippet_df(override=False, snippet_func=None)[source]

Builds raw_snippets from input data

Default snippet function is script_url.netloc||script_url.path_end||func_name If script_url is missing, location is used.

Parameters:
  • override (bool) – True to replace any existing outputs
  • snippet_func (function) – Function that accepts row of data as input and computes the snippet value. Default provided.
Returns:

str. The file path where output is saved

build_snippet_map(override=False)[source]

Builds snippet ids and saves map of ids to raw snippets

xarray cannot handle arbitrary length string indexes so we need to build a set of unique ids to reference snippets. This method creates the ids and saves the map of raw ids to snippets.

Parameters:override (bool, optional) – True to replace any existing outputs. Defaults to False
Returns:str. The file path where output is saved
build_snippet_snippet_dyeing_map(spark, override=False)[source]

Build file used to join snippets to data for dyeing.

Adds clean_script field to dataset. Saves parquet file with:
  • snippet - the int version, not raw_snippet
  • document_url
  • script_url
  • clean_script
Parameters:
  • spark (pyspark.sql.session.SparkSession) – spark instance
  • override (bool, optional) – True to replace any existing outputs. Defaults to False
Returns:

str. The file path where output is saved

build_snippets(spark, na_value=0, override=False)[source]

Builds row-normalized snippet dataset

  • Dimensions are n snippets x s unique symbols in dataset.
  • Data is output in zarr format with processing by spark, dask, and xarray.
  • Creates an intermediate tmp file when converting from spark to dask.
  • Slow running operation - follow spark and dask status to see progress

We use spark here because dask cannot memory efficiently compute a pivot table. This is the only function we need spark context for.

Parameters:
  • spark (pyspark.sql.session.SparkSession) – spark instance
  • na_value (int, optional) – The value to fill vector where there’s no call. Defaults to 0.
  • override (bool, optional) – True to replace any existing outputs. Defaults to False
Returns:

str. The file path where output is saved

compute_distances_for_dye_snippets(dye_snippets, filename_suffix='dye_snippets', snippet_chunksize=1000, dye_snippet_chunksize=1000, distance_function='chebyshev', override=False, **kwargs)[source]

Computes all pairwise distances from dye snippets to all other snippets.

  • Expects snippets file to exist.
  • Writes results to zarr with name snippets_dye_distances_from_{filename_suffix}
  • This is a long-running function - see dask for progress
Parameters:
  • dye_snippets (np.array) – Numpy array of snippets to be dyed. Must be a subset of snippets index.
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • snippet_chunksize (int, optional) – Set the chunk size for snippet xarray input, i along the snippet dimension (not the symbol dimension). Defaults to 1000.
  • dye_snippet_chunksize (int, optional) – Set the chunk size for dye snippet xarray input, along the snippet dimension (not the symbol dimension). Defaults to 1000.
  • distance_function (string or function, optional) – Provide a function to compute distances or a string to use a built-in distance function. See dye_score.distances.py for template for example distance functions. Default is "chebyshev". Alternatives are cosine, jaccard, cityblock.
  • override (bool, optional) – Override output files. Defaults to False.
  • kwargs – kwargs to pass to distance function if required e.g. mahalanobis requires vi
Returns:

str. Path results were written to

compute_dye_scores_for_thresholds(thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]

Get dye scores for a range of distance thresholds.

  • Uses results from compute_snippets_scores_for_thresholds
  • Writes results to gzipped csv files with name dye_score_from_{filename_suffix}_{threshold}.csv.gz
Parameters:
  • thresholds (list) – List of distances to compute snippet scores for e.g. [0.23, 0.24, 0.25]
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

compute_leaky_snippet_data(thresholds_to_test, filename_suffix='dye_snippets', override=False)[source]

Compute leaky percentages for a range of thresholds. This enables user to select the “leaky threshold” for following rounds.

  • Writes results to parquet files with name leak_test_{filename_suffix}_{threshold}
Parameters:
  • thresholds_to_test (list) – List of distances to compute percentage of snippets dyed at for e.g. [0.23, 0.24, 0.25]
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

compute_snippets_scores_for_thresholds(thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]

Get score for snippets for a range of distance thresholds.

  • Writes results to parquet files with name snippets_score_from_{filename_suffix}_{threshold}
Parameters:
  • thresholds (list) – List of distances to compute snippet scores for e.g. [0.23, 0.24, 0.25]
  • leaky_threshold (float) – Remove all snippets which dye more than this fraction of all other snippets.
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

config(option)[source]

Method to retrieve config values

Parameters:option (str) – The desired config option key
Returns:The config option value
dye_score_data_file(filename)[source]

Helper function to return standardized filename.

DyeScore class holds a dictionary to standardize the file names that DyeScore saves. This method looks up filenames by their short name.

Parameters:filename (str) – data file name
Returns:str. The path where the data file should reside
file_in_validation(inpath)[source]

Check path exists.

Raises ValueError if not. Used for input files, as these must exist to proceed. :param inpath: Path of input file :type inpath: str

file_out_validation(outpath, override)[source]

Check path exists. Raises ValueError if override is False. Otherwises removes the existing file. :param outpath: Path of ourput file. :type outpath: str :param override: Whether to raise an error or remove existing data. :type override: bool

from_parquet_opts

Options used when saving to parquet.

get_input_df(columns=None)[source]

Helper function to return the input dataframe.

Parameters:columns (list, optional) – List of columns to retrieve. If None, all columns are returned.
Returns:dask.DataFrame. Input dataframe with subset of columns requested.
s3_storage_options

s3 storage options built from config

Returns:dict. if USE_AWS is True returns s3 options as dict, else None.
to_parquet_opts

Options used when saving to parquet.

validate_config()[source]

Validate the config data. Currently just checks that values are correct for aws.

Raises AssertionError if values are incorrect.

validate_input_data()[source]

Checks for expected columns and types in input data.

Plotting utils

The following plotting utils can be used directly or maybe useful template code for reviewing your results.

dye_score.plotting.get_plots_for_thresholds(ds, thresholds, leaky_threshold, n_scripts_range, filename_suffix='dye_snippets', y_range=(0, 1), recall_color='black', n_scripts_color='firebrick', **extra_plot_opts)[source]
dye_score.plotting.get_pr_plot(pr_df, title, n_scripts_range, y_range=(0, 1), recall_color='black', n_scripts_color='firebrick', **extra_plot_opts)[source]

Example code for plotting dye score threshold plots

dye_score.plotting.get_threshold_summary_plot(ds)[source]
dye_score.plotting.plot_hist(title, hist, edges, y_axis_type='linear', bottom=0)[source]
dye_score.plotting.plot_key_leaky(percent_to_dye, key, y_axis_type='linear', bottom=0, bins=40)[source]

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/birdsarah/dye_score/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.
  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.
Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

Dye Score could always use more documentation, whether as part of the official Dye Score docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/birdsarah/dye_score/issues.

If you are proposing a feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible, to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up dye_score for local development.

  1. Fork the dye_score repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your_name_here/dye_score.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv dye_score
    $ cd dye_score/
    $ python setup.py develop
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ flake8 dye_score tests
    $ python setup.py test or py.test
    $ tox
    

    To get flake8 and tox, just pip install them into your virtualenv.

  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.
  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
  3. The pull request should work for Python 2.7, 3.4, 3.5 and 3.6, and for PyPy. Check https://travis-ci.org/birdsarah/dye_score/pull_requests and make sure that the tests pass for all supported Python versions.

Tips

To run a subset of tests:

$ py.test tests.test_dye_score

Deploying

A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.rst). Then run:

$ bumpversion patch # possible: major / minor / patch
$ git push
$ git push --tags
$ make release

History

  • v0.10.0
    • Add new distance functions and tests (#30)
  • v0.9.3
    • Small patches to leave 0.9 functional (#26)
  • v0.9.0
    • Numerous small improvements (#14)
    • Support fastparquet (#21)
    • fill_na option (#22)
    • Plotting new threshold plots (#24)
    • Add leaky threshhold to filenames (#25)
  • v0.2.0 - S3 compatibility, and minor improvements
  • v0.1.0 - Initial Release

Community Participation Guidelines

This repository is governed by Mozilla’s code of conduct and etiquette guidelines. For more details, please read the Mozilla Community Participation Guidelines.

How to Report

For more information on how to report violations of the Community Participation Guidelines, please read our How to Report page.