Welcome to extendedMD’s documentation!

This repository contains a python implementation of the Extended Motif Discovery (EMD) algorithm for motif discovery in time-series.

Installation

PyPI

This packages is available on PyPI and thus can be directly installed with pip:

$ pip install extendedMD

Source

This package can also be installed directly from source:

Extracting motifs

extendedMD.bs

extendedMD.bs.extract_modified_bs_sequence(sax_sequence)

This functions extracts the modified Behaviour Subsequence (BS) sequence list, which is the original sax word sequence where every consecutive pairs of sax words are equal are fused into the same sax word.

Parameters:sax_sequence (list of str) – list of original sax words
Returns:
  • bs_sequence (list of str) - list of modified sax words
  • bs_lengths (list of int) - list of lengths of each modified sax word
extendedMD.bs.generate_bs_pointers(bs_lengths, bs_size)

It generates the pointers (i.e. time indexes) of each modified sax word into the original time-series data

Parameters:
  • bs_lengths (list of str) – list of modified sax words
  • bs_size (int) – window size (in the original time-series) of a single sax word
Returns:

list of pointers to the original time-series

Return type:

list of list of int

extendedMD.bs.get_bs_subsequences_dic_list(ts, bs_seq, bs_pointers, subseq_size)

This function extracts a list with all the BS subsequences with fixed size from a BS sequence

Parameters:
  • ts (list of float) – original 1-d
  • bs_seq (list of str) – list of modified sax words (i.e. BS sequence)
  • bs_pointers (list of list of int) – list of pointers to the original time-series
  • subseq_size (int) – number of sax words in a single BS subsequence
Returns:

list of dictionaries where each dic represents a single BS subsequence. The dic has 4 entries:

  • pattern - the list of sax words related to that BS subsequence
  • pointers - the pointers list to the original time-series related to that subsequence
  • ts - time-series subsequence related to that BS subsequence (numpy array!)
  • bs_position - list of indexes of the the bs_seq list related to that BS subsequence

Return type:

list of dic

extendedMD.dtwdist

extendedMD.dtwdist.add_distance_vec_to_pattern_dic(pattern_dic_list, r=None)

This function returns a new pattern_dic_list where each dic has an extra entry with the vector of distances between the BS sequence related to that dic and all the other BS subsequences in the pattern

Parameters:
  • pattern_dic_list (list of dic) – list of all the dictionaries related to BS sequences in the pattern
  • r (float) – distance upper bound (for DTW distance computation)
Returns:

new_pattern_dic_list

Return type:

list of dic

extendedMD.dtwdist.compute_dtw_dist_mat(ts_list, r=None)

This function computes the pairwise distance matrix of a list of time-series with Dynamic Time Warping distance. It is based on dtaidistance package

Parameters:
  • ts_list (list of 1D array) – list of time-series to compare pairwise
  • r (float) – distance upper bound - if distance is higher than R, then computation stops and the distance is set as inf this parameter serves merely for speeding up computation
Returns:

distance matrix

Return type:

2D array

extendedMD.dtwdist.compute_dwt_dist_between_ts_and_list(single_ts, ts_list, r)

This function computes the list of DTW distances between a single time-series and a list of time-series.

Parameters:
  • single_ts (1d array) – single time-series
  • ts_list (list of 1d array) – list of time-series
  • r (float) – distance upper bound (for DTW distance computation)
Returns:

list of DTW distances

Return type:

list of float

extendedMD.emd

extendedMD.emd.find_motifs_from_emd(ts, r, win_size, paa_size, alphabet_size, adaptive_break_points=True, z_threshold=0.01)

Returns the full list of motifs from either a multi-dimensional or a 1-dimensional time-series by running the extendedMD algorithm. If the time-series received is multi-dimensional (i.e. a pandas dataframe), then the algorithm starts by applying PCA to reduce it to a single dimension.

Parameters:
  • ts (Union[1d array, DataFrame]) – original time-series
  • r (float) – maximum distance to the center of the motif
  • win_size (int) – size fo the sliding window that generated each sax word
  • paa_size (int) – number of characters in a single sax word
  • alphabet_size (int) – number of unique characters to use in the sax representation
  • adaptive_break_points (bool) – Whether to use a representation with adaptive break-points
  • z_threshold (float) – z_threshold for the znorm method from saxpy
Returns:

  • motif_candidates_dic_list (list of dic) - list of motif dictionaries
  • ts_1d (1d array) - 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series

extendedMD.mdl

extendedMD.mdl.compute_motif_mdl_cost(members_dic_list, bs_len)

This function compute the MDL cost associated to the motif that generated the members in members_dic_list, as proposed by Tanaka et all

Parameters:
  • members_dic_list (list of dic) – list of dictionaries related to the BS subsequences that belong to a motif
  • bs_len (list of int) – list of the lengths of each BS sequence
Returns:

mdl cost

Return type:

float

extendedMD.mdl.compute_segmentation_mdl_cost(seg_len_list)

This function computes the MDL cost associated with the lengths split BS sequence. This is a middle step in the MDL cost computation proposed by Tanaka et all as proposed by Tanaka et all

Parameters:seg_len_list (list of list of int) – list of split BS lengths
Returns:mdl cost
Return type:float
extendedMD.mdl.split_bs_len(members_dic_list, bs_len)

This function splits the BS sequence based on the position of the motif members (i.e. subsequences in members_dic_list). This is a middle step in the MDL cost computation proposed by Tanaka et all

Parameters:
  • members_dic_list (list of dic) – list of dictionaries related to the BS subsequences that belong to a motif
  • bs_len (list of int) – list of the lengths of each BS sequence
Returns:

list of split BS lengths

Return type:

list of list of int

extendedMD.motifs

extendedMD.motifs.build_motif_dic_from_pattern(bs_subseq_dic_list, bs_len, pattern, r)

This function builds the motif dictionary related to the pattern given as input

Parameters:
  • bs_subseq_dic_list (list of dic) – list of dictionaries where each dic represents a single BS subsequence
  • bs_len (list of int) – list of the lengths of each BS sequence
  • pattern (list of str) – list of sax words that made up the pattern
  • r (float) – maximum distance to the center of the motif
Returns:

motif_dic

Return type:

dic

extendedMD.motifs.find_all_motif_candidates(ts, bs_seq, bs_len, bs_pointers, r)

This function finds all the motif candidates of all sizes and saves each motif instance in a dictionary. The result is a list of dictionaries.

Parameters:
  • ts (1d array) – original 1-d time-series
  • bs_seq (list of str) – list of modified sax words (i.e. BS sequence)
  • bs_len (list of int) – list of the lengths of each BS sequence
  • bs_pointers (list of list of int) – list of pointers to the original time-series
  • r (float) – maximum distance to the center of the motif
Returns:

motif_dic_list

Return type:

list of dic

extendedMD.motifs.transform_members_list_into_motif_dic(center_dic, members_dic_list, mdl_cost, mean_dist)

This functions creates the dictionary related to the motif composed by the subsequences in members_dic_list

Parameters:
  • center_dic (dic) – dictionary related to the center of motif
  • members_dic_list (list of dic) – list of dictionaries related to all the subsequences that belong to the motif
  • mdl_cost (float) – MDL cost of the motif (as defined by Tanaka et all)
  • mean_dist (float) – mean distance between all the pair of motif’s members
Returns:

motif_dic

Return type:

dic

extendedMD.patterns

extendedMD.patterns.find_all_pruned_members(center_index, pattern_dic_list, r)

This function finds all the motif members assuming that the center of the motif is the BS sequence with index center_index. It returns the dictionaries related to the all the non-overlapping BS subsequences that have a distance to the center subsequence lower than R

Parameters:
  • center_index (int) – index of the center motif in the pattern_dic_list
  • pattern_dic_list (list of dic) – list of dictionaries related to all the subsequences in the pattern
  • r (float) – maximum distance to the center
Returns:

list of dictionaries of the motif members

Return type:

list of dic

extendedMD.patterns.find_pattern_center_and_members(pattern_dic_list, r)

This function finds the all the members of a motif (all non-overlapping subsequences with a distance to the motif center less than R) and and the motif’s center (the subsequence with the maximal count of members and minimal mean distance)

Parameters:
  • pattern_dic_list (list of dic) – list of dictionaries related to all the subsequences in the pattern
  • r (float) – maximum distance to the center of the motif
Returns:

  • center_dic (dic) - dictionary of the motif’s center subsequence
  • members_dic_list (list of dic) - list of dictionaries related to all the motif’s members
  • min_mean_dist (float) - mean distance between the motif’s center and all the motif’s members

extendedMD.patterns.lists_overlap(l1, l2)

This functions return whether two lists overlap

Parameters:
  • l1 (list) – list 1
  • l2 (list) – list 2
Returns:

whether the list overlap. If TRUE, then the list overlap

Return type:

bool

extendedMD.pca

extendedMD.pca.extract_pca_ts(multi_dim_ts)

This function reduces a multi-dimensional time-series to 1d using PCA

Parameters:multi_dim_ts (DataFrame) – multi-dimensional time-series as a pandas dataframe
Returns:1-dimensional time-series
Return type:1d array

extendedMD.pruning

extendedMD.pruning.extract_ts_from_pointers(ts, pointers)
Parameters:
  • ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
  • pointers (list of int) – list of indexes related to the subsequence one wishes to extract from ts
Returns:

time-series subsequence

Return type:

1d array

extendedMD.pruning.prune_motifs(ts, sorted_dic_list, r)
Parameters:
  • ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
  • sorted_dic_list (list of dic) – list of motif dictionaries returned from the emd algorithm, ordered by relevance
  • r (float) – maximum distance to the center of the motif
Returns:

list of dictionaries with the most relevant motifs

Return type:

list of dic

extendedMD.pruning.prune_motifs_with_dist(ts, motif_dic_list, r, mdl_bins)
Parameters:
  • ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
  • motif_dic_list (list of dic) – list of motif dictionaries returned from the emd algorithm
  • r (float) – maximum distance to the center of the motif
  • mdl_bins (int) – number of bins to break the MDL cost range
Returns:

list of dictionaries with the most relevant motifs. The list is ordered based on MDL cost and motif’s compactness

Return type:

list of dic

extendedMD.pruning.prune_motifs_with_mdl(ts, motif_dic_list, r)

This function returns the most relevant motifs from the original list of motif extracted from the emd algorithm, based on the computed MDL cost and avoiding overlapping motifs

Parameters:
  • ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
  • motif_dic_list (list of dic) – list of motif dictionaries returned from the emd algorithm
  • r (float) – maximum distance to the center of the motif
Returns:

list of dictionaries with the most relevant motifs. The list is ordered based on the MDL cost

Return type:

list of dic

extendedMD.sax

extendedMD.sax.apply_adaptive_sax(ts, win_size, paa_size, alphabet_size, z_threshold)

This function applies the sax transformation to a 1-dim time series using adaptive break-points

Parameters:
  • ts (1D array) – 1-time series
  • win_size (int) – size fo the sliding window that generated each sax word
  • paa_size (int) – number of characters in a single sax word
  • alphabet_size (int) – number of unique characters to use in the sax representation
  • z_threshold (float) – z_threshold for the znorm method from saxpy
Returns:

the sax sequence, a list of strings, where each string represents a single sax word

Return type:

list of str

extendedMD.sax.apply_non_adaptive_sax(ts, win_size, paa_size, alphabet_size, z_threshold)

This function applies the sax transformation to a 1-dim time series using adaptive break-points

Parameters:
  • ts (1D array) – 1-time series
  • win_size (int) – size fo the sliding window that generated each sax word
  • paa_size (int) – number of characters in a single sax word
  • alphabet_size (int) – number of unique characters to use in the sax representation
  • z_threshold (float) – z_threshold for the znorm method from saxpy
Returns:

the sax sequence, a list of strings, where each string represents a single sax word

Return type:

list of str

extendedMD.sax.extract_sax_sequence(ts, win_size, paa_size, alphabet_size, adaptive_break_points, z_threshold=0.01)

This function applies the sax transformation to a 1-dim time series. Based on saxpy package

Parameters:
  • ts (1D array) – 1-time series
  • win_size (int) – size fo the sliding window that generated each sax word
  • paa_size (int) – number of characters in a single sax word
  • alphabet_size (int) – number of unique characters to use in the sax representation
  • adaptive_break_points (bool) – Whether to use a representation with adaptive break-points
  • z_threshold (float) – z_threshold for the znorm method from saxpy
Returns:

the sax sequence, a list of strings, where each string represents a single sax word

Return type:

list of str

extendedMD.viz

extendedMD.viz.create_motif_table(motif_dic_list)

This function creates a pandas dataframe with a summary of each motif in motif_dic_list

Parameters:motif_dic_list (list of dic) – list of dictionaries, where a dic is related to a single motif
Returns:motif_table_df
Return type:pandas.DataFrame
extendedMD.viz.plot_k_motifs(k, ts, events_ts, motif_dic_list, y_label='Full 1-d time-series')

This function prints the base visualisation for the first k motifs in motif_dic_list for the original 1-d time-series

Parameters:
  • k (int) – number of motifs to plot
  • ts (1d array) – original 1-dimensional time-series
  • events_ts (list) – list of labels for each entry in ts
  • motif_dic_list (list of dic) – list of dictionaries, where a dic is related to a single motif
  • y_label (str) – label to add in the y-axis of the first plot (optional and defaults to ‘Full 1-d time-series’)
Returns:

None

extendedMD.viz.plot_k_multdim_motifs(k, multidim_ts, events_ts, motif_dic_list)

This function prints the base visualisation for the first k motifs in motif_dic_list for the original multidimensional time-series. It shows one 1-d plot for each dimension in multidim_ts

Parameters:
  • k (int) – number of motifs to plot
  • multidim_ts (pandas.DataFrame) – original multidimensional time-series
  • events_ts (list) – list of labels for each entry in multidim_ts
  • motif_dic_list (list of dic) – list of dictionaries, where a dic is related to a single motif
Returns:

None

extendedMD.viz.plot_single_motif(ts, events_ts, motif_dic, y_label)
This function creates the base visualization for a single motif:
  1. plot with the whole time-series highlighting the labels and the position of each motif’s member
  2. plot with all the motif’s members
  3. plot with the motif’s center
Parameters:
  • ts (1d array) – original 1-dimensional time-series
  • events_ts (list) – list of labels for each entry in ts (max of 10 labels for avoiding color cycling in the plots) label == 0 is assumed to represent no events and thus it is not plotted
  • motif_dic (dic) – dictionary related to the motif
  • y_label (str) – label to add in the y-axis of the first plot
Returns:

fig - figure with the motif plot

Return type:

matplotlib.figure.Figure