Welcome to extendedMD’s documentation!¶
This repository contains a python implementation of the Extended Motif Discovery (EMD) algorithm for motif discovery in time-series.
Installation¶
PyPI¶
This packages is available on PyPI and thus can be directly installed with pip:
$ pip install extendedMD
Source¶
This package can also be installed directly from source:
- Clone the repository from https://github.com/misilva73/extendedMD
- Install it with the command
python setup.py install
Extracting motifs¶
extendedMD.bs¶
-
extendedMD.bs.
extract_modified_bs_sequence
(sax_sequence)¶ This functions extracts the modified Behaviour Subsequence (BS) sequence list, which is the original sax word sequence where every consecutive pairs of sax words are equal are fused into the same sax word.
Parameters: sax_sequence (list of str) – list of original sax words Returns: - bs_sequence (
list of str
) - list of modified sax words - bs_lengths (
list of int
) - list of lengths of each modified sax word
- bs_sequence (
-
extendedMD.bs.
generate_bs_pointers
(bs_lengths, bs_size)¶ It generates the pointers (i.e. time indexes) of each modified sax word into the original time-series data
Parameters: - bs_lengths (list of str) – list of modified sax words
- bs_size (int) – window size (in the original time-series) of a single sax word
Returns: list of pointers to the original time-series
Return type: list of list of int
-
extendedMD.bs.
get_bs_subsequences_dic_list
(ts, bs_seq, bs_pointers, subseq_size)¶ This function extracts a list with all the BS subsequences with fixed size from a BS sequence
Parameters: - ts (list of float) – original 1-d
- bs_seq (list of str) – list of modified sax words (i.e. BS sequence)
- bs_pointers (list of list of int) – list of pointers to the original time-series
- subseq_size (int) – number of sax words in a single BS subsequence
Returns: list of dictionaries where each dic represents a single BS subsequence. The dic has 4 entries:
- pattern - the list of sax words related to that BS subsequence
- pointers - the pointers list to the original time-series related to that subsequence
- ts - time-series subsequence related to that BS subsequence (numpy array!)
- bs_position - list of indexes of the the bs_seq list related to that BS subsequence
Return type: list of dic
extendedMD.dtwdist¶
-
extendedMD.dtwdist.
add_distance_vec_to_pattern_dic
(pattern_dic_list, r=None)¶ This function returns a new pattern_dic_list where each dic has an extra entry with the vector of distances between the BS sequence related to that dic and all the other BS subsequences in the pattern
Parameters: - pattern_dic_list (list of dic) – list of all the dictionaries related to BS sequences in the pattern
- r (float) – distance upper bound (for DTW distance computation)
Returns: new_pattern_dic_list
Return type: list of dic
-
extendedMD.dtwdist.
compute_dtw_dist_mat
(ts_list, r=None)¶ This function computes the pairwise distance matrix of a list of time-series with Dynamic Time Warping distance. It is based on dtaidistance package
Parameters: - ts_list (list of 1D array) – list of time-series to compare pairwise
- r (float) – distance upper bound - if distance is higher than R, then computation stops and the distance is set as inf this parameter serves merely for speeding up computation
Returns: distance matrix
Return type: 2D array
-
extendedMD.dtwdist.
compute_dwt_dist_between_ts_and_list
(single_ts, ts_list, r)¶ This function computes the list of DTW distances between a single time-series and a list of time-series.
Parameters: - single_ts (1d array) – single time-series
- ts_list (list of 1d array) – list of time-series
- r (float) – distance upper bound (for DTW distance computation)
Returns: list of DTW distances
Return type: list of float
extendedMD.emd¶
-
extendedMD.emd.
find_motifs_from_emd
(ts, r, win_size, paa_size, alphabet_size, adaptive_break_points=True, z_threshold=0.01)¶ Returns the full list of motifs from either a multi-dimensional or a 1-dimensional time-series by running the extendedMD algorithm. If the time-series received is multi-dimensional (i.e. a pandas dataframe), then the algorithm starts by applying PCA to reduce it to a single dimension.
Parameters: - ts (Union[1d array, DataFrame]) – original time-series
- r (float) – maximum distance to the center of the motif
- win_size (int) – size fo the sliding window that generated each sax word
- paa_size (int) – number of characters in a single sax word
- alphabet_size (int) – number of unique characters to use in the sax representation
- adaptive_break_points (bool) – Whether to use a representation with adaptive break-points
- z_threshold (float) – z_threshold for the znorm method from saxpy
Returns: - motif_candidates_dic_list (
list of dic
) - list of motif dictionaries - ts_1d (
1d array
) - 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
extendedMD.mdl¶
-
extendedMD.mdl.
compute_motif_mdl_cost
(members_dic_list, bs_len)¶ This function compute the MDL cost associated to the motif that generated the members in members_dic_list, as proposed by Tanaka et all
Parameters: - members_dic_list (list of dic) – list of dictionaries related to the BS subsequences that belong to a motif
- bs_len (list of int) – list of the lengths of each BS sequence
Returns: mdl cost
Return type: float
-
extendedMD.mdl.
compute_segmentation_mdl_cost
(seg_len_list)¶ This function computes the MDL cost associated with the lengths split BS sequence. This is a middle step in the MDL cost computation proposed by Tanaka et all as proposed by Tanaka et all
Parameters: seg_len_list (list of list of int) – list of split BS lengths Returns: mdl cost Return type: float
-
extendedMD.mdl.
split_bs_len
(members_dic_list, bs_len)¶ This function splits the BS sequence based on the position of the motif members (i.e. subsequences in members_dic_list). This is a middle step in the MDL cost computation proposed by Tanaka et all
Parameters: - members_dic_list (list of dic) – list of dictionaries related to the BS subsequences that belong to a motif
- bs_len (list of int) – list of the lengths of each BS sequence
Returns: list of split BS lengths
Return type: list of list of int
extendedMD.motifs¶
-
extendedMD.motifs.
build_motif_dic_from_pattern
(bs_subseq_dic_list, bs_len, pattern, r)¶ This function builds the motif dictionary related to the pattern given as input
Parameters: - bs_subseq_dic_list (list of dic) – list of dictionaries where each dic represents a single BS subsequence
- bs_len (list of int) – list of the lengths of each BS sequence
- pattern (list of str) – list of sax words that made up the pattern
- r (float) – maximum distance to the center of the motif
Returns: motif_dic
Return type: dic
-
extendedMD.motifs.
find_all_motif_candidates
(ts, bs_seq, bs_len, bs_pointers, r)¶ This function finds all the motif candidates of all sizes and saves each motif instance in a dictionary. The result is a list of dictionaries.
Parameters: - ts (1d array) – original 1-d time-series
- bs_seq (list of str) – list of modified sax words (i.e. BS sequence)
- bs_len (list of int) – list of the lengths of each BS sequence
- bs_pointers (list of list of int) – list of pointers to the original time-series
- r (float) – maximum distance to the center of the motif
Returns: motif_dic_list
Return type: list of dic
-
extendedMD.motifs.
transform_members_list_into_motif_dic
(center_dic, members_dic_list, mdl_cost, mean_dist)¶ This functions creates the dictionary related to the motif composed by the subsequences in members_dic_list
Parameters: - center_dic (dic) – dictionary related to the center of motif
- members_dic_list (list of dic) – list of dictionaries related to all the subsequences that belong to the motif
- mdl_cost (float) – MDL cost of the motif (as defined by Tanaka et all)
- mean_dist (float) – mean distance between all the pair of motif’s members
Returns: motif_dic
Return type: dic
extendedMD.patterns¶
-
extendedMD.patterns.
find_all_pruned_members
(center_index, pattern_dic_list, r)¶ This function finds all the motif members assuming that the center of the motif is the BS sequence with index center_index. It returns the dictionaries related to the all the non-overlapping BS subsequences that have a distance to the center subsequence lower than R
Parameters: - center_index (int) – index of the center motif in the pattern_dic_list
- pattern_dic_list (list of dic) – list of dictionaries related to all the subsequences in the pattern
- r (float) – maximum distance to the center
Returns: list of dictionaries of the motif members
Return type: list of dic
-
extendedMD.patterns.
find_pattern_center_and_members
(pattern_dic_list, r)¶ This function finds the all the members of a motif (all non-overlapping subsequences with a distance to the motif center less than R) and and the motif’s center (the subsequence with the maximal count of members and minimal mean distance)
Parameters: - pattern_dic_list (list of dic) – list of dictionaries related to all the subsequences in the pattern
- r (float) – maximum distance to the center of the motif
Returns: - center_dic (
dic
) - dictionary of the motif’s center subsequence - members_dic_list (
list of dic
) - list of dictionaries related to all the motif’s members - min_mean_dist (
float
) - mean distance between the motif’s center and all the motif’s members
-
extendedMD.patterns.
lists_overlap
(l1, l2)¶ This functions return whether two lists overlap
Parameters: - l1 (list) – list 1
- l2 (list) – list 2
Returns: whether the list overlap. If TRUE, then the list overlap
Return type: bool
extendedMD.pca¶
-
extendedMD.pca.
extract_pca_ts
(multi_dim_ts)¶ This function reduces a multi-dimensional time-series to 1d using PCA
Parameters: multi_dim_ts (DataFrame) – multi-dimensional time-series as a pandas dataframe Returns: 1-dimensional time-series Return type: 1d array
extendedMD.pruning¶
-
extendedMD.pruning.
extract_ts_from_pointers
(ts, pointers)¶ Parameters: - ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
- pointers (list of int) – list of indexes related to the subsequence one wishes to extract from ts
Returns: time-series subsequence
Return type: 1d array
-
extendedMD.pruning.
prune_motifs
(ts, sorted_dic_list, r)¶ Parameters: - ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
- sorted_dic_list (list of dic) – list of motif dictionaries returned from the emd algorithm, ordered by relevance
- r (float) – maximum distance to the center of the motif
Returns: list of dictionaries with the most relevant motifs
Return type: list of dic
-
extendedMD.pruning.
prune_motifs_with_dist
(ts, motif_dic_list, r, mdl_bins)¶ Parameters: - ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
- motif_dic_list (list of dic) – list of motif dictionaries returned from the emd algorithm
- r (float) – maximum distance to the center of the motif
- mdl_bins (int) – number of bins to break the MDL cost range
Returns: list of dictionaries with the most relevant motifs. The list is ordered based on MDL cost and motif’s compactness
Return type: list of dic
-
extendedMD.pruning.
prune_motifs_with_mdl
(ts, motif_dic_list, r)¶ This function returns the most relevant motifs from the original list of motif extracted from the emd algorithm, based on the computed MDL cost and avoiding overlapping motifs
Parameters: - ts (1d array) – 1-dimensional time-series either resulting from the PCA method or the original 1-dimensional time-series
- motif_dic_list (list of dic) – list of motif dictionaries returned from the emd algorithm
- r (float) – maximum distance to the center of the motif
Returns: list of dictionaries with the most relevant motifs. The list is ordered based on the MDL cost
Return type: list of dic
extendedMD.sax¶
-
extendedMD.sax.
apply_adaptive_sax
(ts, win_size, paa_size, alphabet_size, z_threshold)¶ This function applies the sax transformation to a 1-dim time series using adaptive break-points
Parameters: - ts (1D array) – 1-time series
- win_size (int) – size fo the sliding window that generated each sax word
- paa_size (int) – number of characters in a single sax word
- alphabet_size (int) – number of unique characters to use in the sax representation
- z_threshold (float) – z_threshold for the znorm method from saxpy
Returns: the sax sequence, a list of strings, where each string represents a single sax word
Return type: list of str
-
extendedMD.sax.
apply_non_adaptive_sax
(ts, win_size, paa_size, alphabet_size, z_threshold)¶ This function applies the sax transformation to a 1-dim time series using adaptive break-points
Parameters: - ts (1D array) – 1-time series
- win_size (int) – size fo the sliding window that generated each sax word
- paa_size (int) – number of characters in a single sax word
- alphabet_size (int) – number of unique characters to use in the sax representation
- z_threshold (float) – z_threshold for the znorm method from saxpy
Returns: the sax sequence, a list of strings, where each string represents a single sax word
Return type: list of str
-
extendedMD.sax.
extract_sax_sequence
(ts, win_size, paa_size, alphabet_size, adaptive_break_points, z_threshold=0.01)¶ This function applies the sax transformation to a 1-dim time series. Based on saxpy package
Parameters: - ts (1D array) – 1-time series
- win_size (int) – size fo the sliding window that generated each sax word
- paa_size (int) – number of characters in a single sax word
- alphabet_size (int) – number of unique characters to use in the sax representation
- adaptive_break_points (bool) – Whether to use a representation with adaptive break-points
- z_threshold (float) – z_threshold for the znorm method from saxpy
Returns: the sax sequence, a list of strings, where each string represents a single sax word
Return type: list of str
extendedMD.viz¶
-
extendedMD.viz.
create_motif_table
(motif_dic_list)¶ This function creates a pandas dataframe with a summary of each motif in motif_dic_list
Parameters: motif_dic_list (list of dic) – list of dictionaries, where a dic is related to a single motif Returns: motif_table_df Return type: pandas.DataFrame
-
extendedMD.viz.
plot_k_motifs
(k, ts, events_ts, motif_dic_list, y_label='Full 1-d time-series')¶ This function prints the base visualisation for the first k motifs in motif_dic_list for the original 1-d time-series
Parameters: - k (int) – number of motifs to plot
- ts (1d array) – original 1-dimensional time-series
- events_ts (list) – list of labels for each entry in ts
- motif_dic_list (list of dic) – list of dictionaries, where a dic is related to a single motif
- y_label (str) – label to add in the y-axis of the first plot (optional and defaults to ‘Full 1-d time-series’)
Returns: None
-
extendedMD.viz.
plot_k_multdim_motifs
(k, multidim_ts, events_ts, motif_dic_list)¶ This function prints the base visualisation for the first k motifs in motif_dic_list for the original multidimensional time-series. It shows one 1-d plot for each dimension in multidim_ts
Parameters: - k (int) – number of motifs to plot
- multidim_ts (pandas.DataFrame) – original multidimensional time-series
- events_ts (list) – list of labels for each entry in multidim_ts
- motif_dic_list (list of dic) – list of dictionaries, where a dic is related to a single motif
Returns: None
-
extendedMD.viz.
plot_single_motif
(ts, events_ts, motif_dic, y_label)¶ - This function creates the base visualization for a single motif:
- plot with the whole time-series highlighting the labels and the position of each motif’s member
- plot with all the motif’s members
- plot with the motif’s center
Parameters: - ts (1d array) – original 1-dimensional time-series
- events_ts (list) – list of labels for each entry in ts (max of 10 labels for avoiding color cycling in the plots) label == 0 is assumed to represent no events and thus it is not plotted
- motif_dic (dic) – dictionary related to the motif
- y_label (str) – label to add in the y-axis of the first plot
Returns: fig - figure with the motif plot
Return type: matplotlib.figure.Figure