Welcome to PLSA’s documentation!¶
plsa.preprocessors module¶
Preprocessors for documents and words.
These preprocessors come in three flavours (functions, closures that return functions, and classes defining callable objects). The choice for the respective flavour is motivated by the complexity of the preprocessor. If it doesn’t need any parameters, a simple function will do. If it is simple, does not need to be manipulated interactively, but needs some parameter(s), then a closure is fine. If it would be convenient to alter parameters of the preprocessor interactively, then a class is a good choice.
Preprocessors act either on an entire document string or, after splitting documents into individual words, on an iterable over the words contained in a single document. Therefore, they cannot be combined in arbitrary order but care must be taken to ensure that the return value of one matches the call signature of the next.
-
plsa.preprocessors.
remove_non_ascii
(doc: str) → str¶ Removes non-ASCII characters (i.e., with unicode > 127) from a string.
Parameters: doc (str) – A document given as a single string. Returns: The document as a single string with all characters of unicode > 127 removed. Return type: str
-
plsa.preprocessors.
to_lower
(doc: str) → str¶ Converts a string to all-lowercase.
Parameters: doc (str) – A document given as a single string. Returns: The document as a single string with all characters converted to lowercase. Return type: str
-
plsa.preprocessors.
remove_numbers
(doc: str) → str¶ Removes digit/number characters from a string.
Parameters: doc (str) – A document given as a single string. Returns: The document as a single string with all number/digit characters removed. Return type: str
Returns callable that removes matches to the given regular expression.
Parameters: exclude_regex (str) – A regular expression specifying specific patterns to remove from a document. Returns: A callable that removes patterns matching the given regular expression from a string. Return type: function
-
plsa.preprocessors.
remove_punctuation
(punctuation: Iterable[str]) → Callable[[str], str]¶ Returns callable that removes punctuation characters from a string.”
Parameters: punctuation (iterable of str) – An iterable over single-character strings specifying punctuation characters to remove from a document. Returns: A callable that removes the given punctuation characters from a string. Return type: function
-
plsa.preprocessors.
tokenize
(doc: str) → Tuple[str, ...]¶ Splits a string into individual words.
Parameters: doc (str) – A document given as a single string. Returns: The document as tuple of individual words. Return type: tuple of str
-
class
plsa.preprocessors.
RemoveStopwords
(stopwords: Union[str, Iterable[str]])¶ Bases:
object
Instantiate callable objects that remove stopwords from a document.
Parameters: stopwords (str or iterable of str) – Stopword(s) to remove from a document given as an iterable over words. Examples
>>> from plsa.preprocessors import RemoveStopwords >>> remover = RemoveStopwords('is') >>> remover.words ('is',)
>>> remover.words = 'the', 'are' >>> remover.words ('the', 'are')
>>> remover += 'is', 'we' >>> remover.words ('is', 'we', 'the', 'are')
>>> new_instance = remover + 'do' >>> new_instance.words ('are', 'we', 'is', 'do', 'the')
-
words
¶ The current stopwords.
-
-
class
plsa.preprocessors.
LemmatizeWords
(*incl_pos)¶ Bases:
object
Instantiate callable objects that find the root form of words.
Parameters: *inc_pos (str) – One or more positional tag(s) indicating the type(s) of words to retain and to find the root form of. Must be one of ‘JJ’ (adjectives), ‘NN’ (nouns), ‘VB’ (verbs), or ‘RB’ (adverbs). Raises: KeyError
– If the given positional tags are not among the list of allowed ones.Examples
>>> from plsa.preprocessors import LemmatizeWords >>> lemmatizer = LemmatizeWords('VB') >>> lemmatizer.types ('VB',)
>>> lemmatizer.types = 'jj', 'nn' >>> lemmatizer.types ('JJ', 'NN')
>>> lemmatizer += 'VB', 'NN' >>> lemmatizer.types ('JJ', 'NN', 'VB')
>>> new_instance = lemmatizer + 'RB' >>> new_instance.types ('JJ', 'RB', 'NN', 'VB')
-
types
¶ The current type(s) of words to retain.
-
-
plsa.preprocessors.
remove_short_words
(min_word_len: int) → Callable[[Iterable[str]], Tuple[str, ...]]¶ Returns a callable that removes short words from an iterable of strings.
Parameters: min_word_len (int) – Minimum number of characters in a word for it to be retained. Returns: A callable that removes words shorter than the given threshold from an iterable over strings. Return type: function
plsa.pipeline module¶
-
class
plsa.pipeline.
Pipeline
(*preprocessors)¶ Bases:
object
Encapsulates and applies multiple document preprocessors.
Each preprocessor is assumed to be a callable that takes a single document as input and produces a single document as output. Importantly, each document fed to the first preprocessor in the chain is delivered as a single string, while the last preprocessor is required to return it as an iterable over strings with each element representing one word of that document.
Other than that, preprocessors can be combined in any which way, provided that the return value of one matches the call signature of the next. The order in which they are applied is the order in which they are specified, i.e., from left to right.
Parameters: *preprocessors (callable) – Function(s) or other callable object(s) that each take a single document as input and produce a (processed) document as output. See also
-
process
(doc: str) → Tuple[str, ...]¶ Applies a chain of one or more preprocessors to a document.
Parameters: doc (str) – A text document given as a single string. Returns: Each element represents one word of the document. Return type: tuple of str
-
plsa.corpus module¶
-
class
plsa.corpus.
Corpus
(corpus: Iterable[str], pipeline: plsa.pipeline.Pipeline)¶ Bases:
object
Processes raw document collections and provides numeric representations.
Parameters: - corpus (iterable of str) – An iterable over documents given as a single string each.
- pipeline (Pipeline) – The preprocessing pipeline.
See also
-
classmethod
from_csv
(path: str, pipeline: plsa.pipeline.Pipeline, col: int = -1, encoding: str = 'latin_1', max_docs: int = 1000, **kwargs) → plsa.corpus.Corpus¶ Instantiate a corpus from documents in a column of a CSV file.
Parameters: - path (str) – Full path (incl. file name) to a CSV file with one column containing documents.
- pipeline – The preprocessing pipeline.
- col (int) – Which column contains the documents. Numbering starts with 0 for the first column. Negative numbers count back from the last column (e.g., -1 for last, -2 just before the last, etc.).
- encoding (str) – A valid python encoding used to read the documents.
- max_docs (int) – The maximum number of documents to read from file.
- **kwargs – Keyword arguments are passed on to Python’s own
csv.reader
function.
Raises: StopIteration
– If you do not have at least two lines in your CSV file.Notes
If you set a
col
to a value outside the range present in the CSV file, it will be silently reset to the first or last column, depending on which side you exceed the permitted range.A list of available encodings can be found at https://docs.python.org/3/library/codecs.html
Formatting parameters for the Python’s
csv.reader
can be found at https://docs.python.org/3/library/csv.html#csv-fmt-params
-
classmethod
from_xml
(directory: str, pipeline: plsa.pipeline.Pipeline, tag: str = 'post', encoding: str = 'latin_1', max_files: int = 100) → plsa.corpus.Corpus¶ Instantiate a corpus from elements of XML files in a directory.
Parameters: - directory (str) – Path to the directory with the XML files.
- pipeline (Pipeline) – The preprocessing pipeline.
- tag – The XML tag that opens (<…>) and closes (</…>) the elements containing documents.
- encoding – A valid python encoding used to read the documents.
- max_files – The maximum number of XML files to read.
Notes
A list of available encodings can be found at https://docs.python.org/3/library/codecs.html
-
get_doc
(tf_idf: bool) → numpy.ndarray¶ The marginal probability that any word comes from a given document.
This probability p(d) is obtained by summing the joint document- word probability p(d, w) over all words.
Parameters: tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The document probability p(d). Return type: ndarray
-
get_doc_given_word
(tf_idf: bool) → numpy.ndarray¶ The conditional probability of a particular word in a given document.
This probability p(d|w) is obtained by dividing the joint document- word probability p(d, w) by the marginal word probability p(w).
Parameters: tf_idf (bool) – Whether to base the conditional probability on the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The conditional word probability p(d|w). Return type: ndarray
-
get_doc_word
(tf_idf: bool) → numpy.ndarray¶ The normalized document-word counts matrix.
Also referred to as the term-frequency matrix. Because words (or terms) that occur in the majority of documents are the least helpful in discriminating types of documents, each column of this matrix can be multiplied by the logarithm of the total number of documents divided by the number of documents containing the given word. The result is then referred to as the term-frequency inverse-document-frequency or TF-IDF matrix.
Either way, the returned matrix is always normalized such that it can be interpreted as the joint document-word probability p(d, w).
Parameters: tf_idf (bool) – Whether to return the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The normalized document (rows) - word (columns) matrix, either as pure counts (if tf_idf
=False
) or weighted by the inverse document frequency (iftf_idf
isFalse
).Return type: ndarray
-
get_word
(tf_idf: bool) → numpy.ndarray¶ The marginal probability of a particular word.
This probability p(w) is obtained by summing the joint document- word probability p(d, w) over all documents.
Parameters: tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The word probability p(w). Return type: ndarray
-
idf
¶ Logarithm of inverse fraction of documents each word occurs in.
-
index
¶ Mapping from actual word to numeric word index.
-
n_docs
¶ The number of non-empty documents.
-
n_occurrences
¶ Total number of times any word occurred in any document.
-
n_words
¶ The number of unique words retained after preprocessing.
-
pipeline
¶ The pipeline of preprocessors for each document.
-
raw
¶ The raw documents as they were read from the source.
-
vocabulary
¶ Mapping from numeric word index to actual word.
plsa.algorithms package¶
Submodules¶
plsa.algorithms.plsa module¶
-
class
plsa.algorithms.plsa.
PLSA
(corpus: plsa.corpus.Corpus, n_topics: int, tf_idf: bool = True)¶ Implements probabilistic latent semantic analysis (PLSA).
At its core lies the assumption that the normalized document-word (or term-frequency) matrix p(d, w), weighted with the inverse document frequency or not, can be factorized as:
\[p(d, w)\approx\sum_t \tilde{p}(d|t)\tilde{p}(w|t)\tilde{p}(t)\]Parameters: - corpus (Corpus) – The corpus of preprocessed and numerically represented documents.
- n_topics (int) – The number of latent topics to identify.
- tf_idf (bool) – Whether to use the term-frequency inverse-document-frequency or just the term-frequency matrix as joint probability p(d, w) of documents and words.
Raises: ValueError
– If the number of topics is < 2 or the number of both, words and documents, in the corpus isn’t greater than the number of topics.Notes
The implementation follows algorithm 15.2 in Barber’s book [1] to the letter. What is not said there is that, in order to update the conditional probability p(t|d, w) of a certain topic given a certain word in a certain document, one first needs to find the joint probability of all random variables as
\[\tilde{p}(t, d, w) = \tilde{p}(d|t)\tilde{p}(w|t)\tilde{p}(t)\]and then divide by the marginal \(\tilde{p}(d, w)\).
References
[1] “Bayesian Reasoning and Machine Learning”, David Barber (Cambridge Press, 2012). -
best_of
(n_runs: int = 3, **kwargs) → plsa.algorithms.result.PlsaResult¶ Finds the best PLSA model among the specified number of runs.
As with any iterative algorithm, also the probabilities in PSLA need to be (randomly) initialized prior to the first iteration step. Therefore, calling the
fit
method of two different instances operating on the same corpus with the same number of topics potentially leads to (slightly) different results, corresponding to different local minima of the Kullback-Leibler divergence between the true document-word probability and its approximate factorization. To mitigate this effect, perform multiple runs and pick the best model.Parameters: - n_runs (int, optional) – Number of runs to pick the best model of. Defaults to 3.
- **kwargs – Keyword-only arguments are passed on to the
fit
method.
Returns: Container class for the best result.
Return type:
-
fit
(eps: float = 1e-05, max_iter: int = 200, warmup: int = 5) → plsa.algorithms.result.PlsaResult¶ Run EM-style training to find latent topics in documents.
Expectation-maximization (EM) iterates until either the maximum number of iterations is reached or if relative changes of the Kullback- Leibler divergence between the actual document-word probability and its approximate fall below a certain threshold, whichever occurs first.
Since all quantities are update in-place, calling the
fit
method again after a successful run (possibly with changed convergence criteria) will continue to add more iterations on top of the status quo rather than starting all over again from scratch.Because a few EM iterations are needed to get things going, you can specify an initial warm-up period, during which progress in the Kullback-Leibler divergence is not tracked, and which does not count towards the maximum number of iterations.
Parameters: - eps (float, optional) – The convergence cutoff for relative changes in the Kullback- Leibler divergence between the actual document-word probability and its approximate. Defaults to 1e-5.
- max_iter (int, optional) – The maximum number of iterations to perform. Defaults to 200.
- warmup (int, optional) – The number of iterations to perform before changes in the Kullback-Leibler divergence are tracked for convergence.
Returns: Container class for the results of the latent semantic analysis.
Return type:
-
n_topics
¶ The number of topics to find.
-
tf_idf
¶ Use inverse document frequency to weigh the document-word counts?
plsa.algorithms.conditional_plsa module¶
-
class
plsa.algorithms.conditional_plsa.
ConditionalPLSA
(corpus: plsa.corpus.Corpus, n_topics: int, tf_idf: bool = True)¶ Implements conditional probabilistic latent semantic analysis (PLSA).
Given that the normalized document-word (or term-frequency) matrix p(d, w), weighted with the inverse document frequency or not, can always be written as,
\[p(d, w) = p(d|w)p(w)\]the core of conditional PLSA is the assumption that the conditional p(d|w) can be factorized as:
\[p(d|w) \approx \sum_t \tilde{p}(d|t)\tilde{p}(t|w)\]Parameters: - corpus (Corpus) – The corpus of preprocessed and numerically represented documents.
- n_topics (int) – The number of latent topics to identify.
- tf_idf (bool) – Whether to use the term-frequency inverse-document-frequency or just the term-frequency matrix as joint probability p(d, w) of documents and words.
Raises: ValueError
– If the number of topics is < 2 or the number of both, words and documents, in the corpus isn’t greater than the number of topics.Notes
Importantly, the present implementation does not follow algorithm 15.3 in Barber’s book [1]. The update equations there appear non-sensical. Following through the derivation that gives (non-conditional) PLSA, one arrives at the following updates:
\[\begin{split}\tilde{p}(d|t) &= \sum_w p(d, w)q(t|d, w) \\ \tilde{p}(t|w) &= \sum_d p(d, w)q(t|d, w) \\ \tilde{p}(t, d, w) &= p(w)\sum_t\tilde{p}(d|t)\tilde{p}(t|w) \\ q(t| d, w) &= \tilde{p}(t, d, w) / \tilde{p}(d, w)\end{split}\]References
[1] “Bayesian Reasoning and Machine Learning”, David Barber (Cambridge Press, 2012). -
best_of
(n_runs: int = 3, **kwargs) → plsa.algorithms.result.PlsaResult¶ Finds the best PLSA model among the specified number of runs.
As with any iterative algorithm, also the probabilities in PSLA need to be (randomly) initialized prior to the first iteration step. Therefore, calling the
fit
method of two different instances operating on the same corpus with the same number of topics potentially leads to (slightly) different results, corresponding to different local minima of the Kullback-Leibler divergence between the true document-word probability and its approximate factorization. To mitigate this effect, perform multiple runs and pick the best model.Parameters: - n_runs (int, optional) – Number of runs to pick the best model of. Defaults to 3.
- **kwargs – Keyword-only arguments are passed on to the
fit
method.
Returns: Container class for the best result.
Return type:
-
fit
(eps: float = 1e-05, max_iter: int = 200, warmup: int = 5) → plsa.algorithms.result.PlsaResult¶ Run EM-style training to find latent topics in documents.
Expectation-maximization (EM) iterates until either the maximum number of iterations is reached or if relative changes of the Kullback- Leibler divergence between the actual document-word probability and its approximate fall below a certain threshold, whichever occurs first.
Since all quantities are update in-place, calling the
fit
method again after a successful run (possibly with changed convergence criteria) will continue to add more iterations on top of the status quo rather than starting all over again from scratch.Because a few EM iterations are needed to get things going, you can specify an initial warm-up period, during which progress in the Kullback-Leibler divergence is not tracked, and which does not count towards the maximum number of iterations.
Parameters: - eps (float, optional) – The convergence cutoff for relative changes in the Kullback- Leibler divergence between the actual document-word probability and its approximate. Defaults to 1e-5.
- max_iter (int, optional) – The maximum number of iterations to perform. Defaults to 200.
- warmup (int, optional) – The number of iterations to perform before changes in the Kullback-Leibler divergence are tracked for convergence.
Returns: Container class for the results of the latent semantic analysis.
Return type:
-
n_topics
¶ The number of topics to find.
-
tf_idf
¶ Use inverse document frequency to weigh the document-word counts?
plsa.algorithms.result module¶
-
class
plsa.algorithms.result.
PlsaResult
(topic_given_doc: numpy.ndarray, word_given_topic: numpy.ndarray, topic_given_word: numpy.ndarray, topic: numpy.ndarray, kl_divergences: List[float], corpus: plsa.corpus.Corpus, tf_idf: bool)¶ Bases:
object
Container for the results generated by a (conditional) PLSA run.
Parameters: - topic_given_doc (ndarray) – The conditional probability p(t|d) as \(n_{topics}\times n_{docs}\) array.
- word_given_topic (ndarray) – The conditional probability p(w|t) as \(n_{words}\times n_{topics}\) array.
- topic_given_word (ndarray) – The conditional probability p(t|w) as \(n_{topics}\times n_{words}\) array.
- topic (ndarray) – The marginal probability p(w).
- kl_divergences (list of float) – The Kullback-Leibler divergences between the original document-word probability p(d, w) and its approximate for each iteration.
- corpus (Corpus) – The original corpus the PLSA model was trained on.
- tf_idf (bool) – Whether to weigh the document.word matrix with the inverse document frequencies or not.
-
convergence
¶ The convergence of the Kullback-Leibler divergence.
-
kl_divergence
¶ KL-divergence of approximate and true document-word probability.
-
n_topics
¶ The number of latent topics identified.
-
predict
(doc: str) → Tuple[numpy.ndarray, int, Tuple[str, ...]]¶ Predict the relative importance of latent topics in a new document.
Parameters: doc (str) – A new document given as a single string. Returns: - ndarray – A 1-D array with the relative importance of latent topics.
- int – The number of words in the new document that were not present in the corpus the PLSA model was trained on.
- tuple of str – Those words in the new document that were not present in the corpus the PLSA model was trained on.
Raises: ValueError
– If the document to predict on is an empty string, if there are no words left after preprocessing the document, or if there are no known words in the document.
-
tf_idf
¶ Used inverse document frequency to weigh the document-word counts?
-
topic
¶ The relative importance of latent topics.
-
topic_given_doc
¶ The relative importance of latent topics in each document.
Dimensions are \(n_{docs} \times n_{topics}\).
-
word_given_topic
¶ The words in each latent topic and their relative importance.
Results are presented as a tuple of 2-tuples (word, word importance).
plsa.visualize module¶
-
class
plsa.visualize.
Visualize
(result: plsa.algorithms.result.PlsaResult)¶ Bases:
object
Visualize the results of probabilistic latent semantic analysis.
Parameters: result (PlsaResult) – The results object returned by the fit
method of a PLSA model object.-
convergence
(axis: matplotlib.axes._subplots.AxesSubplot) → List[matplotlib.lines.Line2D]¶ Plot the convergence of the PLSA run.
The quantity to be minimized is the Kullback-Leibler divergence between the original document-word matrix and its approximation given by the (conditional) PLSA factorization.
Parameters: axis (Subplot) – The matplotlib axis to plot into. Returns: The line object plotted into the given axis. Return type: list of Line2D
-
prediction
(doc: str, axis: matplotlib.axes._subplots.AxesSubplot) → matplotlib.container.BarContainer¶ Plot the predicted relative weights of topics in a new document.
Parameters: - doc (str) – A new document given as a single string.
- axis (Subplot) – The matplotlib axis to plot into.
Returns: The container for the bars plotted into the given axis.
Return type: BarContainer
-
topics
(axis: matplotlib.axes._subplots.AxesSubplot) → matplotlib.container.BarContainer¶ Plot the relative importance of the individual topics.
Parameters: axis (Subplot) – The matplotlib axis to plot into. Returns: The container for the bars plotted into the given axis. Return type: BarContainer
-
topics_in_doc
(i_doc: int, axis: matplotlib.axes._subplots.AxesSubplot) → matplotlib.container.BarContainer¶ Plot the relative weights of topics in a given document.
Parameters: - i_doc (int) – Index of the document to plot. Numbering starts at 0.
- axis (Subplot) – The matplotlib axis to plot into.
Returns: The container for the bars plotted into the given axis.
Return type: BarContainer
-
wordclouds
(figure: matplotlib.figure.Figure) → List[matplotlib.image.AxesImage]¶ Plot the relative importance of words in all topics.
Parameters: figure (Figure) – An empty matplotlib figure to plot into. Returns: List of images with the created word clouds. Return type: list of AxisImage
-
words_in_topic
(i_topic: int, axis: matplotlib.axes._subplots.AxesSubplot) → matplotlib.image.AxesImage¶ Plot the relative importance of words in a given topic.
Parameters: - i_topic (int) – Index of the topic to plot. Numbering starts at 0.
- axis (Subplot) – The matplotlib axis to plot into.
Returns: The image with the produced word cloud.
Return type: AxesImage
-