Archives.org Latin Toolkit¶
What ?¶
This piece of software is intended to be used with the 11K Latin Texts produced by David Bamman ( http://www.cs.cmu.edu/~dbamman/latin.html ). It supports only the plain text formats and the metadata github repo CSV file. This has been tested with Python3 only. I welcome any new functions or backward compatibility support.
How to install ?¶
- With development version:
- Clone the repository :
git clone https://github.com/ponteineptique/archives_org_latin_toolkit.git
- Go to the directory :
cd archives_org_latin_toolkit
- Install the source with develop option :
python setup.py install
- Clone the repository :
- With pip:
- Install from pip :
pip install archives_org_latin_toolkit
- Install from pip :
Example¶
The following example should run with the data in tests/test_data. The example can be run with python example.py
# We import the main classes from the module
from archives_org_latin_toolkit import Repo, Metadata
from pprint import pprint
# We initiate a Metadata object and a Repo object
metadata = Metadata("./test/test_data/latin_metadata.csv")
# We want the text to be set in lowercase
repo = Repo("./test/test_data/archive_org_latin/", metadata=metadata, lowercase=True)
# We define a list of token we want to search for
tokens = ["ecclesiastico", "ecclesia", "ecclesiis"]
# We instantiate a result storage
results = []
# We iter over text having those tokens :
# Note that we need to "unzip" the list
# Make multiprocess lower if you want to use less processor. Use None to use one processor only
for text_matching in repo.find(*tokens, multiprocess=4):
# For each text, we iter over embeddings found in the text
# We want 3 words left, 3 words right,
# and we want to keep the original token (Default behaviour)
for embedding in text_matching.find_embedding(*tokens, window=3, ignore_center=False):
# We add it to the results
results.append(embedding)
# We print the result (list of list of strings)
pprint(results)
Contents¶
Archives.org Latin Toolkit Documentation¶
Classes¶
-
class
archives_org_latin_toolkit.
Metadata
(csv_file)[source]¶ Bases:
object
Metadata object for a file
Parameters: csv_file (str) – Path to the CSV file to parse
-
class
archives_org_latin_toolkit.
Text
(file, metadata=None, lowercase=False)[source]¶ Bases:
object
Text reading object for archive_org
Parameters: Variables: -
clean
¶ Clean version of the text : normalized space, remove new line, dehyphenize, remove punctuation and number.
-
composed
¶
-
find_embedding
(*strings, window=50, ignore_center=False, memory_efficient=True)[source]¶ Check if given string is in the file
Parameters: - strings – Strings as multiple arguments
- window – Number of lines to retrieve
- ignore_center – Remove the word found from the embedding
-
has_strings
(*strings)[source]¶ Check if given string is in the file
Parameters: strings – Strings as multiple arguments Returns: If found, return True Return type: bool
-
name
¶
-
random_embedding
(grab, window=50, avoid=None, memory_efficient=True, _taken=None, _generator=True)[source]¶ Search for random sentences in the text. Can avoid certain words
Parameters: Returns: Generator with random texts
Note
Right now, new window found are not added to _taken, which is problematic
-
raw
¶
-
-
class
archives_org_latin_toolkit.
Repo
(directory, metadata=None, lowercase=False)[source]¶ Bases:
object
Repo reading object for archive_org
Parameters: -
find
(*strings, multiprocess=None, memory_efficient=True)[source]¶ Find files who contains given strings
Parameters: Returns: Files who are matching the strings
Return type: generator
-
get
(identifier)[source]¶ Get the Text object given its identifier
Parameters: identifier (str) – Filename or identifier Returns: Text object Return type: Text
-
metadata
¶
-
Helpers¶
-
archives_org_latin_toolkit.
period
(x)[source]¶ Parse a period in metadata. If there is multiple dates, returns the mean
Parameters: x (str) – Value to parse Returns: Parsed numeral Return type: int
-
archives_org_latin_toolkit.
bce
(x)[source]¶ Format A BCE string
Parameters: x (str) – Value to parse Returns: Parsed numeral Return type: str
-
archives_org_latin_toolkit.
__window__
(array, window, i)[source]¶ Compute embedding using i
Parameters: - strings –
- window – Number of word to take left, then right [ len(result) = (2*window)+1 ]
- i – Index of the word
- memory_efficient (bool) – Drop the content of files to avoid filling the ram with unused content
Returns: List of words