Versions

Description

Corpora is a lightweight, fast and scalable corpus library able to store a collection of raw text documents with additional key-value headers. It uses Berkeley DB (bsddb3 module) for index managing what guarantee speed and bullet-proof. Text storage model is based on chunked flat, human readable text files. This architecture can easily scale up to millions documents, hundred of gigabytes collections.

Repository

https://github.com/cypreess/corpora.git

Project Slug

corpora

Last Built

5 years, 10 months ago passed

Maintainers

Home Page

https://github.com/cypreess/corpora

Badge

Tags

nlp, text, corpus, corpora

Short URLs

corpora.readthedocs.io
corpora.rtfd.io

Default Version

latest

'latest' Version

master