Paperweight¶
Paperweight is a Python package for hacking LaTeX documents.
Paperweight operates on LaTeX documents either on the local filesystem, or as a commit in a git repository. Paperweight can be used to manipulate documents, such as embedded input files and bibliographies, or can provide analytics such as rich information about references made in a document. See the API documentation below for further information.
Paperweight is developed on GitHub under an open BSD license.
Installation¶
You can install the latest Paperweight and its dependencies from PyPI:
pip install paperweight
You can also install the bleeding-edge from git:
pip install git+git://github.com/jonathansick/paperweight.git
Alternatively you can develop directly on the git source repository:
git clone https://github.com/jonathansick/paperweight.git
cd paperweight
python setup.py develop
python setup.py build_sphinx
Documentation¶
Tutorials¶
Paperweight Basics¶
This guide provides a brief demo of what you can do with Paperweight. Here we’ll use a LaTeX document that I have handy, but you may want to try the same commands on your own papers. There are some issues typesetting ipython notebooks on Read The Docs, so you may prefer to view this notebook with the ipython notebook viewer.
First, let’s set our current working directory to match the paper for convenience.
from pprint import pprint
import os
os.chdir("/Users/jsick/Dropbox/m31/writing/skysubpub")
!rm embedded_demo.tex
rm: embedded_demo.tex: No such file or directory
If you’re processing TeX documents in an automated fashion, you might
not know a priori what file contains the root of a LaTeX document. By
root we mean the file with a '\documentclass'
declaration.
Paperweight provides an API to find this. (Note that
find_root_tex_document()
can accept a directory path as an argument,
so you are not restricted to the current working directory)
from paperweight import texutils
root_tex_path = texutils.find_root_tex_document()
print(root_tex_path)
./skysub.tex
Working with LaTeX Documents¶
Now that we know where our LaTeX document is, we can open it as a
FilesystemTexDocument
. Note we also provide a GitTexDocument
class to work with documents stored in an arbitrary commit within a git
repo. We’ll get to that in a separate demo.
from paperweight.document import FilesystemTexDocument
doc = FilesystemTexDocument(root_tex_path)
With a FilesystemTexDocument
you can ask questions about the
document, such as what BibTeX file it uses:
print(doc.bib_name)
print(doc.bib_path)
master.bib
/Users/jsick/Library/texmf/bibtex/bib/master.bib
or what the names and word-count locations of the sections are:
doc.sections
[(908, u'Introduction'),
(2601, u'Observations'),
(4641, u'Image Preparation'),
(6186, u'Sky Flat Fielding and Median Sky Subtraction'),
(7762, u'Photometric calibration'),
(8317, u'Analysis of Sky Flat Fielding and Background Subtraction Methods'),
(12500, u'Sky Offset Optimization'),
(14284, u'Analysis of Scalar Sky Offsets'),
(17032, u'Systematic Uncertainties in Surface Brightness Reconstruction'),
(18456, u'Conclusions')]
or what other tex files it includes using \input{}
commands:
doc.find_input_documents()
[u'tables/nightset_medsky_offset_hierarchy.tex',
u'tables/nightset_medsky_scalar_resid_diffs.tex']
We can manipulate our LaTeX document too. For instance, we can embed the input TeX files and bibliography directly into the main text body:
doc.inline_inputs()
doc.inline_bbl()
Now you’ll see that we no longer reference other tex files or a bibtex file since all text content is embedded into the root TeX document. This can be handy for submitting the article to a journal (in fact the preprint tool uses Paperweight to do just that).
print(doc.find_input_documents())
print(doc.bib_name)
[]
None
We can delete comments from the LaTeX source as well. When we do that you’ll notice that the sections now appear at earlier word count locations.
doc.remove_comments()
doc.sections
[(812, u'Introduction'),
(2505, u'Observations'),
(4365, u'Image Preparation'),
(5530, u'Sky Flat Fielding and Median Sky Subtraction'),
(7011, u'Photometric calibration'),
(7548, u'Analysis of Sky Flat Fielding and Background Subtraction Methods'),
(10809, u'Sky Offset Optimization'),
(12497, u'Analysis of Scalar Sky Offsets'),
(14942, u'Systematic Uncertainties in Surface Brightness Reconstruction'),
(16082, u'Conclusions')]
You can easily write the modified TeX source back to the filesystem with
the write()
method:
doc.write("embedded_demo.tex")
Extracting Citation Information¶
One of the goals of Paperweight is to allow us to understand our scientific documents. A big part of that is understanding how we cite other papers.
With our document, we can ask for what references are made in the
document according to the cite keys used in \cite*{}
commands:
doc.bib_keys
[u'Saglia:2010',
u'Courteau:2014',
u'Courteau:2011',
u'Athanassoula:2006',
u'Nelder:1965',
u'de-Jong:1996b',
u'Maraston:1998',
u'Marigo:2008',
u'Conroy:2010b',
u'Barmby:2006',
u'Pforr:2012',
u'Maraston:2005',
u'Maraston:2006',
u'Williams:2003',
u'Williams:2002',
u'Bruzual:2007',
u'Brown:2009a',
u'Press:2007',
u'Olsen:2006',
u'Bertin:2002',
u'McConnachie:2005',
u'McConnachie:2009',
u'Bertin:1996',
u'Bertin:2006',
u'Brown:2008',
u'Adams:1996',
u'Dalcanton:2012',
u'Brown:2003',
u'Dutton:2005',
u'Sick:2013a',
u'Marmo:2008',
u'Massey:2006',
u'Irwin:2005',
u'Skrutskie:2006',
u'MacArthur:2004',
u'Conroy:2013',
u'Ibata:2005',
u'Berriman:2008',
u'Kormendy:2004',
u'Taylor:2011',
u'Beaton:2007',
u'Puget:2004',
u'Brown:2006',
u'Vaduvescu:2004',
u'Worthey:2005']
This is useful, but we can go deeper by understanding the context in
which these works are cited. To do this we can use the
extract_citation_context()
method to generate a dictionary, keyed by
bib keys, of all citation instances in the document. In this example
paper I’ve cited 45 works:
cites = doc.extract_citation_context()
print(len(cites))
45
Each entry in the cites
dictionary is a list of specific occurences
where that work was cited. Thus its easy to count the number of times
each work was cited:
for cite_key, instances in cites.iteritems():
print("{0} cited {1:d} time(s)".format(cite_key, len(instances)))
Saglia:2010 cited 1 time(s)
Courteau:2014 cited 2 time(s)
Courteau:2011 cited 1 time(s)
Athanassoula:2006 cited 1 time(s)
Nelder:1965 cited 1 time(s)
de-Jong:1996b cited 1 time(s)
Maraston:1998 cited 1 time(s)
Marigo:2008 cited 1 time(s)
Conroy:2010b cited 1 time(s)
Barmby:2006 cited 5 time(s)
Pforr:2012 cited 1 time(s)
Maraston:2005 cited 1 time(s)
Maraston:2006 cited 1 time(s)
Williams:2003 cited 1 time(s)
Williams:2002 cited 1 time(s)
Bruzual:2007 cited 1 time(s)
Brown:2009a cited 1 time(s)
Press:2007 cited 1 time(s)
Olsen:2006 cited 1 time(s)
Bertin:2002 cited 1 time(s)
McConnachie:2005 cited 1 time(s)
McConnachie:2009 cited 1 time(s)
Bertin:1996 cited 2 time(s)
Bertin:2006 cited 1 time(s)
Brown:2008 cited 1 time(s)
Adams:1996 cited 4 time(s)
Dalcanton:2012 cited 2 time(s)
Brown:2003 cited 1 time(s)
Dutton:2005 cited 1 time(s)
Sick:2013a cited 1 time(s)
Marmo:2008 cited 1 time(s)
Massey:2006 cited 1 time(s)
Irwin:2005 cited 1 time(s)
Skrutskie:2006 cited 2 time(s)
MacArthur:2004 cited 1 time(s)
Conroy:2013 cited 1 time(s)
Ibata:2005 cited 1 time(s)
Berriman:2008 cited 2 time(s)
Kormendy:2004 cited 1 time(s)
Taylor:2011 cited 1 time(s)
Beaton:2007 cited 2 time(s)
Puget:2004 cited 2 time(s)
Brown:2006 cited 1 time(s)
Vaduvescu:2004 cited 7 time(s)
Worthey:2005 cited 1 time(s)
It looks like I’ve cited Vaduvescu:2004
a lot. Lets look at where it
was cited:
print([c['section'] for c in cites['Vaduvescu:2004']])
[(812, u'Introduction'), (2505, u'Observations'), (2505, u'Observations'), (2505, u'Observations'), (7548, u'Analysis of Sky Flat Fielding and Background Subtraction Methods'), (7548, u'Analysis of Sky Flat Fielding and Background Subtraction Methods'), (12497, u'Analysis of Scalar Sky Offsets')]
In the list above, the first item lists the cumulative word count where the section starts, while the second item is the name of the section.
There’s a lot of other information associated with each citation
instance. Here’s metadata associated with the first reference to
Vaduvescu:2004
:
pprint(cites['Vaduvescu:2004'][0])
{'position': 2235,
'section': (812, u'Introduction'),
'wordsafter': u'also found detector systems , case ( decommissioned ) CFHT - IR camera , add time - varying background signal',
'wordsbefore': u'Spatial structures NIR sky leave residual shapes background subtracted disk images ultimately affect ability produce seamless NIR mosaic M31 .'}
API Reference¶
paperweight.document¶
Object oriented access to LaTeX documents.
The paperweight.document
module provides object-oriented interfaces
for manipulating or mining LaTeX documents.
Much of the functionality of the paperweight.texutils
,
paperweight.gitio
and paperweight.nlputils
modules can be
accessed through this interface.
Depending on how the LaTeX document is stored, you should use either of two
document classes.
paperweight.document.FilesystemTexDocument
should be used for regular
documents in the filesystem.
If you wish to operate on documents stored within a certain commit of a
checked-out Git repository, then use
paperweight.document.GitTexDocument
.
The interfaces for both classes are consistent since they inherit from
paperweight.document.TexDocument
under the hood.
-
class
paperweight.document.
FilesystemTexDocument
(path, recursive=True)¶ Bases:
paperweight.document.TexDocument
A TeX document derived from a file in the filesystem.
Parameters: filepath : unicode
Path to the ‘.tex’ on the filesystem.
recursive : bool
If True (default), then tex documents input by this root document will be opened.
Attributes
bib_keys
List of all bib keys in the document (and input documents). bib_name
Name of the BibTeX bibliography file (e.g., 'mybibliography.bib'
).bib_path
Absolute file path to the .bib bibliography document. bibitems
List of bibitem strings appearing in the document. sections
List with tuples of section names and positions. Methods
extract_citation_context
([n_words])Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document. find_input_documents
()Find all tex documents input by this root document. inline_bbl
()Inline a compiled bibliography (.bbl) in place of a bibliography environment. inline_inputs
()Inline all input latex files references by this document. remove_comments
([recursive])Remove latex comments from document (modifies document in place). write
(path)Write the document’s text to a path
on the filesystem.-
bib_keys
¶ List of all bib keys in the document (and input documents).
-
bib_name
¶ Name of the BibTeX bibliography file (e.g.,
'mybibliography.bib'
).
-
bib_path
¶ Absolute file path to the .bib bibliography document.
-
bibitems
¶ List of bibitem strings appearing in the document.
-
extract_citation_context
(n_words=20)¶ Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.
For example, suppose
'Sick:2014'
is cited twice within a document. Then the dictionary returned by this method will have a length-2 list under the'Sick:2014'
key. Each item in this list will be a dictionary providing metadata of the context for that citation. Fields of this dictionary are:position
: (int) the cumulative word count at which the citation occurs.wordsbefore
: (unicode) text occuring before the citation.wordsafter
: (unicode) text occuring after the citation.section
: (unicode) name of the section in which the citation occurs.
Parameters: n_words : int
Number of words before and after the citation to extract for context.
Returns: bib_keys : dict
Dictionary, keyed by BibTeX cite key, where entires are lists of instances of citations. See above for the format of the instance metadata.
-
find_input_documents
()¶ Find all tex documents input by this root document.
Returns: paths : list
List of filepaths for input documents. Paths are relative to the document (i.e., as written in the latex document).
-
inline_bbl
()¶ Inline a compiled bibliography (.bbl) in place of a bibliography environment. The document is modified in place.
-
inline_inputs
()¶ Inline all input latex files references by this document. The inlining is accomplished recursively. The document is modified in place.
-
remove_comments
(recursive=True)¶ Remove latex comments from document (modifies document in place).
Parameters: recursive : bool
Remove comments from all input LaTeX documents (default
True
).
-
sections
¶ List with tuples of section names and positions. Positions of section names are measured by cumulative word count.
-
write
(path)¶ Write the document’s text to a
path
on the filesystem.
-
-
class
paperweight.document.
GitTexDocument
(git_path, git_hash, repo_dir='.', recursive=True)¶ Bases:
paperweight.document.TexDocument
A tex document derived from a file in the git repository.
Parameters: git_path : str
Path to the document in the git repository, relative to the root of the repository.
git_hash : str
Any SHA or git tag that can resolve into a commit in the git repository.
repo_dir : str
Path from current working directory to the root of the git repository.
Attributes
bib_keys
List of all bib keys in the document (and input documents). bib_name
Name of the BibTeX bibliography file (e.g., 'mybibliography.bib'
).bib_path
Absolute file path to the .bib bibliography document. bibitems
List of bibitem strings appearing in the document. sections
List with tuples of section names and positions. Methods
extract_citation_context
([n_words])Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document. find_input_documents
()Find all tex documents input by this root document. remove_comments
([recursive])Remove latex comments from document (modifies document in place). write
(path)Write the document’s text to a path
on the filesystem.-
bib_keys
¶ List of all bib keys in the document (and input documents).
-
bib_name
¶ Name of the BibTeX bibliography file (e.g.,
'mybibliography.bib'
).
-
bib_path
¶ Absolute file path to the .bib bibliography document.
-
bibitems
¶ List of bibitem strings appearing in the document.
-
extract_citation_context
(n_words=20)¶ Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.
For example, suppose
'Sick:2014'
is cited twice within a document. Then the dictionary returned by this method will have a length-2 list under the'Sick:2014'
key. Each item in this list will be a dictionary providing metadata of the context for that citation. Fields of this dictionary are:position
: (int) the cumulative word count at which the citation occurs.wordsbefore
: (unicode) text occuring before the citation.wordsafter
: (unicode) text occuring after the citation.section
: (unicode) name of the section in which the citation occurs.
Parameters: n_words : int
Number of words before and after the citation to extract for context.
Returns: bib_keys : dict
Dictionary, keyed by BibTeX cite key, where entires are lists of instances of citations. See above for the format of the instance metadata.
-
find_input_documents
()¶ Find all tex documents input by this root document.
Returns: paths : list
List of filepaths for input documents. Paths are relative to the document (i.e., as written in the latex document).
-
remove_comments
(recursive=True)¶ Remove latex comments from document (modifies document in place).
Parameters: recursive : bool
Remove comments from all input LaTeX documents (default
True
).
-
sections
¶ List with tuples of section names and positions. Positions of section names are measured by cumulative word count.
-
write
(path)¶ Write the document’s text to a
path
on the filesystem.
-
-
class
paperweight.document.
TexDocument
(text)¶ Bases:
object
Baseclass for a tex document.
Parameters: text : unicode
Unicode-encoded text of the latex document.
Attributes
text (unicode) Text of the document as a unicode string. Methods
extract_citation_context
([n_words])Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document. find_input_documents
()Find all tex documents input by this root document. remove_comments
([recursive])Remove latex comments from document (modifies document in place). write
(path)Write the document’s text to a path
on the filesystem.-
bib_keys
¶ List of all bib keys in the document (and input documents).
-
bib_name
¶ Name of the BibTeX bibliography file (e.g.,
'mybibliography.bib'
).
-
bib_path
¶ Absolute file path to the .bib bibliography document.
-
bibitems
¶ List of bibitem strings appearing in the document.
-
extract_citation_context
(n_words=20)¶ Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.
For example, suppose
'Sick:2014'
is cited twice within a document. Then the dictionary returned by this method will have a length-2 list under the'Sick:2014'
key. Each item in this list will be a dictionary providing metadata of the context for that citation. Fields of this dictionary are:position
: (int) the cumulative word count at which the citation occurs.wordsbefore
: (unicode) text occuring before the citation.wordsafter
: (unicode) text occuring after the citation.section
: (unicode) name of the section in which the citation occurs.
Parameters: n_words : int
Number of words before and after the citation to extract for context.
Returns: bib_keys : dict
Dictionary, keyed by BibTeX cite key, where entires are lists of instances of citations. See above for the format of the instance metadata.
-
find_input_documents
()¶ Find all tex documents input by this root document.
Returns: paths : list
List of filepaths for input documents. Paths are relative to the document (i.e., as written in the latex document).
-
remove_comments
(recursive=True)¶ Remove latex comments from document (modifies document in place).
Parameters: recursive : bool
Remove comments from all input LaTeX documents (default
True
).
-
sections
¶ List with tuples of section names and positions. Positions of section names are measured by cumulative word count.
-
write
(path)¶ Write the document’s text to a
path
on the filesystem.
-
paperweight.texutils¶
Utilities for maniuplating latex documents.
-
paperweight.texutils.
find_root_tex_document
(base_dir='.')¶ Find the tex article in the current directory that can be considered a root. We do this by searching contents for
'\documentclass'
.Parameters: base_dir : str
Directory to search for LaTeX documents, relative to the current working directory.
Returns: tex_path : str
Path to the root tex document relative to the current working directory.
-
paperweight.texutils.
iter_tex_documents
(base_dir='.')¶ Iterate through all .tex documents in the current directory.
-
paperweight.texutils.
inline
(root_text, base_dir='', replacer=None, ifexists_replacer=None)¶ Inline all input latex files.
The inlining is accomplished recursively. All files are opened as UTF-8 unicode files.
Parameters: root_txt : unicode
Text to process (and include in-lined files).
base_dir : str
Base directory of file containing
root_text
. Defaults to the current working directory.replacer : function
Function called by
re.sub()
to replace\input
expressions with a latex document. Changeable only for testing purposes.ifexists_replacer : function
Function called by
re.sub()
to replace\InputIfExists
expressions with a latex document. Changeable only for testing purposes.Returns: txt : unicode
Text with referenced files included.
-
paperweight.texutils.
inline_blob
(commit_ref, root_text, base_dir='.', repo_dir='')¶ Inline all input latex files that exist as git blobs in a tree object.
The inlining is accomplished recursively. All files are opened as UTF-8 unicode files.
Parameters: commit_ref : str
String identifying a git commit/tag.
root_text : unicode
Text of tex document where referenced files will be inlined.
base_dir : str
Directory of the master tex document, relative to the repo_dir.
repo_dir : str
Directory of the containing git repository.
Returns: txt : unicode
Text with referenced files included.
-
paperweight.texutils.
inline_bbl
(root_tex, bbl_tex)¶ Inline a compiled bibliography (.bbl) in place of a bibliography environment.
Parameters: root_tex : unicode
Text to process.
bbl_tex : unicode
Text of bibliography file.
Returns: txt : unicode
Text with bibliography included.
-
paperweight.texutils.
remove_comments
(tex)¶ Delete latex comments from a manuscript.
Parameters: tex : unicode
The latex manuscript
Returns: tex : unicode
The manuscript without comments.
paperweight.gitio¶
Utilities for reading content in git repositories.
-
paperweight.gitio.
read_git_blob
(commit_ref, path, repo_dir='.')¶ Get text from a git blob.
Parameters: commit_ref : str
Any SHA or git tag that can resolve into a commit in the git repository.
path : str
Path to the document in the git repository, relative to the root of the repository.
repo_dir : str
Path from current working directory to the root of the git repository.
Returns: text : unicode
The document text.
-
paperweight.gitio.
absolute_git_root_dir
(fpath='')¶ Absolute path to the git root directory containing a given file or directory.
Indices¶
Contributors¶
Paperweight was created as a hack for .Astronomy 6 (2014) in Chicago.