Python HOWTOs¶
Python HOWTOs are documents that cover a single, specific topic, and attempt to cover it fairly completely. Modelled on the Linux Documentation Project’s HOWTO collection, this collection is an effort to foster documentation that’s more detailed than the Python Library Reference.
Currently, the HOWTOs are:
Python Advocacy HOWTO¶
Author: | A.M. Kuchling |
---|---|
Release: | 0.03 |
Abstract
It’s usually difficult to get your management to accept open source software, and Python is no exception to this rule. This document discusses reasons to use Python, strategies for winning acceptance, facts and arguments you can use, and cases where you shouldn’t try to use Python.
Reasons to Use Python¶
There are several reasons to incorporate a scripting language into your development process, and this section will discuss them, and why Python has some properties that make it a particularly good choice.
Programmability¶
Programs are often organized in a modular fashion. Lower-level operations are grouped together, and called by higher-level functions, which may in turn be used as basic operations by still further upper levels.
For example, the lowest level might define a very low-level set of functions for
accessing a hash table. The next level might use hash tables to store the
headers of a mail message, mapping a header name like Date
to a value such
as Tue, 13 May 1997 20:00:54 -0400
. A yet higher level may operate on
message objects, without knowing or caring that message headers are stored in a
hash table, and so forth.
Often, the lowest levels do very simple things; they implement a data structure such as a binary tree or hash table, or they perform some simple computation, such as converting a date string to a number. The higher levels then contain logic connecting these primitive operations. Using the approach, the primitives can be seen as basic building blocks which are then glued together to produce the complete product.
Why is this design approach relevant to Python? Because Python is well suited to functioning as such a glue language. A common approach is to write a Python module that implements the lower level operations; for the sake of speed, the implementation might be in C, Java, or even Fortran. Once the primitives are available to Python programs, the logic underlying higher level operations is written in the form of Python code. The high-level logic is then more understandable, and easier to modify.
John Ousterhout wrote a paper that explains this idea at greater length, entitled “Scripting: Higher Level Programming for the 21st Century”. I recommend that you read this paper; see the references for the URL. Ousterhout is the inventor of the Tcl language, and therefore argues that Tcl should be used for this purpose; he only briefly refers to other languages such as Python, Perl, and Lisp/Scheme, but in reality, Ousterhout’s argument applies to scripting languages in general, since you could equally write extensions for any of the languages mentioned above.
Prototyping¶
In The Mythical Man-Month, Fredrick Brooks suggests the following rule when planning software projects: “Plan to throw one away; you will anyway.” Brooks is saying that the first attempt at a software design often turns out to be wrong; unless the problem is very simple or you’re an extremely good designer, you’ll find that new requirements and features become apparent once development has actually started. If these new requirements can’t be cleanly incorporated into the program’s structure, you’re presented with two unpleasant choices: hammer the new features into the program somehow, or scrap everything and write a new version of the program, taking the new features into account from the beginning.
Python provides you with a good environment for quickly developing an initial prototype. That lets you get the overall program structure and logic right, and you can fine-tune small details in the fast development cycle that Python provides. Once you’re satisfied with the GUI interface or program output, you can translate the Python code into C++, Fortran, Java, or some other compiled language.
Prototyping means you have to be careful not to use too many Python features
that are hard to implement in your other language. Using eval()
, or regular
expressions, or the pickle
module, means that you’re going to need C or
Java libraries for formula evaluation, regular expressions, and serialization,
for example. But it’s not hard to avoid such tricky code, and in the end the
translation usually isn’t very difficult. The resulting code can be rapidly
debugged, because any serious logical errors will have been removed from the
prototype, leaving only more minor slip-ups in the translation to track down.
This strategy builds on the earlier discussion of programmability. Using Python as glue to connect lower-level components has obvious relevance for constructing prototype systems. In this way Python can help you with development, even if end users never come in contact with Python code at all. If the performance of the Python version is adequate and corporate politics allow it, you may not need to do a translation into C or Java, but it can still be faster to develop a prototype and then translate it, instead of attempting to produce the final version immediately.
One example of this development strategy is Microsoft Merchant Server. Version 1.0 was written in pure Python, by a company that subsequently was purchased by Microsoft. Version 2.0 began to translate the code into C++, shipping with some C++code and some Python code. Version 3.0 didn’t contain any Python at all; all the code had been translated into C++. Even though the product doesn’t contain a Python interpreter, the Python language has still served a useful purpose by speeding up development.
This is a very common use for Python. Past conference papers have also
described this approach for developing high-level numerical algorithms; see
David M. Beazley and Peter S. Lomdahl’s paper “Feeding a Large-scale Physics
Application to Python” in the references for a good example. If an algorithm’s
basic operations are things like “Take the inverse of this 4000x4000 matrix”,
and are implemented in some lower-level language, then Python has almost no
additional performance cost; the extra time required for Python to evaluate an
expression like m.invert()
is dwarfed by the cost of the actual computation.
It’s particularly good for applications where seemingly endless tweaking is
required to get things right. GUI interfaces and Web sites are prime examples.
The Python code is also shorter and faster to write (once you’re familiar with Python), so it’s easier to throw it away if you decide your approach was wrong; if you’d spent two weeks working on it instead of just two hours, you might waste time trying to patch up what you’ve got out of a natural reluctance to admit that those two weeks were wasted. Truthfully, those two weeks haven’t been wasted, since you’ve learnt something about the problem and the technology you’re using to solve it, but it’s human nature to view this as a failure of some sort.
Simplicity and Ease of Understanding¶
Python is definitely not a toy language that’s only usable for small tasks. The language features are general and powerful enough to enable it to be used for many different purposes. It’s useful at the small end, for 10- or 20-line scripts, but it also scales up to larger systems that contain thousands of lines of code.
However, this expressiveness doesn’t come at the cost of an obscure or tricky syntax. While Python has some dark corners that can lead to obscure code, there are relatively few such corners, and proper design can isolate their use to only a few classes or modules. It’s certainly possible to write confusing code by using too many features with too little concern for clarity, but most Python code can look a lot like a slightly-formalized version of human-understandable pseudocode.
In The New Hacker’s Dictionary, Eric S. Raymond gives the following definition for “compact”:
Compact adj. Of a design, describes the valuable property that it can all be apprehended at once in one’s head. This generally means the thing created from the design can be used with greater facility and fewer errors than an equivalent tool that is not compact. Compactness does not imply triviality or lack of power; for example, C is compact and FORTRAN is not, but C is more powerful than FORTRAN. Designs become non-compact through accreting features and cruft that don’t merge cleanly into the overall design scheme (thus, some fans of Classic C maintain that ANSI C is no longer compact).
In this sense of the word, Python is quite compact, because the language has
just a few ideas, which are used in lots of places. Take namespaces, for
example. Import a module with import math
, and you create a new namespace
called math
. Classes are also namespaces that share many of the properties
of modules, and have a few of their own; for example, you can create instances
of a class. Instances? They’re yet another namespace. Namespaces are currently
implemented as Python dictionaries, so they have the same methods as the
standard dictionary data type: .keys() returns all the keys, and so forth.
This simplicity arises from Python’s development history. The language syntax derives from different sources; ABC, a relatively obscure teaching language, is one primary influence, and Modula-3 is another. (For more information about ABC and Modula-3, consult their respective Web sites at http://www.cwi.nl/~steven/abc/ and http://www.m3.org.) Other features have come from C, Icon, Algol-68, and even Perl. Python hasn’t really innovated very much, but instead has tried to keep the language small and easy to learn, building on ideas that have been tried in other languages and found useful.
Simplicity is a virtue that should not be underestimated. It lets you learn the language more quickly, and then rapidly write code – code that often works the first time you run it.
Java Integration¶
If you’re working with Java, Jython (http://www.jython.org/) is definitely worth your attention. Jython is a re-implementation of Python in Java that compiles Python code into Java bytecodes. The resulting environment has very tight, almost seamless, integration with Java. It’s trivial to access Java classes from Python, and you can write Python classes that subclass Java classes. Jython can be used for prototyping Java applications in much the same way CPython is used, and it can also be used for test suites for Java code, or embedded in a Java application to add scripting capabilities.
Arguments and Rebuttals¶
Let’s say that you’ve decided upon Python as the best choice for your application. How can you convince your management, or your fellow developers, to use Python? This section lists some common arguments against using Python, and provides some possible rebuttals.
Python is freely available software that doesn’t cost anything. How good can it be?
Very good, indeed. These days Linux and Apache, two other pieces of open source software, are becoming more respected as alternatives to commercial software, but Python hasn’t had all the publicity.
Python has been around for several years, with many users and developers. Accordingly, the interpreter has been used by many people, and has gotten most of the bugs shaken out of it. While bugs are still discovered at intervals, they’re usually either quite obscure (they’d have to be, for no one to have run into them before) or they involve interfaces to external libraries. The internals of the language itself are quite stable.
Having the source code should be viewed as making the software available for peer review; people can examine the code, suggest (and implement) improvements, and track down bugs. To find out more about the idea of open source code, along with arguments and case studies supporting it, go to http://www.opensource.org.
Who’s going to support it?
Python has a sizable community of developers, and the number is still growing. The Internet community surrounding the language is an active one, and is worth being considered another one of Python’s advantages. Most questions posted to the comp.lang.python newsgroup are quickly answered by someone.
Should you need to dig into the source code, you’ll find it’s clear and well-organized, so it’s not very difficult to write extensions and track down bugs yourself. If you’d prefer to pay for support, there are companies and individuals who offer commercial support for Python.
Who uses Python for serious work?
Lots of people; one interesting thing about Python is the surprising diversity of applications that it’s been used for. People are using Python to:
- Run Web sites
- Write GUI interfaces
- Control number-crunching code on supercomputers
- Make a commercial application scriptable by embedding the Python interpreter inside it
- Process large XML data sets
- Build test suites for C or Java code
Whatever your application domain is, there’s probably someone who’s used Python for something similar. Yet, despite being useable for such high-end applications, Python’s still simple enough to use for little jobs.
See http://wiki.python.org/moin/OrganizationsUsingPython for a list of some of the organizations that use Python.
What are the restrictions on Python’s use?
They’re practically nonexistent. Consult the Misc/COPYRIGHT
file in the
source distribution, or the section history-and-license for the full
language, but it boils down to three conditions:
- You have to leave the copyright notice on the software; if you don’t include the source code in a product, you have to put the copyright notice in the supporting documentation.
- Don’t claim that the institutions that have developed Python endorse your product in any way.
- If something goes wrong, you can’t sue for damages. Practically all software licenses contain this condition.
Notice that you don’t have to provide source code for anything that contains Python or is built with it. Also, the Python interpreter and accompanying documentation can be modified and redistributed in any way you like, and you don’t have to pay anyone any licensing fees at all.
Why should we use an obscure language like Python instead of well-known language X?
I hope this HOWTO, and the documents listed in the final section, will help convince you that Python isn’t obscure, and has a healthily growing user base. One word of advice: always present Python’s positive advantages, instead of concentrating on language X’s failings. People want to know why a solution is good, rather than why all the other solutions are bad. So instead of attacking a competing solution on various grounds, simply show how Python’s virtues can help.
Useful Resources¶
- http://www.pythonology.com/success
- The Python Success Stories are a collection of stories from successful users of Python, with the emphasis on business and corporate users.
- http://www.tcl.tk/doc/scripting.html
- John Ousterhout’s white paper on scripting is a good argument for the utility of scripting languages, though naturally enough, he emphasizes Tcl, the language he developed. Most of the arguments would apply to any scripting language.
- http://www.python.org/workshops/1997-10/proceedings/beazley.html
The authors, David M. Beazley and Peter S. Lomdahl, describe their use of Python at Los Alamos National Laboratory. It’s another good example of how Python can help get real work done. This quotation from the paper has been echoed by many people:
Originally developed as a large monolithic application for massively parallel processing systems, we have used Python to transform our application into a flexible, highly modular, and extremely powerful system for performing simulation, data analysis, and visualization. In addition, we describe how Python has solved a number of important problems related to the development, debugging, deployment, and maintenance of scientific software.- http://pythonjournal.cognizor.com/pyj1/Everitt-Feit_interview98-V1.html
- This interview with Andy Feit, discussing Infoseek’s use of Python, can be used to show that choosing Python didn’t introduce any difficulties into a company’s development process, and provided some substantial benefits.
- http://www.python.org/workshops/1997-10/proceedings/stein.ps
- For the 6th Python conference, Greg Stein presented a paper that traced Python’s adoption and usage at a startup called eShop, and later at Microsoft.
- http://www.opensource.org
- Management may be doubtful of the reliability and usefulness of software that wasn’t written commercially. This site presents arguments that show how open source software can have considerable advantages over closed-source software.
- http://www.faqs.org/docs/Linux-mini/Advocacy.html
- The Linux Advocacy mini-HOWTO was the inspiration for this document, and is also well worth reading for general suggestions on winning acceptance for a new technology, such as Linux or Python. In general, you won’t make much progress by simply attacking existing systems and complaining about their inadequacies; this often ends up looking like unfocused whining. It’s much better to point out some of the many areas where Python is an improvement over other systems.
Porting Python 2 Code to Python 3¶
author: | Brett Cannon |
---|
Abstract
With Python 3 being the future of Python while Python 2 is still in active use, it is good to have your project available for both major releases of Python. This guide is meant to help you choose which strategy works best for your project to support both Python 2 & 3 along with how to execute that strategy.
If you are looking to port an extension module instead of pure Python code, please see Porting Extension Modules to 3.0.
Choosing a Strategy¶
When a project makes the decision that it’s time to support both Python 2 & 3, a decision needs to be made as to how to go about accomplishing that goal. The chosen strategy will depend on how large the project’s existing codebase is and how much divergence you want from your Python 2 codebase from your Python 3 one (e.g., starting a new version with Python 3).
If your project is brand-new or does not have a large codebase, then you may want to consider writing/porting all of your code for Python 3 and use 3to2 to port your code for Python 2.
If you would prefer to maintain a codebase which is semantically and syntactically compatible with Python 2 & 3 simultaneously, you can write Python 2/3 Compatible Source. While this tends to lead to somewhat non-idiomatic code, it does mean you keep a rapid development process for you, the developer.
Finally, you do have the option of using 2to3 to translate Python 2 code into Python 3 code (with some manual help). This can take the form of branching your code and using 2to3 to start a Python 3 branch. You can also have users perform the translation as installation time automatically so that you only have to maintain a Python 2 codebase.
Regardless of which approach you choose, porting is not as hard or time-consuming as you might initially think. You can also tackle the problem piece-meal as a good portion of porting is simply updating your code to follow current best practices in a Python 2/3 compatible way.
Universal Bits of Advice¶
Regardless of what strategy you pick, there are a few things you should consider.
One is make sure you have a robust test suite. You need to make sure everything continues to work, just like when you support a new minor version of Python. This means making sure your test suite is thorough and is ported properly between Python 2 & 3. You will also most likely want to use something like tox to automate testing between both a Python 2 and Python 3 VM.
Two, once your project has Python 3 support, make sure to add the proper classifier on the Cheeseshop (PyPI). To have your project listed as Python 3 compatible it must have the Python 3 classifier (from http://techspot.zzzeek.org/2011/01/24/zzzeek-s-guide-to-python-3-porting/):
setup(
name='Your Library',
version='1.0',
classifiers=[
# make sure to use :: Python *and* :: Python :: 3 so
# that pypi can list the package on the python 3 page
'Programming Language :: Python',
'Programming Language :: Python :: 3'
],
packages=['yourlibrary'],
# make sure to add custom_fixers to the MANIFEST.in
include_package_data=True,
# ...
)
Doing so will cause your project to show up in the Python 3 packages list. You will know you set the classifier properly as visiting your project page on the Cheeseshop will show a Python 3 logo in the upper-left corner of the page.
Three, the six project provides a library which helps iron out differences
between Python 2 & 3. If you find there is a sticky point that is a continual
point of contention in your translation or maintenance of code, consider using
a source-compatible solution relying on six. If you have to create your own
Python 2/3 compatible solution, you can use sys.version_info[0] >= 3
as a
guard.
Four, read all the approaches. Just because some bit of advice applies to one approach more than another doesn’t mean that some advice doesn’t apply to other strategies.
Five, drop support for older Python versions if possible. Python 2.5 introduced a lot of useful syntax and libraries which have become idiomatic in Python 3. Python 2.6 introduced future statements which makes compatibility much easier if you are going from Python 2 to 3. Python 2.7 continues the trend in the stdlib. So choose the newest version of Python which you believe can be your minimum support version and work from there.
Python 3 and 3to2¶
If you are starting a new project or your codebase is small enough, you may want to consider writing your code for Python 3 and backporting to Python 2 using 3to2. Thanks to Python 3 being more strict about things than Python 2 (e.g., bytes vs. strings), the source translation can be easier and more straightforward than from Python 2 to 3. Plus it gives you more direct experience developing in Python 3 which, since it is the future of Python, is a good thing long-term.
A drawback of this approach is that 3to2 is a third-party project. This means that the Python core developers (and thus this guide) can make no promises about how well 3to2 works at any time. There is nothing to suggest, though, that 3to2 is not a high-quality project.
Python 2 and 2to3¶
Included with Python since 2.6, the 2to3 tool (and lib2to3
module)
helps with porting Python 2 to Python 3 by performing various source
translations. This is a perfect solution for projects which wish to branch
their Python 3 code from their Python 2 codebase and maintain them as
independent codebases. You can even begin preparing to use this approach
today by writing future-compatible Python code which works cleanly in
Python 2 in conjunction with 2to3; all steps outlined below will work
with Python 2 code up to the point when the actual use of 2to3 occurs.
Use of 2to3 as an on-demand translation step at install time is also possible, preventing the need to maintain a separate Python 3 codebase, but this approach does come with some drawbacks. While users will only have to pay the translation cost once at installation, you as a developer will need to pay the cost regularly during development. If your codebase is sufficiently large enough then the translation step ends up acting like a compilation step, robbing you of the rapid development process you are used to with Python. Obviously the time required to translate a project will vary, so do an experimental translation just to see how long it takes to evaluate whether you prefer this approach compared to using Python 2/3 Compatible Source or simply keeping a separate Python 3 codebase.
Below are the typical steps taken by a project which uses a 2to3-based approach to supporting Python 2 & 3.
Support Python 2.7¶
As a first step, make sure that your project is compatible with Python 2.7.
This is just good to do as Python 2.7 is the last release of Python 2 and thus
will be used for a rather long time. It also allows for use of the -3
flag
to Python to help discover places in your code which 2to3 cannot handle but are
known to cause issues.
Try to Support Python 2.6 and Newer Only¶
While not possible for all projects, if you can support Python 2.6 and newer only, your life will be much easier. Various future statements, stdlib additions, etc. exist only in Python 2.6 and later which greatly assist in porting to Python 3. But if you project must keep support for Python 2.5 (or even Python 2.4) then it is still possible to port to Python 3.
Below are the benefits you gain if you only have to support Python 2.6 and newer. Some of these options are personal choice while others are strongly recommended (the ones that are more for personal choice are labeled as such). If you continue to support older versions of Python then you at least need to watch out for situations that these solutions fix.
from __future__ import print_function
¶
This is a personal choice. 2to3 handles the translation from the print
statement to the print function rather well so this is an optional step. This
future statement does help, though, with getting used to typing
print('Hello, World')
instead of print 'Hello, World'
.
from __future__ import unicode_literals
¶
Another personal choice. You can always mark what you want to be a (unicode)
string with a u
prefix to get the same effect. But regardless of whether
you use this future statement or not, you must make sure you know exactly
which Python 2 strings you want to be bytes, and which are to be strings. This
means you should, at minimum mark all strings that are meant to be text
strings with a u
prefix if you do not use this future statement.
Bytes literals¶
This is a very important one. The ability to prefix Python 2 strings that
are meant to contain bytes with a b
prefix help to very clearly delineate
what is and is not a Python 3 string. When you run 2to3 on code, all Python 2
strings become Python 3 strings unless they are prefixed with b
.
There are some differences between byte literals in Python 2 and those in
Python 3 thanks to the bytes type just being an alias to str
in Python 2.
Probably the biggest “gotcha” is that indexing results in different values. In
Python 2, the value of b'py'[1]
is 'y'
, while in Python 3 it’s 121
.
You can avoid this disparity by always slicing at the size of a single element:
b'py'[1:2]
is 'y'
in Python 2 and b'y'
in Python 3 (i.e., close
enough).
You cannot concatenate bytes and strings in Python 3. But since in Python
2 has bytes aliased to str
, it will succeed: b'a' + u'b'
works in
Python 2, but b'a' + 'b'
in Python 3 is a TypeError
. A similar issue
also comes about when doing comparisons between bytes and strings.
Supporting Python 2.5 and Newer Only¶
If you are supporting Python 2.5 and newer there are still some features of Python that you can utilize.
from __future__ import absolute_import
¶
Implicit relative imports (e.g., importing spam.bacon
from within
spam.eggs
with the statement import bacon
) does not work in Python 3.
This future statement moves away from that and allows the use of explicit
relative imports (e.g., from . import bacon
).
In Python 2.5 you must use the __future__ statement to get to use explicit relative imports and prevent implicit ones. In Python 2.6 explicit relative imports are available without the statement, but you still want the __future__ statement to prevent implicit relative imports. In Python 2.7 the __future__ statement is not needed. In other words, unless you are only supporting Python 2.7 or a version earlier than Python 2.5, use the __future__ statement.
Handle Common “Gotchas”¶
There are a few things that just consistently come up as sticking points for people which 2to3 cannot handle automatically or can easily be done in Python 2 to help modernize your code.
from __future__ import division
¶
While the exact same outcome can be had by using the -Qnew
argument to
Python, using this future statement lifts the requirement that your users use
the flag to get the expected behavior of division in Python 3
(e.g., 1/2 == 0.5; 1//2 == 0
).
Specify when opening a file as binary¶
Unless you have been working on Windows, there is a chance you have not always
bothered to add the b
mode when opening a binary file (e.g., rb
for
binary reading). Under Python 3, binary files and text files are clearly
distinct and mutually incompatible; see the io
module for details.
Therefore, you must make a decision of whether a file will be used for
binary access (allowing to read and/or write bytes data) or text access
(allowing to read and/or write unicode data).
Text files¶
Text files created using open()
under Python 2 return byte strings,
while under Python 3 they return unicode strings. Depending on your porting
strategy, this can be an issue.
If you want text files to return unicode strings in Python 2, you have two possibilities:
- Under Python 2.6 and higher, use
io.open()
. Sinceio.open()
is essentially the same function in both Python 2 and Python 3, it will help iron out any issues that might arise. - If pre-2.6 compatibility is needed, then you should use
codecs.open()
instead. This will make sure that you get back unicode strings in Python 2.
Subclass object
¶
New-style classes have been around since Python 2.2. You need to make sure
you are subclassing from object
to avoid odd edge cases involving method
resolution order, etc. This continues to be totally valid in Python 3 (although
unneeded as all classes implicitly inherit from object
).
Deal With the Bytes/String Dichotomy¶
One of the biggest issues people have when porting code to Python 3 is handling
the bytes/string dichotomy. Because Python 2 allowed the str
type to hold
textual data, people have over the years been rather loose in their delineation
of what str
instances held text compared to bytes. In Python 3 you cannot
be so care-free anymore and need to properly handle the difference. The key
handling this issue to make sure that every string literal in your
Python 2 code is either syntactically of functionally marked as either bytes or
text data. After this is done you then need to make sure your APIs are designed
to either handle a specific type or made to be properly polymorphic.
Mark Up Python 2 String Literals¶
First thing you must do is designate every single string literal in Python 2
as either textual or bytes data. If you are only supporting Python 2.6 or
newer, this can be accomplished by marking bytes literals with a b
prefix
and then designating textual data with a u
prefix or using the
unicode_literals
future statement.
If your project supports versions of Python pre-dating 2.6, then you should use
the six project and its b()
function to denote bytes literals. For text
literals you can either use six’s u()
function or use a u
prefix.
Decide what APIs Will Accept¶
In Python 2 it was very easy to accidentally create an API that accepted both bytes and textual data. But in Python 3, thanks to the more strict handling of disparate types, this loose usage of bytes and text together tends to fail.
Take the dict {b'a': 'bytes', u'a': 'text'}
in Python 2.6. It creates the
dict {u'a': 'text'}
since b'a' == u'a'
. But in Python 3 the equivalent
dict creates {b'a': 'bytes', 'a': 'text'}
, i.e., no lost data. Similar
issues can crop up when transitioning Python 2 code to Python 3.
This means you need to choose what an API is going to accept and create and consistently stick to that API in both Python 2 and 3.
Bytes / Unicode Comparison¶
In Python 3, mixing bytes and unicode is forbidden in most situations; it
will raise a TypeError
where Python 2 would have attempted an implicit
coercion between types. However, there is one case where it doesn’t and
it can be very misleading:
>>> b"" == ""
False
This is because an equality comparison is required by the language to always
succeed (and return False
for incompatible types). However, this also
means that code incorrectly ported to Python 3 can display buggy behaviour
if such comparisons are silently executed. To detect such situations,
Python 3 has a -b
flag that will display a warning:
$ python3 -b
>>> b"" == ""
__main__:1: BytesWarning: Comparison between bytes and string
False
To turn the warning into an exception, use the -bb
flag instead:
$ python3 -bb
>>> b"" == ""
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BytesWarning: Comparison between bytes and string
Indexing bytes objects¶
Another potentially surprising change is the indexing behaviour of bytes objects in Python 3:
>>> b"xyz"[0]
120
Indeed, Python 3 bytes objects (as well as bytearray
objects)
are sequences of integers. But code converted from Python 2 will often
assume that indexing a bytestring produces another bytestring, not an
integer. To reconcile both behaviours, use slicing:
>>> b"xyz"[0:1]
b'x'
>>> n = 1
>>> b"xyz"[n:n+1]
b'y'
The only remaining gotcha is that an out-of-bounds slice returns an empty
bytes object instead of raising IndexError
:
>>> b"xyz"[3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: index out of range
>>> b"xyz"[3:4]
b''
__str__()
/__unicode__()
¶
In Python 2, objects can specify both a string and unicode representation of
themselves. In Python 3, though, there is only a string representation. This
becomes an issue as people can inadvertently do things in their __str__()
methods which have unpredictable results (e.g., infinite recursion if you
happen to use the unicode(self).encode('utf8')
idiom as the body of your
__str__()
method).
There are two ways to solve this issue. One is to use a custom 2to3 fixer. The
blog post at http://lucumr.pocoo.org/2011/1/22/forwards-compatible-python/
specifies how to do this. That will allow 2to3 to change all instances of def
__unicode(self): ...
to def __str__(self): ...
. This does require you
define your __str__()
method in Python 2 before your __unicode__()
method.
The other option is to use a mixin class. This allows you to only define a
__unicode__()
method for your class and let the mixin derive
__str__()
for you (code from
http://lucumr.pocoo.org/2011/1/22/forwards-compatible-python/):
import sys
class UnicodeMixin(object):
"""Mixin class to handle defining the proper __str__/__unicode__
methods in Python 2 or 3."""
if sys.version_info[0] >= 3: # Python 3
def __str__(self):
return self.__unicode__()
else: # Python 2
def __str__(self):
return self.__unicode__().encode('utf8')
class Spam(UnicodeMixin):
def __unicode__(self):
return u'spam-spam-bacon-spam' # 2to3 will remove the 'u' prefix
Don’t Index on Exceptions¶
In Python 2, the following worked:
>>> exc = Exception(1, 2, 3)
>>> exc.args[1]
2
>>> exc[1] # Python 2 only!
2
But in Python 3, indexing directly on an exception is an error. You need to
make sure to only index on the BaseException.args
attribute which is a
sequence containing all arguments passed to the __init__()
method.
Even better is to use the documented attributes the exception provides.
Don’t use __getslice__
& Friends¶
Been deprecated for a while, but Python 3 finally drops support for
__getslice__()
, etc. Move completely over to __getitem__()
and
friends.
Updating doctests¶
2to3 will attempt to generate fixes for doctests that it comes across. It’s
not perfect, though. If you wrote a monolithic set of doctests (e.g., a single
docstring containing all of your doctests), you should at least consider
breaking the doctests up into smaller pieces to make it more manageable to fix.
Otherwise it might very well be worth your time and effort to port your tests
to unittest
.
Eliminate -3
Warnings¶
When you run your application’s test suite, run it using the -3
flag passed
to Python. This will cause various warnings to be raised during execution about
things that 2to3 cannot handle automatically (e.g., modules that have been
removed). Try to eliminate those warnings to make your code even more portable
to Python 3.
Run 2to3¶
Once you have made your Python 2 code future-compatible with Python 3, it’s time to use 2to3 to actually port your code.
Manually¶
To manually convert source code using 2to3, you use the 2to3
script that
is installed with Python 2.6 and later.:
2to3 <directory or file to convert>
This will cause 2to3 to write out a diff with all of the fixers applied for the
converted source code. If you would like 2to3 to go ahead and apply the changes
you can pass it the -w
flag:
2to3 -w <stuff to convert>
There are other flags available to control exactly which fixers are applied, etc.
During Installation¶
When a user installs your project for Python 3, you can have either
distutils
or Distribute run 2to3 on your behalf.
For distutils, use the following idiom:
try: # Python 3
from distutils.command.build_py import build_py_2to3 as build_py
except ImportError: # Python 2
from distutils.command.build_py import build_py
setup(cmdclass = {'build_py': build_py},
# ...
)
For Distribute:
setup(use_2to3=True,
# ...
)
This will allow you to not have to distribute a separate Python 3 version of your project. It does require, though, that when you perform development that you at least build your project and use the built Python 3 source for testing.
Verify & Test¶
At this point you should (hopefully) have your project converted in such a way that it works in Python 3. Verify it by running your unit tests and making sure nothing has gone awry. If you miss something then figure out how to fix it in Python 3, backport to your Python 2 code, and run your code through 2to3 again to verify the fix transforms properly.
Python 2/3 Compatible Source¶
While it may seem counter-intuitive, you can write Python code which is
source-compatible between Python 2 & 3. It does lead to code that is not
entirely idiomatic Python (e.g., having to extract the currently raised
exception from sys.exc_info()[1]
), but it can be run under Python 2
and Python 3 without using 2to3 as a translation step (although the tool
should be used to help find potential portability problems). This allows you to
continue to have a rapid development process regardless of whether you are
developing under Python 2 or Python 3. Whether this approach or using
Python 2 and 2to3 works best for you will be a per-project decision.
To get a complete idea of what issues you will need to deal with, see the What’s New in Python 3.0. Others have reorganized the data in other formats such as http://docs.pythonsprints.com/python3_porting/py-porting.html .
The following are some steps to take to try to support both Python 2 & 3 from the same source code.
Follow The Steps for Using 2to3¶
All of the steps outlined in how to
port Python 2 code with 2to3 apply
to creating a Python 2/3 codebase. This includes trying only support Python 2.6
or newer (the __future__
statements work in Python 3 without issue),
eliminating warnings that are triggered by -3
, etc.
You should even consider running 2to3 over your code (without committing the changes). This will let you know where potential pain points are within your code so that you can fix them properly before they become an issue.
Use six¶
The six project contains many things to help you write portable Python code. You should make sure to read its documentation from beginning to end and use any and all features it provides. That way you will minimize any mistakes you might make in writing cross-version code.
Capturing the Currently Raised Exception¶
One change between Python 2 and 3 that will require changing how you code (if you support Python 2.5 and earlier) is accessing the currently raised exception. In Python 2.5 and earlier the syntax to access the current exception is:
try:
raise Exception()
except Exception, exc:
# Current exception is 'exc'
pass
This syntax changed in Python 3 (and backported to Python 2.6 and later) to:
try:
raise Exception()
except Exception as exc:
# Current exception is 'exc'
# In Python 3, 'exc' is restricted to the block; Python 2.6 will "leak"
pass
Because of this syntax change you must change to capturing the current exception to:
try:
raise Exception()
except Exception:
import sys
exc = sys.exc_info()[1]
# Current exception is 'exc'
pass
You can get more information about the raised exception from
sys.exc_info()
than simply the current exception instance, but you most
likely don’t need it.
Note
In Python 3, the traceback is attached to the exception instance
through the __traceback__
attribute. If the instance is saved in
a local variable that persists outside of the except
block, the
traceback will create a reference cycle with the current frame and its
dictionary of local variables. This will delay reclaiming dead
resources until the next cyclic garbage collection pass.
In Python 2, this problem only occurs if you save the traceback itself
(e.g. the third element of the tuple returned by sys.exc_info()
)
in a variable.
Other Resources¶
The authors of the following blog posts, wiki pages, and books deserve special thanks for making public their tips for porting Python 2 code to Python 3 (and thus helping provide information for this document):
- http://python3porting.com/
- http://docs.pythonsprints.com/python3_porting/py-porting.html
- http://techspot.zzzeek.org/2011/01/24/zzzeek-s-guide-to-python-3-porting/
- http://dabeaz.blogspot.com/2011/01/porting-py65-and-my-superboard-to.html
- http://lucumr.pocoo.org/2011/1/22/forwards-compatible-python/
- http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide/
- http://wiki.python.org/moin/PortingPythonToPy3k
If you feel there is something missing from this document that should be added, please email the python-porting mailing list.
Porting Extension Modules to 3.0¶
author: | Benjamin Peterson |
---|
Abstract
Although changing the C-API was not one of Python 3.0’s objectives, the many
Python level changes made leaving 2.x’s API intact impossible. In fact, some
changes such as int()
and long()
unification are more obvious on
the C level. This document endeavors to document incompatibilities and how
they can be worked around.
Conditional compilation¶
The easiest way to compile only some code for 3.0 is to check if
PY_MAJOR_VERSION
is greater than or equal to 3.
#if PY_MAJOR_VERSION >= 3
#define IS_PY3K
#endif
API functions that are not present can be aliased to their equivalents within conditional blocks.
Changes to Object APIs¶
Python 3.0 merged together some types with similar functions while cleanly separating others.
str/unicode Unification¶
Python 3.0’s str()
(PyString_*
functions in C) type is equivalent to
2.x’s unicode()
(PyUnicode_*
). The old 8-bit string type has become
bytes()
. Python 2.6 and later provide a compatibility header,
bytesobject.h
, mapping PyBytes
names to PyString
ones. For best
compatibility with 3.0, PyUnicode
should be used for textual data and
PyBytes
for binary data. It’s also important to remember that
PyBytes
and PyUnicode
in 3.0 are not interchangeable like
PyString
and PyUnicode
are in 2.x. The following example
shows best practices with regards to PyUnicode
, PyString
,
and PyBytes
.
#include "stdlib.h"
#include "Python.h"
#include "bytesobject.h"
/* text example */
static PyObject *
say_hello(PyObject *self, PyObject *args) {
PyObject *name, *result;
if (!PyArg_ParseTuple(args, "U:say_hello", &name))
return NULL;
result = PyUnicode_FromFormat("Hello, %S!", name);
return result;
}
/* just a forward */
static char * do_encode(PyObject *);
/* bytes example */
static PyObject *
encode_object(PyObject *self, PyObject *args) {
char *encoded;
PyObject *result, *myobj;
if (!PyArg_ParseTuple(args, "O:encode_object", &myobj))
return NULL;
encoded = do_encode(myobj);
if (encoded == NULL)
return NULL;
result = PyBytes_FromString(encoded);
free(encoded);
return result;
}
long/int Unification¶
In Python 3.0, there is only one integer type. It is called int()
on the
Python level, but actually corresponds to 2.x’s long()
type. In the
C-API, PyInt_*
functions are replaced by their PyLong_*
neighbors. The
best course of action here is using the PyInt_*
functions aliased to
PyLong_*
found in intobject.h
. The abstract PyNumber_*
APIs
can also be used in some cases.
#include "Python.h"
#include "intobject.h"
static PyObject *
add_ints(PyObject *self, PyObject *args) {
int one, two;
PyObject *result;
if (!PyArg_ParseTuple(args, "ii:add_ints", &one, &two))
return NULL;
return PyInt_FromLong(one + two);
}
Module initialization and state¶
Python 3.0 has a revamped extension module initialization system. (See PEP 3121.) Instead of storing module state in globals, they should be stored in an interpreter specific structure. Creating modules that act correctly in both 2.x and 3.0 is tricky. The following simple example demonstrates how.
#include "Python.h"
struct module_state {
PyObject *error;
};
#if PY_MAJOR_VERSION >= 3
#define GETSTATE(m) ((struct module_state*)PyModule_GetState(m))
#else
#define GETSTATE(m) (&_state)
static struct module_state _state;
#endif
static PyObject *
error_out(PyObject *m) {
struct module_state *st = GETSTATE(m);
PyErr_SetString(st->error, "something bad happened");
return NULL;
}
static PyMethodDef myextension_methods[] = {
{"error_out", (PyCFunction)error_out, METH_NOARGS, NULL},
{NULL, NULL}
};
#if PY_MAJOR_VERSION >= 3
static int myextension_traverse(PyObject *m, visitproc visit, void *arg) {
Py_VISIT(GETSTATE(m)->error);
return 0;
}
static int myextension_clear(PyObject *m) {
Py_CLEAR(GETSTATE(m)->error);
return 0;
}
static struct PyModuleDef moduledef = {
PyModuleDef_HEAD_INIT,
"myextension",
NULL,
sizeof(struct module_state),
myextension_methods,
NULL,
myextension_traverse,
myextension_clear,
NULL
};
#define INITERROR return NULL
PyObject *
PyInit_myextension(void)
#else
#define INITERROR return
void
initmyextension(void)
#endif
{
#if PY_MAJOR_VERSION >= 3
PyObject *module = PyModule_Create(&moduledef);
#else
PyObject *module = Py_InitModule("myextension", myextension_methods);
#endif
if (module == NULL)
INITERROR;
struct module_state *st = GETSTATE(module);
st->error = PyErr_NewException("myextension.Error", NULL, NULL);
if (st->error == NULL) {
Py_DECREF(module);
INITERROR;
}
#if PY_MAJOR_VERSION >= 3
return module;
#endif
}
CObject replaced with Capsule¶
The Capsule
object was introduced in Python 3.1 and 2.7 to replace
CObject
. CObjects were useful,
but the CObject
API was problematic: it didn’t permit distinguishing
between valid CObjects, which allowed mismatched CObjects to crash the
interpreter, and some of its APIs relied on undefined behavior in C.
(For further reading on the rationale behind Capsules, please see :issue:`5630`.)
If you’re currently using CObjects, and you want to migrate to 3.1 or newer,
you’ll need to switch to Capsules.
CObject
was deprecated in 3.1 and 2.7 and completely removed in
Python 3.2. If you only support 2.7, or 3.1 and above, you
can simply switch to Capsule
. If you need to support 3.0 or
versions of Python earlier than 2.7 you’ll have to support both CObjects
and Capsules.
The following example header file capsulethunk.h
may
solve the problem for you;
simply write your code against the Capsule
API, include
this header file after "Python.h"
, and you’ll automatically use CObjects
in Python 3.0 or versions earlier than 2.7.
capsulethunk.h
simulates Capsules using CObjects. However,
CObject
provides no place to store the capsule’s “name”. As a
result the simulated Capsule
objects created by capsulethunk.h
behave slightly differently from real Capsules. Specifically:
- The name parameter passed in to
PyCapsule_New()
is ignored.- The name parameter passed in to
PyCapsule_IsValid()
andPyCapsule_GetPointer()
is ignored, and no error checking of the name is performed.PyCapsule_GetName()
always returns NULL.PyCapsule_SetName()
always throws an exception and returns failure. (Since there’s no way to store a name in a CObject, noisy failure ofPyCapsule_SetName()
was deemed preferable to silent failure here. If this is inconveient, feel free to modify your local copy as you see fit.)
You can find capsulethunk.h
in the Python source distribution
in the Doc/includes
directory. We also include it here for
your reference; here is capsulethunk.h
:
Curses Programming with Python¶
Author: | A.M. Kuchling, Eric S. Raymond |
---|---|
Release: | 2.03 |
Abstract
This document describes how to write text-mode programs with Python 2.x, using
the curses
extension module to control the display.
What is curses?¶
The curses library supplies a terminal-independent screen-painting and keyboard-handling facility for text-based terminals; such terminals include VT100s, the Linux console, and the simulated terminal provided by X11 programs such as xterm and rxvt. Display terminals support various control codes to perform common operations such as moving the cursor, scrolling the screen, and erasing areas. Different terminals use widely differing codes, and often have their own minor quirks.
In a world of X displays, one might ask “why bother”? It’s true that character-cell display terminals are an obsolete technology, but there are niches in which being able to do fancy things with them are still valuable. One is on small-footprint or embedded Unixes that don’t carry an X server. Another is for tools like OS installers and kernel configurators that may have to run before X is available.
The curses library hides all the details of different terminals, and provides the programmer with an abstraction of a display, containing multiple non-overlapping windows. The contents of a window can be changed in various ways– adding text, erasing it, changing its appearance–and the curses library will automagically figure out what control codes need to be sent to the terminal to produce the right output.
The curses library was originally written for BSD Unix; the later System V versions of Unix from AT&T added many enhancements and new functions. BSD curses is no longer maintained, having been replaced by ncurses, which is an open-source implementation of the AT&T interface. If you’re using an open-source Unix such as Linux or FreeBSD, your system almost certainly uses ncurses. Since most current commercial Unix versions are based on System V code, all the functions described here will probably be available. The older versions of curses carried by some proprietary Unixes may not support everything, though.
No one has made a Windows port of the curses module. On a Windows platform, try the Console module written by Fredrik Lundh. The Console module provides cursor-addressable text output, plus full support for mouse and keyboard input, and is available from http://effbot.org/zone/console-index.htm.
The Python curses module¶
Thy Python module is a fairly simple wrapper over the C functions provided by
curses; if you’re already familiar with curses programming in C, it’s really
easy to transfer that knowledge to Python. The biggest difference is that the
Python interface makes things simpler, by merging different C functions such as
addstr()
, mvaddstr()
, mvwaddstr()
, into a single
addstr()
method. You’ll see this covered in more detail later.
This HOWTO is simply an introduction to writing text-mode programs with curses and Python. It doesn’t attempt to be a complete guide to the curses API; for that, see the Python library guide’s section on ncurses, and the C manual pages for ncurses. It will, however, give you the basic ideas.
Starting and ending a curses application¶
Before doing anything, curses must be initialized. This is done by calling the
initscr()
function, which will determine the terminal type, send any
required setup codes to the terminal, and create various internal data
structures. If successful, initscr()
returns a window object representing
the entire screen; this is usually called stdscr
, after the name of the
corresponding C variable.
import curses
stdscr = curses.initscr()
Usually curses applications turn off automatic echoing of keys to the screen, in
order to be able to read keys and only display them under certain circumstances.
This requires calling the noecho()
function.
curses.noecho()
Applications will also commonly need to react to keys instantly, without requiring the Enter key to be pressed; this is called cbreak mode, as opposed to the usual buffered input mode.
curses.cbreak()
Terminals usually return special keys, such as the cursor keys or navigation
keys such as Page Up and Home, as a multibyte escape sequence. While you could
write your application to expect such sequences and process them accordingly,
curses can do it for you, returning a special value such as
curses.KEY_LEFT
. To get curses to do the job, you’ll have to enable
keypad mode.
stdscr.keypad(1)
Terminating a curses application is much easier than starting one. You’ll need to call
curses.nocbreak(); stdscr.keypad(0); curses.echo()
to reverse the curses-friendly terminal settings. Then call the endwin()
function to restore the terminal to its original operating mode.
curses.endwin()
A common problem when debugging a curses application is to get your terminal messed up when the application dies without restoring the terminal to its previous state. In Python this commonly happens when your code is buggy and raises an uncaught exception. Keys are no longer be echoed to the screen when you type them, for example, which makes using the shell difficult.
In Python you can avoid these complications and make debugging much easier by
importing the module curses.wrapper
. It supplies a wrapper()
function that takes a callable. It does the initializations described above,
and also initializes colors if color support is present. It then runs your
provided callable and finally deinitializes appropriately. The callable is
called inside a try-catch clause which catches exceptions, performs curses
deinitialization, and then passes the exception upwards. Thus, your terminal
won’t be left in a funny state on exception.
Windows and Pads¶
Windows are the basic abstraction in curses. A window object represents a rectangular area of the screen, and supports various methods to display text, erase it, allow the user to input strings, and so forth.
The stdscr
object returned by the initscr()
function is a window
object that covers the entire screen. Many programs may need only this single
window, but you might wish to divide the screen into smaller windows, in order
to redraw or clear them separately. The newwin()
function creates a new
window of a given size, returning the new window object.
begin_x = 20 ; begin_y = 7
height = 5 ; width = 40
win = curses.newwin(height, width, begin_y, begin_x)
A word about the coordinate system used in curses: coordinates are always passed in the order y,x, and the top-left corner of a window is coordinate (0,0). This breaks a common convention for handling coordinates, where the x coordinate usually comes first. This is an unfortunate difference from most other computer applications, but it’s been part of curses since it was first written, and it’s too late to change things now.
When you call a method to display or erase text, the effect doesn’t immediately show up on the display. This is because curses was originally written with slow 300-baud terminal connections in mind; with these terminals, minimizing the time required to redraw the screen is very important. This lets curses accumulate changes to the screen, and display them in the most efficient manner. For example, if your program displays some characters in a window, and then clears the window, there’s no need to send the original characters because they’d never be visible.
Accordingly, curses requires that you explicitly tell it to redraw windows,
using the refresh()
method of window objects. In practice, this doesn’t
really complicate programming with curses much. Most programs go into a flurry
of activity, and then pause waiting for a keypress or some other action on the
part of the user. All you have to do is to be sure that the screen has been
redrawn before pausing to wait for user input, by simply calling
stdscr.refresh()
or the refresh()
method of some other relevant
window.
A pad is a special case of a window; it can be larger than the actual display screen, and only a portion of it displayed at a time. Creating a pad simply requires the pad’s height and width, while refreshing a pad requires giving the coordinates of the on-screen area where a subsection of the pad will be displayed.
pad = curses.newpad(100, 100)
# These loops fill the pad with letters; this is
# explained in the next section
for y in range(0, 100):
for x in range(0, 100):
try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
except curses.error: pass
# Displays a section of the pad in the middle of the screen
pad.refresh( 0,0, 5,5, 20,75)
The refresh()
call displays a section of the pad in the rectangle
extending from coordinate (5,5) to coordinate (20,75) on the screen; the upper
left corner of the displayed section is coordinate (0,0) on the pad. Beyond
that difference, pads are exactly like ordinary windows and support the same
methods.
If you have multiple windows and pads on screen there is a more efficient way to
go, which will prevent annoying screen flicker at refresh time. Use the
noutrefresh()
method of each window to update the data structure
representing the desired state of the screen; then change the physical screen to
match the desired state in one go with the function doupdate()
. The
normal refresh()
method calls doupdate()
as its last act.
Displaying Text¶
From a C programmer’s point of view, curses may sometimes look like a twisty
maze of functions, all subtly different. For example, addstr()
displays a
string at the current cursor location in the stdscr
window, while
mvaddstr()
moves to a given y,x coordinate first before displaying the
string. waddstr()
is just like addstr()
, but allows specifying a
window to use, instead of using stdscr
by default. mvwaddstr()
follows
similarly.
Fortunately the Python interface hides all these details; stdscr
is a window
object like any other, and methods like addstr()
accept multiple argument
forms. Usually there are four different forms.
Form | Description |
---|---|
str or ch | Display the string str or character ch at the current position |
str or ch, attr | Display the string str or character ch, using attribute attr at the current position |
y, x, str or ch | Move to position y,x within the window, and display str or ch |
y, x, str or ch, attr | Move to position y,x within the window, and display str or ch, using attribute attr |
Attributes allow displaying text in highlighted forms, such as in boldface, underline, reverse code, or in color. They’ll be explained in more detail in the next subsection.
The addstr()
function takes a Python string as the value to be displayed,
while the addch()
functions take a character, which can be either a Python
string of length 1 or an integer. If it’s a string, you’re limited to
displaying characters between 0 and 255. SVr4 curses provides constants for
extension characters; these constants are integers greater than 255. For
example, ACS_PLMINUS
is a +/- symbol, and ACS_ULCORNER
is the
upper left corner of a box (handy for drawing borders).
Windows remember where the cursor was left after the last operation, so if you
leave out the y,x coordinates, the string or character will be displayed
wherever the last operation left off. You can also move the cursor with the
move(y,x)
method. Because some terminals always display a flashing cursor,
you may want to ensure that the cursor is positioned in some location where it
won’t be distracting; it can be confusing to have the cursor blinking at some
apparently random location.
If your application doesn’t need a blinking cursor at all, you can call
curs_set(0)
to make it invisible. Equivalently, and for compatibility with
older curses versions, there’s a leaveok(bool)
function. When bool is
true, the curses library will attempt to suppress the flashing cursor, and you
won’t need to worry about leaving it in odd locations.
Attributes and Color¶
Characters can be displayed in different ways. Status lines in a text-based application are commonly shown in reverse video; a text viewer may need to highlight certain words. curses supports this by allowing you to specify an attribute for each cell on the screen.
An attribute is a integer, each bit representing a different attribute. You can try to display text with multiple attribute bits set, but curses doesn’t guarantee that all the possible combinations are available, or that they’re all visually distinct. That depends on the ability of the terminal being used, so it’s safest to stick to the most commonly available attributes, listed here.
Attribute | Description |
---|---|
A_BLINK |
Blinking text |
A_BOLD |
Extra bright or bold text |
A_DIM |
Half bright text |
A_REVERSE |
Reverse-video text |
A_STANDOUT |
The best highlighting mode available |
A_UNDERLINE |
Underlined text |
So, to display a reverse-video status line on the top line of the screen, you could code:
stdscr.addstr(0, 0, "Current mode: Typing mode",
curses.A_REVERSE)
stdscr.refresh()
The curses library also supports color on those terminals that provide it, The most common such terminal is probably the Linux console, followed by color xterms.
To use color, you must call the start_color()
function soon after calling
initscr()
, to initialize the default color set (the
curses.wrapper.wrapper()
function does this automatically). Once that’s
done, the has_colors()
function returns TRUE if the terminal in use can
actually display color. (Note: curses uses the American spelling ‘color’,
instead of the Canadian/British spelling ‘colour’. If you’re used to the
British spelling, you’ll have to resign yourself to misspelling it for the sake
of these functions.)
The curses library maintains a finite number of color pairs, containing a
foreground (or text) color and a background color. You can get the attribute
value corresponding to a color pair with the color_pair()
function; this
can be bitwise-OR’ed with other attributes such as A_REVERSE
, but
again, such combinations are not guaranteed to work on all terminals.
An example, which displays a line of text using color pair 1:
stdscr.addstr( "Pretty text", curses.color_pair(1) )
stdscr.refresh()
As I said before, a color pair consists of a foreground and background color.
start_color()
initializes 8 basic colors when it activates color mode.
They are: 0:black, 1:red, 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and
7:white. The curses module defines named constants for each of these colors:
curses.COLOR_BLACK
, curses.COLOR_RED
, and so forth.
The init_pair(n, f, b)
function changes the definition of color pair n, to
foreground color f and background color b. Color pair 0 is hard-wired to white
on black, and cannot be changed.
Let’s put all this together. To change color 1 to red text on a white background, you would call:
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
When you change a color pair, any text already displayed using that color pair will change to the new colors. You can also display new text in this color with:
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
Very fancy terminals can change the definitions of the actual colors to a given
RGB value. This lets you change color 1, which is usually red, to purple or
blue or any other color you like. Unfortunately, the Linux console doesn’t
support this, so I’m unable to try it out, and can’t provide any examples. You
can check if your terminal can do this by calling can_change_color()
,
which returns TRUE if the capability is there. If you’re lucky enough to have
such a talented terminal, consult your system’s man pages for more information.
User Input¶
The curses library itself offers only very simple input mechanisms. Python’s support adds a text-input widget that makes up some of the lack.
The most common way to get input to a window is to use its getch()
method.
getch()
pauses and waits for the user to hit a key, displaying it if
echo()
has been called earlier. You can optionally specify a coordinate
to which the cursor should be moved before pausing.
It’s possible to change this behavior with the method nodelay()
. After
nodelay(1)
, getch()
for the window becomes non-blocking and returns
curses.ERR
(a value of -1) when no input is ready. There’s also a
halfdelay()
function, which can be used to (in effect) set a timer on each
getch()
; if no input becomes available within a specified
delay (measured in tenths of a second), curses raises an exception.
The getch()
method returns an integer; if it’s between 0 and 255, it
represents the ASCII code of the key pressed. Values greater than 255 are
special keys such as Page Up, Home, or the cursor keys. You can compare the
value returned to constants such as curses.KEY_PPAGE
,
curses.KEY_HOME
, or curses.KEY_LEFT
. Usually the main loop of
your program will look something like this:
while 1:
c = stdscr.getch()
if c == ord('p'): PrintDocument()
elif c == ord('q'): break # Exit the while()
elif c == curses.KEY_HOME: x = y = 0
The curses.ascii
module supplies ASCII class membership functions that
take either integer or 1-character-string arguments; these may be useful in
writing more readable tests for your command interpreters. It also supplies
conversion functions that take either integer or 1-character-string arguments
and return the same type. For example, curses.ascii.ctrl()
returns the
control character corresponding to its argument.
There’s also a method to retrieve an entire string, getstr()
. It isn’t
used very often, because its functionality is quite limited; the only editing
keys available are the backspace key and the Enter key, which terminates the
string. It can optionally be limited to a fixed number of characters.
curses.echo() # Enable echoing of characters
# Get a 15-character string, with the cursor on the top line
s = stdscr.getstr(0,0, 15)
The Python curses.textpad
module supplies something better. With it, you
can turn a window into a text box that supports an Emacs-like set of
keybindings. Various methods of Textbox
class support editing with
input validation and gathering the edit results either with or without trailing
spaces. See the library documentation on curses.textpad
for the
details.
For More Information¶
This HOWTO didn’t cover some advanced topics, such as screen-scraping or capturing mouse events from an xterm instance. But the Python library page for the curses modules is now pretty complete. You should browse it next.
If you’re in doubt about the detailed behavior of any of the ncurses entry
points, consult the manual pages for your curses implementation, whether it’s
ncurses or a proprietary Unix vendor’s. The manual pages will document any
quirks, and provide complete lists of all the functions, attributes, and
ACS_*
characters available to you.
Because the curses API is so large, some functions aren’t supported in the Python interface, not because they’re difficult to implement, but because no one has needed them yet. Feel free to add them and then submit a patch. Also, we don’t yet have support for the menu library associated with ncurses; feel free to add that.
If you write an interesting little program, feel free to contribute it as another demo. We can always use more of them!
The ncurses FAQ: http://invisible-island.net/ncurses/ncurses.faq.html
文件描述符HowTo向导¶
作者: | Raymond Hettinger |
---|---|
联系: | <python at rcn dot com> |
Contents
概述¶
定义描述符、总结协议和展示描述符调用机制。分析自定义的描述符和几类python内置 描述符。函数、属性、静态方法和类方法。通过给出等效的python实现和例子程序,展示 他们的工作机制。
研究描述符不仅能访问更多的工具集,而且有助于进一步理解python工作的机制,和欣赏 python设计的优雅之处。
Definition and Introduction¶
In general, a descriptor is an object attribute with “binding behavior”, one
whose attribute access has been overridden by methods in the descriptor
protocol. Those methods are __get__()
, __set__()
, and
__delete__()
. If any of those methods are defined for an object, it is
said to be a descriptor.
The default behavior for attribute access is to get, set, or delete the
attribute from an object’s dictionary. For instance, a.x
has a lookup chain
starting with a.__dict__['x']
, then type(a).__dict__['x']
, and
continuing through the base classes of type(a)
excluding metaclasses. If the
looked-up value is an object defining one of the descriptor methods, then Python
may override the default behavior and invoke the descriptor method instead.
Where this occurs in the precedence chain depends on which descriptor methods
were defined. Note that descriptors are only invoked for new style objects or
classes (a class is new style if it inherits from object
or
type
).
Descriptors are a powerful, general purpose protocol. They are the mechanism
behind properties, methods, static methods, class methods, and super()
.
They are used throughout Python itself to implement the new style classes
introduced in version 2.2. Descriptors simplify the underlying C-code and offer
a flexible set of new tools for everyday Python programs.
描述符协议¶
descr.__get__(self, obj, type=None) --> value
descr.__set__(self, obj, value) --> None
descr.__delete__(self, obj) --> None
这三个方法是协议的所有方法。对象定义了它们中的任意一个,就会被认为是一个描述符。 用户可在查找属性时重载默认查找行为。
如果对象同时定义了 __get__()
和 __set__()
,称为这个对象为数据描述符(data descriptor)。
如果只定义了 __get__()
,称为这个对象为非数据描述符(non-data descriptor)。
(非数据描述符通常用于方法,但也可用于其他方面)。
数据描述符与非数据描述符不同在实例(instance)的字典中被采用的优先级不同。如果实例 字典中有个成员和数据描述符成员同名,数据描述符将优先采用。而如果实例字典中有个成员和 数据非描述符成员同名,字典成员将优先采用。
如果想定义一个只读的数据描述符,可以在对象中同时定义 __get__()
和 __set__()
, 但 __set__()
必须触发 AttributeError
异常。虽然对象的 __set__()
方法中只触发异常,但足以称它数据描述符。
Invoking Descriptors¶
A descriptor can be called directly by its method name. For example,
d.__get__(obj)
.
Alternatively, it is more common for a descriptor to be invoked automatically
upon attribute access. For example, obj.d
looks up d
in the dictionary
of obj
. If d
defines the method __get__()
, then d.__get__(obj)
is invoked according to the precedence rules listed below.
The details of invocation depend on whether obj
is an object or a class.
Either way, descriptors only work for new style objects and classes. A class is
new style if it is a subclass of object
.
For objects, the machinery is in object.__getattribute__()
which
transforms b.x
into type(b).__dict__['x'].__get__(b, type(b))
. The
implementation works through a precedence chain that gives data descriptors
priority over instance variables, instance variables priority over non-data
descriptors, and assigns lowest priority to __getattr__()
if provided. The
full C implementation can be found in PyObject_GenericGetAttr()
in
Objects/object.c.
For classes, the machinery is in type.__getattribute__()
which transforms
B.x
into B.__dict__['x'].__get__(None, B)
. In pure Python, it looks
like:
def __getattribute__(self, key):
"Emulate type_getattro() in Objects/typeobject.c"
v = object.__getattribute__(self, key)
if hasattr(v, '__get__'):
return v.__get__(None, self)
return v
The important points to remember are:
- descriptors are invoked by the
__getattribute__()
method - overriding
__getattribute__()
prevents automatic descriptor calls __getattribute__()
is only available with new style classes and objectsobject.__getattribute__()
andtype.__getattribute__()
make different calls to__get__()
.- data descriptors always override instance dictionaries.
- non-data descriptors may be overridden by instance dictionaries.
The object returned by super()
also has a custom __getattribute__()
method for invoking descriptors. The call super(B, obj).m()
searches
obj.__class__.__mro__
for the base class A
immediately following B
and then returns A.__dict__['m'].__get__(obj, A)
. If not a descriptor,
m
is returned unchanged. If not in the dictionary, m
reverts to a
search using object.__getattribute__()
.
Note, in Python 2.2, super(B, obj).m()
would only invoke __get__()
if
m
was a data descriptor. In Python 2.3, non-data descriptors also get
invoked unless an old-style class is involved. The implementation details are
in super_getattro()
in
Objects/typeobject.c
and a pure Python equivalent can be found in Guido’s Tutorial.
The details above show that the mechanism for descriptors is embedded in the
__getattribute__()
methods for object
, type
, and
super()
. Classes inherit this machinery when they derive from
object
or if they have a meta-class providing similar functionality.
Likewise, classes can turn-off descriptor invocation by overriding
__getattribute__()
.
Descriptor Example¶
The following code creates a class whose objects are data descriptors which
print a message for each get or set. Overriding __getattribute__()
is
alternate approach that could do this for every attribute. However, this
descriptor is useful for monitoring just a few chosen attributes:
class RevealAccess(object):
"""A data descriptor that sets and returns values
normally and prints a message logging their access.
"""
def __init__(self, initval=None, name='var'):
self.val = initval
self.name = name
def __get__(self, obj, objtype):
print 'Retrieving', self.name
return self.val
def __set__(self, obj, val):
print 'Updating' , self.name
self.val = val
>>> class MyClass(object):
x = RevealAccess(10, 'var "x"')
y = 5
>>> m = MyClass()
>>> m.x
Retrieving var "x"
10
>>> m.x = 20
Updating var "x"
>>> m.x
Retrieving var "x"
20
>>> m.y
5
The protocol is simple and offers exciting possibilities. Several use cases are so common that they have been packaged into individual function calls. Properties, bound and unbound methods, static methods, and class methods are all based on the descriptor protocol.
Properties¶
Calling property()
is a succinct way of building a data descriptor that
triggers function calls upon access to an attribute. Its signature is:
property(fget=None, fset=None, fdel=None, doc=None) -> property attribute
The documentation shows a typical use to define a managed attribute x
:
class C(object):
def getx(self): return self.__x
def setx(self, value): self.__x = value
def delx(self): del self.__x
x = property(getx, setx, delx, "I'm the 'x' property.")
To see how property()
is implemented in terms of the descriptor protocol,
here is a pure Python equivalent:
class Property(object):
"Emulate PyProperty_Type() in Objects/descrobject.c"
def __init__(self, fget=None, fset=None, fdel=None, doc=None):
self.fget = fget
self.fset = fset
self.fdel = fdel
self.__doc__ = doc
def __get__(self, obj, objtype=None):
if obj is None:
return self
if self.fget is None:
raise AttributeError, "unreadable attribute"
return self.fget(obj)
def __set__(self, obj, value):
if self.fset is None:
raise AttributeError, "can't set attribute"
self.fset(obj, value)
def __delete__(self, obj):
if self.fdel is None:
raise AttributeError, "can't delete attribute"
self.fdel(obj)
The property()
builtin helps whenever a user interface has granted
attribute access and then subsequent changes require the intervention of a
method.
For instance, a spreadsheet class may grant access to a cell value through
Cell('b10').value
. Subsequent improvements to the program require the cell
to be recalculated on every access; however, the programmer does not want to
affect existing client code accessing the attribute directly. The solution is
to wrap access to the value attribute in a property data descriptor:
class Cell(object):
. . .
def getvalue(self, obj):
"Recalculate cell before returning value"
self.recalc()
return obj._value
value = property(getvalue)
Functions and Methods¶
Python’s object oriented features are built upon a function based environment. Using non-data descriptors, the two are merged seamlessly.
Class dictionaries store methods as functions. In a class definition, methods
are written using def
and lambda
, the usual tools for
creating functions. The only difference from regular functions is that the
first argument is reserved for the object instance. By Python convention, the
instance reference is called self but may be called this or any other
variable name.
To support method calls, functions include the __get__()
method for
binding methods during attribute access. This means that all functions are
non-data descriptors which return bound or unbound methods depending whether
they are invoked from an object or a class. In pure python, it works like
this:
class Function(object):
. . .
def __get__(self, obj, objtype=None):
"Simulate func_descr_get() in Objects/funcobject.c"
return types.MethodType(self, obj, objtype)
Running the interpreter shows how the function descriptor works in practice:
>>> class D(object):
def f(self, x):
return x
>>> d = D()
>>> D.__dict__['f'] # Stored internally as a function
<function f at 0x00C45070>
>>> D.f # Get from a class becomes an unbound method
<unbound method D.f>
>>> d.f # Get from an instance becomes a bound method
<bound method D.f of <__main__.D object at 0x00B18C90>>
The output suggests that bound and unbound methods are two different types.
While they could have been implemented that way, the actual C implementation of
PyMethod_Type
in
Objects/classobject.c
is a single object with two different representations depending on whether the
im_self
field is set or is NULL (the C equivalent of None).
Likewise, the effects of calling a method object depend on the im_self
field. If set (meaning bound), the original function (stored in the
im_func
field) is called as expected with the first argument set to the
instance. If unbound, all of the arguments are passed unchanged to the original
function. The actual C implementation of instancemethod_call()
is only
slightly more complex in that it includes some type checking.
Static Methods and Class Methods¶
Non-data descriptors provide a simple mechanism for variations on the usual patterns of binding functions into methods.
To recap, functions have a __get__()
method so that they can be converted
to a method when accessed as attributes. The non-data descriptor transforms a
obj.f(*args)
call into f(obj, *args)
. Calling klass.f(*args)
becomes f(*args)
.
This chart summarizes the binding and its two most useful variants:
Transformation Called from an Object Called from a Class function f(obj, *args) f(*args) staticmethod f(*args) f(*args) classmethod f(type(obj), *args) f(klass, *args)
Static methods return the underlying function without changes. Calling either
c.f
or C.f
is the equivalent of a direct lookup into
object.__getattribute__(c, "f")
or object.__getattribute__(C, "f")
. As a
result, the function becomes identically accessible from either an object or a
class.
Good candidates for static methods are methods that do not reference the
self
variable.
For instance, a statistics package may include a container class for
experimental data. The class provides normal methods for computing the average,
mean, median, and other descriptive statistics that depend on the data. However,
there may be useful functions which are conceptually related but do not depend
on the data. For instance, erf(x)
is handy conversion routine that comes up
in statistical work but does not directly depend on a particular dataset.
It can be called either from an object or the class: s.erf(1.5) --> .9332
or
Sample.erf(1.5) --> .9332
.
Since staticmethods return the underlying function with no changes, the example calls are unexciting:
>>> class E(object):
def f(x):
print x
f = staticmethod(f)
>>> print E.f(3)
3
>>> print E().f(3)
3
Using the non-data descriptor protocol, a pure Python version of
staticmethod()
would look like this:
class StaticMethod(object):
"Emulate PyStaticMethod_Type() in Objects/funcobject.c"
def __init__(self, f):
self.f = f
def __get__(self, obj, objtype=None):
return self.f
Unlike static methods, class methods prepend the class reference to the argument list before calling the function. This format is the same for whether the caller is an object or a class:
>>> class E(object):
def f(klass, x):
return klass.__name__, x
f = classmethod(f)
>>> print E.f(3)
('E', 3)
>>> print E().f(3)
('E', 3)
This behavior is useful whenever the function only needs to have a class
reference and does not care about any underlying data. One use for classmethods
is to create alternate class constructors. In Python 2.3, the classmethod
dict.fromkeys()
creates a new dictionary from a list of keys. The pure
Python equivalent is:
class Dict:
. . .
def fromkeys(klass, iterable, value=None):
"Emulate dict_fromkeys() in Objects/dictobject.c"
d = klass()
for key in iterable:
d[key] = value
return d
fromkeys = classmethod(fromkeys)
Now a new dictionary of unique keys can be constructed like this:
>>> Dict.fromkeys('abracadabra')
{'a': None, 'r': None, 'b': None, 'c': None, 'd': None}
Using the non-data descriptor protocol, a pure Python version of
classmethod()
would look like this:
class ClassMethod(object):
"Emulate PyClassMethod_Type() in Objects/funcobject.c"
def __init__(self, f):
self.f = f
def __get__(self, obj, klass=None):
if klass is None:
klass = type(obj)
def newfunc(*args):
return self.f(klass, *args)
return newfunc
Idioms and Anti-Idioms in Python¶
Author: | Moshe Zadka |
---|
This document is placed in the public domain.
Abstract
This document can be considered a companion to the tutorial. It shows how to use Python, and even more importantly, how not to use Python.
Language Constructs You Should Not Use¶
While Python has relatively few gotchas compared to other languages, it still has some constructs which are only useful in corner cases, or are plain dangerous.
from module import *¶
Inside Function Definitions¶
from module import *
is invalid inside function definitions. While many
versions of Python do not check for the invalidity, it does not make it more
valid, no more than having a smart lawyer makes a man innocent. Do not use it
like that ever. Even in versions where it was accepted, it made the function
execution slower, because the compiler could not be certain which names were
local and which were global. In Python 2.1 this construct causes warnings, and
sometimes even errors.
At Module Level¶
While it is valid to use from module import *
at module level it is usually
a bad idea. For one, this loses an important property Python otherwise has —
you can know where each toplevel name is defined by a simple “search” function
in your favourite editor. You also open yourself to trouble in the future, if
some module grows additional functions or classes.
One of the most awful questions asked on the newsgroup is why this code:
f = open("www")
f.read()
does not work. Of course, it works just fine (assuming you have a file called
“www”.) But it does not work if somewhere in the module, the statement from
os import *
is present. The os
module has a function called
open()
which returns an integer. While it is very useful, shadowing a
builtin is one of its least useful properties.
Remember, you can never know for sure what names a module exports, so either
take what you need — from module import name1, name2
, or keep them in the
module and access on a per-need basis — import module;print module.name
.
When It Is Just Fine¶
There are situations in which from module import *
is just fine:
- The interactive prompt. For example,
from math import *
makes Python an amazing scientific calculator. - When extending a module in C with a module in Python.
- When the module advertises itself as
from import *
safe.
Unadorned exec
, execfile()
and friends¶
The word “unadorned” refers to the use without an explicit dictionary, in which
case those constructs evaluate code in the current environment. This is
dangerous for the same reasons from import *
is dangerous — it might step
over variables you are counting on and mess up things for the rest of your code.
Simply do not do that.
Bad examples:
>>> for name in sys.argv[1:]:
>>> exec "%s=1" % name
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> exec "s.%s=val" % var # invalid!
>>> execfile("handler.py")
>>> handle()
Good examples:
>>> d = {}
>>> for name in sys.argv[1:]:
>>> d[name] = 1
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> setattr(s, var, val)
>>> d={}
>>> execfile("handle.py", d, d)
>>> handle = d['handle']
>>> handle()
from module import name1, name2¶
This is a “don’t” which is much weaker than the previous “don’t”s but is still something you should not do if you don’t have good reasons to do that. The reason it is usually a bad idea is because you suddenly have an object which lives in two separate namespaces. When the binding in one namespace changes, the binding in the other will not, so there will be a discrepancy between them. This happens when, for example, one module is reloaded, or changes the definition of a function at runtime.
Bad example:
# foo.py
a = 1
# bar.py
from foo import a
if something():
a = 2 # danger: foo.a != a
Good example:
# foo.py
a = 1
# bar.py
import foo
if something():
foo.a = 2
except:¶
Python has the except:
clause, which catches all exceptions. Since every
error in Python raises an exception, using except:
can make many
programming errors look like runtime problems, which hinders the debugging
process.
The following code shows a great example of why this is bad:
try:
foo = opne("file") # misspelled "open"
except:
sys.exit("could not open file!")
The second line triggers a NameError
, which is caught by the except
clause. The program will exit, and the error message the program prints will
make you think the problem is the readability of "file"
when in fact
the real error has nothing to do with "file"
.
A better way to write the above is
try:
foo = opne("file")
except IOError:
sys.exit("could not open file")
When this is run, Python will produce a traceback showing the NameError
,
and it will be immediately apparent what needs to be fixed.
Because except:
catches all exceptions, including SystemExit
,
KeyboardInterrupt
, and GeneratorExit
(which is not an error and
should not normally be caught by user code), using a bare except:
is almost
never a good idea. In situations where you need to catch all “normal” errors,
such as in a framework that runs callbacks, you can catch the base class for
all normal exceptions, Exception
. Unfortunately in Python 2.x it is
possible for third-party code to raise exceptions that do not inherit from
Exception
, so in Python 2.x there are some cases where you may have to
use a bare except:
and manually re-raise the exceptions you don’t want
to catch.
Exceptions¶
Exceptions are a useful feature of Python. You should learn to raise them whenever something unexpected occurs, and catch them only where you can do something about them.
The following is a very popular anti-idiom
def get_status(file):
if not os.path.exists(file):
print "file not found"
sys.exit(1)
return open(file).readline()
Consider the case where the file gets deleted between the time the call to
os.path.exists()
is made and the time open()
is called. In that
case the last line will raise an IOError
. The same thing would happen
if file exists but has no read permission. Since testing this on a normal
machine on existent and non-existent files makes it seem bugless, the test
results will seem fine, and the code will get shipped. Later an unhandled
IOError
(or perhaps some other EnvironmentError
) escapes to the
user, who gets to watch the ugly traceback.
Here is a somewhat better way to do it.
def get_status(file):
try:
return open(file).readline()
except EnvironmentError as err:
print "Unable to open file: {}".format(err)
sys.exit(1)
In this version, either the file gets opened and the line is read (so it works even on flaky NFS or SMB connections), or an error message is printed that provides all the available information on why the open failed, and the application is aborted.
However, even this version of get_status()
makes too many assumptions —
that it will only be used in a short running script, and not, say, in a long
running server. Sure, the caller could do something like
try:
status = get_status(log)
except SystemExit:
status = None
But there is a better way. You should try to use as few except
clauses in
your code as you can — the ones you do use will usually be inside calls which
should always succeed, or a catch-all in a main function.
So, an even better version of get_status()
is probably
def get_status(file):
return open(file).readline()
The caller can deal with the exception if it wants (for example, if it tries several files in a loop), or just let the exception filter upwards to its caller.
But the last version still has a serious problem — due to implementation details in CPython, the file would not be closed when an exception is raised until the exception handler finishes; and, worse, in other implementations (e.g., Jython) it might not be closed at all regardless of whether or not an exception is raised.
The best version of this function uses the open()
call as a context
manager, which will ensure that the file gets closed as soon as the
function returns:
def get_status(file):
with open(file) as fp:
return fp.readline()
Using the Batteries¶
Every so often, people seem to be writing stuff in the Python library again, usually poorly. While the occasional module has a poor interface, it is usually much better to use the rich standard library and data types that come with Python than inventing your own.
A useful module very few people know about is os.path
. It always has the
correct path arithmetic for your operating system, and will usually be much
better than whatever you come up with yourself.
Compare:
# ugh!
return dir+"/"+file
# better
return os.path.join(dir, file)
More useful functions in os.path
: basename()
, dirname()
and
splitext()
.
There are also many useful built-in functions people seem not to be aware of
for some reason: min()
and max()
can find the minimum/maximum of
any sequence with comparable semantics, for example, yet many people write
their own max()
/min()
. Another highly useful function is
reduce()
which can be used to repeatly apply a binary operation to a
sequence, reducing it to a single value. For example, compute a factorial
with a series of multiply operations:
>>> n = 4
>>> import operator
>>> reduce(operator.mul, range(1, n+1))
24
When it comes to parsing numbers, note that float()
, int()
and
long()
all accept string arguments and will reject ill-formed strings
by raising an ValueError
.
Using Backslash to Continue Statements¶
Since Python treats a newline as a statement terminator, and since statements are often more than is comfortable to put in one line, many people do:
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
calculate_number(10, 20) != forbulate(500, 360):
pass
You should realize that this is dangerous: a stray space after the \
would
make this line wrong, and stray spaces are notoriously hard to see in editors.
In this case, at least it would be a syntax error, but if the code was:
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
+ calculate_number(10, 20)*forbulate(500, 360)
then it would just be subtly wrong.
It is usually much better to use the implicit continuation inside parenthesis:
This version is bulletproof:
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
+ calculate_number(10, 20)*forbulate(500, 360))
Functional Programming HOWTO¶
Author: |
|
---|---|
Release: | 0.31 |
In this document, we’ll take a tour of Python’s features suitable for
implementing programs in a functional style. After an introduction to the
concepts of functional programming, we’ll look at language features such as
iterators and generators and relevant library modules such as
itertools
and functools
.
Introduction¶
This section explains the basic concept of functional programming; if you’re just interested in learning about Python language features, skip to the next section.
Programming languages support decomposing problems in several different ways:
- Most programming languages are procedural: programs are lists of instructions that tell the computer what to do with the program’s input. C, Pascal, and even Unix shells are procedural languages.
- In declarative languages, you write a specification that describes the problem to be solved, and the language implementation figures out how to perform the computation efficiently. SQL is the declarative language you’re most likely to be familiar with; a SQL query describes the data set you want to retrieve, and the SQL engine decides whether to scan tables or use indexes, which subclauses should be performed first, etc.
- Object-oriented programs manipulate collections of objects. Objects have internal state and support methods that query or modify this internal state in some way. Smalltalk and Java are object-oriented languages. C++ and Python are languages that support object-oriented programming, but don’t force the use of object-oriented features.
- Functional programming decomposes a problem into a set of functions. Ideally, functions only take inputs and produce outputs, and don’t have any internal state that affects the output produced for a given input. Well-known functional languages include the ML family (Standard ML, OCaml, and other variants) and Haskell.
The designers of some computer languages choose to emphasize one particular approach to programming. This often makes it difficult to write programs that use a different approach. Other languages are multi-paradigm languages that support several different approaches. Lisp, C++, and Python are multi-paradigm; you can write programs or libraries that are largely procedural, object-oriented, or functional in all of these languages. In a large program, different sections might be written using different approaches; the GUI might be object-oriented while the processing logic is procedural or functional, for example.
In a functional program, input flows through a set of functions. Each function operates on its input and produces some output. Functional style discourages functions with side effects that modify internal state or make other changes that aren’t visible in the function’s return value. Functions that have no side effects at all are called purely functional. Avoiding side effects means not using data structures that get updated as a program runs; every function’s output must only depend on its input.
Some languages are very strict about purity and don’t even have assignment
statements such as a=3
or c = a + b
, but it’s difficult to avoid all
side effects. Printing to the screen or writing to a disk file are side
effects, for example. For example, in Python a print
statement or a
time.sleep(1)
both return no useful value; they’re only called for their
side effects of sending some text to the screen or pausing execution for a
second.
Python programs written in functional style usually won’t go to the extreme of avoiding all I/O or all assignments; instead, they’ll provide a functional-appearing interface but will use non-functional features internally. For example, the implementation of a function will still use assignments to local variables, but won’t modify global variables or have other side effects.
Functional programming can be considered the opposite of object-oriented programming. Objects are little capsules containing some internal state along with a collection of method calls that let you modify this state, and programs consist of making the right set of state changes. Functional programming wants to avoid state changes as much as possible and works with data flowing between functions. In Python you might combine the two approaches by writing functions that take and return instances representing objects in your application (e-mail messages, transactions, etc.).
Functional design may seem like an odd constraint to work under. Why should you avoid objects and side effects? There are theoretical and practical advantages to the functional style:
- Formal provability.
- Modularity.
- Composability.
- Ease of debugging and testing.
Formal provability¶
A theoretical benefit is that it’s easier to construct a mathematical proof that a functional program is correct.
For a long time researchers have been interested in finding ways to mathematically prove programs correct. This is different from testing a program on numerous inputs and concluding that its output is usually correct, or reading a program’s source code and concluding that the code looks right; the goal is instead a rigorous proof that a program produces the right result for all possible inputs.
The technique used to prove programs correct is to write down invariants, properties of the input data and of the program’s variables that are always true. For each line of code, you then show that if invariants X and Y are true before the line is executed, the slightly different invariants X’ and Y’ are true after the line is executed. This continues until you reach the end of the program, at which point the invariants should match the desired conditions on the program’s output.
Functional programming’s avoidance of assignments arose because assignments are difficult to handle with this technique; assignments can break invariants that were true before the assignment without producing any new invariants that can be propagated onward.
Unfortunately, proving programs correct is largely impractical and not relevant to Python software. Even trivial programs require proofs that are several pages long; the proof of correctness for a moderately complicated program would be enormous, and few or none of the programs you use daily (the Python interpreter, your XML parser, your web browser) could be proven correct. Even if you wrote down or generated a proof, there would then be the question of verifying the proof; maybe there’s an error in it, and you wrongly believe you’ve proved the program correct.
Modularity¶
A more practical benefit of functional programming is that it forces you to break apart your problem into small pieces. Programs are more modular as a result. It’s easier to specify and write a small function that does one thing than a large function that performs a complicated transformation. Small functions are also easier to read and to check for errors.
Ease of debugging and testing¶
Testing and debugging a functional-style program is easier.
Debugging is simplified because functions are generally small and clearly specified. When a program doesn’t work, each function is an interface point where you can check that the data are correct. You can look at the intermediate inputs and outputs to quickly isolate the function that’s responsible for a bug.
Testing is easier because each function is a potential subject for a unit test. Functions don’t depend on system state that needs to be replicated before running a test; instead you only have to synthesize the right input and then check that the output matches expectations.
Composability¶
As you work on a functional-style program, you’ll write a number of functions with varying inputs and outputs. Some of these functions will be unavoidably specialized to a particular application, but others will be useful in a wide variety of programs. For example, a function that takes a directory path and returns all the XML files in the directory, or a function that takes a filename and returns its contents, can be applied to many different situations.
Over time you’ll form a personal library of utilities. Often you’ll assemble new programs by arranging existing functions in a new configuration and writing a few functions specialized for the current task.
Iterators¶
I’ll start by looking at a Python language feature that’s an important foundation for writing functional-style programs: iterators.
An iterator is an object representing a stream of data; this object returns the
data one element at a time. A Python iterator must support a method called
next()
that takes no arguments and always returns the next element of the
stream. If there are no more elements in the stream, next()
must raise the
StopIteration
exception. Iterators don’t have to be finite, though; it’s
perfectly reasonable to write an iterator that produces an infinite stream of
data.
The built-in iter()
function takes an arbitrary object and tries to return
an iterator that will return the object’s contents or elements, raising
TypeError
if the object doesn’t support iteration. Several of Python’s
built-in data types support iteration, the most common being lists and
dictionaries. An object is called an iterable object if you can get an
iterator for it.
You can experiment with the iteration interface manually:
>>> L = [1,2,3]
>>> it = iter(L)
>>> print it
<...iterator object at ...>
>>> it.next()
1
>>> it.next()
2
>>> it.next()
3
>>> it.next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
>>>
Python expects iterable objects in several different contexts, the most
important being the for
statement. In the statement for X in Y
, Y must
be an iterator or some object for which iter()
can create an iterator.
These two statements are equivalent:
for i in iter(obj):
print i
for i in obj:
print i
Iterators can be materialized as lists or tuples by using the list()
or
tuple()
constructor functions:
>>> L = [1,2,3]
>>> iterator = iter(L)
>>> t = tuple(iterator)
>>> t
(1, 2, 3)
Sequence unpacking also supports iterators: if you know an iterator will return N elements, you can unpack them into an N-tuple:
>>> L = [1,2,3]
>>> iterator = iter(L)
>>> a,b,c = iterator
>>> a,b,c
(1, 2, 3)
Built-in functions such as max()
and min()
can take a single
iterator argument and will return the largest or smallest element. The "in"
and "not in"
operators also support iterators: X in iterator
is true if
X is found in the stream returned by the iterator. You’ll run into obvious
problems if the iterator is infinite; max()
, min()
, and "not in"
will never return, and if the element X never appears in the stream, the
"in"
operator won’t return either.
Note that you can only go forward in an iterator; there’s no way to get the
previous element, reset the iterator, or make a copy of it. Iterator objects
can optionally provide these additional capabilities, but the iterator protocol
only specifies the next()
method. Functions may therefore consume all of
the iterator’s output, and if you need to do something different with the same
stream, you’ll have to create a new iterator.
Data Types That Support Iterators¶
We’ve already seen how lists and tuples support iterators. In fact, any Python sequence type, such as strings, will automatically support creation of an iterator.
Calling iter()
on a dictionary returns an iterator that will loop over the
dictionary’s keys:
>>> m = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
... 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
>>> for key in m:
... print key, m[key]
Mar 3
Feb 2
Aug 8
Sep 9
Apr 4
Jun 6
Jul 7
Jan 1
May 5
Nov 11
Dec 12
Oct 10
Note that the order is essentially random, because it’s based on the hash ordering of the objects in the dictionary.
Applying iter()
to a dictionary always loops over the keys, but dictionaries
have methods that return other iterators. If you want to iterate over keys,
values, or key/value pairs, you can explicitly call the iterkeys()
,
itervalues()
, or iteritems()
methods to get an appropriate iterator.
The dict()
constructor can accept an iterator that returns a finite stream
of (key, value)
tuples:
>>> L = [('Italy', 'Rome'), ('France', 'Paris'), ('US', 'Washington DC')]
>>> dict(iter(L))
{'Italy': 'Rome', 'US': 'Washington DC', 'France': 'Paris'}
Files also support iteration by calling the readline()
method until there
are no more lines in the file. This means you can read each line of a file like
this:
for line in file:
# do something for each line
...
Sets can take their contents from an iterable and let you iterate over the set’s elements:
S = set((2, 3, 5, 7, 11, 13))
for i in S:
print i
Generator expressions and list comprehensions¶
Two common operations on an iterator’s output are 1) performing some operation for every element, 2) selecting a subset of elements that meet some condition. For example, given a list of strings, you might want to strip off trailing whitespace from each line or extract all the strings containing a given substring.
List comprehensions and generator expressions (short form: “listcomps” and “genexps”) are a concise notation for such operations, borrowed from the functional programming language Haskell (http://www.haskell.org/). You can strip all the whitespace from a stream of strings with the following code:
line_list = [' line 1\n', 'line 2 \n', ...]
# Generator expression -- returns iterator
stripped_iter = (line.strip() for line in line_list)
# List comprehension -- returns list
stripped_list = [line.strip() for line in line_list]
You can select only certain elements by adding an "if"
condition:
stripped_list = [line.strip() for line in line_list
if line != ""]
With a list comprehension, you get back a Python list; stripped_list
is a
list containing the resulting lines, not an iterator. Generator expressions
return an iterator that computes the values as necessary, not needing to
materialize all the values at once. This means that list comprehensions aren’t
useful if you’re working with iterators that return an infinite stream or a very
large amount of data. Generator expressions are preferable in these situations.
Generator expressions are surrounded by parentheses (“()”) and list comprehensions are surrounded by square brackets (“[]”). Generator expressions have the form:
( expression for expr in sequence1
if condition1
for expr2 in sequence2
if condition2
for expr3 in sequence3 ...
if condition3
for exprN in sequenceN
if conditionN )
Again, for a list comprehension only the outside brackets are different (square brackets instead of parentheses).
The elements of the generated output will be the successive values of
expression
. The if
clauses are all optional; if present, expression
is only evaluated and added to the result when condition
is true.
Generator expressions always have to be written inside parentheses, but the parentheses signalling a function call also count. If you want to create an iterator that will be immediately passed to a function you can write:
obj_total = sum(obj.count for obj in list_all_objects())
The for...in
clauses contain the sequences to be iterated over. The
sequences do not have to be the same length, because they are iterated over from
left to right, not in parallel. For each element in sequence1
,
sequence2
is looped over from the beginning. sequence3
is then looped
over for each resulting pair of elements from sequence1
and sequence2
.
To put it another way, a list comprehension or generator expression is equivalent to the following Python code:
for expr1 in sequence1:
if not (condition1):
continue # Skip this element
for expr2 in sequence2:
if not (condition2):
continue # Skip this element
...
for exprN in sequenceN:
if not (conditionN):
continue # Skip this element
# Output the value of
# the expression.
This means that when there are multiple for...in
clauses but no if
clauses, the length of the resulting output will be equal to the product of the
lengths of all the sequences. If you have two lists of length 3, the output
list is 9 elements long:
To avoid introducing an ambiguity into Python’s grammar, if expression
is
creating a tuple, it must be surrounded with parentheses. The first list
comprehension below is a syntax error, while the second one is correct:
# Syntax error
[ x,y for x in seq1 for y in seq2]
# Correct
[ (x,y) for x in seq1 for y in seq2]
Generators¶
Generators are a special class of functions that simplify the task of writing iterators. Regular functions compute a value and return it, but generators return an iterator that returns a stream of values.
You’re doubtless familiar with how regular function calls work in Python or C.
When you call a function, it gets a private namespace where its local variables
are created. When the function reaches a return
statement, the local
variables are destroyed and the value is returned to the caller. A later call
to the same function creates a new private namespace and a fresh set of local
variables. But, what if the local variables weren’t thrown away on exiting a
function? What if you could later resume the function where it left off? This
is what generators provide; they can be thought of as resumable functions.
Here’s the simplest example of a generator function:
Any function containing a yield
keyword is a generator function; this is
detected by Python’s bytecode compiler which compiles the function
specially as a result.
When you call a generator function, it doesn’t return a single value; instead it
returns a generator object that supports the iterator protocol. On executing
the yield
expression, the generator outputs the value of i
, similar to a
return
statement. The big difference between yield
and a return
statement is that on reaching a yield
the generator’s state of execution is
suspended and local variables are preserved. On the next call to the
generator’s .next()
method, the function will resume executing.
Here’s a sample usage of the generate_ints()
generator:
>>> gen = generate_ints(3)
>>> gen
<generator object generate_ints at ...>
>>> gen.next()
0
>>> gen.next()
1
>>> gen.next()
2
>>> gen.next()
Traceback (most recent call last):
File "stdin", line 1, in ?
File "stdin", line 2, in generate_ints
StopIteration
You could equally write for i in generate_ints(5)
, or a,b,c =
generate_ints(3)
.
Inside a generator function, the return
statement can only be used without a
value, and signals the end of the procession of values; after executing a
return
the generator cannot return any further values. return
with a
value, such as return 5
, is a syntax error inside a generator function. The
end of the generator’s results can also be indicated by raising
StopIteration
manually, or by just letting the flow of execution fall off
the bottom of the function.
You could achieve the effect of generators manually by writing your own class
and storing all the local variables of the generator as instance variables. For
example, returning a list of integers could be done by setting self.count
to
0, and having the next()
method increment self.count
and return it.
However, for a moderately complicated generator, writing a corresponding class
can be much messier.
The test suite included with Python’s library, test_generators.py
, contains
a number of more interesting examples. Here’s one generator that implements an
in-order traversal of a tree using generators recursively.
# A recursive generator that generates Tree leaves in in-order.
def inorder(t):
if t:
for x in inorder(t.left):
yield x
yield t.label
for x in inorder(t.right):
yield x
Two other examples in test_generators.py
produce solutions for the N-Queens
problem (placing N queens on an NxN chess board so that no queen threatens
another) and the Knight’s Tour (finding a route that takes a knight to every
square of an NxN chessboard without visiting any square twice).
Passing values into a generator¶
In Python 2.4 and earlier, generators only produced output. Once a generator’s code was invoked to create an iterator, there was no way to pass any new information into the function when its execution is resumed. You could hack together this ability by making the generator look at a global variable or by passing in some mutable object that callers then modify, but these approaches are messy.
In Python 2.5 there’s a simple way to pass values into a generator.
yield
became an expression, returning a value that can be assigned to
a variable or otherwise operated on:
val = (yield i)
I recommend that you always put parentheses around a yield
expression
when you’re doing something with the returned value, as in the above example.
The parentheses aren’t always necessary, but it’s easier to always add them
instead of having to remember when they’re needed.
(PEP 342 explains the exact rules, which are that a yield
-expression must
always be parenthesized except when it occurs at the top-level expression on the
right-hand side of an assignment. This means you can write val = yield i
but have to use parentheses when there’s an operation, as in val = (yield i)
+ 12
.)
Values are sent into a generator by calling its send(value)
method. This
method resumes the generator’s code and the yield
expression returns the
specified value. If the regular next()
method is called, the yield
returns None
.
Here’s a simple counter that increments by 1 and allows changing the value of the internal counter.
And here’s an example of changing the counter:
>>> it = counter(10)
>>> print it.next()
0
>>> print it.next()
1
>>> print it.send(8)
8
>>> print it.next()
9
>>> print it.next()
Traceback (most recent call last):
File "t.py", line 15, in ?
print it.next()
StopIteration
Because yield
will often be returning None
, you should always check for
this case. Don’t just use its value in expressions unless you’re sure that the
send()
method will be the only method used resume your generator function.
In addition to send()
, there are two other new methods on generators:
throw(type, value=None, traceback=None)
is used to raise an exception inside the generator; the exception is raised by theyield
expression where the generator’s execution is paused.close()
raises aGeneratorExit
exception inside the generator to terminate the iteration. On receiving this exception, the generator’s code must either raiseGeneratorExit
orStopIteration
; catching the exception and doing anything else is illegal and will trigger aRuntimeError
.close()
will also be called by Python’s garbage collector when the generator is garbage-collected.If you need to run cleanup code when a
GeneratorExit
occurs, I suggest using atry: ... finally:
suite instead of catchingGeneratorExit
.
The cumulative effect of these changes is to turn generators from one-way producers of information into both producers and consumers.
Generators also become coroutines, a more generalized form of subroutines.
Subroutines are entered at one point and exited at another point (the top of the
function, and a return
statement), but coroutines can be entered, exited,
and resumed at many different points (the yield
statements).
Built-in functions¶
Let’s look in more detail at built-in functions often used with iterators.
Two of Python’s built-in functions, map()
and filter()
, are somewhat
obsolete; they duplicate the features of list comprehensions but return actual
lists instead of iterators.
map(f, iterA, iterB, ...)
returns a list containing f(iterA[0], iterB[0]),
f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ...
.
>>> def upper(s):
... return s.upper()
>>> map(upper, ['sentence', 'fragment'])
['SENTENCE', 'FRAGMENT']
>>> [upper(s) for s in ['sentence', 'fragment']]
['SENTENCE', 'FRAGMENT']
As shown above, you can achieve the same effect with a list comprehension. The
itertools.imap()
function does the same thing but can handle infinite
iterators; it’ll be discussed later, in the section on the itertools
module.
filter(predicate, iter)
returns a list that contains all the sequence
elements that meet a certain condition, and is similarly duplicated by list
comprehensions. A predicate is a function that returns the truth value of
some condition; for use with filter()
, the predicate must take a single
value.
>>> def is_even(x):
... return (x % 2) == 0
>>> filter(is_even, range(10))
[0, 2, 4, 6, 8]
This can also be written as a list comprehension:
>>> [x for x in range(10) if is_even(x)]
[0, 2, 4, 6, 8]
filter()
also has a counterpart in the itertools
module,
itertools.ifilter()
, that returns an iterator and can therefore handle
infinite sequences just as itertools.imap()
can.
reduce(func, iter, [initial_value])
doesn’t have a counterpart in the
itertools
module because it cumulatively performs an operation on all the
iterable’s elements and therefore can’t be applied to infinite iterables.
func
must be a function that takes two elements and returns a single value.
reduce()
takes the first two elements A and B returned by the iterator and
calculates func(A, B)
. It then requests the third element, C, calculates
func(func(A, B), C)
, combines this result with the fourth element returned,
and continues until the iterable is exhausted. If the iterable returns no
values at all, a TypeError
exception is raised. If the initial value is
supplied, it’s used as a starting point and func(initial_value, A)
is the
first calculation.
>>> import operator
>>> reduce(operator.concat, ['A', 'BB', 'C'])
'ABBC'
>>> reduce(operator.concat, [])
Traceback (most recent call last):
...
TypeError: reduce() of empty sequence with no initial value
>>> reduce(operator.mul, [1,2,3], 1)
6
>>> reduce(operator.mul, [], 1)
1
If you use operator.add()
with reduce()
, you’ll add up all the
elements of the iterable. This case is so common that there’s a special
built-in called sum()
to compute it:
>>> reduce(operator.add, [1,2,3,4], 0)
10
>>> sum([1,2,3,4])
10
>>> sum([])
0
For many uses of reduce()
, though, it can be clearer to just write the
obvious for
loop:
# Instead of:
product = reduce(operator.mul, [1,2,3], 1)
# You can write:
product = 1
for i in [1,2,3]:
product *= i
enumerate(iter)
counts off the elements in the iterable, returning 2-tuples
containing the count and each element.
>>> for item in enumerate(['subject', 'verb', 'object']):
... print item
(0, 'subject')
(1, 'verb')
(2, 'object')
enumerate()
is often used when looping through a list and recording the
indexes at which certain conditions are met:
f = open('data.txt', 'r')
for i, line in enumerate(f):
if line.strip() == '':
print 'Blank line at line #%i' % i
sorted(iterable, [cmp=None], [key=None], [reverse=False])
collects all the
elements of the iterable into a list, sorts the list, and returns the sorted
result. The cmp
, key
, and reverse
arguments are passed through to
the constructed list’s .sort()
method.
>>> import random
>>> # Generate 8 random numbers between [0, 10000)
>>> rand_list = random.sample(range(10000), 8)
>>> rand_list
[769, 7953, 9828, 6431, 8442, 9878, 6213, 2207]
>>> sorted(rand_list)
[769, 2207, 6213, 6431, 7953, 8442, 9828, 9878]
>>> sorted(rand_list, reverse=True)
[9878, 9828, 8442, 7953, 6431, 6213, 2207, 769]
(For a more detailed discussion of sorting, see the Sorting mini-HOWTO in the Python wiki at http://wiki.python.org/moin/HowTo/Sorting.)
The any(iter)
and all(iter)
built-ins look at the truth values of an
iterable’s contents. any()
returns True if any element in the iterable is
a true value, and all()
returns True if all of the elements are true
values:
>>> any([0,1,0])
True
>>> any([0,0,0])
False
>>> any([1,1,1])
True
>>> all([0,1,0])
False
>>> all([0,0,0])
False
>>> all([1,1,1])
True
Small functions and the lambda expression¶
When writing functional-style programs, you’ll often need little functions that act as predicates or that combine elements in some way.
If there’s a Python built-in or a module function that’s suitable, you don’t need to define a new function at all:
stripped_lines = [line.strip() for line in lines]
existing_files = filter(os.path.exists, file_list)
If the function you need doesn’t exist, you need to write it. One way to write
small functions is to use the lambda
statement. lambda
takes a number
of parameters and an expression combining these parameters, and creates a small
function that returns the value of the expression:
lowercase = lambda x: x.lower()
print_assign = lambda name, value: name + '=' + str(value)
adder = lambda x, y: x+y
An alternative is to just use the def
statement and define a function in the
usual way:
def lowercase(x):
return x.lower()
def print_assign(name, value):
return name + '=' + str(value)
def adder(x,y):
return x + y
Which alternative is preferable? That’s a style question; my usual course is to
avoid using lambda
.
One reason for my preference is that lambda
is quite limited in the
functions it can define. The result has to be computable as a single
expression, which means you can’t have multiway if... elif... else
comparisons or try... except
statements. If you try to do too much in a
lambda
statement, you’ll end up with an overly complicated expression that’s
hard to read. Quick, what’s the following code doing?
total = reduce(lambda a, b: (0, a[1] + b[1]), items)[1]
You can figure it out, but it takes time to disentangle the expression to figure
out what’s going on. Using a short nested def
statements makes things a
little bit better:
def combine (a, b):
return 0, a[1] + b[1]
total = reduce(combine, items)[1]
But it would be best of all if I had simply used a for
loop:
total = 0
for a, b in items:
total += b
Or the sum()
built-in and a generator expression:
total = sum(b for a,b in items)
Many uses of reduce()
are clearer when written as for
loops.
Fredrik Lundh once suggested the following set of rules for refactoring uses of
lambda
:
- Write a lambda function.
- Write a comment explaining what the heck that lambda does.
- Study the comment for a while, and think of a name that captures the essence of the comment.
- Convert the lambda to a def statement, using that name.
- Remove the comment.
I really like these rules, but you’re free to disagree about whether this lambda-free style is better.
The itertools module¶
The itertools
module contains a number of commonly-used iterators as well
as functions for combining several iterators. This section will introduce the
module’s contents by showing small examples.
The module’s functions fall into a few broad classes:
- Functions that create a new iterator based on an existing iterator.
- Functions for treating an iterator’s elements as function arguments.
- Functions for selecting portions of an iterator’s output.
- A function for grouping an iterator’s output.
Creating new iterators¶
itertools.count(n)
returns an infinite stream of integers, increasing by 1
each time. You can optionally supply the starting number, which defaults to 0:
itertools.count() =>
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
itertools.count(10) =>
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...
itertools.cycle(iter)
saves a copy of the contents of a provided iterable
and returns a new iterator that returns its elements from first to last. The
new iterator will repeat these elements infinitely.
itertools.cycle([1,2,3,4,5]) =>
1, 2, 3, 4, 5, 1, 2, 3, 4, 5, ...
itertools.repeat(elem, [n])
returns the provided element n
times, or
returns the element endlessly if n
is not provided.
itertools.repeat('abc') =>
abc, abc, abc, abc, abc, abc, abc, abc, abc, abc, ...
itertools.repeat('abc', 5) =>
abc, abc, abc, abc, abc
itertools.chain(iterA, iterB, ...)
takes an arbitrary number of iterables as
input, and returns all the elements of the first iterator, then all the elements
of the second, and so on, until all of the iterables have been exhausted.
itertools.chain(['a', 'b', 'c'], (1, 2, 3)) =>
a, b, c, 1, 2, 3
itertools.izip(iterA, iterB, ...)
takes one element from each iterable and
returns them in a tuple:
itertools.izip(['a', 'b', 'c'], (1, 2, 3)) =>
('a', 1), ('b', 2), ('c', 3)
It’s similar to the built-in zip()
function, but doesn’t construct an
in-memory list and exhaust all the input iterators before returning; instead
tuples are constructed and returned only if they’re requested. (The technical
term for this behaviour is lazy evaluation.)
This iterator is intended to be used with iterables that are all of the same length. If the iterables are of different lengths, the resulting stream will be the same length as the shortest iterable.
itertools.izip(['a', 'b'], (1, 2, 3)) =>
('a', 1), ('b', 2)
You should avoid doing this, though, because an element may be taken from the longer iterators and discarded. This means you can’t go on to use the iterators further because you risk skipping a discarded element.
itertools.islice(iter, [start], stop, [step])
returns a stream that’s a
slice of the iterator. With a single stop
argument, it will return the
first stop
elements. If you supply a starting index, you’ll get
stop-start
elements, and if you supply a value for step
, elements will
be skipped accordingly. Unlike Python’s string and list slicing, you can’t use
negative values for start
, stop
, or step
.
itertools.islice(range(10), 8) =>
0, 1, 2, 3, 4, 5, 6, 7
itertools.islice(range(10), 2, 8) =>
2, 3, 4, 5, 6, 7
itertools.islice(range(10), 2, 8, 2) =>
2, 4, 6
itertools.tee(iter, [n])
replicates an iterator; it returns n
independent iterators that will all return the contents of the source iterator.
If you don’t supply a value for n
, the default is 2. Replicating iterators
requires saving some of the contents of the source iterator, so this can consume
significant memory if the iterator is large and one of the new iterators is
consumed more than the others.
itertools.tee( itertools.count() ) =>
iterA, iterB
where iterA ->
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
and iterB ->
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
Calling functions on elements¶
Two functions are used for calling other functions on the contents of an iterable.
itertools.imap(f, iterA, iterB, ...)
returns a stream containing
f(iterA[0], iterB[0]), f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ...
:
itertools.imap(operator.add, [5, 6, 5], [1, 2, 3]) =>
6, 8, 8
The operator
module contains a set of functions corresponding to Python’s
operators. Some examples are operator.add(a, b)
(adds two values),
operator.ne(a, b)
(same as a!=b
), and operator.attrgetter('id')
(returns a callable that fetches the "id"
attribute).
itertools.starmap(func, iter)
assumes that the iterable will return a stream
of tuples, and calls f()
using these tuples as the arguments:
itertools.starmap(os.path.join,
[('/usr', 'bin', 'java'), ('/bin', 'python'),
('/usr', 'bin', 'perl'),('/usr', 'bin', 'ruby')])
=>
/usr/bin/java, /bin/python, /usr/bin/perl, /usr/bin/ruby
Selecting elements¶
Another group of functions chooses a subset of an iterator’s elements based on a predicate.
itertools.ifilter(predicate, iter)
returns all the elements for which the
predicate returns true:
def is_even(x):
return (x % 2) == 0
itertools.ifilter(is_even, itertools.count()) =>
0, 2, 4, 6, 8, 10, 12, 14, ...
itertools.ifilterfalse(predicate, iter)
is the opposite, returning all
elements for which the predicate returns false:
itertools.ifilterfalse(is_even, itertools.count()) =>
1, 3, 5, 7, 9, 11, 13, 15, ...
itertools.takewhile(predicate, iter)
returns elements for as long as the
predicate returns true. Once the predicate returns false, the iterator will
signal the end of its results.
def less_than_10(x):
return (x < 10)
itertools.takewhile(less_than_10, itertools.count()) =>
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
itertools.takewhile(is_even, itertools.count()) =>
0
itertools.dropwhile(predicate, iter)
discards elements while the predicate
returns true, and then returns the rest of the iterable’s results.
itertools.dropwhile(less_than_10, itertools.count()) =>
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...
itertools.dropwhile(is_even, itertools.count()) =>
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
Grouping elements¶
The last function I’ll discuss, itertools.groupby(iter, key_func=None)
, is
the most complicated. key_func(elem)
is a function that can compute a key
value for each element returned by the iterable. If you don’t supply a key
function, the key is simply each element itself.
groupby()
collects all the consecutive elements from the underlying iterable
that have the same key value, and returns a stream of 2-tuples containing a key
value and an iterator for the elements with that key.
city_list = [('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL'),
('Anchorage', 'AK'), ('Nome', 'AK'),
('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ'),
...
]
def get_state ((city, state)):
return state
itertools.groupby(city_list, get_state) =>
('AL', iterator-1),
('AK', iterator-2),
('AZ', iterator-3), ...
where
iterator-1 =>
('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL')
iterator-2 =>
('Anchorage', 'AK'), ('Nome', 'AK')
iterator-3 =>
('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ')
groupby()
assumes that the underlying iterable’s contents will already be
sorted based on the key. Note that the returned iterators also use the
underlying iterable, so you have to consume the results of iterator-1 before
requesting iterator-2 and its corresponding key.
The functools module¶
The functools
module in Python 2.5 contains some higher-order functions.
A higher-order function takes one or more functions as input and returns a
new function. The most useful tool in this module is the
functools.partial()
function.
For programs written in a functional style, you’ll sometimes want to construct
variants of existing functions that have some of the parameters filled in.
Consider a Python function f(a, b, c)
; you may wish to create a new function
g(b, c)
that’s equivalent to f(1, b, c)
; you’re filling in a value for
one of f()
‘s parameters. This is called “partial function application”.
The constructor for partial
takes the arguments (function, arg1, arg2,
... kwarg1=value1, kwarg2=value2)
. The resulting object is callable, so you
can just call it to invoke function
with the filled-in arguments.
Here’s a small but realistic example:
import functools
def log (message, subsystem):
"Write the contents of 'message' to the specified subsystem."
print '%s: %s' % (subsystem, message)
...
server_log = functools.partial(log, subsystem='server')
server_log('Unable to open socket')
The operator module¶
The operator
module was mentioned earlier. It contains a set of
functions corresponding to Python’s operators. These functions are often useful
in functional-style code because they save you from writing trivial functions
that perform a single operation.
Some of the functions in this module are:
- Math operations:
add()
,sub()
,mul()
,div()
,floordiv()
,abs()
, ... - Logical operations:
not_()
,truth()
. - Bitwise operations:
and_()
,or_()
,invert()
. - Comparisons:
eq()
,ne()
,lt()
,le()
,gt()
, andge()
. - Object identity:
is_()
,is_not()
.
Consult the operator module’s documentation for a complete list.
Revision History and Acknowledgements¶
The author would like to thank the following people for offering suggestions, corrections and assistance with various drafts of this article: Ian Bicking, Nick Coghlan, Nick Efford, Raymond Hettinger, Jim Jewett, Mike Krell, Leandro Lameiro, Jussi Salmela, Collin Winter, Blake Winton.
Version 0.1: posted June 30 2006.
Version 0.11: posted July 1 2006. Typo fixes.
Version 0.2: posted July 10 2006. Merged genexp and listcomp sections into one. Typo fixes.
Version 0.21: Added more references suggested on the tutor mailing list.
Version 0.30: Adds a section on the functional
module written by Collin
Winter; adds short section on the operator module; a few other edits.
References¶
General¶
Structure and Interpretation of Computer Programs, by Harold Abelson and Gerald Jay Sussman with Julie Sussman. Full text at http://mitpress.mit.edu/sicp/. In this classic textbook of computer science, chapters 2 and 3 discuss the use of sequences and streams to organize the data flow inside a program. The book uses Scheme for its examples, but many of the design approaches described in these chapters are applicable to functional-style Python code.
http://www.defmacro.org/ramblings/fp.html: A general introduction to functional programming that uses Java examples and has a lengthy historical introduction.
http://en.wikipedia.org/wiki/Functional_programming: General Wikipedia entry describing functional programming.
http://en.wikipedia.org/wiki/Coroutine: Entry for coroutines.
http://en.wikipedia.org/wiki/Currying: Entry for the concept of currying.
Python-specific¶
http://gnosis.cx/TPiP/: The first chapter of David Mertz’s book Text Processing in Python discusses functional programming for text processing, in the section titled “Utilizing Higher-Order Functions in Text Processing”.
Mertz also wrote a 3-part series of articles on functional programming for IBM’s DeveloperWorks site; see
Logging HOWTO¶
Author: | Vinay Sajip <vinay_sajip at red-dove dot com> |
---|
Basic Logging Tutorial¶
Logging is a means of tracking events that happen when some software runs. The software’s developer adds logging calls to their code to indicate that certain events have occurred. An event is described by a descriptive message which can optionally contain variable data (i.e. data that is potentially different for each occurrence of the event). Events also have an importance which the developer ascribes to the event; the importance can also be called the level or severity.
When to use logging¶
Logging provides a set of convenience functions for simple logging usage. These
are debug()
, info()
, warning()
, error()
and
critical()
. To determine when to use logging, see the table below, which
states, for each of a set of common tasks, the best tool to use for it.
Task you want to perform | The best tool for the task |
---|---|
Display console output for ordinary usage of a command line script or program | print() |
Report events that occur during normal operation of a program (e.g. for status monitoring or fault investigation) | logging.info() (or
logging.debug() for very
detailed output for diagnostic
purposes) |
Issue a warning regarding a particular runtime event |
|
Report an error regarding a particular runtime event | Raise an exception |
Report suppression of an error without raising an exception (e.g. error handler in a long-running server process) | logging.error() ,
logging.exception() or
logging.critical() as
appropriate for the specific error
and application domain |
The logging functions are named after the level or severity of the events they are used to track. The standard levels and their applicability are described below (in increasing order of severity):
Level | When it’s used |
---|---|
DEBUG |
Detailed information, typically of interest only when diagnosing problems. |
INFO |
Confirmation that things are working as expected. |
WARNING |
An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |
ERROR |
Due to a more serious problem, the software has not been able to perform some function. |
CRITICAL |
A serious error, indicating that the program itself may be unable to continue running. |
The default level is WARNING
, which means that only events of this level
and above will be tracked, unless the logging package is configured to do
otherwise.
Events that are tracked can be handled in different ways. The simplest way of handling tracked events is to print them to the console. Another common way is to write them to a disk file.
A simple example¶
A very simple example is:
import logging
logging.warning('Watch out!') # will print a message to the console
logging.info('I told you so') # will not print anything
If you type these lines into a script and run it, you’ll see:
WARNING:root:Watch out!
printed out on the console. The INFO
message doesn’t appear because the
default level is WARNING
. The printed message includes the indication of
the level and the description of the event provided in the logging call, i.e.
‘Watch out!’. Don’t worry about the ‘root’ part for now: it will be explained
later. The actual output can be formatted quite flexibly if you need that;
formatting options will also be explained later.
Logging to a file¶
A very common situation is that of recording logging events in a file, so let’s look at that next:
import logging
logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.debug('This message should go to the log file')
logging.info('So should this')
logging.warning('And this, too')
And now if we open the file and look at what we have, we should find the log messages:
DEBUG:root:This message should go to the log file
INFO:root:So should this
WARNING:root:And this, too
This example also shows how you can set the logging level which acts as the
threshold for tracking. In this case, because we set the threshold to
DEBUG
, all of the messages were printed.
If you want to set the logging level from a command-line option such as:
--log=INFO
and you have the value of the parameter passed for --log
in some variable
loglevel, you can use:
getattr(logging, loglevel.upper())
to get the value which you’ll pass to basicConfig()
via the level
argument. You may want to error check any user input value, perhaps as in the
following example:
# assuming loglevel is bound to the string value obtained from the
# command line argument. Convert to upper case to allow the user to
# specify --log=DEBUG or --log=debug
numeric_level = getattr(logging, loglevel.upper(), None)
if not isinstance(numeric_level, int):
raise ValueError('Invalid log level: %s' % loglevel)
logging.basicConfig(level=numeric_level, ...)
The call to basicConfig()
should come before any calls to debug()
,
info()
etc. As it’s intended as a one-off simple configuration facility,
only the first call will actually do anything: subsequent calls are effectively
no-ops.
If you run the above script several times, the messages from successive runs are appended to the file example.log. If you want each run to start afresh, not remembering the messages from earlier runs, you can specify the filemode argument, by changing the call in the above example to:
logging.basicConfig(filename='example.log', filemode='w', level=logging.DEBUG)
The output will be the same as before, but the log file is no longer appended to, so the messages from earlier runs are lost.
Logging from multiple modules¶
If your program consists of multiple modules, here’s an example of how you could organize logging in it:
# myapp.py
import logging
import mylib
def main():
logging.basicConfig(filename='myapp.log', level=logging.INFO)
logging.info('Started')
mylib.do_something()
logging.info('Finished')
if __name__ == '__main__':
main()
# mylib.py
import logging
def do_something():
logging.info('Doing something')
If you run myapp.py, you should see this in myapp.log:
INFO:root:Started
INFO:root:Doing something
INFO:root:Finished
which is hopefully what you were expecting to see. You can generalize this to multiple modules, using the pattern in mylib.py. Note that for this simple usage pattern, you won’t know, by looking in the log file, where in your application your messages came from, apart from looking at the event description. If you want to track the location of your messages, you’ll need to refer to the documentation beyond the tutorial level – see Advanced Logging Tutorial.
Logging variable data¶
To log variable data, use a format string for the event description message and append the variable data as arguments. For example:
import logging
logging.warning('%s before you %s', 'Look', 'leap!')
will display:
WARNING:root:Look before you leap!
As you can see, merging of variable data into the event description message
uses the old, %-style of string formatting. This is for backwards
compatibility: the logging package pre-dates newer formatting options such as
str.format()
and string.Template
. These newer formatting
options are supported, but exploring them is outside the scope of this
tutorial.
Changing the format of displayed messages¶
To change the format which is used to display messages, you need to specify the format you want to use:
import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
logging.debug('This message should appear on the console')
logging.info('So should this')
logging.warning('And this, too')
which would print:
DEBUG:This message should appear on the console
INFO:So should this
WARNING:And this, too
Notice that the ‘root’ which appeared in earlier examples has disappeared. For a full set of things that can appear in format strings, you can refer to the documentation for logrecord-attributes, but for simple usage, you just need the levelname (severity), message (event description, including variable data) and perhaps to display when the event occurred. This is described in the next section.
Displaying the date/time in messages¶
To display the date and time of an event, you would place ‘%(asctime)s’ in your format string:
import logging
logging.basicConfig(format='%(asctime)s %(message)s')
logging.warning('is when this event was logged.')
which should print something like this:
2010-12-12 11:41:42,612 is when this event was logged.
The default format for date/time display (shown above) is ISO8601. If you need
more control over the formatting of the date/time, provide a datefmt
argument to basicConfig
, as in this example:
import logging
logging.basicConfig(format='%(asctime)s %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
logging.warning('is when this event was logged.')
which would display something like this:
12/12/2010 11:46:36 AM is when this event was logged.
The format of the datefmt argument is the same as supported by
time.strftime()
.
Next Steps¶
That concludes the basic tutorial. It should be enough to get you up and running with logging. There’s a lot more that the logging package offers, but to get the best out of it, you’ll need to invest a little more of your time in reading the following sections. If you’re ready for that, grab some of your favourite beverage and carry on.
If your logging needs are simple, then use the above examples to incorporate logging into your own scripts, and if you run into problems or don’t understand something, please post a question on the comp.lang.python Usenet group (available at http://groups.google.com/group/comp.lang.python) and you should receive help before too long.
Still here? You can carry on reading the next few sections, which provide a slightly more advanced/in-depth tutorial than the basic one above. After that, you can take a look at the Logging Cookbook.
Advanced Logging Tutorial¶
The logging library takes a modular approach and offers several categories of components: loggers, handlers, filters, and formatters.
- Loggers expose the interface that application code directly uses.
- Handlers send the log records (created by loggers) to the appropriate destination.
- Filters provide a finer grained facility for determining which log records to output.
- Formatters specify the layout of log records in the final output.
Logging is performed by calling methods on instances of the Logger
class (hereafter called loggers). Each instance has a name, and they are
conceptually arranged in a namespace hierarchy using dots (periods) as
separators. For example, a logger named ‘scan’ is the parent of loggers
‘scan.text’, ‘scan.html’ and ‘scan.pdf’. Logger names can be anything you want,
and indicate the area of an application in which a logged message originates.
A good convention to use when naming loggers is to use a module-level logger, in each module which uses logging, named as follows:
logger = logging.getLogger(__name__)
This means that logger names track the package/module hierarchy, and it’s intuitively obvious where events are logged just from the logger name.
The root of the hierarchy of loggers is called the root logger. That’s the
logger used by the functions debug()
, info()
, warning()
,
error()
and critical()
, which just call the same-named method of
the root logger. The functions and the methods have the same signatures. The
root logger’s name is printed as ‘root’ in the logged output.
It is, of course, possible to log messages to different destinations. Support is included in the package for writing log messages to files, HTTP GET/POST locations, email via SMTP, generic sockets, or OS-specific logging mechanisms such as syslog or the Windows NT event log. Destinations are served by handler classes. You can create your own log destination class if you have special requirements not met by any of the built-in handler classes.
By default, no destination is set for any logging messages. You can specify
a destination (such as console or file) by using basicConfig()
as in the
tutorial examples. If you call the functions debug()
, info()
,
warning()
, error()
and critical()
, they will check to see
if no destination is set; and if one is not set, they will set a destination
of the console (sys.stderr
) and a default format for the displayed
message before delegating to the root logger to do the actual message output.
The default format set by basicConfig()
for messages is:
severity:logger name:message
You can change this by passing a format string to basicConfig()
with the
format keyword argument. For all options regarding how a format string is
constructed, see formatter-objects.
Loggers¶
Logger
objects have a threefold job. First, they expose several
methods to application code so that applications can log messages at runtime.
Second, logger objects determine which log messages to act upon based upon
severity (the default filtering facility) or filter objects. Third, logger
objects pass along relevant log messages to all interested log handlers.
The most widely used methods on logger objects fall into two categories: configuration and message sending.
These are the most common configuration methods:
Logger.setLevel()
specifies the lowest-severity log message a logger will handle, where debug is the lowest built-in severity level and critical is the highest built-in severity. For example, if the severity level is INFO, the logger will handle only INFO, WARNING, ERROR, and CRITICAL messages and will ignore DEBUG messages.Logger.addHandler()
andLogger.removeHandler()
add and remove handler objects from the logger object. Handlers are covered in more detail in Handlers.Logger.addFilter()
andLogger.removeFilter()
add and remove filter objects from the logger object. Filters are covered in more detail in filter.
You don’t need to always call these methods on every logger you create. See the last two paragraphs in this section.
With the logger object configured, the following methods create log messages:
Logger.debug()
,Logger.info()
,Logger.warning()
,Logger.error()
, andLogger.critical()
all create log records with a message and a level that corresponds to their respective method names. The message is actually a format string, which may contain the standard string substitution syntax of%s
,%d
,%f
, and so on. The rest of their arguments is a list of objects that correspond with the substitution fields in the message. With regard to**kwargs
, the logging methods care only about a keyword ofexc_info
and use it to determine whether to log exception information.Logger.exception()
creates a log message similar toLogger.error()
. The difference is thatLogger.exception()
dumps a stack trace along with it. Call this method only from an exception handler.Logger.log()
takes a log level as an explicit argument. This is a little more verbose for logging messages than using the log level convenience methods listed above, but this is how to log at custom log levels.
getLogger()
returns a reference to a logger instance with the specified
name if it is provided, or root
if not. The names are period-separated
hierarchical structures. Multiple calls to getLogger()
with the same name
will return a reference to the same logger object. Loggers that are further
down in the hierarchical list are children of loggers higher up in the list.
For example, given a logger with a name of foo
, loggers with names of
foo.bar
, foo.bar.baz
, and foo.bam
are all descendants of foo
.
Loggers have a concept of effective level. If a level is not explicitly set
on a logger, the level of its parent is used instead as its effective level.
If the parent has no explicit level set, its parent is examined, and so on -
all ancestors are searched until an explicitly set level is found. The root
logger always has an explicit level set (WARNING
by default). When deciding
whether to process an event, the effective level of the logger is used to
determine whether the event is passed to the logger’s handlers.
Child loggers propagate messages up to the handlers associated with their ancestor loggers. Because of this, it is unnecessary to define and configure handlers for all the loggers an application uses. It is sufficient to configure handlers for a top-level logger and create child loggers as needed. (You can, however, turn off propagation by setting the propagate attribute of a logger to False.)
Handlers¶
Handler
objects are responsible for dispatching the
appropriate log messages (based on the log messages’ severity) to the handler’s
specified destination. Logger objects can add zero or more handler objects to
themselves with an addHandler()
method. As an example scenario, an
application may want to send all log messages to a log file, all log messages
of error or higher to stdout, and all messages of critical to an email address.
This scenario requires three individual handlers where each handler is
responsible for sending messages of a specific severity to a specific location.
The standard library includes quite a few handler types (see
Useful Handlers); the tutorials use mainly StreamHandler
and
FileHandler
in its examples.
There are very few methods in a handler for application developers to concern themselves with. The only handler methods that seem relevant for application developers who are using the built-in handler objects (that is, not creating custom handlers) are the following configuration methods:
- The
Handler.setLevel()
method, just as in logger objects, specifies the lowest severity that will be dispatched to the appropriate destination. Why are there twosetLevel()
methods? The level set in the logger determines which severity of messages it will pass to its handlers. The level set in each handler determines which messages that handler will send on. setFormatter()
selects a Formatter object for this handler to use.addFilter()
andremoveFilter()
respectively configure and deconfigure filter objects on handlers.
Application code should not directly instantiate and use instances of
Handler
. Instead, the Handler
class is a base class that
defines the interface that all handlers should have and establishes some
default behavior that child classes can use (or override).
Formatters¶
Formatter objects configure the final order, structure, and contents of the log
message. Unlike the base logging.Handler
class, application code may
instantiate formatter classes, although you could likely subclass the formatter
if your application needs special behavior. The constructor takes two
optional arguments – a message format string and a date format string.
-
logging.Formatter.
__init__
(fmt=None, datefmt=None)¶
If there is no message format string, the default is to use the raw message. If there is no date format string, the default date format is:
%Y-%m-%d %H:%M:%S
with the milliseconds tacked on at the end.
The message format string uses %(<dictionary key>)s
styled string
substitution; the possible keys are documented in logrecord-attributes.
The following message format string will log the time in a human-readable format, the severity of the message, and the contents of the message, in that order:
'%(asctime)s - %(levelname)s - %(message)s'
Formatters use a user-configurable function to convert the creation time of a
record to a tuple. By default, time.localtime()
is used; to change this
for a particular formatter instance, set the converter
attribute of the
instance to a function with the same signature as time.localtime()
or
time.gmtime()
. To change it for all formatters, for example if you want
all logging times to be shown in GMT, set the converter
attribute in the
Formatter class (to time.gmtime
for GMT display).
Configuring Logging¶
Programmers can configure logging in three ways:
- Creating loggers, handlers, and formatters explicitly using Python code that calls the configuration methods listed above.
- Creating a logging config file and reading it using the
fileConfig()
function. - Creating a dictionary of configuration information and passing it
to the
dictConfig()
function.
For the reference documentation on the last two options, see logging-config-api. The following example configures a very simple logger, a console handler, and a simple formatter using Python code:
import logging
# create logger
logger = logging.getLogger('simple_example')
logger.setLevel(logging.DEBUG)
# create console handler and set level to debug
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
# create formatter
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
# add formatter to ch
ch.setFormatter(formatter)
# add ch to logger
logger.addHandler(ch)
# 'application' code
logger.debug('debug message')
logger.info('info message')
logger.warn('warn message')
logger.error('error message')
logger.critical('critical message')
Running this module from the command line produces the following output:
$ python simple_logging_module.py
2005-03-19 15:10:26,618 - simple_example - DEBUG - debug message
2005-03-19 15:10:26,620 - simple_example - INFO - info message
2005-03-19 15:10:26,695 - simple_example - WARNING - warn message
2005-03-19 15:10:26,697 - simple_example - ERROR - error message
2005-03-19 15:10:26,773 - simple_example - CRITICAL - critical message
The following Python module creates a logger, handler, and formatter nearly identical to those in the example listed above, with the only difference being the names of the objects:
import logging
import logging.config
logging.config.fileConfig('logging.conf')
# create logger
logger = logging.getLogger('simpleExample')
# 'application' code
logger.debug('debug message')
logger.info('info message')
logger.warn('warn message')
logger.error('error message')
logger.critical('critical message')
Here is the logging.conf file:
[loggers]
keys=root,simpleExample
[handlers]
keys=consoleHandler
[formatters]
keys=simpleFormatter
[logger_root]
level=DEBUG
handlers=consoleHandler
[logger_simpleExample]
level=DEBUG
handlers=consoleHandler
qualname=simpleExample
propagate=0
[handler_consoleHandler]
class=StreamHandler
level=DEBUG
formatter=simpleFormatter
args=(sys.stdout,)
[formatter_simpleFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
datefmt=
The output is nearly identical to that of the non-config-file-based example:
$ python simple_logging_config.py
2005-03-19 15:38:55,977 - simpleExample - DEBUG - debug message
2005-03-19 15:38:55,979 - simpleExample - INFO - info message
2005-03-19 15:38:56,054 - simpleExample - WARNING - warn message
2005-03-19 15:38:56,055 - simpleExample - ERROR - error message
2005-03-19 15:38:56,130 - simpleExample - CRITICAL - critical message
You can see that the config file approach has a few advantages over the Python code approach, mainly separation of configuration and code and the ability of noncoders to easily modify the logging properties.
Note that the class names referenced in config files need to be either relative
to the logging module, or absolute values which can be resolved using normal
import mechanisms. Thus, you could use either
WatchedFileHandler
(relative to the logging module) or
mypackage.mymodule.MyHandler
(for a class defined in package mypackage
and module mymodule
, where mypackage
is available on the Python import
path).
In Python 2.7, a new means of configuring logging has been introduced, using dictionaries to hold configuration information. This provides a superset of the functionality of the config-file-based approach outlined above, and is the recommended configuration method for new applications and deployments. Because a Python dictionary is used to hold configuration information, and since you can populate that dictionary using different means, you have more options for configuration. For example, you can use a configuration file in JSON format, or, if you have access to YAML processing functionality, a file in YAML format, to populate the configuration dictionary. Or, of course, you can construct the dictionary in Python code, receive it in pickled form over a socket, or use whatever approach makes sense for your application.
Here’s an example of the same configuration as above, in YAML format for the new dictionary-based approach:
version: 1
formatters:
simple:
format: '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
handlers:
console:
class: logging.StreamHandler
level: DEBUG
formatter: simple
stream: ext://sys.stdout
loggers:
simpleExample:
level: DEBUG
handlers: [console]
propagate: no
root:
level: DEBUG
handlers: [console]
For more information about logging using a dictionary, see logging-config-api.
What happens if no configuration is provided¶
If no logging configuration is provided, it is possible to have a situation where a logging event needs to be output, but no handlers can be found to output the event. The behaviour of the logging package in these circumstances is dependent on the Python version.
For Python 2.x, the behaviour is as follows:
- If logging.raiseExceptions is False (production mode), the event is silently dropped.
- If logging.raiseExceptions is True (development mode), a message ‘No handlers could be found for logger X.Y.Z’ is printed once.
Configuring Logging for a Library¶
When developing a library which uses logging, you should take care to
document how the library uses logging - for example, the names of loggers
used. Some consideration also needs to be given to its logging configuration.
If the using application does not use logging, and library code makes logging
calls, then (as described in the previous section) events of severity
WARNING
and greater will be printed to sys.stderr
. This is regarded as
the best default behaviour.
If for some reason you don’t want these messages printed in the absence of any logging configuration, you can attach a do-nothing handler to the top-level logger for your library. This avoids the message being printed, since a handler will be always be found for the library’s events: it just doesn’t produce any output. If the library user configures logging for application use, presumably that configuration will add some handlers, and if levels are suitably configured then logging calls made in library code will send output to those handlers, as normal.
A do-nothing handler is included in the logging package:
NullHandler
(since Python 2.7). An instance of this handler
could be added to the top-level logger of the logging namespace used by the
library (if you want to prevent your library’s logged events being output to
sys.stderr
in the absence of logging configuration). If all logging by a
library foo is done using loggers with names matching ‘foo.x’, ‘foo.x.y’,
etc. then the code:
import logging
logging.getLogger('foo').addHandler(logging.NullHandler())
should have the desired effect. If an organisation produces a number of libraries, then the logger name specified can be ‘orgname.foo’ rather than just ‘foo’.
PLEASE NOTE: It is strongly advised that you do not add any handlers other
than NullHandler
to your library’s loggers. This is
because the configuration of handlers is the prerogative of the application
developer who uses your library. The application developer knows their target
audience and what handlers are most appropriate for their application: if you
add handlers ‘under the hood’, you might well interfere with their ability to
carry out unit tests and deliver logs which suit their requirements.
Logging Levels¶
The numeric values of logging levels are given in the following table. These are primarily of interest if you want to define your own levels, and need them to have specific values relative to the predefined levels. If you define a level with the same numeric value, it overwrites the predefined value; the predefined name is lost.
Level | Numeric value |
---|---|
CRITICAL |
50 |
ERROR |
40 |
WARNING |
30 |
INFO |
20 |
DEBUG |
10 |
NOTSET |
0 |
Levels can also be associated with loggers, being set either by the developer or through loading a saved logging configuration. When a logging method is called on a logger, the logger compares its own level with the level associated with the method call. If the logger’s level is higher than the method call’s, no logging message is actually generated. This is the basic mechanism controlling the verbosity of logging output.
Logging messages are encoded as instances of the LogRecord
class. When a logger decides to actually log an event, a
LogRecord
instance is created from the logging message.
Logging messages are subjected to a dispatch mechanism through the use of
handlers, which are instances of subclasses of the Handler
class. Handlers are responsible for ensuring that a logged message (in the form
of a LogRecord
) ends up in a particular location (or set of locations)
which is useful for the target audience for that message (such as end users,
support desk staff, system administrators, developers). Handlers are passed
LogRecord
instances intended for particular destinations. Each logger
can have zero, one or more handlers associated with it (via the
addHandler()
method of Logger
). In addition to any
handlers directly associated with a logger, all handlers associated with all
ancestors of the logger are called to dispatch the message (unless the
propagate flag for a logger is set to a false value, at which point the
passing to ancestor handlers stops).
Just as for loggers, handlers can have levels associated with them. A handler’s
level acts as a filter in the same way as a logger’s level does. If a handler
decides to actually dispatch an event, the emit()
method is used
to send the message to its destination. Most user-defined subclasses of
Handler
will need to override this emit()
.
Custom Levels¶
Defining your own levels is possible, but should not be necessary, as the existing levels have been chosen on the basis of practical experience. However, if you are convinced that you need custom levels, great care should be exercised when doing this, and it is possibly a very bad idea to define custom levels if you are developing a library. That’s because if multiple library authors all define their own custom levels, there is a chance that the logging output from such multiple libraries used together will be difficult for the using developer to control and/or interpret, because a given numeric value might mean different things for different libraries.
Useful Handlers¶
In addition to the base Handler
class, many useful subclasses are
provided:
StreamHandler
instances send messages to streams (file-like objects).FileHandler
instances send messages to disk files.BaseRotatingHandler
is the base class for handlers that rotate log files at a certain point. It is not meant to be instantiated directly. Instead, useRotatingFileHandler
orTimedRotatingFileHandler
.RotatingFileHandler
instances send messages to disk files, with support for maximum log file sizes and log file rotation.TimedRotatingFileHandler
instances send messages to disk files, rotating the log file at certain timed intervals.SocketHandler
instances send messages to TCP/IP sockets.DatagramHandler
instances send messages to UDP sockets.SMTPHandler
instances send messages to a designated email address.SysLogHandler
instances send messages to a Unix syslog daemon, possibly on a remote machine.NTEventLogHandler
instances send messages to a Windows NT/2000/XP event log.MemoryHandler
instances send messages to a buffer in memory, which is flushed whenever specific criteria are met.HTTPHandler
instances send messages to an HTTP server using eitherGET
orPOST
semantics.WatchedFileHandler
instances watch the file they are logging to. If the file changes, it is closed and reopened using the file name. This handler is only useful on Unix-like systems; Windows does not support the underlying mechanism used.NullHandler
instances do nothing with error messages. They are used by library developers who want to use logging, but want to avoid the ‘No handlers could be found for logger XXX’ message which can be displayed if the library user has not configured logging. See Configuring Logging for a Library for more information.
New in version 2.7: The NullHandler
class.
The NullHandler
, StreamHandler
and FileHandler
classes are defined in the core logging package. The other handlers are
defined in a sub- module, logging.handlers
. (There is also another
sub-module, logging.config
, for configuration functionality.)
Logged messages are formatted for presentation through instances of the
Formatter
class. They are initialized with a format string suitable for
use with the % operator and a dictionary.
For formatting multiple messages in a batch, instances of
BufferingFormatter
can be used. In addition to the format string (which
is applied to each message in the batch), there is provision for header and
trailer format strings.
When filtering based on logger level and/or handler level is not enough,
instances of Filter
can be added to both Logger
and
Handler
instances (through their addFilter()
method). Before
deciding to process a message further, both loggers and handlers consult all
their filters for permission. If any filter returns a false value, the message
is not processed further.
The basic Filter
functionality allows filtering by specific logger
name. If this feature is used, messages sent to the named logger and its
children are allowed through the filter, and all others dropped.
Exceptions raised during logging¶
The logging package is designed to swallow exceptions which occur while logging in production. This is so that errors which occur while handling logging events - such as logging misconfiguration, network or other similar errors - do not cause the application using logging to terminate prematurely.
SystemExit
and KeyboardInterrupt
exceptions are never
swallowed. Other exceptions which occur during the emit()
method of a
Handler
subclass are passed to its handleError()
method.
The default implementation of handleError()
in Handler
checks
to see if a module-level variable, raiseExceptions
, is set. If set, a
traceback is printed to sys.stderr
. If not set, the exception is swallowed.
Note: The default value of raiseExceptions
is True
. This is because
during development, you typically want to be notified of any exceptions that
occur. It’s advised that you set raiseExceptions
to False
for production
usage.
Using arbitrary objects as messages¶
In the preceding sections and examples, it has been assumed that the message
passed when logging the event is a string. However, this is not the only
possibility. You can pass an arbitrary object as a message, and its
__str__()
method will be called when the logging system needs to convert
it to a string representation. In fact, if you want to, you can avoid
computing a string representation altogether - for example, the
SocketHandler
emits an event by pickling it and sending it over the
wire.
Optimization¶
Formatting of message arguments is deferred until it cannot be avoided.
However, computing the arguments passed to the logging method can also be
expensive, and you may want to avoid doing it if the logger will just throw
away your event. To decide what to do, you can call the isEnabledFor()
method which takes a level argument and returns true if the event would be
created by the Logger for that level of call. You can write code like this:
if logger.isEnabledFor(logging.DEBUG):
logger.debug('Message with %s, %s', expensive_func1(),
expensive_func2())
so that if the logger’s threshold is set above DEBUG
, the calls to
expensive_func1()
and expensive_func2()
are never made.
There are other optimizations which can be made for specific applications which need more precise control over what logging information is collected. Here’s a list of things you can do to avoid processing during logging which you don’t need:
What you don’t want to collect | How to avoid collecting it |
---|---|
Information about where calls were made from. | Set logging._srcfile to None . |
Threading information. | Set logging.logThreads to 0 . |
Process information. | Set logging.logProcesses to 0 . |
Also note that the core logging module only includes the basic handlers. If
you don’t import logging.handlers
and logging.config
, they won’t
take up any memory.
See also
- Module
logging
- API reference for the logging module.
- Module
logging.config
- Configuration API for the logging module.
- Module
logging.handlers
- Useful handlers included with the logging module.
Logging Cookbook¶
Author: | Vinay Sajip <vinay_sajip at red-dove dot com> |
---|
This page contains a number of recipes related to logging, which have been found useful in the past.
Using logging in multiple modules¶
Multiple calls to logging.getLogger('someLogger')
return a reference to the
same logger object. This is true not only within the same module, but also
across modules as long as it is in the same Python interpreter process. It is
true for references to the same object; additionally, application code can
define and configure a parent logger in one module and create (but not
configure) a child logger in a separate module, and all logger calls to the
child will pass up to the parent. Here is a main module:
import logging
import auxiliary_module
# create logger with 'spam_application'
logger = logging.getLogger('spam_application')
logger.setLevel(logging.DEBUG)
# create file handler which logs even debug messages
fh = logging.FileHandler('spam.log')
fh.setLevel(logging.DEBUG)
# create console handler with a higher log level
ch = logging.StreamHandler()
ch.setLevel(logging.ERROR)
# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)
# add the handlers to the logger
logger.addHandler(fh)
logger.addHandler(ch)
logger.info('creating an instance of auxiliary_module.Auxiliary')
a = auxiliary_module.Auxiliary()
logger.info('created an instance of auxiliary_module.Auxiliary')
logger.info('calling auxiliary_module.Auxiliary.do_something')
a.do_something()
logger.info('finished auxiliary_module.Auxiliary.do_something')
logger.info('calling auxiliary_module.some_function()')
auxiliary_module.some_function()
logger.info('done with auxiliary_module.some_function()')
Here is the auxiliary module:
import logging
# create logger
module_logger = logging.getLogger('spam_application.auxiliary')
class Auxiliary:
def __init__(self):
self.logger = logging.getLogger('spam_application.auxiliary.Auxiliary')
self.logger.info('creating an instance of Auxiliary')
def do_something(self):
self.logger.info('doing something')
a = 1 + 1
self.logger.info('done doing something')
def some_function():
module_logger.info('received a call to "some_function"')
The output looks like this:
2005-03-23 23:47:11,663 - spam_application - INFO -
creating an instance of auxiliary_module.Auxiliary
2005-03-23 23:47:11,665 - spam_application.auxiliary.Auxiliary - INFO -
creating an instance of Auxiliary
2005-03-23 23:47:11,665 - spam_application - INFO -
created an instance of auxiliary_module.Auxiliary
2005-03-23 23:47:11,668 - spam_application - INFO -
calling auxiliary_module.Auxiliary.do_something
2005-03-23 23:47:11,668 - spam_application.auxiliary.Auxiliary - INFO -
doing something
2005-03-23 23:47:11,669 - spam_application.auxiliary.Auxiliary - INFO -
done doing something
2005-03-23 23:47:11,670 - spam_application - INFO -
finished auxiliary_module.Auxiliary.do_something
2005-03-23 23:47:11,671 - spam_application - INFO -
calling auxiliary_module.some_function()
2005-03-23 23:47:11,672 - spam_application.auxiliary - INFO -
received a call to 'some_function'
2005-03-23 23:47:11,673 - spam_application - INFO -
done with auxiliary_module.some_function()
Multiple handlers and formatters¶
Loggers are plain Python objects. The addHandler()
method has no minimum
or maximum quota for the number of handlers you may add. Sometimes it will be
beneficial for an application to log all messages of all severities to a text
file while simultaneously logging errors or above to the console. To set this
up, simply configure the appropriate handlers. The logging calls in the
application code will remain unchanged. Here is a slight modification to the
previous simple module-based configuration example:
import logging
logger = logging.getLogger('simple_example')
logger.setLevel(logging.DEBUG)
# create file handler which logs even debug messages
fh = logging.FileHandler('spam.log')
fh.setLevel(logging.DEBUG)
# create console handler with a higher log level
ch = logging.StreamHandler()
ch.setLevel(logging.ERROR)
# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
fh.setFormatter(formatter)
# add the handlers to logger
logger.addHandler(ch)
logger.addHandler(fh)
# 'application' code
logger.debug('debug message')
logger.info('info message')
logger.warn('warn message')
logger.error('error message')
logger.critical('critical message')
Notice that the ‘application’ code does not care about multiple handlers. All that changed was the addition and configuration of a new handler named fh.
The ability to create new handlers with higher- or lower-severity filters can be
very helpful when writing and testing an application. Instead of using many
print
statements for debugging, use logger.debug
: Unlike the print
statements, which you will have to delete or comment out later, the logger.debug
statements can remain intact in the source code and remain dormant until you
need them again. At that time, the only change that needs to happen is to
modify the severity level of the logger and/or handler to debug.
Logging to multiple destinations¶
Let’s say you want to log to console and file with different message formats and in differing circumstances. Say you want to log messages with levels of DEBUG and higher to file, and those messages at level INFO and higher to the console. Let’s also assume that the file should contain timestamps, but the console messages should not. Here’s how you can achieve this:
import logging
# set up logging to file - see previous section for more details
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
datefmt='%m-%d %H:%M',
filename='/temp/myapp.log',
filemode='w')
# define a Handler which writes INFO messages or higher to the sys.stderr
console = logging.StreamHandler()
console.setLevel(logging.INFO)
# set a format which is simpler for console use
formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
# tell the handler to use this format
console.setFormatter(formatter)
# add the handler to the root logger
logging.getLogger('').addHandler(console)
# Now, we can log to the root logger, or any other logger. First the root...
logging.info('Jackdaws love my big sphinx of quartz.')
# Now, define a couple of other loggers which might represent areas in your
# application:
logger1 = logging.getLogger('myapp.area1')
logger2 = logging.getLogger('myapp.area2')
logger1.debug('Quick zephyrs blow, vexing daft Jim.')
logger1.info('How quickly daft jumping zebras vex.')
logger2.warning('Jail zesty vixen who grabbed pay from quack.')
logger2.error('The five boxing wizards jump quickly.')
When you run this, on the console you will see
root : INFO Jackdaws love my big sphinx of quartz.
myapp.area1 : INFO How quickly daft jumping zebras vex.
myapp.area2 : WARNING Jail zesty vixen who grabbed pay from quack.
myapp.area2 : ERROR The five boxing wizards jump quickly.
and in the file you will see something like
10-22 22:19 root INFO Jackdaws love my big sphinx of quartz.
10-22 22:19 myapp.area1 DEBUG Quick zephyrs blow, vexing daft Jim.
10-22 22:19 myapp.area1 INFO How quickly daft jumping zebras vex.
10-22 22:19 myapp.area2 WARNING Jail zesty vixen who grabbed pay from quack.
10-22 22:19 myapp.area2 ERROR The five boxing wizards jump quickly.
As you can see, the DEBUG message only shows up in the file. The other messages are sent to both destinations.
This example uses console and file handlers, but you can use any number and combination of handlers you choose.
Configuration server example¶
Here is an example of a module using the logging configuration server:
import logging
import logging.config
import time
import os
# read initial config file
logging.config.fileConfig('logging.conf')
# create and start listener on port 9999
t = logging.config.listen(9999)
t.start()
logger = logging.getLogger('simpleExample')
try:
# loop through logging calls to see the difference
# new configurations make, until Ctrl+C is pressed
while True:
logger.debug('debug message')
logger.info('info message')
logger.warn('warn message')
logger.error('error message')
logger.critical('critical message')
time.sleep(5)
except KeyboardInterrupt:
# cleanup
logging.config.stopListening()
t.join()
And here is a script that takes a filename and sends that file to the server, properly preceded with the binary-encoded length, as the new logging configuration:
#!/usr/bin/env python
import socket, sys, struct
with open(sys.argv[1], 'rb') as f:
data_to_send = f.read()
HOST = 'localhost'
PORT = 9999
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
print('connecting...')
s.connect((HOST, PORT))
print('sending config...')
s.send(struct.pack('>L', len(data_to_send)))
s.send(data_to_send)
s.close()
print('complete')
Sending and receiving logging events across a network¶
Let’s say you want to send logging events across a network, and handle them at
the receiving end. A simple way of doing this is attaching a
SocketHandler
instance to the root logger at the sending end:
import logging, logging.handlers
rootLogger = logging.getLogger('')
rootLogger.setLevel(logging.DEBUG)
socketHandler = logging.handlers.SocketHandler('localhost',
logging.handlers.DEFAULT_TCP_LOGGING_PORT)
# don't bother with a formatter, since a socket handler sends the event as
# an unformatted pickle
rootLogger.addHandler(socketHandler)
# Now, we can log to the root logger, or any other logger. First the root...
logging.info('Jackdaws love my big sphinx of quartz.')
# Now, define a couple of other loggers which might represent areas in your
# application:
logger1 = logging.getLogger('myapp.area1')
logger2 = logging.getLogger('myapp.area2')
logger1.debug('Quick zephyrs blow, vexing daft Jim.')
logger1.info('How quickly daft jumping zebras vex.')
logger2.warning('Jail zesty vixen who grabbed pay from quack.')
logger2.error('The five boxing wizards jump quickly.')
At the receiving end, you can set up a receiver using the socketserver
module. Here is a basic working example:
import pickle
import logging
import logging.handlers
import socketserver
import struct
class LogRecordStreamHandler(socketserver.StreamRequestHandler):
"""Handler for a streaming logging request.
This basically logs the record using whatever logging policy is
configured locally.
"""
def handle(self):
"""
Handle multiple requests - each expected to be a 4-byte length,
followed by the LogRecord in pickle format. Logs the record
according to whatever policy is configured locally.
"""
while True:
chunk = self.connection.recv(4)
if len(chunk) < 4:
break
slen = struct.unpack('>L', chunk)[0]
chunk = self.connection.recv(slen)
while len(chunk) < slen:
chunk = chunk + self.connection.recv(slen - len(chunk))
obj = self.unPickle(chunk)
record = logging.makeLogRecord(obj)
self.handleLogRecord(record)
def unPickle(self, data):
return pickle.loads(data)
def handleLogRecord(self, record):
# if a name is specified, we use the named logger rather than the one
# implied by the record.
if self.server.logname is not None:
name = self.server.logname
else:
name = record.name
logger = logging.getLogger(name)
# N.B. EVERY record gets logged. This is because Logger.handle
# is normally called AFTER logger-level filtering. If you want
# to do filtering, do it at the client end to save wasting
# cycles and network bandwidth!
logger.handle(record)
class LogRecordSocketReceiver(socketserver.ThreadingTCPServer):
"""
Simple TCP socket-based logging receiver suitable for testing.
"""
allow_reuse_address = 1
def __init__(self, host='localhost',
port=logging.handlers.DEFAULT_TCP_LOGGING_PORT,
handler=LogRecordStreamHandler):
socketserver.ThreadingTCPServer.__init__(self, (host, port), handler)
self.abort = 0
self.timeout = 1
self.logname = None
def serve_until_stopped(self):
import select
abort = 0
while not abort:
rd, wr, ex = select.select([self.socket.fileno()],
[], [],
self.timeout)
if rd:
self.handle_request()
abort = self.abort
def main():
logging.basicConfig(
format='%(relativeCreated)5d %(name)-15s %(levelname)-8s %(message)s')
tcpserver = LogRecordSocketReceiver()
print('About to start TCP server...')
tcpserver.serve_until_stopped()
if __name__ == '__main__':
main()
First run the server, and then the client. On the client side, nothing is printed on the console; on the server side, you should see something like:
About to start TCP server...
59 root INFO Jackdaws love my big sphinx of quartz.
59 myapp.area1 DEBUG Quick zephyrs blow, vexing daft Jim.
69 myapp.area1 INFO How quickly daft jumping zebras vex.
69 myapp.area2 WARNING Jail zesty vixen who grabbed pay from quack.
69 myapp.area2 ERROR The five boxing wizards jump quickly.
Note that there are some security issues with pickle in some scenarios. If
these affect you, you can use an alternative serialization scheme by overriding
the makePickle()
method and implementing your alternative there, as
well as adapting the above script to use your alternative serialization.
Adding contextual information to your logging output¶
Sometimes you want logging output to contain contextual information in
addition to the parameters passed to the logging call. For example, in a
networked application, it may be desirable to log client-specific information
in the log (e.g. remote client’s username, or IP address). Although you could
use the extra parameter to achieve this, it’s not always convenient to pass
the information in this way. While it might be tempting to create
Logger
instances on a per-connection basis, this is not a good idea
because these instances are not garbage collected. While this is not a problem
in practice, when the number of Logger
instances is dependent on the
level of granularity you want to use in logging an application, it could
be hard to manage if the number of Logger
instances becomes
effectively unbounded.
Using LoggerAdapters to impart contextual information¶
An easy way in which you can pass contextual information to be output along
with logging event information is to use the LoggerAdapter
class.
This class is designed to look like a Logger
, so that you can call
debug()
, info()
, warning()
, error()
,
exception()
, critical()
and log()
. These methods have the
same signatures as their counterparts in Logger
, so you can use the
two types of instances interchangeably.
When you create an instance of LoggerAdapter
, you pass it a
Logger
instance and a dict-like object which contains your contextual
information. When you call one of the logging methods on an instance of
LoggerAdapter
, it delegates the call to the underlying instance of
Logger
passed to its constructor, and arranges to pass the contextual
information in the delegated call. Here’s a snippet from the code of
LoggerAdapter
:
def debug(self, msg, *args, **kwargs):
"""
Delegate a debug call to the underlying logger, after adding
contextual information from this adapter instance.
"""
msg, kwargs = self.process(msg, kwargs)
self.logger.debug(msg, *args, **kwargs)
The process()
method of LoggerAdapter
is where the contextual
information is added to the logging output. It’s passed the message and
keyword arguments of the logging call, and it passes back (potentially)
modified versions of these to use in the call to the underlying logger. The
default implementation of this method leaves the message alone, but inserts
an ‘extra’ key in the keyword argument whose value is the dict-like object
passed to the constructor. Of course, if you had passed an ‘extra’ keyword
argument in the call to the adapter, it will be silently overwritten.
The advantage of using ‘extra’ is that the values in the dict-like object are
merged into the LogRecord
instance’s __dict__, allowing you to use
customized strings with your Formatter
instances which know about
the keys of the dict-like object. If you need a different method, e.g. if you
want to prepend or append the contextual information to the message string,
you just need to subclass LoggerAdapter
and override process()
to do what you need. Here’s an example script which uses this class, which
also illustrates what dict-like behaviour is needed from an arbitrary
‘dict-like’ object for use in the constructor:
import logging
class ConnInfo:
"""
An example class which shows how an arbitrary class can be used as
the 'extra' context information repository passed to a LoggerAdapter.
"""
def __getitem__(self, name):
"""
To allow this instance to look like a dict.
"""
from random import choice
if name == 'ip':
result = choice(['127.0.0.1', '192.168.0.1'])
elif name == 'user':
result = choice(['jim', 'fred', 'sheila'])
else:
result = self.__dict__.get(name, '?')
return result
def __iter__(self):
"""
To allow iteration over keys, which will be merged into
the LogRecord dict before formatting and output.
"""
keys = ['ip', 'user']
keys.extend(self.__dict__.keys())
return keys.__iter__()
if __name__ == '__main__':
from random import choice
levels = (logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL)
a1 = logging.LoggerAdapter(logging.getLogger('a.b.c'),
{ 'ip' : '123.231.231.123', 'user' : 'sheila' })
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)-15s %(name)-5s %(levelname)-8s IP: %(ip)-15s User: %(user)-8s %(message)s')
a1.debug('A debug message')
a1.info('An info message with %s', 'some parameters')
a2 = logging.LoggerAdapter(logging.getLogger('d.e.f'), ConnInfo())
for x in range(10):
lvl = choice(levels)
lvlname = logging.getLevelName(lvl)
a2.log(lvl, 'A message at %s level with %d %s', lvlname, 2, 'parameters')
When this script is run, the output should look something like this:
2008-01-18 14:49:54,023 a.b.c DEBUG IP: 123.231.231.123 User: sheila A debug message
2008-01-18 14:49:54,023 a.b.c INFO IP: 123.231.231.123 User: sheila An info message with some parameters
2008-01-18 14:49:54,023 d.e.f CRITICAL IP: 192.168.0.1 User: jim A message at CRITICAL level with 2 parameters
2008-01-18 14:49:54,033 d.e.f INFO IP: 192.168.0.1 User: jim A message at INFO level with 2 parameters
2008-01-18 14:49:54,033 d.e.f WARNING IP: 192.168.0.1 User: sheila A message at WARNING level with 2 parameters
2008-01-18 14:49:54,033 d.e.f ERROR IP: 127.0.0.1 User: fred A message at ERROR level with 2 parameters
2008-01-18 14:49:54,033 d.e.f ERROR IP: 127.0.0.1 User: sheila A message at ERROR level with 2 parameters
2008-01-18 14:49:54,033 d.e.f WARNING IP: 192.168.0.1 User: sheila A message at WARNING level with 2 parameters
2008-01-18 14:49:54,033 d.e.f WARNING IP: 192.168.0.1 User: jim A message at WARNING level with 2 parameters
2008-01-18 14:49:54,033 d.e.f INFO IP: 192.168.0.1 User: fred A message at INFO level with 2 parameters
2008-01-18 14:49:54,033 d.e.f WARNING IP: 192.168.0.1 User: sheila A message at WARNING level with 2 parameters
2008-01-18 14:49:54,033 d.e.f WARNING IP: 127.0.0.1 User: jim A message at WARNING level with 2 parameters
Using Filters to impart contextual information¶
You can also add contextual information to log output using a user-defined
Filter
. Filter
instances are allowed to modify the LogRecords
passed to them, including adding additional attributes which can then be output
using a suitable format string, or if needed a custom Formatter
.
For example in a web application, the request being processed (or at least,
the interesting parts of it) can be stored in a threadlocal
(threading.local
) variable, and then accessed from a Filter
to
add, say, information from the request - say, the remote IP address and remote
user’s username - to the LogRecord
, using the attribute names ‘ip’ and
‘user’ as in the LoggerAdapter
example above. In that case, the same format
string can be used to get similar output to that shown above. Here’s an example
script:
import logging
from random import choice
class ContextFilter(logging.Filter):
"""
This is a filter which injects contextual information into the log.
Rather than use actual contextual information, we just use random
data in this demo.
"""
USERS = ['jim', 'fred', 'sheila']
IPS = ['123.231.231.123', '127.0.0.1', '192.168.0.1']
def filter(self, record):
record.ip = choice(ContextFilter.IPS)
record.user = choice(ContextFilter.USERS)
return True
if __name__ == '__main__':
levels = (logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL)
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)-15s %(name)-5s %(levelname)-8s IP: %(ip)-15s User: %(user)-8s %(message)s')
a1 = logging.getLogger('a.b.c')
a2 = logging.getLogger('d.e.f')
f = ContextFilter()
a1.addFilter(f)
a2.addFilter(f)
a1.debug('A debug message')
a1.info('An info message with %s', 'some parameters')
for x in range(10):
lvl = choice(levels)
lvlname = logging.getLevelName(lvl)
a2.log(lvl, 'A message at %s level with %d %s', lvlname, 2, 'parameters')
which, when run, produces something like:
2010-09-06 22:38:15,292 a.b.c DEBUG IP: 123.231.231.123 User: fred A debug message
2010-09-06 22:38:15,300 a.b.c INFO IP: 192.168.0.1 User: sheila An info message with some parameters
2010-09-06 22:38:15,300 d.e.f CRITICAL IP: 127.0.0.1 User: sheila A message at CRITICAL level with 2 parameters
2010-09-06 22:38:15,300 d.e.f ERROR IP: 127.0.0.1 User: jim A message at ERROR level with 2 parameters
2010-09-06 22:38:15,300 d.e.f DEBUG IP: 127.0.0.1 User: sheila A message at DEBUG level with 2 parameters
2010-09-06 22:38:15,300 d.e.f ERROR IP: 123.231.231.123 User: fred A message at ERROR level with 2 parameters
2010-09-06 22:38:15,300 d.e.f CRITICAL IP: 192.168.0.1 User: jim A message at CRITICAL level with 2 parameters
2010-09-06 22:38:15,300 d.e.f CRITICAL IP: 127.0.0.1 User: sheila A message at CRITICAL level with 2 parameters
2010-09-06 22:38:15,300 d.e.f DEBUG IP: 192.168.0.1 User: jim A message at DEBUG level with 2 parameters
2010-09-06 22:38:15,301 d.e.f ERROR IP: 127.0.0.1 User: sheila A message at ERROR level with 2 parameters
2010-09-06 22:38:15,301 d.e.f DEBUG IP: 123.231.231.123 User: fred A message at DEBUG level with 2 parameters
2010-09-06 22:38:15,301 d.e.f INFO IP: 123.231.231.123 User: fred A message at INFO level with 2 parameters
Logging to a single file from multiple processes¶
Although logging is thread-safe, and logging to a single file from multiple
threads in a single process is supported, logging to a single file from
multiple processes is not supported, because there is no standard way to
serialize access to a single file across multiple processes in Python. If you
need to log to a single file from multiple processes, one way of doing this is
to have all the processes log to a SocketHandler
, and have a separate
process which implements a socket server which reads from the socket and logs
to file. (If you prefer, you can dedicate one thread in one of the existing
processes to perform this function.) This section
documents this approach in more detail and includes a working socket receiver
which can be used as a starting point for you to adapt in your own
applications.
If you are using a recent version of Python which includes the
multiprocessing
module, you could write your own handler which uses the
Lock
class from this module to serialize access to the file from
your processes. The existing FileHandler
and subclasses do not make
use of multiprocessing
at present, though they may do so in the future.
Note that at present, the multiprocessing
module does not provide
working lock functionality on all platforms (see
http://bugs.python.org/issue3770).
Using file rotation¶
Sometimes you want to let a log file grow to a certain size, then open a new
file and log to that. You may want to keep a certain number of these files, and
when that many files have been created, rotate the files so that the number of
files and the size of the files both remain bounded. For this usage pattern, the
logging package provides a RotatingFileHandler
:
import glob
import logging
import logging.handlers
LOG_FILENAME = 'logging_rotatingfile_example.out'
# Set up a specific logger with our desired output level
my_logger = logging.getLogger('MyLogger')
my_logger.setLevel(logging.DEBUG)
# Add the log message handler to the logger
handler = logging.handlers.RotatingFileHandler(
LOG_FILENAME, maxBytes=20, backupCount=5)
my_logger.addHandler(handler)
# Log some messages
for i in range(20):
my_logger.debug('i = %d' % i)
# See what files are created
logfiles = glob.glob('%s*' % LOG_FILENAME)
for filename in logfiles:
print(filename)
The result should be 6 separate files, each with part of the log history for the application:
logging_rotatingfile_example.out
logging_rotatingfile_example.out.1
logging_rotatingfile_example.out.2
logging_rotatingfile_example.out.3
logging_rotatingfile_example.out.4
logging_rotatingfile_example.out.5
The most current file is always logging_rotatingfile_example.out
,
and each time it reaches the size limit it is renamed with the suffix
.1
. Each of the existing backup files is renamed to increment the suffix
(.1
becomes .2
, etc.) and the .6
file is erased.
Obviously this example sets the log length much too small as an extreme example. You would want to set maxBytes to an appropriate value.
An example dictionary-based configuration¶
Below is an example of a logging configuration dictionary - it’s taken from
the documentation on the Django project.
This dictionary is passed to dictConfig()
to put the configuration into effect:
LOGGING = {
'version': 1,
'disable_existing_loggers': True,
'formatters': {
'verbose': {
'format': '%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s'
},
'simple': {
'format': '%(levelname)s %(message)s'
},
},
'filters': {
'special': {
'()': 'project.logging.SpecialFilter',
'foo': 'bar',
}
},
'handlers': {
'null': {
'level':'DEBUG',
'class':'django.utils.log.NullHandler',
},
'console':{
'level':'DEBUG',
'class':'logging.StreamHandler',
'formatter': 'simple'
},
'mail_admins': {
'level': 'ERROR',
'class': 'django.utils.log.AdminEmailHandler',
'filters': ['special']
}
},
'loggers': {
'django': {
'handlers':['null'],
'propagate': True,
'level':'INFO',
},
'django.request': {
'handlers': ['mail_admins'],
'level': 'ERROR',
'propagate': False,
},
'myproject.custom': {
'handlers': ['console', 'mail_admins'],
'level': 'INFO',
'filters': ['special']
}
}
}
For more information about this configuration, you can see the relevant section of the Django documentation.
Regular Expression HOWTO¶
Author: | A.M. Kuchling <amk@amk.ca> |
---|
Abstract
This document is an introductory tutorial to using regular expressions in Python
with the re
module. It provides a gentler introduction than the
corresponding section in the Library Reference.
Introduction¶
The re
module was added in Python 1.5, and provides Perl-style regular
expression patterns. Earlier versions of Python came with the regex
module, which provided Emacs-style patterns. The regex
module was
removed completely in Python 2.5.
Regular expressions (called REs, or regexes, or regex patterns) are essentially
a tiny, highly specialized programming language embedded inside Python and made
available through the re
module. Using this little language, you specify
the rules for the set of possible strings that you want to match; this set might
contain English sentences, or e-mail addresses, or TeX commands, or anything you
like. You can then ask questions such as “Does this string match the pattern?”,
or “Is there a match for the pattern anywhere in this string?”. You can also
use REs to modify a string or to split it apart in various ways.
Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.
The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.
Simple Patterns¶
We’ll start by learning about the simplest possible regular expressions. Since regular expressions are used to operate on strings, we’ll begin with the most common task: matching characters.
For a detailed explanation of the computer science underlying regular expressions (deterministic and non-deterministic finite automata), you can refer to almost any textbook on writing compilers.
Matching Characters¶
Most letters and characters will simply match themselves. For example, the
regular expression test
will match the string test
exactly. (You can
enable a case-insensitive mode that would let this RE match Test
or TEST
as well; more about this later.)
There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. Much of this document is devoted to discussing various metacharacters and what they do.
Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.
. ^ $ * + ? { } [ ] \ | ( )
The first metacharacters we’ll look at are [
and ]
. They’re used for
specifying a character class, which is a set of characters that you wish to
match. Characters can be listed individually, or a range of characters can be
indicated by giving two characters and separating them by a '-'
. For
example, [abc]
will match any of the characters a
, b
, or c
; this
is the same as [a-c]
, which uses a range to express the same set of
characters. If you wanted to match only lowercase letters, your RE would be
[a-z]
.
Metacharacters are not active inside classes. For example, [akm$]
will
match any of the characters 'a'
, 'k'
, 'm'
, or '$'
; '$'
is
usually a metacharacter, but inside a character class it’s stripped of its
special nature.
You can match the characters not listed within the class by complementing
the set. This is indicated by including a '^'
as the first character of the
class; '^'
outside a character class will simply match the '^'
character. For example, [^5]
will match any character except '5'
.
Perhaps the most important metacharacter is the backslash, \
. As in Python
string literals, the backslash can be followed by various characters to signal
various special sequences. It’s also used to escape all the metacharacters so
you can still match them in patterns; for example, if you need to match a [
or \
, you can precede them with a backslash to remove their special
meaning: \[
or \\
.
Some of the special sequences beginning with '\'
represent predefined sets
of characters that are often useful, such as the set of digits, the set of
letters, or the set of anything that isn’t whitespace. The following predefined
special sequences are a subset of those available. The equivalent classes are
for byte string patterns. For a complete list of sequences and expanded class
definitions for Unicode string patterns, see the last part of
Regular Expression Syntax.
\d
- Matches any decimal digit; this is equivalent to the class
[0-9]
. \D
- Matches any non-digit character; this is equivalent to the class
[^0-9]
. \s
- Matches any whitespace character; this is equivalent to the class
[ \t\n\r\f\v]
. \S
- Matches any non-whitespace character; this is equivalent to the class
[^ \t\n\r\f\v]
. \w
- Matches any alphanumeric character; this is equivalent to the class
[a-zA-Z0-9_]
. \W
- Matches any non-alphanumeric character; this is equivalent to the class
[^a-zA-Z0-9_]
.
These sequences can be included inside a character class. For example,
[\s,.]
is a character class that will match any whitespace character, or
','
or '.'
.
The final metacharacter in this section is .
. It matches anything except a
newline character, and there’s an alternate mode (re.DOTALL
) where it will
match even a newline. '.'
is often used where you want to match “any
character”.
Repeating Things¶
Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. Another capability is that you can specify that portions of the RE must be repeated a certain number of times.
The first metacharacter for repeating things that we’ll look at is *
. *
doesn’t match the literal character *
; instead, it specifies that the
previous character can be matched zero or more times, instead of exactly once.
For example, ca*t
will match ct
(0 a
characters), cat
(1 a
),
caaat
(3 a
characters), and so forth. The RE engine has various
internal limitations stemming from the size of C’s int
type that will
prevent it from matching over 2 billion a
characters; you probably don’t
have enough memory to construct a string that large, so you shouldn’t run into
that limit.
Repetitions such as *
are greedy; when repeating a RE, the matching
engine will try to repeat it as many times as possible. If later portions of the
pattern don’t match, the matching engine will then back up and try again with
few repetitions.
A step-by-step example will make this more obvious. Let’s consider the
expression a[bcd]*b
. This matches the letter 'a'
, zero or more letters
from the class [bcd]
, and finally ends with a 'b'
. Now imagine matching
this RE against the string abcbd
.
Step | Matched | Explanation |
---|---|---|
1 | a |
The a in the RE matches. |
2 | abcbd |
The engine matches [bcd]* ,
going as far as it can, which
is to the end of the string. |
3 | Failure | The engine tries to match
b , but the current position
is at the end of the string, so
it fails. |
4 | abcb |
Back up, so that [bcd]*
matches one less character. |
5 | Failure | Try b again, but the
current position is at the last
character, which is a 'd' . |
6 | abc |
Back up again, so that
[bcd]* is only matching
bc . |
6 | abcb |
Try b again. This time
the character at the
current position is 'b' , so
it succeeds. |
The end of the RE has now been reached, and it has matched abcb
. This
demonstrates how the matching engine goes as far as it can at first, and if no
match is found it will then progressively back up and retry the rest of the RE
again and again. It will back up until it has tried zero matches for
[bcd]*
, and if that subsequently fails, the engine will conclude that the
string doesn’t match the RE at all.
Another repeating metacharacter is +
, which matches one or more times. Pay
careful attention to the difference between *
and +
; *
matches
zero or more times, so whatever’s being repeated may not be present at all,
while +
requires at least one occurrence. To use a similar example,
ca+t
will match cat
(1 a
), caaat
(3 a
‘s), but won’t match
ct
.
There are two more repeating qualifiers. The question mark character, ?
,
matches either once or zero times; you can think of it as marking something as
being optional. For example, home-?brew
matches either homebrew
or
home-brew
.
The most complicated repeated qualifier is {m,n}
, where m and n are
decimal integers. This qualifier means there must be at least m repetitions,
and at most n. For example, a/{1,3}b
will match a/b
, a//b
, and
a///b
. It won’t match ab
, which has no slashes, or a////b
, which
has four.
You can omit either m or n; in that case, a reasonable value is assumed for the missing value. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity — actually, the upper bound is the 2-billion limit mentioned earlier, but that might as well be infinity.
Readers of a reductionist bent may notice that the three other qualifiers can
all be expressed using this notation. {0,}
is the same as *
, {1,}
is equivalent to +
, and {0,1}
is the same as ?
. It’s better to use
*
, +
, or ?
when you can, simply because they’re shorter and easier
to read.
Using Regular Expressions¶
Now that we’ve looked at some simple regular expressions, how do we actually use
them in Python? The re
module provides an interface to the regular
expression engine, allowing you to compile REs into objects and then perform
matches with them.
Compiling Regular Expressions¶
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.
>>> import re
>>> p = re.compile('ab*')
>>> print p
<_sre.SRE_Pattern object at 0x...>
re.compile()
also accepts an optional flags argument, used to enable
various special features and syntax variations. We’ll go over the available
settings later, but for now a single example will do:
>>> p = re.compile('ab*', re.IGNORECASE)
The RE is passed to re.compile()
as a string. REs are handled as strings
because regular expressions aren’t part of the core Python language, and no
special syntax was created for expressing them. (There are applications that
don’t need REs at all, so there’s no need to bloat the language specification by
including them.) Instead, the re
module is simply a C extension module
included with Python, just like the socket
or zlib
modules.
Putting REs in strings keeps the Python language simpler, but has one disadvantage which is the topic of the next section.
The Backslash Plague¶
As stated earlier, regular expressions use the backslash character ('\'
) to
indicate special forms or to allow special characters to be used without
invoking their special meaning. This conflicts with Python’s usage of the same
character for the same purpose in string literals.
Let’s say you want to write a RE that matches the string \section
, which
might be found in a LaTeX file. To figure out what to write in the program
code, start with the desired string to be matched. Next, you must escape any
backslashes and other metacharacters by preceding them with a backslash,
resulting in the string \\section
. The resulting string that must be passed
to re.compile()
must be \\section
. However, to express this as a
Python string literal, both backslashes must be escaped again.
Characters | Stage |
---|---|
\section |
Text string to be matched |
\\section |
Escaped backslash for re.compile() |
"\\\\section" |
Escaped backslashes for a string literal |
In short, to match a literal backslash, one has to write '\\\\'
as the RE
string, because the regular expression must be \\
, and each backslash must
be expressed as \\
inside a regular Python string literal. In REs that
feature backslashes repeatedly, this leads to lots of repeated backslashes and
makes the resulting strings difficult to understand.
The solution is to use Python’s raw string notation for regular expressions;
backslashes are not handled in any special way in a string literal prefixed with
'r'
, so r"\n"
is a two-character string containing '\'
and 'n'
,
while "\n"
is a one-character string containing a newline. Regular
expressions will often be written in Python code using this raw string notation.
Regular String | Raw string |
---|---|
"ab*" |
r"ab*" |
"\\\\section" |
r"\\section" |
"\\w+\\s+\\1" |
r"\w+\s+\1" |
Performing Matches¶
Once you have an object representing a compiled regular expression, what do you
do with it? Pattern objects have several methods and attributes.
Only the most significant ones will be covered here; consult the re
docs
for a complete listing.
Method/Attribute | Purpose |
---|---|
match() |
Determine if the RE matches at the beginning of the string. |
search() |
Scan through a string, looking for any location where this RE matches. |
findall() |
Find all substrings where the RE matches, and returns them as a list. |
finditer() |
Find all substrings where the RE matches, and returns them as an iterator. |
match()
and search()
return None
if no match can be found. If
they’re successful, a MatchObject
instance is returned, containing
information about the match: where it starts and ends, the substring it matched,
and more.
You can learn about this by interactively experimenting with the re
module. If you have Tkinter available, you may also want to look at
Tools/scripts/redemo.py
, a demonstration program included with the
Python distribution. It allows you to enter REs and strings, and displays
whether the RE matches or fails. redemo.py
can be quite useful when
trying to debug a complicated RE. Phil Schwartz’s Kodos is also an interactive tool for developing and
testing RE patterns.
This HOWTO uses the standard Python interpreter for its examples. First, run the
Python interpreter, import the re
module, and compile a RE:
Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
>>> import re
>>> p = re.compile('[a-z]+')
>>> p
<_sre.SRE_Pattern object at 0x...>
Now, you can try matching various strings against the RE [a-z]+
. An empty
string shouldn’t match at all, since +
means ‘one or more repetitions’.
match()
should return None
in this case, which will cause the
interpreter to print no output. You can explicitly print the result of
match()
to make this clear.
>>> p.match("")
>>> print p.match("")
None
Now, let’s try it on a string that it should match, such as tempo
. In this
case, match()
will return a MatchObject
, so you should store the
result in a variable for later use.
>>> m = p.match('tempo')
>>> print m
<_sre.SRE_Match object at 0x...>
Now you can query the MatchObject
for information about the matching
string. MatchObject
instances also have several methods and
attributes; the most important ones are:
Method/Attribute | Purpose |
---|---|
group() |
Return the string matched by the RE |
start() |
Return the starting position of the match |
end() |
Return the ending position of the match |
span() |
Return a tuple containing the (start, end) positions of the match |
Trying these methods will soon clarify their meaning:
>>> m.group()
'tempo'
>>> m.start(), m.end()
(0, 5)
>>> m.span()
(0, 5)
group()
returns the substring that was matched by the RE. start()
and end()
return the starting and ending index of the match. span()
returns both start and end indexes in a single tuple. Since the match()
method only checks if the RE matches at the start of a string, start()
will always be zero. However, the search()
method of patterns
scans through the string, so the match may not start at zero in that
case.
>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<_sre.SRE_Match object at 0x...>
>>> m.group()
'message'
>>> m.span()
(4, 11)
In actual programs, the most common style is to store the MatchObject
in a variable, and then check if it was None
. This usually looks like:
p = re.compile( ... )
m = p.match( 'string goes here' )
if m:
print 'Match found: ', m.group()
else:
print 'No match'
Two pattern methods return all of the matches for a pattern.
findall()
returns a list of matching strings:
>>> p = re.compile('\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
findall()
has to create the entire list before it can be returned as the
result. The finditer()
method returns a sequence of MatchObject
instances as an iterator. [1]
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
Module-Level Functions¶
You don’t have to create a pattern object and call its methods; the
re
module also provides top-level functions called match()
,
search()
, findall()
, sub()
, and so forth. These functions
take the same arguments as the corresponding pattern method, with
the RE string added as the first argument, and still return either None
or a
MatchObject
instance.
>>> print re.match(r'From\s+', 'Fromage amk')
None
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
<_sre.SRE_Match object at 0x...>
Under the hood, these functions simply create a pattern object for you and call the appropriate method on it. They also store the compiled object in a cache, so future calls using the same RE are faster.
Should you use these module-level functions, or should you get the
pattern and call its methods yourself? That choice depends on how
frequently the RE will be used, and on your personal coding style. If the RE is
being used at only one point in the code, then the module functions are probably
more convenient. If a program contains a lot of regular expressions, or re-uses
the same ones in several locations, then it might be worthwhile to collect all
the definitions in one place, in a section of code that compiles all the REs
ahead of time. To take an example from the standard library, here’s an extract
from xmllib.py
:
ref = re.compile( ... )
entityref = re.compile( ... )
charref = re.compile( ... )
starttagopen = re.compile( ... )
I generally prefer to work with the compiled object, even for one-time uses, but few people will be as much of a purist about this as I am.
Compilation Flags¶
Compilation flags let you modify some aspects of how regular expressions work.
Flags are available in the re
module under two names, a long name such as
IGNORECASE
and a short, one-letter form such as I
. (If you’re
familiar with Perl’s pattern modifiers, the one-letter forms use the same
letters; the short form of re.VERBOSE
is re.X
, for example.)
Multiple flags can be specified by bitwise OR-ing them; re.I | re.M
sets
both the I
and M
flags, for example.
Here’s a table of the available flags, followed by a more detailed explanation of each one.
Flag | Meaning |
---|---|
DOTALL , S |
Make . match any character, including
newlines |
IGNORECASE , I |
Do case-insensitive matches |
LOCALE , L |
Do a locale-aware match |
MULTILINE , M |
Multi-line matching, affecting ^ and
$ |
VERBOSE , X |
Enable verbose REs, which can be organized more cleanly and understandably. |
UNICODE , U |
Makes several escapes like \w , \b ,
\s and \d dependent on the Unicode
character database. |
-
I
-
IGNORECASE
Perform case-insensitive matching; character class and literal strings will match letters by ignoring case. For example,
[A-Z]
will match lowercase letters, too, andSpam
will matchSpam
,spam
, orspAM
. This lowercasing doesn’t take the current locale into account; it will if you also set theLOCALE
flag.
-
L
-
LOCALE
Make
\w
,\W
,\b
, and\B
, dependent on the current locale.Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you’re processing French text, you’d want to be able to write
\w+
to match words, but\w
only matches the character class[A-Za-z]
; it won’t match'é'
or'ç'
. If your system is configured properly and a French locale is selected, certain C functions will tell the program that'é'
should also be considered a letter. Setting theLOCALE
flag when compiling a regular expression will cause the resulting compiled object to use these C functions for\w
; this is slower, but also enables\w+
to match French words as you’d expect.
-
M
-
MULTILINE
(
^
and$
haven’t been explained yet; they’ll be introduced in section More Metacharacters.)Usually
^
matches only at the beginning of the string, and$
matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified,^
matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the$
metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).
-
S
-
DOTALL
Makes the
'.'
special character match any character at all, including a newline; without this flag,'.'
will match anything except a newline.
-
U
-
UNICODE
Make
\w
,\W
,\b
,\B
,\d
,\D
,\s
and\S
dependent on the Unicode character properties database.
-
X
-
VERBOSE
This flag allows you to write regular expressions that are more readable by granting you more flexibility in how you can format them. When this flag has been specified, whitespace within the RE string is ignored, except when the whitespace is in a character class or preceded by an unescaped backslash; this lets you organize and indent the RE more clearly. This flag also lets you put comments within a RE that will be ignored by the engine; comments are marked by a
'#'
that’s neither in a character class or preceded by an unescaped backslash.For example, here’s a RE that uses
re.VERBOSE
; see how much easier it is to read?charref = re.compile(r""" &[#] # Start of a numeric entity reference ( 0[0-7]+ # Octal form | [0-9]+ # Decimal form | x[0-9a-fA-F]+ # Hexadecimal form ) ; # Trailing semicolon """, re.VERBOSE)
Without the verbose setting, the RE would look like this:
charref = re.compile("&#(0[0-7]+" "|[0-9]+" "|x[0-9a-fA-F]+);")
In the above example, Python’s automatic concatenation of string literals has been used to break up the RE into smaller pieces, but it’s still more difficult to understand than the version using
re.VERBOSE
.
More Pattern Power¶
So far we’ve only covered a part of the features of regular expressions. In this section, we’ll cover some new metacharacters, and how to use groups to retrieve portions of the text that was matched.
More Metacharacters¶
There are some metacharacters that we haven’t covered yet. Most of them will be covered in this section.
Some of the remaining metacharacters to be discussed are zero-width
assertions. They don’t cause the engine to advance through the string;
instead, they consume no characters at all, and simply succeed or fail. For
example, \b
is an assertion that the current position is located at a word
boundary; the position isn’t changed by the \b
at all. This means that
zero-width assertions should never be repeated, because if they match once at a
given location, they can obviously be matched an infinite number of times.
|
Alternation, or the “or” operator. If A and B are regular expressions,
A|B
will match any string that matches eitherA
orB
.|
has very low precedence in order to make it work reasonably when you’re alternating multi-character strings.Crow|Servo
will match eitherCrow
orServo
, notCro
, a'w'
or an'S'
, andervo
.To match a literal
'|'
, use\|
, or enclose it inside a character class, as in[|]
.^
Matches at the beginning of lines. Unless the
MULTILINE
flag has been set, this will only match at the beginning of the string. InMULTILINE
mode, this also matches immediately after each newline within the string.For example, if you wish to match the word
From
only at the beginning of a line, the RE to use is^From
.>>> print re.search('^From', 'From Here to Eternity') <_sre.SRE_Match object at 0x...> >>> print re.search('^From', 'Reciting From Memory') None
$
Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.
>>> print re.search('}$', '{block}') <_sre.SRE_Match object at 0x...> >>> print re.search('}$', '{block} ') None >>> print re.search('}$', '{block}\n') <_sre.SRE_Match object at 0x...>
To match a literal
'$'
, use\$
or enclose it inside a character class, as in[$]
.\A
- Matches only at the start of the string. When not in
MULTILINE
mode,\A
and^
are effectively the same. InMULTILINE
mode, they’re different:\A
still matches only at the beginning of the string, but^
may match at any location inside the string that follows a newline character. \Z
- Matches only at the end of the string.
\b
Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.
The following example matches
class
only when it’s a complete word; it won’t match when it’s contained inside another word.>>> p = re.compile(r'\bclass\b') >>> print p.search('no class at all') <_sre.SRE_Match object at 0x...> >>> print p.search('the declassified algorithm') None >>> print p.search('one subclass is') None
There are two subtleties you should remember when using this special sequence. First, this is the worst collision between Python’s string literals and regular expression sequences. In Python’s string literals,
\b
is the backspace character, ASCII value 8. If you’re not using raw strings, then Python will convert the\b
to a backspace, and your RE won’t match as you expect it to. The following example looks the same as our previous RE, but omits the'r'
in front of the RE string.>>> p = re.compile('\bclass\b') >>> print p.search('no class at all') None >>> print p.search('\b' + 'class' + '\b') <_sre.SRE_Match object at 0x...>
Second, inside a character class, where there’s no use for this assertion,
\b
represents the backspace character, for compatibility with Python’s string literals.\B
- Another zero-width assertion, this is the opposite of
\b
, only matching when the current position is not at a word boundary.
Grouping¶
Frequently you need to obtain more information than just whether the RE matched
or not. Regular expressions are often used to dissect strings by writing a RE
divided into several subgroups which match different components of interest.
For example, an RFC-822 header line is divided into a header name and a value,
separated by a ':'
, like this:
From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
This can be handled by writing a regular expression which matches an entire header line, and has one group which matches the header name, and another group which matches the header’s value.
Groups are marked by the '('
, ')'
metacharacters. '('
and ')'
have much the same meaning as they do in mathematical expressions; they group
together the expressions contained inside them, and you can repeat the contents
of a group with a repeating qualifier, such as *
, +
, ?
, or
{m,n}
. For example, (ab)*
will match zero or more repetitions of
ab
.
>>> p = re.compile('(ab)*')
>>> print p.match('ababababab').span()
(0, 10)
Groups indicated with '('
, ')'
also capture the starting and ending
index of the text that they match; this can be retrieved by passing an argument
to group()
, start()
, end()
, and span()
. Groups are
numbered starting with 0. Group 0 is always present; it’s the whole RE, so
MatchObject
methods all have group 0 as their default argument. Later
we’ll see how to express groups that don’t capture the span of text that they
match.
>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'
Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
group()
can be passed multiple group numbers at a time, in which case it
will return a tuple containing the corresponding values for those groups.
>>> m.group(2,1,2)
('b', 'abc', 'b')
The groups()
method returns a tuple containing the strings for all the
subgroups, from 1 up to however many there are.
>>> m.groups()
('abc', 'b')
Backreferences in a pattern allow you to specify that the contents of an earlier
capturing group must also be found at the current location in the string. For
example, \1
will succeed if the exact contents of group 1 can be found at
the current position, and fails otherwise. Remember that Python’s string
literals also use a backslash followed by numbers to allow including arbitrary
characters in a string, so be sure to use a raw string when incorporating
backreferences in a RE.
For example, the following RE detects doubled words in a string.
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'
Backreferences like this aren’t often useful for just searching through a string — there are few text formats which repeat data in this way — but you’ll soon find out that they’re very useful when performing string substitutions.
Non-capturing and Named Groups¶
Elaborate REs may use many groups, both to capture substrings of interest, and to group and structure the RE itself. In complex REs, it becomes difficult to keep track of the group numbers. There are two features which help with this problem. Both of them use a common syntax for regular expression extensions, so we’ll look at that first.
Perl 5 added several additional features to standard regular expressions, and
the Python re
module supports most of them. It would have been
difficult to choose new single-keystroke metacharacters or new special sequences
beginning with \
to represent the new features without making Perl’s regular
expressions confusingly different from standard REs. If you chose &
as a
new metacharacter, for example, old expressions would be assuming that &
was
a regular character and wouldn’t have escaped it by writing \&
or [&]
.
The solution chosen by the Perl developers was to use (?...)
as the
extension syntax. ?
immediately after a parenthesis was a syntax error
because the ?
would have nothing to repeat, so this didn’t introduce any
compatibility problems. The characters immediately after the ?
indicate
what extension is being used, so (?=foo)
is one thing (a positive lookahead
assertion) and (?:foo)
is something else (a non-capturing group containing
the subexpression foo
).
Python adds an extension syntax to Perl’s extension syntax. If the first
character after the question mark is a P
, you know that it’s an extension
that’s specific to Python. Currently there are two such extensions:
(?P<name>...)
defines a named group, and (?P=name)
is a backreference to
a named group. If future versions of Perl 5 add similar features using a
different syntax, the re
module will be changed to support the new
syntax, while preserving the Python-specific syntax for compatibility’s sake.
Now that we’ve looked at the general extension syntax, we can return to the features that simplify working with groups in complex REs. Since groups are numbered from left to right and a complex expression may use many groups, it can become difficult to keep track of the correct numbering. Modifying such a complex RE is annoying, too: insert a new group near the beginning and you change the numbers of everything that follows it.
Sometimes you’ll want to use a group to collect a part of a regular expression,
but aren’t interested in retrieving the group’s contents. You can make this fact
explicit by using a non-capturing group: (?:...)
, where you can replace the
...
with any other regular expression.
>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()
Except for the fact that you can’t retrieve the contents of what the group
matched, a non-capturing group behaves exactly the same as a capturing group;
you can put anything inside it, repeat it with a repetition metacharacter such
as *
, and nest it within other groups (capturing or non-capturing).
(?:...)
is particularly useful when modifying an existing pattern, since you
can add new groups without changing how all the other groups are numbered. It
should be mentioned that there’s no performance difference in searching between
capturing and non-capturing groups; neither form is any faster than the other.
A more significant feature is named groups: instead of referring to them by numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions:
(?P<name>...)
. name is, obviously, the name of the group. Named groups
also behave exactly like capturing groups, and additionally associate a name
with a group. The MatchObject
methods that deal with capturing groups
all accept either integers that refer to the group by number or strings that
contain the desired group’s name. Named groups are still given numbers, so you
can retrieve information about a group in two ways:
>>> p = re.compile(r'(?P<word>\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'
Named groups are handy because they let you use easily-remembered names, instead
of having to remember numbers. Here’s an example RE from the imaplib
module:
InternalDate = re.compile(r'INTERNALDATE "'
r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
r'(?P<year>[0-9][0-9][0-9][0-9])'
r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
r'"')
It’s obviously much easier to retrieve m.group('zonem')
, instead of having
to remember to retrieve group 9.
The syntax for backreferences in an expression such as (...)\1
refers to the
number of the group. There’s naturally a variant that uses the group name
instead of the number. This is another Python extension: (?P=name)
indicates
that the contents of the group called name should again be matched at the
current point. The regular expression for finding doubled words,
(\b\w+)\s+\1
can also be written as (?P<word>\b\w+)\s+(?P=word)
:
>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
>>> p.search('Paris in the the spring').group()
'the the'
Lookahead Assertions¶
Another zero-width assertion is the lookahead assertion. Lookahead assertions are available in both positive and negative form, and look like this:
(?=...)
- Positive lookahead assertion. This succeeds if the contained regular
expression, represented here by
...
, successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn’t advance at all; the rest of the pattern is tried right where the assertion started. (?!...)
- Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at the current position in the string.
To make this concrete, let’s look at a case where a lookahead is useful.
Consider a simple pattern to match a filename and split it apart into a base
name and an extension, separated by a .
. For example, in news.rc
,
news
is the base name, and rc
is the filename’s extension.
The pattern to match this is quite simple:
.*[.].*$
Notice that the .
needs to be treated specially because it’s a
metacharacter; I’ve put it inside a character class. Also notice the trailing
$
; this is added to ensure that all the rest of the string must be included
in the extension. This regular expression matches foo.bar
and
autoexec.bat
and sendmail.cf
and printers.conf
.
Now, consider complicating the problem a bit; what if you want to match
filenames where the extension is not bat
? Some incorrect attempts:
.*[.][^b].*$
The first attempt above tries to exclude bat
by requiring
that the first character of the extension is not a b
. This is wrong,
because the pattern also doesn’t match foo.bar
.
.*[.]([^b]..|.[^a].|..[^t])$
The expression gets messier when you try to patch up the first solution by
requiring one of the following cases to match: the first character of the
extension isn’t b
; the second character isn’t a
; or the third character
isn’t t
. This accepts foo.bar
and rejects autoexec.bat
, but it
requires a three-letter extension and won’t accept a filename with a two-letter
extension such as sendmail.cf
. We’ll complicate the pattern again in an
effort to fix it.
.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$
In the third attempt, the second and third letters are all made optional in
order to allow matching extensions shorter than three characters, such as
sendmail.cf
.
The pattern’s getting really complicated now, which makes it hard to read and
understand. Worse, if the problem changes and you want to exclude both bat
and exe
as extensions, the pattern would get even more complicated and
confusing.
A negative lookahead cuts through all this confusion:
.*[.](?!bat$).*$
The negative lookahead means: if the expression bat
doesn’t match at this point, try the rest of the pattern; if bat$
does
match, the whole pattern will fail. The trailing $
is required to ensure
that something like sample.batch
, where the extension only starts with
bat
, will be allowed.
Excluding another filename extension is now easy; simply add it as an
alternative inside the assertion. The following pattern excludes filenames that
end in either bat
or exe
:
.*[.](?!bat$|exe$).*$
Modifying Strings¶
Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:
Method/Attribute | Purpose |
---|---|
split() |
Split the string into a list, splitting it wherever the RE matches |
sub() |
Find all substrings where the RE matches, and replace them with a different string |
subn() |
Does the same thing as sub() , but
returns the new string and the number of
replacements |
Splitting Strings¶
The split()
method of a pattern splits a string apart
wherever the RE matches, returning a list of the pieces. It’s similar to the
split()
method of strings but provides much more generality in the
delimiters that you can split by; split()
only supports splitting by
whitespace or by a fixed string. As you’d expect, there’s a module-level
re.split()
function, too.
-
.
split
(string[, maxsplit=0]) Split string by the matches of the regular expression. If capturing parentheses are used in the RE, then their contents will also be returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits are performed.
You can limit the number of splits made, by passing a value for maxsplit. When maxsplit is nonzero, at most maxsplit splits will be made, and the remainder of the string is returned as the final element of the list. In the following example, the delimiter is any sequence of non-alphanumeric characters.
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
Sometimes you’re not only interested in what the text between delimiters is, but also need to know what the delimiter was. If capturing parentheses are used in the RE, then their values are also returned as part of the list. Compare the following calls:
>>> p = re.compile(r'\W+')
>>> p2 = re.compile(r'(\W+)')
>>> p.split('This... is a test.')
['This', 'is', 'a', 'test', '']
>>> p2.split('This... is a test.')
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
The module-level function re.split()
adds the RE to be used as the first
argument, but is otherwise the same.
>>> re.split('[\W]+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('([\W]+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('[\W]+', 'Words, words, words.', 1)
['Words', 'words, words.']
Search and Replace¶
Another common task is to find all the matches for a pattern, and replace them
with a different string. The sub()
method takes a replacement value,
which can be either a string or a function, and the string to be processed.
-
.
sub
(replacement, string[, count=0]) Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn’t found, string is returned unchanged.
The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. The default value of 0 means to replace all occurrences.
Here’s a simple example of using the sub()
method. It replaces colour
names with the word colour
:
>>> p = re.compile( '(blue|white|red)')
>>> p.sub( 'colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'
The subn()
method does the same work, but returns a 2-tuple containing the
new string value and the number of replacements that were performed:
>>> p = re.compile( '(blue|white|red)')
>>> p.subn( 'colour', 'blue socks and red shoes')
('colour socks and colour shoes', 2)
>>> p.subn( 'colour', 'no colours at all')
('no colours at all', 0)
Empty matches are replaced only when they’re not adjacent to a previous match.
>>> p = re.compile('x*')
>>> p.sub('-', 'abxd')
'-a-b-d-'
If replacement is a string, any backslash escapes in it are processed. That
is, \n
is converted to a single newline character, \r
is converted to a
carriage return, and so forth. Unknown escapes such as \j
are left alone.
Backreferences, such as \6
, are replaced with the substring matched by the
corresponding group in the RE. This lets you incorporate portions of the
original text in the resulting replacement string.
This example matches the word section
followed by a string enclosed in
{
, }
, and changes section
to subsection
:
>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First} section{second}')
'subsection{First} subsection{second}'
There’s also a syntax for referring to named groups as defined by the
(?P<name>...)
syntax. \g<name>
will use the substring matched by the
group named name
, and \g<number>
uses the corresponding group number.
\g<2>
is therefore equivalent to \2
, but isn’t ambiguous in a
replacement string such as \g<2>0
. (\20
would be interpreted as a
reference to group 20, not a reference to group 2 followed by the literal
character '0'
.) The following substitutions are all equivalent, but use all
three variations of the replacement string.
>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<1>}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<name>}','section{First}')
'subsection{First}'
replacement can also be a function, which gives you even more control. If
replacement is a function, the function is called for every non-overlapping
occurrence of pattern. On each call, the function is passed a
MatchObject
argument for the match and can use this information to
compute the desired replacement string and return it.
In the following example, the replacement function translates decimals into hexadecimal:
>>> def hexrepl( match ):
... "Return the hex string for a decimal number"
... value = int( match.group() )
... return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'
When using the module-level re.sub()
function, the pattern is passed as
the first argument. The pattern may be provided as an object or as a string; if
you need to specify regular expression flags, you must either use a
pattern object as the first parameter, or use embedded modifiers in the
pattern string, e.g. sub("(?i)b+", "x", "bbbb BBBB")
returns 'x x'
.
Common Problems¶
Regular expressions are a powerful tool for some applications, but in some ways their behaviour isn’t intuitive and at times they don’t behave the way you may expect them to. This section will point out some of the most common pitfalls.
Use String Methods¶
Sometimes using the re
module is a mistake. If you’re matching a fixed
string, or a single character class, and you’re not using any re
features
such as the IGNORECASE
flag, then the full power of regular expressions
may not be required. Strings have several methods for performing operations with
fixed strings and they’re usually much faster, because the implementation is a
single small C loop that’s been optimized for the purpose, instead of the large,
more generalized regular expression engine.
One example might be replacing a single fixed string with another one; for
example, you might replace word
with deed
. re.sub()
seems like the
function to use for this, but consider the replace()
method. Note that
replace()
will also replace word
inside words, turning swordfish
into sdeedfish
, but the naive RE word
would have done that, too. (To
avoid performing the substitution on parts of words, the pattern would have to
be \bword\b
, in order to require that word
have a word boundary on
either side. This takes the job beyond replace()
‘s abilities.)
Another common task is deleting every occurrence of a single character from a
string or replacing it with another single character. You might do this with
something like re.sub('\n', ' ', S)
, but translate()
is capable of
doing both tasks and will be faster than any regular expression operation can
be.
In short, before turning to the re
module, consider whether your problem
can be solved with a faster and simpler string method.
match() versus search()¶
The match()
function only checks if the RE matches at the beginning of the
string while search()
will scan forward through the string for a match.
It’s important to keep this distinction in mind. Remember, match()
will
only report a successful match which will start at 0; if the match wouldn’t
start at zero, match()
will not report it.
>>> print re.match('super', 'superstition').span()
(0, 5)
>>> print re.match('super', 'insuperable')
None
On the other hand, search()
will scan forward through the string,
reporting the first match it finds.
>>> print re.search('super', 'superstition').span()
(0, 5)
>>> print re.search('super', 'insuperable').span()
(2, 7)
Sometimes you’ll be tempted to keep using re.match()
, and just add .*
to the front of your RE. Resist this temptation and use re.search()
instead. The regular expression compiler does some analysis of REs in order to
speed up the process of looking for a match. One such analysis figures out what
the first character of a match must be; for example, a pattern starting with
Crow
must match starting with a 'C'
. The analysis lets the engine
quickly scan through the string looking for the starting character, only trying
the full match if a 'C'
is found.
Adding .*
defeats this optimization, requiring scanning to the end of the
string and then backtracking to find a match for the rest of the RE. Use
re.search()
instead.
Greedy versus Non-Greedy¶
When repeating a regular expression, as in a*
, the resulting action is to
consume as much of the pattern as possible. This fact often bites you when
you’re trying to match a pair of balanced delimiters, such as the angle brackets
surrounding an HTML tag. The naive pattern for matching a single HTML tag
doesn’t work because of the greedy nature of .*
.
>>> s = '<html><head><title>Title</title>'
>>> len(s)
32
>>> print re.match('<.*>', s).span()
(0, 32)
>>> print re.match('<.*>', s).group()
<html><head><title>Title</title>
The RE matches the '<'
in <html>
, and the .*
consumes the rest of
the string. There’s still more left in the RE, though, and the >
can’t
match at the end of the string, so the regular expression engine has to
backtrack character by character until it finds a match for the >
. The
final match extends from the '<'
in <html>
to the '>'
in
</title>
, which isn’t what you want.
In this case, the solution is to use the non-greedy qualifiers *?
, +?
,
??
, or {m,n}?
, which match as little text as possible. In the above
example, the '>'
is tried immediately after the first '<'
matches, and
when it fails, the engine advances a character at a time, retrying the '>'
at every step. This produces just the right result:
>>> print re.match('<.*?>', s).group()
<html>
(Note that parsing HTML or XML with regular expressions is painful. Quick-and-dirty patterns will handle common cases, but HTML and XML have special cases that will break the obvious regular expression; by the time you’ve written a regular expression that handles all of the possible cases, the patterns will be very complicated. Use an HTML or XML parser module for such tasks.)
Using re.VERBOSE¶
By now you’ve probably noticed that regular expressions are a very compact notation, but they’re not terribly readable. REs of moderate complexity can become lengthy collections of backslashes, parentheses, and metacharacters, making them difficult to read and understand.
For such REs, specifying the re.VERBOSE
flag when compiling the regular
expression can be helpful, because it allows you to format the regular
expression more clearly.
The re.VERBOSE
flag has several effects. Whitespace in the regular
expression that isn’t inside a character class is ignored. This means that an
expression such as dog | cat
is equivalent to the less readable dog|cat
,
but [a b]
will still match the characters 'a'
, 'b'
, or a space. In
addition, you can also put comments inside a RE; comments extend from a #
character to the next newline. When used with triple-quoted strings, this
enables REs to be formatted more neatly:
pat = re.compile(r"""
\s* # Skip leading whitespace
(?P<header>[^:]+) # Header name
\s* : # Whitespace, and a colon
(?P<value>.*?) # The header's value -- *? used to
# lose the following trailing whitespace
\s*$ # Trailing whitespace to end-of-line
""", re.VERBOSE)
This is far more readable than:
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
Feedback¶
Regular expressions are a complicated topic. Did this document help you understand them? Were there parts that were unclear, or Problems you encountered that weren’t covered here? If so, please send suggestions for improvements to the author.
The most complete book on regular expressions is almost certainly Jeffrey
Friedl’s Mastering Regular Expressions, published by O’Reilly. Unfortunately,
it exclusively concentrates on Perl and Java’s flavours of regular expressions,
and doesn’t contain any Python material at all, so it won’t be useful as a
reference for programming in Python. (The first edition covered Python’s
now-removed regex
module, which won’t help you much.) Consider checking
it out from your library.
Footnotes
[1] | Introduced in Python 2.2.2. |
Socket Programming HOWTO¶
Author: | Gordon McMillan |
---|
Abstract
Sockets are used nearly everywhere, but are one of the most severely misunderstood technologies around. This is a 10,000 foot overview of sockets. It’s not really a tutorial - you’ll still have work to do in getting things operational. It doesn’t cover the fine points (and there are a lot of them), but I hope it will give you enough background to begin using them decently.
Sockets¶
Sockets are used nearly everywhere, but are one of the most severely misunderstood technologies around. This is a 10,000 foot overview of sockets. It’s not really a tutorial - you’ll still have work to do in getting things working. It doesn’t cover the fine points (and there are a lot of them), but I hope it will give you enough background to begin using them decently.
I’m only going to talk about INET sockets, but they account for at least 99% of the sockets in use. And I’ll only talk about STREAM sockets - unless you really know what you’re doing (in which case this HOWTO isn’t for you!), you’ll get better behavior and performance from a STREAM socket than anything else. I will try to clear up the mystery of what a socket is, as well as some hints on how to work with blocking and non-blocking sockets. But I’ll start by talking about blocking sockets. You’ll need to know how they work before dealing with non-blocking sockets.
Part of the trouble with understanding these things is that “socket” can mean a number of subtly different things, depending on context. So first, let’s make a distinction between a “client” socket - an endpoint of a conversation, and a “server” socket, which is more like a switchboard operator. The client application (your browser, for example) uses “client” sockets exclusively; the web server it’s talking to uses both “server” sockets and “client” sockets.
History¶
Of the various forms of IPC, sockets are by far the most popular. On any given platform, there are likely to be other forms of IPC that are faster, but for cross-platform communication, sockets are about the only game in town.
They were invented in Berkeley as part of the BSD flavor of Unix. They spread like wildfire with the Internet. With good reason — the combination of sockets with INET makes talking to arbitrary machines around the world unbelievably easy (at least compared to other schemes).
Creating a Socket¶
Roughly speaking, when you clicked on the link that brought you to this page, your browser did something like the following:
#create an INET, STREAMing socket
s = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#now connect to the web server on port 80
# - the normal http port
s.connect(("www.mcmillan-inc.com", 80))
When the connect
completes, the socket s
can be used to send
in a request for the text of the page. The same socket will read the
reply, and then be destroyed. That’s right, destroyed. Client sockets
are normally only used for one exchange (or a small set of sequential
exchanges).
What happens in the web server is a bit more complex. First, the web server creates a “server socket”:
#create an INET, STREAMing socket
serversocket = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#bind the socket to a public host,
# and a well-known port
serversocket.bind((socket.gethostname(), 80))
#become a server socket
serversocket.listen(5)
A couple things to notice: we used socket.gethostname()
so that the socket
would be visible to the outside world. If we had used s.bind(('', 80))
or
s.bind(('localhost', 80))
or s.bind(('127.0.0.1', 80))
we would still
have a “server” socket, but one that was only visible within the same machine.
A second thing to note: low number ports are usually reserved for “well known” services (HTTP, SNMP etc). If you’re playing around, use a nice high number (4 digits).
Finally, the argument to listen
tells the socket library that we want it to
queue up as many as 5 connect requests (the normal max) before refusing outside
connections. If the rest of the code is written properly, that should be plenty.
Now that we have a “server” socket, listening on port 80, we can enter the mainloop of the web server:
while 1:
#accept connections from outside
(clientsocket, address) = serversocket.accept()
#now do something with the clientsocket
#in this case, we'll pretend this is a threaded server
ct = client_thread(clientsocket)
ct.run()
There’s actually 3 general ways in which this loop could work - dispatching a
thread to handle clientsocket
, create a new process to handle
clientsocket
, or restructure this app to use non-blocking sockets, and
mulitplex between our “server” socket and any active clientsocket
s using
select
. More about that later. The important thing to understand now is
this: this is all a “server” socket does. It doesn’t send any data. It doesn’t
receive any data. It just produces “client” sockets. Each clientsocket
is
created in response to some other “client” socket doing a connect()
to the
host and port we’re bound to. As soon as we’ve created that clientsocket
, we
go back to listening for more connections. The two “clients” are free to chat it
up - they are using some dynamically allocated port which will be recycled when
the conversation ends.
IPC¶
If you need fast IPC between two processes on one machine, you should look into whatever form of shared memory the platform offers. A simple protocol based around shared memory and locks or semaphores is by far the fastest technique.
If you do decide to use sockets, bind the “server” socket to 'localhost'
. On
most platforms, this will take a shortcut around a couple of layers of network
code and be quite a bit faster.
Using a Socket¶
The first thing to note, is that the web browser’s “client” socket and the web
server’s “client” socket are identical beasts. That is, this is a “peer to peer”
conversation. Or to put it another way, as the designer, you will have to
decide what the rules of etiquette are for a conversation. Normally, the
connect
ing socket starts the conversation, by sending in a request, or
perhaps a signon. But that’s a design decision - it’s not a rule of sockets.
Now there are two sets of verbs to use for communication. You can use send
and recv
, or you can transform your client socket into a file-like beast and
use read
and write
. The latter is the way Java presents its sockets.
I’m not going to talk about it here, except to warn you that you need to use
flush
on sockets. These are buffered “files”, and a common mistake is to
write
something, and then read
for a reply. Without a flush
in
there, you may wait forever for the reply, because the request may still be in
your output buffer.
Now we come the major stumbling block of sockets - send
and recv
operate
on the network buffers. They do not necessarily handle all the bytes you hand
them (or expect from them), because their major focus is handling the network
buffers. In general, they return when the associated network buffers have been
filled (send
) or emptied (recv
). They then tell you how many bytes they
handled. It is your responsibility to call them again until your message has
been completely dealt with.
When a recv
returns 0 bytes, it means the other side has closed (or is in
the process of closing) the connection. You will not receive any more data on
this connection. Ever. You may be able to send data successfully; I’ll talk
about that some on the next page.
A protocol like HTTP uses a socket for only one transfer. The client sends a request, then reads a reply. That’s it. The socket is discarded. This means that a client can detect the end of the reply by receiving 0 bytes.
But if you plan to reuse your socket for further transfers, you need to realize
that there is no EOT on a socket. I repeat: if a socket
send
or recv
returns after handling 0 bytes, the connection has been
broken. If the connection has not been broken, you may wait on a recv
forever, because the socket will not tell you that there’s nothing more to
read (for now). Now if you think about that a bit, you’ll come to realize a
fundamental truth of sockets: messages must either be fixed length (yuck), or
be delimited (shrug), or indicate how long they are (much better), or end by
shutting down the connection. The choice is entirely yours, (but some ways are
righter than others).
Assuming you don’t want to end the connection, the simplest solution is a fixed length message:
class mysocket:
'''demonstration class only
- coded for clarity, not efficiency
'''
def __init__(self, sock=None):
if sock is None:
self.sock = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
else:
self.sock = sock
def connect(self, host, port):
self.sock.connect((host, port))
def mysend(self, msg):
totalsent = 0
while totalsent < MSGLEN:
sent = self.sock.send(msg[totalsent:])
if sent == 0:
raise RuntimeError("socket connection broken")
totalsent = totalsent + sent
def myreceive(self):
msg = ''
while len(msg) < MSGLEN:
chunk = self.sock.recv(MSGLEN-len(msg))
if chunk == '':
raise RuntimeError("socket connection broken")
msg = msg + chunk
return msg
The sending code here is usable for almost any messaging scheme - in Python you
send strings, and you can use len()
to determine its length (even if it has
embedded \0
characters). It’s mostly the receiving code that gets more
complex. (And in C, it’s not much worse, except you can’t use strlen
if the
message has embedded \0
s.)
The easiest enhancement is to make the first character of the message an
indicator of message type, and have the type determine the length. Now you have
two recv
s - the first to get (at least) that first character so you can
look up the length, and the second in a loop to get the rest. If you decide to
go the delimited route, you’ll be receiving in some arbitrary chunk size, (4096
or 8192 is frequently a good match for network buffer sizes), and scanning what
you’ve received for a delimiter.
One complication to be aware of: if your conversational protocol allows multiple
messages to be sent back to back (without some kind of reply), and you pass
recv
an arbitrary chunk size, you may end up reading the start of a
following message. You’ll need to put that aside and hold onto it, until it’s
needed.
Prefixing the message with it’s length (say, as 5 numeric characters) gets more
complex, because (believe it or not), you may not get all 5 characters in one
recv
. In playing around, you’ll get away with it; but in high network loads,
your code will very quickly break unless you use two recv
loops - the first
to determine the length, the second to get the data part of the message. Nasty.
This is also when you’ll discover that send
does not always manage to get
rid of everything in one pass. And despite having read this, you will eventually
get bit by it!
In the interests of space, building your character, (and preserving my competitive position), these enhancements are left as an exercise for the reader. Lets move on to cleaning up.
Binary Data¶
It is perfectly possible to send binary data over a socket. The major problem is
that not all machines use the same formats for binary data. For example, a
Motorola chip will represent a 16 bit integer with the value 1 as the two hex
bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00.
Socket libraries have calls for converting 16 and 32 bit integers - ntohl,
htonl, ntohs, htons
where “n” means network and “h” means host, “s” means
short and “l” means long. Where network order is host order, these do
nothing, but where the machine is byte-reversed, these swap the bytes around
appropriately.
In these days of 32 bit machines, the ascii representation of binary data is frequently smaller than the binary representation. That’s because a surprising amount of the time, all those longs have the value 0, or maybe 1. The string “0” would be two bytes, while binary is four. Of course, this doesn’t fit well with fixed-length messages. Decisions, decisions.
Disconnecting¶
Strictly speaking, you’re supposed to use shutdown
on a socket before you
close
it. The shutdown
is an advisory to the socket at the other end.
Depending on the argument you pass it, it can mean “I’m not going to send
anymore, but I’ll still listen”, or “I’m not listening, good riddance!”. Most
socket libraries, however, are so used to programmers neglecting to use this
piece of etiquette that normally a close
is the same as shutdown();
close()
. So in most situations, an explicit shutdown
is not needed.
One way to use shutdown
effectively is in an HTTP-like exchange. The client
sends a request and then does a shutdown(1)
. This tells the server “This
client is done sending, but can still receive.” The server can detect “EOF” by
a receive of 0 bytes. It can assume it has the complete request. The server
sends a reply. If the send
completes successfully then, indeed, the client
was still receiving.
Python takes the automatic shutdown a step further, and says that when a socket
is garbage collected, it will automatically do a close
if it’s needed. But
relying on this is a very bad habit. If your socket just disappears without
doing a close
, the socket at the other end may hang indefinitely, thinking
you’re just being slow. Please close
your sockets when you’re done.
When Sockets Die¶
Probably the worst thing about using blocking sockets is what happens when the
other side comes down hard (without doing a close
). Your socket is likely to
hang. SOCKSTREAM is a reliable protocol, and it will wait a long, long time
before giving up on a connection. If you’re using threads, the entire thread is
essentially dead. There’s not much you can do about it. As long as you aren’t
doing something dumb, like holding a lock while doing a blocking read, the
thread isn’t really consuming much in the way of resources. Do not try to kill
the thread - part of the reason that threads are more efficient than processes
is that they avoid the overhead associated with the automatic recycling of
resources. In other words, if you do manage to kill the thread, your whole
process is likely to be screwed up.
Non-blocking Sockets¶
If you’ve understood the preceding, you already know most of what you need to know about the mechanics of using sockets. You’ll still use the same calls, in much the same ways. It’s just that, if you do it right, your app will be almost inside-out.
In Python, you use socket.setblocking(0)
to make it non-blocking. In C, it’s
more complex, (for one thing, you’ll need to choose between the BSD flavor
O_NONBLOCK
and the almost indistinguishable Posix flavor O_NDELAY
, which
is completely different from TCP_NODELAY
), but it’s the exact same idea. You
do this after creating the socket, but before using it. (Actually, if you’re
nuts, you can switch back and forth.)
The major mechanical difference is that send
, recv
, connect
and
accept
can return without having done anything. You have (of course) a
number of choices. You can check return code and error codes and generally drive
yourself crazy. If you don’t believe me, try it sometime. Your app will grow
large, buggy and suck CPU. So let’s skip the brain-dead solutions and do it
right.
Use select
.
In C, coding select
is fairly complex. In Python, it’s a piece of cake, but
it’s close enough to the C version that if you understand select
in Python,
you’ll have little trouble with it in C:
ready_to_read, ready_to_write, in_error = \
select.select(
potential_readers,
potential_writers,
potential_errs,
timeout)
You pass select
three lists: the first contains all sockets that you might
want to try reading; the second all the sockets you might want to try writing
to, and the last (normally left empty) those that you want to check for errors.
You should note that a socket can go into more than one list. The select
call is blocking, but you can give it a timeout. This is generally a sensible
thing to do - give it a nice long timeout (say a minute) unless you have good
reason to do otherwise.
In return, you will get three lists. They contain the sockets that are actually readable, writable and in error. Each of these lists is a subset (possibly empty) of the corresponding list you passed in.
If a socket is in the output readable list, you can be
as-close-to-certain-as-we-ever-get-in-this-business that a recv
on that
socket will return something. Same idea for the writable list. You’ll be able
to send something. Maybe not all you want to, but something is better than
nothing. (Actually, any reasonably healthy socket will return as writable - it
just means outbound network buffer space is available.)
If you have a “server” socket, put it in the potential_readers list. If it comes
out in the readable list, your accept
will (almost certainly) work. If you
have created a new socket to connect
to someone else, put it in the
potential_writers list. If it shows up in the writable list, you have a decent
chance that it has connected.
One very nasty problem with select
: if somewhere in those input lists of
sockets is one which has died a nasty death, the select
will fail. You then
need to loop through every single damn socket in all those lists and do a
select([sock],[],[],0)
until you find the bad one. That timeout of 0 means
it won’t take long, but it’s ugly.
Actually, select
can be handy even with blocking sockets. It’s one way of
determining whether you will block - the socket returns as readable when there’s
something in the buffers. However, this still doesn’t help with the problem of
determining whether the other end is done, or just busy with something else.
Portability alert: On Unix, select
works both with the sockets and
files. Don’t try this on Windows. On Windows, select
works with sockets
only. Also note that in C, many of the more advanced socket options are done
differently on Windows. In fact, on Windows I usually use threads (which work
very, very well) with my sockets. Face it, if you want any kind of performance,
your code will look very different on Windows than on Unix.
Performance¶
There’s no question that the fastest sockets code uses non-blocking sockets and select to multiplex them. You can put together something that will saturate a LAN connection without putting any strain on the CPU. The trouble is that an app written this way can’t do much of anything else - it needs to be ready to shuffle bytes around at all times.
Assuming that your app is actually supposed to do something more than that, threading is the optimal solution, (and using non-blocking sockets will be faster than using blocking sockets). Unfortunately, threading support in Unixes varies both in API and quality. So the normal Unix solution is to fork a subprocess to deal with each connection. The overhead for this is significant (and don’t do this on Windows - the overhead of process creation is enormous there). It also means that unless each subprocess is completely independent, you’ll need to use another form of IPC, say a pipe, or shared memory and semaphores, to communicate between the parent and child processes.
Finally, remember that even though blocking sockets are somewhat slower than
non-blocking, in many cases they are the “right” solution. After all, if your
app is driven by the data it receives over a socket, there’s not much sense in
complicating the logic just so your app can wait on select
instead of
recv
.
Sorting HOW TO¶
Author: | Andrew Dalke and Raymond Hettinger |
---|---|
Release: | 0.1 |
Python lists have a built-in list.sort()
method that modifies the list
in-place. There is also a sorted()
built-in function that builds a new
sorted list from an iterable.
In this document, we explore the various techniques for sorting data using Python.
Sorting Basics¶
A simple ascending sort is very easy: just call the sorted()
function. It
returns a new sorted list:
>>> sorted([5, 2, 3, 1, 4])
[1, 2, 3, 4, 5]
You can also use the list.sort()
method of a list. It modifies the list
in-place (and returns None to avoid confusion). Usually it’s less convenient
than sorted()
- but if you don’t need the original list, it’s slightly
more efficient.
>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> a
[1, 2, 3, 4, 5]
Another difference is that the list.sort()
method is only defined for
lists. In contrast, the sorted()
function accepts any iterable.
>>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'})
[1, 2, 3, 4, 5]
Key Functions¶
Starting with Python 2.4, both list.sort()
and sorted()
added a
key parameter to specify a function to be called on each list element prior to
making comparisons.
For example, here’s a case-insensitive string comparison:
>>> sorted("This is a test string from Andrew".split(), key=str.lower)
['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
The value of the key parameter should be a function that takes a single argument and returns a key to use for sorting purposes. This technique is fast because the key function is called exactly once for each input record.
A common pattern is to sort complex objects using some of the object’s indices as keys. For example:
>>> student_tuples = [
('john', 'A', 15),
('jane', 'B', 12),
('dave', 'B', 10),
]
>>> sorted(student_tuples, key=lambda student: student[2]) # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The same technique works for objects with named attributes. For example:
>>> class Student:
def __init__(self, name, grade, age):
self.name = name
self.grade = grade
self.age = age
def __repr__(self):
return repr((self.name, self.grade, self.age))
>>> student_objects = [
Student('john', 'A', 15),
Student('jane', 'B', 12),
Student('dave', 'B', 10),
]
>>> sorted(student_objects, key=lambda student: student.age) # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
Operator Module Functions¶
The key-function patterns shown above are very common, so Python provides
convenience functions to make accessor functions easier and faster. The operator
module has operator.itemgetter()
, operator.attrgetter()
, and
starting in Python 2.5 a operator.methodcaller()
function.
Using those functions, the above examples become simpler and faster:
>>> from operator import itemgetter, attrgetter
>>> sorted(student_tuples, key=itemgetter(2))
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
>>> sorted(student_objects, key=attrgetter('age'))
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The operator module functions allow multiple levels of sorting. For example, to sort by grade then by age:
>>> sorted(student_tuples, key=itemgetter(1,2))
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
>>> sorted(student_objects, key=attrgetter('grade', 'age'))
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
The operator.methodcaller()
function makes method calls with fixed
parameters for each object being sorted. For example, the str.count()
method could be used to compute message priority by counting the
number of exclamation marks in a message:
>>> messages = ['critical!!!', 'hurry!', 'standby', 'immediate!!']
>>> sorted(messages, key=methodcaller('count', '!'))
['standby', 'hurry!', 'immediate!!', 'critical!!!']
Ascending and Descending¶
Both list.sort()
and sorted()
accept a reverse parameter with a
boolean value. This is using to flag descending sorts. For example, to get the
student data in reverse age order:
>>> sorted(student_tuples, key=itemgetter(2), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
>>> sorted(student_objects, key=attrgetter('age'), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
Sort Stability and Complex Sorts¶
Starting with Python 2.2, sorts are guaranteed to be stable. That means that when multiple records have the same key, their original order is preserved.
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
>>> sorted(data, key=itemgetter(0))
[('blue', 1), ('blue', 2), ('red', 1), ('red', 2)]
Notice how the two records for blue retain their original order so that
('blue', 1)
is guaranteed to precede ('blue', 2)
.
This wonderful property lets you build complex sorts in a series of sorting steps. For example, to sort the student data by descending grade and then ascending age, do the age sort first and then sort again using grade:
>>> s = sorted(student_objects, key=attrgetter('age')) # sort on secondary key
>>> sorted(s, key=attrgetter('grade'), reverse=True) # now sort on primary key, descending
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The Timsort algorithm used in Python does multiple sorts efficiently because it can take advantage of any ordering already present in a dataset.
The Old Way Using Decorate-Sort-Undecorate¶
This idiom is called Decorate-Sort-Undecorate after its three steps:
- First, the initial list is decorated with new values that control the sort order.
- Second, the decorated list is sorted.
- Finally, the decorations are removed, creating a list that contains only the initial values in the new order.
For example, to sort the student data by grade using the DSU approach:
>>> decorated = [(student.grade, i, student) for i, student in enumerate(student_objects)]
>>> decorated.sort()
>>> [student for grade, i, student in decorated] # undecorate
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
This idiom works because tuples are compared lexicographically; the first items are compared; if they are the same then the second items are compared, and so on.
It is not strictly necessary in all cases to include the index i in the decorated list, but including it gives two benefits:
- The sort is stable – if two items have the same key, their order will be preserved in the sorted list.
- The original items do not have to be comparable because the ordering of the decorated tuples will be determined by at most the first two items. So for example the original list could contain complex numbers which cannot be sorted directly.
Another name for this idiom is Schwartzian transform, after Randal L. Schwartz, who popularized it among Perl programmers.
For large lists and lists where the comparison information is expensive to calculate, and Python versions before 2.4, DSU is likely to be the fastest way to sort the list. For 2.4 and later, key functions provide the same functionality.
The Old Way Using the cmp Parameter¶
Many constructs given in this HOWTO assume Python 2.4 or later. Before that,
there was no sorted()
builtin and list.sort()
took no keyword
arguments. Instead, all of the Py2.x versions supported a cmp parameter to
handle user specified comparison functions.
In Py3.0, the cmp parameter was removed entirely (as part of a larger effort to
simplify and unify the language, eliminating the conflict between rich
comparisons and the __cmp__()
magic method).
In Py2.x, sort allowed an optional function which can be called for doing the comparisons. That function should take two arguments to be compared and then return a negative value for less-than, return zero if they are equal, or return a positive value for greater-than. For example, we can do:
>>> def numeric_compare(x, y):
return x - y
>>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare)
[1, 2, 3, 4, 5]
Or you can reverse the order of comparison with:
>>> def reverse_numeric(x, y):
return y - x
>>> sorted([5, 2, 4, 1, 3], cmp=reverse_numeric)
[5, 4, 3, 2, 1]
When porting code from Python 2.x to 3.x, the situation can arise when you have the user supplying a comparison function and you need to convert that to a key function. The following wrapper makes that easy to do:
def cmp_to_key(mycmp):
'Convert a cmp= function into a key= function'
class K(object):
def __init__(self, obj, *args):
self.obj = obj
def __lt__(self, other):
return mycmp(self.obj, other.obj) < 0
def __gt__(self, other):
return mycmp(self.obj, other.obj) > 0
def __eq__(self, other):
return mycmp(self.obj, other.obj) == 0
def __le__(self, other):
return mycmp(self.obj, other.obj) <= 0
def __ge__(self, other):
return mycmp(self.obj, other.obj) >= 0
def __ne__(self, other):
return mycmp(self.obj, other.obj) != 0
return K
To convert to a key function, just wrap the old comparison function:
>>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric))
[5, 4, 3, 2, 1]
In Python 2.7, the functools.cmp_to_key()
function was added to the
functools module.
Odd and Ends¶
For locale aware sorting, use
locale.strxfrm()
for a key function orlocale.strcoll()
for a comparison function.The reverse parameter still maintains sort stability (so that records with equal keys retain their original order). Interestingly, that effect can be simulated without the parameter by using the builtin
reversed()
function twice:>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)] >>> assert sorted(data, reverse=True) == list(reversed(sorted(reversed(data))))
To create a standard sort order for a class, just add the appropriate rich comparison methods:
>>> Student.__eq__ = lambda self, other: self.age == other.age >>> Student.__ne__ = lambda self, other: self.age != other.age >>> Student.__lt__ = lambda self, other: self.age < other.age >>> Student.__le__ = lambda self, other: self.age <= other.age >>> Student.__gt__ = lambda self, other: self.age > other.age >>> Student.__ge__ = lambda self, other: self.age >= other.age >>> sorted(student_objects) [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
For general purpose comparisons, the recommended approach is to define all six rich comparison operators. The
functools.total_ordering()
class decorator makes this easy to implement.Key functions need not depend directly on the objects being sorted. A key function can also access external resources. For instance, if the student grades are stored in a dictionary, they can be used to sort a separate list of student names:
>>> students = ['dave', 'john', 'jane'] >>> grades = {'john': 'F', 'jane':'A', 'dave': 'C'} >>> sorted(students, key=grades.__getitem__) ['jane', 'dave', 'john']
Unicode HOWTO¶
Release: | 1.03 |
---|
This HOWTO discusses Python 2.x’s support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. (This HOWTO has not yet been updated to cover the 3.x versions of Python.)
Introduction to Unicode¶
History of Character Codes¶
In 1968, the American Standard Code for Information Interchange, better known by its acronym ASCII, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127. For example, the lowercase letter ‘a’ is assigned 97 as its code value.
ASCII was an American-developed standard, so it only defined unaccented characters. There was an ‘e’, but no ‘é’ or ‘Í’. This meant that languages which required accented characters couldn’t be faithfully represented in ASCII. (Actually the missing accents matter for English, too, which contains words such as ‘naïve’ and ‘café’, and some publications have house styles which require spellings such as ‘coöperate’.)
For a while people just wrote programs that didn’t display accents. I remember looking at Apple ][ BASIC programs, published in French-language publications in the mid-1980s, that had lines like these:
PRINT "FICHIER EST COMPLETE."
PRINT "CARACTERE NON ACCEPTE."
Those messages should contain accents, and they just look wrong to someone who can read French.
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some machines assigned values between 128 and 255 to accented characters. Different machines had different codes, however, which led to problems exchanging files. Eventually various commonly used sets of values for the 128-255 range emerged. Some were true standards, defined by the International Standards Organization, and some were de facto conventions that were invented by one company or another and managed to catch on.
255 characters aren’t very many. For example, you can’t fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128-255 range because there are more than 127 such characters.
You could write files using different codes (all your Russian files in a coding system called KOI8, all your French files in a different coding system called Latin1), but what if you wanted to write a French document that quotes some Russian text? In the 1980s people began to want to solve this problem, and the Unicode standardization effort began.
Unicode started out using 16-bit characters instead of 8-bit characters. 16 bits means you have 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different alphabets; an initial goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn’t enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in base-16).
There’s a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode.
(This discussion of Unicode’s history is highly simplified. I don’t think the average Python programmer needs to worry about the historical details; consult the Unicode consortium site listed in the References for more information.)
Definitions¶
A character is the smallest possible component of a text. ‘A’, ‘B’, ‘C’, etc., are all different characters. So are ‘È’ and ‘Í’. Characters are abstractions, and vary depending on the language or context you’re talking about. For example, the symbol for ohms (Ω) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet (they may even be the same in some fonts), but these are two different characters that have different meanings.
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12ca’. U+12ca is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
A character is represented on a screen or on paper by a set of graphical elements that’s called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. Most Python code doesn’t need to worry about glyphs; figuring out the correct glyph to display is generally the job of a GUI toolkit or a terminal’s font renderer.
Encodings¶
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
The first encoding you might think of is an array of 32-bit integers. In this representation, the string “Python” would look like this:
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This representation is straightforward but using it presents a number of problems.
- It’s not portable; different processors order the bytes differently.
- It’s very wasteful of space. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by zero bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn’t matter too much (desktop computers have megabytes of RAM, and strings aren’t usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable.
- It’s not compatible with existing C functions such as
strlen()
, so a new family of wide string functions would need to be used. - Many Internet standards are defined in terms of textual data, and can’t handle content with embedded zero bytes.
Generally people don’t use this encoding, instead choosing other encodings that are more efficient and convenient. UTF-8 is probably the most commonly supported encoding; it will be discussed below.
Encodings don’t have to handle every possible Unicode character, and most encodings don’t. For example, Python’s default encoding is the ‘ascii’ encoding. The rules for converting a Unicode string into the ASCII encoding are simple; for each code point:
- If the code point is < 128, each byte is the same as the value of the code point.
- If the code point is 128 or greater, the Unicode string can’t be represented
in this encoding. (Python raises a
UnicodeEncodeError
exception in this case.)
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points 0-255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1.
Encodings don’t have to be simple one-to-one mappings like Latin-1. Consider IBM’s EBCDIC, which was used on IBM mainframes. Letter values weren’t in one block: ‘a’ through ‘i’ had values from 129 to 137, but ‘j’ through ‘r’ were 145 through 153. If you wanted to use EBCDIC as an encoding, you’d probably use some sort of lookup table to perform the conversion, but this is largely an internal detail.
UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) UTF-8 uses the following rules:
- If the code point is <128, it’s represented by the corresponding byte value.
- If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
- Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
- It can handle any Unicode code point.
- A Unicode string is turned into a string of bytes containing no embedded zero
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
processed by C functions such as
strcpy()
and sent through protocols that can’t handle zero bytes. - A string of ASCII text is also valid UTF-8 text.
- UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
- If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.
References¶
The Unicode Consortium site at <http://www.unicode.org> has character charts, a glossary, and PDF versions of the Unicode specification. Be prepared for some difficult reading. <http://www.unicode.org/history/> is a chronology of the origin and development of Unicode.
To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character tables, available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
Another good introductory article was written by Joel Spolsky <http://www.joelonsoftware.com/articles/Unicode.html>. If this introduction didn’t make things clear to you, you should try reading this alternate article before continuing.
Wikipedia entries are often helpful; see the entries for “character encoding” <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 <http://en.wikipedia.org/wiki/UTF-8>, for example.
Python 2.x’s Unicode Support¶
Now that you’ve learned the rudiments of Unicode, we can look at Python’s Unicode features.
The Unicode Type¶
Unicode strings are expressed as instances of the unicode
type, one of
Python’s repertoire of built-in types. It derives from an abstract type called
basestring
, which is also an ancestor of the str
type; you can
therefore check if a value is a string type with isinstance(value,
basestring)
. Under the hood, Python represents Unicode strings as either 16-
or 32-bit integers, depending on how the Python interpreter was compiled.
The unicode()
constructor has the signature unicode(string[, encoding,
errors])
. All of its arguments should be 8-bit strings. The first argument
is converted to Unicode using the specified encoding; if you leave off the
encoding
argument, the ASCII encoding is used for the conversion, so
characters greater than 127 will be treated as errors:
>>> unicode('abcdef')
u'abcdef'
>>> s = unicode('abcdef')
>>> type(s)
<type 'unicode'>
>>> unicode('abcdef' + chr(255))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)
The errors
argument specifies the response when the input string can’t be
converted according to the encoding’s rules. Legal values for this argument are
‘strict’ (raise a UnicodeDecodeError
exception), ‘replace’ (add U+FFFD,
‘REPLACEMENT CHARACTER’), or ‘ignore’ (just leave the character out of the
Unicode result). The following examples show the differences:
>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'
Encodings are specified as strings containing the encoding’s name. Python 2.7 comes with roughly 100 different encodings; see the Python Library Reference at standard-encodings for a list. Some encodings have multiple names; for example, ‘latin-1’, ‘iso_8859_1’ and ‘8859’ are all synonyms for the same encoding.
One-character Unicode strings can also be created with the unichr()
built-in function, which takes integers and returns a Unicode string of length 1
that contains the corresponding code point. The reverse operation is the
built-in ord()
function that takes a one-character Unicode string and
returns the code point value:
>>> unichr(40960)
u'\ua000'
>>> ord(u'\ua000')
40960
Instances of the unicode
type have many of the same methods as the
8-bit string type for operations such as searching and formatting:
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
>>> s.count('e')
5
>>> s.find('feather')
9
>>> s.find('bird')
-1
>>> s.replace('feather', 'sand')
u'Was ever sand so lightly blown to and fro as this multitude?'
>>> s.upper()
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
Note that the arguments to these methods can be Unicode strings or 8-bit strings. 8-bit strings will be converted to Unicode before carrying out the operation; Python’s default ASCII encoding will be used, so characters greater than 127 will cause an exception:
>>> s.find('Was\x9f')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
>>> s.find(u'Was\x9f')
-1
Much Python code that operates on strings will therefore work with Unicode strings without requiring any changes to the code. (Input and output code needs more updating for Unicode; more on this later.)
Another important method is .encode([encoding], [errors='strict'])
, which
returns an 8-bit string version of the Unicode string, encoded in the requested
encoding. The errors
parameter is the same as the parameter of the
unicode()
constructor, with one additional possibility; as well as ‘strict’,
‘ignore’, and ‘replace’, you can also pass ‘xmlcharrefreplace’ which uses XML’s
character references. The following example shows the different results:
>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'ꀀabcd޴'
Python’s 8-bit strings have a .decode([encoding], [errors])
method that
interprets the string using the given encoding:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
The low-level routines for registering and accessing the available encodings are
found in the codecs
module. However, the encoding and decoding functions
returned by this module are usually more low-level than is comfortable, so I’m
not going to describe the codecs
module here. If you need to implement a
completely new encoding, you’ll need to learn about the codecs
module
interfaces, but implementing encodings is a specialized task that also won’t be
covered here. Consult the Python documentation to learn more about this module.
The most commonly used part of the codecs
module is the
codecs.open()
function which will be discussed in the section on input and
output.
Unicode Literals in Python Source Code¶
In Python source code, Unicode literals are written as strings prefixed with the
‘u’ or ‘U’ character: u'abcdefghijk'
. Specific code points can be written
using the \u
escape sequence, which is followed by four hex digits giving
the code point. The \U
escape sequence is similar, but expects 8 hex
digits, not 4.
Unicode literals can also use the same escape sequences as 8-bit strings,
including \x
, but \x
only takes two hex digits so it can’t express an
arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
>>> s = u"a\xac\u1234\u20ac\U00008000"
^^^^ two-digit hex escape
^^^^^^ four-digit Unicode escape
^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768
Using escape sequences for code points greater than 127 is fine in small doses,
but becomes an annoyance if you’re using many accented characters, as you would
in a program with messages in French or some other accent-using language. You
can also assemble strings using the unichr()
built-in function, but this is
even more tedious.
Ideally, you’d want to be able to write literals in your language’s natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.
Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = u'abcdé'
print ord(u[-1])
The syntax is inspired by Emacs’s notation for specifying variables local to a
file. Emacs supports many different variables, but Python only supports
‘coding’. The -*-
symbols indicate to Emacs that the comment is special;
they have no significance to Python but are a convention. Python looks for
coding: name
or coding=name
in the comment.
If you don’t include such a comment, the default encoding used will be ASCII. Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals; in Python 2.4, characters greater than 127 still work but result in a warning. For example, the following program has no encoding declaration:
#!/usr/bin/env python
u = u'abcdé'
print ord(u[-1])
When you run it with Python 2.4, it will output the following warning:
amk:~$ python2.4 p263.py
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
in file p263.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Python 2.5 and higher are stricter and will produce a syntax error:
amk:~$ python2.5 p263.py
File "/tmp/p263.py", line 2
SyntaxError: Non-ASCII character '\xc3' in file /tmp/p263.py
on line 2, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details
Unicode Properties¶
The Unicode specification includes a database of information about code points. For each code point that’s defined, the information includes the character’s name, its category, the numeric value if applicable (Unicode has characters representing the Roman numerals and fractions such as one-third and four-fifths). There are also properties related to the code point’s use in bidirectional text and other display-related properties.
The following program displays some information about several characters, and prints the numeric value of one particular character:
import unicodedata
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
# Get numeric value of second character
print unicodedata.numeric(u[1])
When run, this prints:
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0
The category codes are abbreviations describing the nature of the character.
These are grouped into categories such as “Letter”, “Number”, “Punctuation”, or
“Symbol”, which in turn are broken up into subcategories. To take the codes
from the above output, 'Ll'
means ‘Letter, lowercase’, 'No'
means
“Number, other”, 'Mn'
is “Mark, nonspacing”, and 'So'
is “Symbol,
other”. See
<http://www.unicode.org/reports/tr44/#General_Category_Values> for a
list of category codes.
References¶
The Unicode and 8-bit string types are described in the Python library reference at typesseq.
The documentation for the unicodedata
module.
The documentation for the codecs
module.
Marc-André Lemburg gave a presentation at EuroPython 2002 titled “Python and Unicode”. A PDF version of his slides is available at <http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an excellent overview of the design of Python’s Unicode features.
Reading and Writing Unicode Data¶
Once you’ve written some code that works with Unicode data, the next problem is input/output. How do you get Unicode strings into your program, and how do you convert Unicode into a form suitable for storage or transmission?
It’s possible that you may not need to do anything depending on your input sources and output destinations; you should check whether the libraries used in your application support Unicode natively. XML parsers often return Unicode data, for example. Many relational databases also support Unicode-valued columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets
written to disk or sent over a socket. It’s possible to do all the work
yourself: open a file, read an 8-bit string from it, and convert the string with
unicode(str, encoding)
. However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1K or 4K), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk. One solution would be to read the entire file into memory and then perform the decoding, but that prevents you from working with files that are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM. (More, really, since for at least a moment you’d need to have both the encoded string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch the case
of partial coding sequences. The work of implementing this has already been
done for you: the codecs
module includes a version of the open()
function that returns a file-like object that assumes the file’s contents are in
a specified encoding and accepts Unicode parameters for methods such as
.read()
and .write()
.
The function’s parameters are open(filename, mode='rb', encoding=None,
errors='strict', buffering=1)
. mode
can be 'r'
, 'w'
, or 'a'
,
just like the corresponding parameter to the regular built-in open()
function; add a '+'
to update the file. buffering
is similarly parallel
to the standard function’s parameter. encoding
is a string giving the
encoding to use; if it’s left as None
, a regular Python file object that
accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
data written to or read from the wrapper object will be converted as needed.
errors
specifies the action for encoding errors and can be one of the usual
values of ‘strict’, ‘ignore’, and ‘replace’.
Reading Unicode from a file is therefore simple:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
It’s also possible to open files in update mode, allowing both reading and writing:
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file’s byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
Unicode filenames¶
Most of the operating systems in common use today support filenames that contain
arbitrary Unicode characters. Usually this is implemented by converting the
Unicode string into some encoding that varies depending on the system. For
example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Windows, Python uses the name “mbcs” to refer to whatever the currently
configured encoding is. On Unix systems, there will only be a filesystem
encoding if you’ve set the LANG
or LC_CTYPE
environment variables; if
you haven’t, the default encoding is ASCII.
The sys.getfilesystemencoding()
function returns the encoding to use on
your current system, in case you want to do the encoding manually, but there’s
not much reason to bother. When opening a file for reading or writing, you can
usually just provide the Unicode string as the filename, and it will be
automatically converted to the right encoding for you:
filename = u'filename\u4500abc'
f = open(filename, 'w')
f.write('blah\n')
f.close()
Functions in the os
module such as os.stat()
will also accept Unicode
filenames.
os.listdir()
, which returns filenames, raises an issue: should it return
the Unicode version of filenames, or should it return 8-bit strings containing
the encoded versions? os.listdir()
will do both, depending on whether you
provided the directory path as an 8-bit string or a Unicode string. If you pass
a Unicode string as the path, filenames will be decoded using the filesystem’s
encoding and a list of Unicode strings will be returned, while passing an 8-bit
path will return the 8-bit versions of the filenames. For example, assuming the
default filesystem encoding is UTF-8, running the following program:
fn = u'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print os.listdir('.')
print os.listdir(u'.')
will produce the following output:
amk:~$ python t.py
['.svn', 'filename\xe4\x94\x80abc', ...]
[u'.svn', u'filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list contains the Unicode versions.
Tips for Writing Unicode-aware Programs¶
This section provides some suggestions on writing software that deals with Unicode.
The most important tip is:
Software should only work with Unicode strings internally, converting to a particular encoding on output.
If you attempt to write processing functions that accept both Unicode and 8-bit
strings, you will find your program vulnerable to bugs wherever you combine the
two different kinds of strings. Python’s default encoding is ASCII, so whenever
a character with an ASCII value > 127 is in the input data, you’ll get a
UnicodeDecodeError
because that character can’t be handled by the ASCII
encoding.
It’s easy to miss such problems if you only test your software with data that doesn’t contain any accents; everything will seem to work, but there’s actually a bug in your program waiting for the first user who attempts to use characters > 127. A second tip, therefore, is:
Include characters > 127 and, even better, characters > 255 in your test data.
When using data coming from a web browser or some other untrusted source, a
common technique is to check for illegal characters in a string before using the
string in a generated command line or storing it in a database. If you’re doing
this, be careful to check the string once it’s in the form that will be used or
stored; it’s possible for encodings to be used to disguise characters. This is
especially true if the input data also specifies the encoding; many encodings
leave the commonly checked-for characters alone, but Python includes some
encodings such as 'base64'
that modify every single character.
For example, let’s say you have a content management system that takes a Unicode filename, and you want to disallow paths with a ‘/’ character. You might write this code:
def read_file (filename, encoding):
if '/' in filename:
raise ValueError("'/' not allowed in filenames")
unicode_name = filename.decode(encoding)
f = open(unicode_name, 'r')
# ... return contents of file ...
However, if an attacker could specify the 'base64'
encoding, they could pass
'L2V0Yy9wYXNzd2Q='
, which is the base-64 encoded form of the string
'/etc/passwd'
, to read a system file. The above code looks for '/'
characters in the encoded form and misses the dangerous character in the
resulting decoded form.
References¶
The PDF slides for Marc-André Lemburg’s presentation “Writing Unicode-aware Applications in Python” are available at <http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> and discuss questions of character encodings as well as how to internationalize and localize an application.
Revision History and Acknowledgements¶
Thanks to the following people who have noted errors or offered suggestions on this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
Version 1.0: posted August 5 2005.
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds several links.
Version 1.02: posted August 16 2005. Corrects factual errors.
Version 1.03: posted June 20 2010. Notes that Python 3.x is not covered, and that the HOWTO only covers 2.x.
HOWTO Fetch Internet Resources Using urllib2¶
Author: | Michael Foord |
---|
Note
There is an French translation of an earlier revision of this HOWTO, available at urllib2 - Le Manuel manquant.
Introduction¶
urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.
urllib2 supports fetching URLs for many “URL schemes” (identified by the string before the ”:” in URL - for example “ftp” is the URL scheme of “ftp://python.org/”) using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP.
For straightforward situations urlopen is very easy to use. But as soon as you
encounter errors or non-trivial cases when opening HTTP URLs, you will need some
understanding of the HyperText Transfer Protocol. The most comprehensive and
authoritative reference to HTTP is RFC 2616. This is a technical document and
not intended to be easy to read. This HOWTO aims to illustrate using urllib2,
with enough detail about HTTP to help you through. It is not intended to replace
the urllib2
docs, but is supplementary to them.
Fetching URLs¶
The simplest way to use urllib2 is as follows:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
Many uses of urllib2 will be that simple (note that instead of an ‘http:’ URL we could have used an URL starting with ‘ftp:’, ‘file:’, etc.). However, it’s the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP.
HTTP is based on requests and responses - the client makes requests and servers
send responses. urllib2 mirrors this with a Request
object which represents
the HTTP request you are making. In its simplest form you create a Request
object that specifies the URL you want to fetch. Calling urlopen
with this
Request object returns a response object for the URL requested. This response is
a file-like object, which means you can for example call .read()
on the
response:
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
Note that urllib2 makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so:
req = urllib2.Request('ftp://example.com/')
In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information (“metadata”) about the data or the about request itself, to the server - this information is sent as HTTP “headers”. Let’s look at each of these in turn.
Data¶
Sometimes you want to send data to a URL (often the URL will refer to a CGI
(Common Gateway Interface) script [1] or other web application). With HTTP,
this is often done using what’s known as a POST request. This is often what
your browser does when you submit a HTML form that you filled in on the web. Not
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
to your own application. In the common case of HTML forms, the data needs to be
encoded in a standard way, and then passed to the Request object as the data
argument. The encoding is done using a function from the urllib
library
not from urllib2
.
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
Note that other encodings are sometimes required (e.g. for file upload from HTML forms - see HTML Specification, Form Submission for more details).
If you do not pass the data
argument, urllib2 uses a GET request. One
way in which GET and POST requests differ is that POST requests often have
“side-effects”: they change the state of the system in some way (for example by
placing an order with the website for a hundredweight of tinned spam to be
delivered to your door). Though the HTTP standard makes it clear that POSTs are
intended to always cause side-effects, and GET requests never to cause
side-effects, nothing prevents a GET request from having side-effects, nor a
POST requests from having no side-effects. Data can also be passed in an HTTP
GET request by encoding it in the URL itself.
This is done as follows:
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.urlopen(full_url)
Notice that the full URL is created by adding a ?
to the URL, followed by
the encoded values.
Headers¶
We’ll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.
Some websites [2] dislike being browsed by programs, or send different versions
to different browsers [3] . By default urllib2 identifies itself as
Python-urllib/x.y
(where x
and y
are the major and minor version
numbers of the Python release,
e.g. Python-urllib/2.5
), which may confuse the site, or just plain
not work. The way a browser identifies itself is through the
User-Agent
header [4]. When you create a Request object you can
pass a dictionary of headers in. The following example makes the same
request as above, but identifies itself as a version of Internet
Explorer [5].
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
The response also has two useful methods. See the section on info and geturl which comes after we have a look at what happens when things go wrong.
Handling Exceptions¶
urlopen raises URLError
when it cannot handle a response (though as
usual with Python APIs, built-in exceptions such as ValueError
,
TypeError
etc. may also be raised).
HTTPError
is the subclass of URLError
raised in the specific case of
HTTP URLs.
URLError¶
Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn’t exist. In this case, the exception raised will have a ‘reason’ attribute, which is a tuple containing an error code and a text error message.
e.g.
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>>
(4, 'getaddrinfo failed')
HTTPError¶
Every HTTP response from the server contains a numeric “status code”. Sometimes
the status code indicates that the server is unable to fulfil the request. The
default handlers will handle some of these responses for you (for example, if
the response is a “redirection” that requests the client fetch the document from
a different URL, urllib2 will handle that for you). For those it can’t handle,
urlopen will raise an HTTPError
. Typical errors include ‘404’ (page not
found), ‘403’ (request forbidden), and ‘401’ (authentication required).
See section 10 of RFC 2616 for a reference on all the HTTP error codes.
The HTTPError
instance raised will have an integer ‘code’ attribute, which
corresponds to the error sent by the server.
Error Codes¶
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
BaseHTTPServer.BaseHTTPRequestHandler.responses
is a useful dictionary of
response codes in that shows all the response codes used by RFC 2616. The
dictionary is reproduced here for convenience
# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),
200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),
300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),
400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),
500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}
When an error is raised the server responds by returning an HTTP error code
and an error page. You can use the HTTPError
instance as a response on the
page returned. This means that as well as the code attribute, it also has read,
geturl, and info, methods.
>>> req = urllib2.Request('http://www.python.org/fish.html')
>>> try:
>>> urllib2.urlopen(req)
>>> except HTTPError, e:
>>> print e.code
>>> print e.read()
>>>
404
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<?xml-stylesheet href="./css/ht2html.css"
type="text/css"?>
<html><head><title>Error 404: File Not Found</title>
...... etc...
Wrapping it Up¶
So if you want to be prepared for HTTPError
or URLError
there are two
basic approaches. I prefer the second approach.
Number 1¶
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
# everything is fine
Note
The except HTTPError
must come first, otherwise except URLError
will also catch an HTTPError
.
Number 2¶
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine
info and geturl¶
The response returned by urlopen (or the HTTPError
instance) has two useful
methods info()
and geturl()
.
geturl - this returns the real URL of the page fetched. This is useful
because urlopen
(or the opener object used) may have followed a
redirect. The URL of the page fetched may not be the same as the URL requested.
info - this returns a dictionary-like object that describes the page
fetched, particularly the headers sent by the server. It is currently an
httplib.HTTPMessage
instance.
Typical headers include ‘Content-length’, ‘Content-type’, and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.
Openers and Handlers¶
When you fetch a URL you use an opener (an instance of the perhaps
confusingly-named urllib2.OpenerDirector
). Normally we have been using
the default opener - via urlopen
- but you can create custom
openers. Openers use handlers. All the “heavy lifting” is done by the
handlers. Each handler knows how to open URLs for a particular URL scheme (http,
ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
redirections or HTTP cookies.
You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.
To create an opener, instantiate an OpenerDirector
, and then call
.add_handler(some_handler_instance)
repeatedly.
Alternatively, you can use build_opener
, which is a convenience function for
creating opener objects with a single function call. build_opener
adds
several handlers by default, but provides a quick way to add more and/or
override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.
install_opener
can be used to make an opener
object the (global) default
opener. This means that calls to urlopen
will use the opener you have
installed.
Opener objects have an open
method, which can be called directly to fetch
urls in the same way as the urlopen
function: there’s no need to call
install_opener
, except as a convenience.
Basic Authentication¶
To illustrate creating and installing a handler we will use the
HTTPBasicAuthHandler
. For a more detailed discussion of this subject –
including an explanation of how Basic Authentication works - see the Basic
Authentication Tutorial.
When authentication is required, the server sends a header (as well as the 401
error code) requesting authentication. This specifies the authentication scheme
and a ‘realm’. The header looks like : Www-authenticate: SCHEME
realm="REALM"
.
e.g.
Www-authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password
for the realm included as a header in the request. This is ‘basic
authentication’. In order to simplify this process we can create an instance of
HTTPBasicAuthHandler
and an opener to use this handler.
The HTTPBasicAuthHandler
uses an object called a password manager to handle
the mapping of URLs and realms to passwords and usernames. If you know what the
realm is (from the authentication header sent by the server), then you can use a
HTTPPasswordMgr
. Frequently one doesn’t care what the realm is. In that
case, it is convenient to use HTTPPasswordMgrWithDefaultRealm
. This allows
you to specify a default username and password for a URL. This will be supplied
in the absence of you providing an alternative combination for a specific
realm. We indicate this by providing None
as the realm argument to the
add_password
method.
The top-level URL is the first URL that requires authentication. URLs “deeper” than the URL you pass to .add_password() will also match.
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)
# use the opener to fetch a URL
opener.open(a_url)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
Note
In the above example we only supplied our HTTPBasicAuthHandler
to
build_opener
. By default openers have the handlers for normal situations
– ProxyHandler
, UnknownHandler
, HTTPHandler
,
HTTPDefaultErrorHandler
, HTTPRedirectHandler
, FTPHandler
,
FileHandler
, HTTPErrorProcessor
.
top_level_url
is in fact either a full URL (including the ‘http:’ scheme
component and the hostname and optionally the port number)
e.g. “http://example.com/” or an “authority” (i.e. the hostname,
optionally including the port number) e.g. “example.com” or “example.com:8080”
(the latter example includes a port number). The authority, if present, must
NOT contain the “userinfo” component - for example “joe@password:example.com” is
not correct.
Proxies¶
urllib2 will auto-detect your proxy settings and use those. This is through
the ProxyHandler
which is part of the normal handler chain. Normally that’s
a good thing, but there are occasions when it may not be helpful [6]. One way
to do this is to setup our own ProxyHandler
, with no proxies defined. This
is done using similar steps to setting up a Basic Authentication handler :
>>> proxy_support = urllib2.ProxyHandler({})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)
Note
Currently urllib2
does not support fetching of https
locations
through a proxy. However, this can be enabled by extending urllib2 as
shown in the recipe [7].
Sockets and Layers¶
The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using
import socket
import urllib2
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
Footnotes¶
This document was reviewed and revised by John Lee.
[1] | For an introduction to the CGI protocol see Writing Web Applications in Python. |
[2] | Like Google for example. The proper way to use google from a program is to use PyGoogle of course. See Voidspace Google for some examples of using the Google API. |
[3] | Browser sniffing is a very bad practise for website design - building sites using web standards is much more sensible. Unfortunately a lot of sites still send different versions to different browsers. |
[4] | The user agent for MSIE 6 is ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)’ |
[5] | For details of more HTTP request headers, see Quick Reference to HTTP Headers. |
[6] | In my case I have to use a proxy to access the internet at work. If you attempt to fetch localhost URLs through this proxy it blocks them. IE is set to use the proxy, which urllib2 picks up on. In order to test scripts with a localhost server, I have to prevent urllib2 from using the proxy. |
[7] | urllib2 opener for SSL proxy (CONNECT method): ASPN Cookbook Recipe. |
HOWTO Use Python in the web¶
Author: | Marek Kubica |
---|
Abstract
This document shows how Python fits into the web. It presents some ways to integrate Python with a web server, and general practices useful for developing web sites.
Programming for the Web has become a hot topic since the rise of “Web 2.0”, which focuses on user-generated content on web sites. It has always been possible to use Python for creating web sites, but it was a rather tedious task. Therefore, many frameworks and helper tools have been created to assist developers in creating faster and more robust sites. This HOWTO describes some of the methods used to combine Python with a web server to create dynamic content. It is not meant as a complete introduction, as this topic is far too broad to be covered in one single document. However, a short overview of the most popular libraries is provided.
See also
While this HOWTO tries to give an overview of Python in the web, it cannot always be as up to date as desired. Web development in Python is rapidly moving forward, so the wiki page on Web Programming may be more in sync with recent development.
The Low-Level View¶
When a user enters a web site, their browser makes a connection to the site’s web server (this is called the request). The server looks up the file in the file system and sends it back to the user’s browser, which displays it (this is the response). This is roughly how the underlying protocol, HTTP, works.
Dynamic web sites are not based on files in the file system, but rather on programs which are run by the web server when a request comes in, and which generate the content that is returned to the user. They can do all sorts of useful things, like display the postings of a bulletin board, show your email, configure software, or just display the current time. These programs can be written in any programming language the server supports. Since most servers support Python, it is easy to use Python to create dynamic web sites.
Most HTTP servers are written in C or C++, so they cannot execute Python code directly – a bridge is needed between the server and the program. These bridges, or rather interfaces, define how programs interact with the server. There have been numerous attempts to create the best possible interface, but there are only a few worth mentioning.
Not every web server supports every interface. Many web servers only support old, now-obsolete interfaces; however, they can often be extended using third-party modules to support newer ones.
Common Gateway Interface¶
This interface, most commonly referred to as “CGI”, is the oldest, and is supported by nearly every web server out of the box. Programs using CGI to communicate with their web server need to be started by the server for every request. So, every request starts a new Python interpreter – which takes some time to start up – thus making the whole interface only usable for low load situations.
The upside of CGI is that it is simple – writing a Python program which uses CGI is a matter of about three lines of code. This simplicity comes at a price: it does very few things to help the developer.
Writing CGI programs, while still possible, is no longer recommended. With WSGI, a topic covered later in this document, it is possible to write programs that emulate CGI, so they can be run as CGI if no better option is available.
See also
The Python standard library includes some modules that are helpful for creating plain CGI programs:
cgi
– Handling of user input in CGI scriptscgitb
– Displays nice tracebacks when errors happen in CGI applications, instead of presenting a “500 Internal Server Error” message
The Python wiki features a page on CGI scripts with some additional information about CGI in Python.
Simple script for testing CGI¶
To test whether your web server works with CGI, you can use this short and simple CGI program:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# enable debugging
import cgitb
cgitb.enable()
print "Content-Type: text/plain;charset=utf-8"
print
print "Hello World!"
Depending on your web server configuration, you may need to save this code with
a .py
or .cgi
extension. Additionally, this file may also need to be
in a cgi-bin
folder, for security reasons.
You might wonder what the cgitb
line is about. This line makes it possible
to display a nice traceback instead of just crashing and displaying an “Internal
Server Error” in the user’s browser. This is useful for debugging, but it might
risk exposing some confidential data to the user. You should not use cgitb
in production code for this reason. You should always catch exceptions, and
display proper error pages – end-users don’t like to see nondescript “Internal
Server Errors” in their browsers.
Setting up CGI on your own server¶
If you don’t have your own web server, this does not apply to you. You can check whether it works as-is, and if not you will need to talk to the administrator of your web server. If it is a big host, you can try filing a ticket asking for Python support.
If you are your own administrator or want to set up CGI for testing purposes on your own computers, you have to configure it by yourself. There is no single way to configure CGI, as there are many web servers with different configuration options. Currently the most widely used free web server is Apache HTTPd, or Apache for short. Apache can be easily installed on nearly every system using the system’s package management tool. lighttpd is another alternative and is said to have better performance. On many systems this server can also be installed using the package management tool, so manually compiling the web server may not be needed.
- On Apache you can take a look at the Dynamic Content with CGI tutorial, where everything
is described. Most of the time it is enough just to set
+ExecCGI
. The tutorial also describes the most common gotchas that might arise. - On lighttpd you need to use the CGI module, which can be configured
in a straightforward way. It boils down to setting
cgi.assign
properly.
Common problems with CGI scripts¶
Using CGI sometimes leads to small annoyances while trying to get these scripts to run. Sometimes a seemingly correct script does not work as expected, the cause being some small hidden problem that’s difficult to spot.
Some of these potential problems are:
- The Python script is not marked as executable. When CGI scripts are not
executable most web servers will let the user download it, instead of
running it and sending the output to the user. For CGI scripts to run
properly on Unix-like operating systems, the
+x
bit needs to be set. Usingchmod a+x your_script.py
may solve this problem. - On a Unix-like system, The line endings in the program file must be Unix style line endings. This is important because the web server checks the first line of the script (called shebang) and tries to run the program specified there. It gets easily confused by Windows line endings (Carriage Return & Line Feed, also called CRLF), so you have to convert the file to Unix line endings (only Line Feed, LF). This can be done automatically by uploading the file via FTP in text mode instead of binary mode, but the preferred way is just telling your editor to save the files with Unix line endings. Most editors support this.
- Your web server must be able to read the file, and you need to make sure the
permissions are correct. On unix-like systems, the server often runs as user
and group
www-data
, so it might be worth a try to change the file ownership, or making the file world readable by usingchmod a+r your_script.py
. - The web server must know that the file you’re trying to access is a CGI script. Check the configuration of your web server, as it may be configured to expect a specific file extension for CGI scripts.
- On Unix-like systems, the path to the interpreter in the shebang
(
#!/usr/bin/env python
) must be correct. This line calls/usr/bin/env
to find Python, but it will fail if there is no/usr/bin/env
, or if Python is not in the web server’s path. If you know where your Python is installed, you can also use that full path. The commandswhereis python
andtype -p python
could help you find where it is installed. Once you know the path, you can change the shebang accordingly:#!/usr/bin/python
. - The file must not contain a BOM (Byte Order Mark). The BOM is meant for determining the byte order of UTF-16 and UTF-32 encodings, but some editors write this also into UTF-8 files. The BOM interferes with the shebang line, so be sure to tell your editor not to write the BOM.
- If the web server is using mod_python,
mod_python
may be having problems.mod_python
is able to handle CGI scripts by itself, but it can also be a source of issues.
mod_python¶
People coming from PHP often find it hard to grasp how to use Python in the web.
Their first thought is mostly mod_python,
because they think that this is the equivalent to mod_php
. Actually, there
are many differences. What mod_python
does is embed the interpreter into
the Apache process, thus speeding up requests by not having to start a Python
interpreter for each request. On the other hand, it is not “Python intermixed
with HTML” in the way that PHP is often intermixed with HTML. The Python
equivalent of that is a template engine. mod_python
itself is much more
powerful and provides more access to Apache internals. It can emulate CGI,
work in a “Python Server Pages” mode (similar to JSP) which is “HTML
intermingled with Python”, and it has a “Publisher” which designates one file
to accept all requests and decide what to do with them.
mod_python
does have some problems. Unlike the PHP interpreter, the Python
interpreter uses caching when executing files, so changes to a file will
require the web server to be restarted. Another problem is the basic concept
– Apache starts child processes to handle the requests, and unfortunately
every child process needs to load the whole Python interpreter even if it does
not use it. This makes the whole web server slower. Another problem is that,
because mod_python
is linked against a specific version of libpython
,
it is not possible to switch from an older version to a newer (e.g. 2.4 to 2.5)
without recompiling mod_python
. mod_python
is also bound to the Apache
web server, so programs written for mod_python
cannot easily run on other
web servers.
These are the reasons why mod_python
should be avoided when writing new
programs. In some circumstances it still might be a good idea to use
mod_python
for deployment, but WSGI makes it possible to run WSGI programs
under mod_python
as well.
FastCGI and SCGI¶
FastCGI and SCGI try to solve the performance problem of CGI in another way. Instead of embedding the interpreter into the web server, they create long-running background processes. There is still a module in the web server which makes it possible for the web server to “speak” with the background process. As the background process is independent of the server, it can be written in any language, including Python. The language just needs to have a library which handles the communication with the webserver.
The difference between FastCGI and SCGI is very small, as SCGI is essentially just a “simpler FastCGI”. As the web server support for SCGI is limited, most people use FastCGI instead, which works the same way. Almost everything that applies to SCGI also applies to FastCGI as well, so we’ll only cover the latter.
These days, FastCGI is never used directly. Just like mod_python
, it is only
used for the deployment of WSGI applications.
See also
- FastCGI, SCGI, and Apache: Background and Future is a discussion on why the concept of FastCGI and SCGI is better than that of mod_python.
Setting up FastCGI¶
Each web server requires a specific module.
- Apache has both mod_fastcgi and mod_fcgid.
mod_fastcgi
is the original one, but it has some licensing issues, which is why it is sometimes considered non-free.mod_fcgid
is a smaller, compatible alternative. One of these modules needs to be loaded by Apache. - lighttpd ships its own FastCGI module as well as an SCGI module.
- nginx also supports FastCGI.
Once you have installed and configured the module, you can test it with the following WSGI-application:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from cgi import escape
import sys, os
from flup.server.fcgi import WSGIServer
def app(environ, start_response):
start_response('200 OK', [('Content-Type', 'text/html')])
yield '<h1>FastCGI Environment</h1>'
yield '<table>'
for k, v in sorted(environ.items()):
yield '<tr><th>%s</th><td>%s</td></tr>' % (escape(k), escape(v))
yield '</table>'
WSGIServer(app).run()
This is a simple WSGI application, but you need to install flup first, as flup handles the low level FastCGI access.
See also
There is some documentation on setting up Django with FastCGI, most of
which can be reused for other WSGI-compliant frameworks and libraries.
Only the manage.py
part has to be changed, the example used here can be
used instead. Django does more or less the exact same thing.
mod_wsgi¶
mod_wsgi is an attempt to get rid of the low level gateways. Given that FastCGI, SCGI, and mod_python are mostly used to deploy WSGI applications, mod_wsgi was started to directly embed WSGI applications into the Apache web server. mod_wsgi is specifically designed to host WSGI applications. It makes the deployment of WSGI applications much easier than deployment using other low level methods, which need glue code. The downside is that mod_wsgi is limited to the Apache web server; other servers would need their own implementations of mod_wsgi.
mod_wsgi supports two modes: embedded mode, in which it integrates with the Apache process, and daemon mode, which is more FastCGI-like. Unlike FastCGI, mod_wsgi handles the worker-processes by itself, which makes administration easier.
Step back: WSGI¶
WSGI has already been mentioned several times, so it has to be something important. In fact it really is, and now it is time to explain it.
The Web Server Gateway Interface, or WSGI for short, is defined in PEP 333 and is currently the best way to do Python web programming. While it is great for programmers writing frameworks, a normal web developer does not need to get in direct contact with it. When choosing a framework for web development it is a good idea to choose one which supports WSGI.
The big benefit of WSGI is the unification of the application programming
interface. When your program is compatible with WSGI – which at the outer
level means that the framework you are using has support for WSGI – your
program can be deployed via any web server interface for which there are WSGI
wrappers. You do not need to care about whether the application user uses
mod_python or FastCGI or mod_wsgi – with WSGI your application will work on
any gateway interface. The Python standard library contains its own WSGI
server, wsgiref
, which is a small web server that can be used for
testing.
A really great WSGI feature is middleware. Middleware is a layer around your program which can add various functionality to it. There is quite a bit of middleware already available. For example, instead of writing your own session management (HTTP is a stateless protocol, so to associate multiple HTTP requests with a single user your application must create and manage such state via a session), you can just download middleware which does that, plug it in, and get on with coding the unique parts of your application. The same thing with compression – there is existing middleware which handles compressing your HTML using gzip to save on your server’s bandwidth. Authentication is another a problem easily solved using existing middleware.
Although WSGI may seem complex, the initial phase of learning can be very rewarding because WSGI and the associated middleware already have solutions to many problems that might arise while developing web sites.
WSGI Servers¶
The code that is used to connect to various low level gateways like CGI or
mod_python is called a WSGI server. One of these servers is flup
, which
supports FastCGI and SCGI, as well as AJP. Some of these servers
are written in Python, as flup
is, but there also exist others which are
written in C and can be used as drop-in replacements.
There are many servers already available, so a Python web application can be deployed nearly anywhere. This is one big advantage that Python has compared with other web technologies.
See also
A good overview of WSGI-related code can be found in the WSGI homepage, which contains an extensive list of WSGI servers which can be used by any application supporting WSGI.
You might be interested in some WSGI-supporting modules already contained in the standard library, namely:
wsgiref
– some tiny utilities and servers for WSGI
Case study: MoinMoin¶
What does WSGI give the web application developer? Let’s take a look at an application that’s been around for a while, which was written in Python without using WSGI.
One of the most widely used wiki software packages is MoinMoin. It was created in 2000, so it predates WSGI by about three years. Older versions needed separate code to run on CGI, mod_python, FastCGI and standalone.
It now includes support for WSGI. Using WSGI, it is possible to deploy MoinMoin on any WSGI compliant server, with no additional glue code. Unlike the pre-WSGI versions, this could include WSGI servers that the authors of MoinMoin know nothing about.
Model-View-Controller¶
The term MVC is often encountered in statements such as “framework foo supports MVC”. MVC is more about the overall organization of code, rather than any particular API. Many web frameworks use this model to help the developer bring structure to their program. Bigger web applications can have lots of code, so it is a good idea to have an effective structure right from the beginning. That way, even users of other frameworks (or even other languages, since MVC is not Python-specific) can easily understand the code, given that they are already familiar with the MVC structure.
MVC stands for three components:
- The model. This is the data that will be displayed and modified. In Python frameworks, this component is often represented by the classes used by an object-relational mapper.
- The view. This component’s job is to display the data of the model to the user. Typically this component is implemented via templates.
- The controller. This is the layer between the user and the model. The controller reacts to user actions (like opening some specific URL), tells the model to modify the data if necessary, and tells the view code what to display,
While one might think that MVC is a complex design pattern, in fact it is not. It is used in Python because it has turned out to be useful for creating clean, maintainable web sites.
Note
While not all Python frameworks explicitly support MVC, it is often trivial to create a web site which uses the MVC pattern by separating the data logic (the model) from the user interaction logic (the controller) and the templates (the view). That’s why it is important not to write unnecessary Python code in the templates – it works against the MVC model and creates chaos in the code base, making it harder to understand and modify.
See also
The English Wikipedia has an article about the Model-View-Controller pattern. It includes a long list of web frameworks for various programming languages.
Ingredients for Websites¶
Websites are complex constructs, so tools have been created to help web developers make their code easier to write and more maintainable. Tools like these exist for all web frameworks in all languages. Developers are not forced to use these tools, and often there is no “best” tool. It is worth learning about the available tools because they can greatly simplify the process of developing a web site.
See also
There are far more components than can be presented here. The Python wiki has a page about these components, called Web Components.
Templates¶
Mixing of HTML and Python code is made possible by a few libraries. While convenient at first, it leads to horribly unmaintainable code. That’s why templates exist. Templates are, in the simplest case, just HTML files with placeholders. The HTML is sent to the user’s browser after filling in the placeholders.
Python already includes two ways to build simple templates:
>>> template = "<html><body><h1>Hello %s!</h1></body></html>"
>>> print template % "Reader"
<html><body><h1>Hello Reader!</h1></body></html>
>>> from string import Template
>>> template = Template("<html><body><h1>Hello ${name}</h1></body></html>")
>>> print template.substitute(dict(name='Dinsdale'))
<html><body><h1>Hello Dinsdale!</h1></body></html>
To generate complex HTML based on non-trivial model data, conditional and looping constructs like Python’s for and if are generally needed. Template engines support templates of this complexity.
There are a lot of template engines available for Python which can be used with or without a framework. Some of these define a plain-text programming language which is easy to learn, partly because it is limited in scope. Others use XML, and the template output is guaranteed to be always be valid XML. There are many other variations.
Some frameworks ship their own template engine or recommend one in particular. In the absence of a reason to use a different template engine, using the one provided by or recommended by the framework is a good idea.
Popular template engines include:
See also
There are many template engines competing for attention, because it is pretty easy to create them in Python. The page Templating in the wiki lists a big, ever-growing number of these. The three listed above are considered “second generation” template engines and are a good place to start.
Data persistence¶
Data persistence, while sounding very complicated, is just about storing data. This data might be the text of blog entries, the postings on a bulletin board or the text of a wiki page. There are, of course, a number of different ways to store information on a web server.
Often, relational database engines like MySQL or
PostgreSQL are used because of their good
performance when handling very large databases consisting of millions of
entries. There is also a small database engine called SQLite, which is bundled with Python in the sqlite3
module, and which uses only one file. It has no other dependencies. For
smaller sites SQLite is just enough.
Relational databases are queried using a language called SQL. Python programmers in general do not like SQL too much, as they prefer to work with objects. It is possible to save Python objects into a database using a technology called ORM (Object Relational Mapping). ORM translates all object-oriented access into SQL code under the hood, so the developer does not need to think about it. Most frameworks use ORMs, and it works quite well.
A second possibility is storing data in normal, plain text files (some times called “flat files”). This is very easy for simple sites, but can be difficult to get right if the web site is performing many updates to the stored data.
A third possibility are object oriented databases (also called “object databases”). These databases store the object data in a form that closely parallels the way the objects are structured in memory during program execution. (By contrast, ORMs store the object data as rows of data in tables and relations between those rows.) Storing the objects directly has the advantage that nearly all objects can be saved in a straightforward way, unlike in relational databases where some objects are very hard to represent.
Frameworks often give hints on which data storage method to choose. It is usually a good idea to stick to the data store recommended by the framework unless the application has special requirements better satisfied by an alternate storage mechanism.
See also
- Persistence Tools lists possibilities on how to save data in the file system. Some of these modules are part of the standard library
- Database Programming helps with choosing a method for saving data
- SQLAlchemy, the most powerful OR-Mapper for Python, and Elixir, which makes SQLAlchemy easier to use
- SQLObject, another popular OR-Mapper
- ZODB and Durus, two object oriented databases
Frameworks¶
The process of creating code to run web sites involves writing code to provide various services. The code to provide a particular service often works the same way regardless of the complexity or purpose of the web site in question. Abstracting these common solutions into reusable code produces what are called “frameworks” for web development. Perhaps the most well-known framework for web development is Ruby on Rails, but Python has its own frameworks. Some of these were partly inspired by Rails, or borrowed ideas from Rails, but many existed a long time before Rails.
Originally Python web frameworks tended to incorporate all of the services needed to develop web sites as a giant, integrated set of tools. No two web frameworks were interoperable: a program developed for one could not be deployed on a different one without considerable re-engineering work. This led to the development of “minimalist” web frameworks that provided just the tools to communicate between the Python code and the http protocol, with all other services to be added on top via separate components. Some ad hoc standards were developed that allowed for limited interoperability between frameworks, such as a standard that allowed different template engines to be used interchangeably.
Since the advent of WSGI, the Python web framework world has been evolving toward interoperability based on the WSGI standard. Now many web frameworks, whether “full stack” (providing all the tools one needs to deploy the most complex web sites) or minimalist, or anything in between, are built from collections of reusable components that can be used with more than one framework.
The majority of users will probably want to select a “full stack” framework that has an active community. These frameworks tend to be well documented, and provide the easiest path to producing a fully functional web site in minimal time.
Some notable frameworks¶
There are an incredible number of frameworks, so they cannot all be covered here. Instead we will briefly touch on some of the most popular.
Django¶
Django is a framework consisting of several tightly coupled elements which were written from scratch and work together very well. It includes an ORM which is quite powerful while being simple to use, and has a great online administration interface which makes it possible to edit the data in the database with a browser. The template engine is text-based and is designed to be usable for page designers who cannot write Python. It supports template inheritance and filters (which work like Unix pipes). Django has many handy features bundled, such as creation of RSS feeds or generic views, which make it possible to create web sites almost without writing any Python code.
It has a big, international community, the members of which have created many web sites. There are also a lot of add-on projects which extend Django’s normal functionality. This is partly due to Django’s well written online documentation and the Django book.
Note
Although Django is an MVC-style framework, it names the elements differently, which is described in the Django FAQ.
TurboGears¶
Another popular web framework for Python is TurboGears. TurboGears takes the approach of using already existing components and combining them with glue code to create a seamless experience. TurboGears gives the user flexibility in choosing components. For example the ORM and template engine can be changed to use packages different from those used by default.
The documentation can be found in the TurboGears wiki, where links to screencasts can be found. TurboGears has also an active user community which can respond to most related questions. There is also a TurboGears book published, which is a good starting point.
The newest version of TurboGears, version 2.0, moves even further in direction of WSGI support and a component-based architecture. TurboGears 2 is based on the WSGI stack of another popular component-based web framework, Pylons.
Zope¶
The Zope framework is one of the “old original” frameworks. Its current incarnation in Zope2 is a tightly integrated full-stack framework. One of its most interesting feature is its tight integration with a powerful object database called the ZODB (Zope Object Database). Because of its highly integrated nature, Zope wound up in a somewhat isolated ecosystem: code written for Zope wasn’t very usable outside of Zope, and vice-versa. To solve this problem the Zope 3 effort was started. Zope 3 re-engineers Zope as a set of more cleanly isolated components. This effort was started before the advent of the WSGI standard, but there is WSGI support for Zope 3 from the Repoze project. Zope components have many years of production use behind them, and the Zope 3 project gives access to these components to the wider Python community. There is even a separate framework based on the Zope components: Grok.
Zope is also the infrastructure used by the Plone content management system, one of the most powerful and popular content management systems available.
Other notable frameworks¶
Of course these are not the only frameworks that are available. There are many other frameworks worth mentioning.
Another framework that’s already been mentioned is Pylons. Pylons is much like TurboGears, but with an even stronger emphasis on flexibility, which comes at the cost of being more difficult to use. Nearly every component can be exchanged, which makes it necessary to use the documentation of every single component, of which there are many. Pylons builds upon Paste, an extensive set of tools which are handy for WSGI.
And that’s still not everything. The most up-to-date information can always be found in the Python wiki.
See also
The Python wiki contains an extensive list of web frameworks.
Most frameworks also have their own mailing lists and IRC channels, look out for these on the projects’ web sites. There is also a general “Python in the Web” IRC channel on freenode called #python.web.