Get the code at https://github.com/sontek/python-feedparser
This documentation claims to describe the behavior of Universal Feed Parser 4.2. It does not claim to describe the behavior of any other version.
This documentation lives at ` <http://readthedocs.org/docs/universal-feedparser/en/latest/>`_. If you’re reading it somewhere else, you may not have the latest version.
This documentation is provided by the author as is without any express or implied warranties. See license for more details.
Introduction
Universal Feed Parser is a Python module for downloading and parsing syndicated feeds. It can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Apple’s iTunes extensions.
To use Universal Feed Parser, you will need Python 2.4 or later (Python 3 is supported). Universal Feed Parser is not meant to run standalone; it is a module for you to use as part of a larger Python program.
Universal Feed Parser is easy to use; the module is self-contained in a single file, feedparser.py, and it has one primary public function, parse. parse takes a number of arguments, but only one is required, and it can be a URL, a local filename, or a raw string containing feed data in any format.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
>>> d['feed']['title']
u'Sample Feed'
The following example assumes you are on Windows, and that you have saved a feed at c:incomingatom10.xml.
Note
Universal Feed Parser works on any platform that can run Python; use the path syntax appropriate for your platform.
>>> import feedparser
>>> d = feedparser.parse(r'c:\\incoming\\atom10.xml')
>>> d['feed']['title']
u'Sample Feed'
Universal Feed Parser can also parse a feed in memory.
>>> import feedparser
>>> rawdata = """<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>"""
>>> d = feedparser.parse(rawdata)
>>> d['feed']['title']
u'Sample Feed'
Values are returned as Python Unicode strings (except when they’re not – see advanced.encoding for all the gory details).
filename=”common-rss-elements.html”
The most commonly used elements in RSS feeds (regardless of version) are title, link, description, modified date, and entry ID. The modified date comes from the pubDate element, and the entry ID comes from the guid element.
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Sample Feed</title>
<description>For documentation <em>only</em></description>
<link>http://example.org/</link>
<pubDate>Sat, 07 Sep 2002 0:00:01 GMT</pubDate>
<!-- other elements omitted from this example -->
<item>
<title>First entry title</title>
<link>http://example.org/entry/3</link>
<description>Watch out for <span style="background-image:
url(javascript:window.location='http://example.org/')">nasty
tricks</span></description>
<pubDate>Sat, 07 Sep 2002 0:00:01 GMT</pubDate>
<guid>http://example.org/entry/3</guid>
<!-- other elements omitted from this example -->
</item>
</channel>
</rss>
The channel elements are available in d.feed.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> d.feed.title
u'Sample Feed'
>>> d.feed.link
u'http://example.org/'
>>> d.feed.description
u'For documentation <em>only</em>'
>>> d.feed.date
u'Sat, 07 Sep 2002 0:00:01 GMT'
>>> d.feed.date_parsed
(2002, 9, 7, 0, 0, 1, 5, 250, 0)
The items are available in d.entries, which is a list. You access items in the list in the same order in which they appear in the original feed, so the first item is available in d.entries[0].
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> d.entries[0].title``u'First item title'``
>>> d.entries[0].link``u'http://example.org/item/1'``
>>> d.entries[0].description``u'Watch out for <span>nasty tricks</span>'``
>>> d.entries[0].date``u'Thu, 05 Sep 2002 0:00:01 GMT'``
>>> d.entries[0].date_parsed``(2002, 9, 5, 0, 0, 1, 3, 248, 0)``
>>> d.entries[0].id``u'http://example.org/guid/1'
Tip
You can also access data from RSS feeds using Atom terminology. See advanced.normalization for details.
filename=”common-atom-elements.html”
Atom feeds generally contain more information than RSS feeds (because more elements are required), but the most commonly used elements are still title, link, subtitle/description, various dates, and ID.
This sample Atom feed is at ` <http://feedparser.org/docs/examples/atom10.xml>`_.<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xml:base="http://example.org/"
xml:lang="en">
<title type="text">Sample Feed</title>
<subtitle type="html">
For documentation <em>only</em>
</subtitle>
<link rel="alternate" href="/"/>
<link rel="self"
type="application/atom+xml"
href="http://www.example.org/atom10.xml"/>
<rights type="html">
<p>Copyright 2005, Mark Pilgrim</p><
</rights>
<id>tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml</id>
<generator
uri="http://example.org/generator/"
version="4.0">
Sample Toolkit
</generator>
<updated>2005-11-09T11:56:34Z</updated>
<entry>
<title>First entry title</title>
<link rel="alternate"
href="/entry/3"/>
<link rel="related"
type="text/html"
href="http://search.example.com/"/>
<link rel="via"
type="text/html"
href="http://toby.example.com/examples/atom10"/>
<link rel="enclosure"
type="video/mpeg4"
href="http://www.example.com/movie.mp4"
length="42301"/>
<id>tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml:3</id>
<published>2005-11-09T00:23:47Z</published>
<updated>2005-11-09T11:56:34Z</updated>
<summary type="text/plain" mode="escaped">Watch out for nasty tricks</summary>
<content type="application/xhtml+xml" mode="xml"
xml:base="http://example.org/entry/3" xml:lang="en-US">
<div xmlns="http://www.w3.org/1999/xhtml">Watch out for
<span style="background: url(javascript:window.location='http://example.org/')">
nasty tricks</span></div>
</content>
</entry>
</feed>
The feed elements are available in d.feed.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.feed.title
``u'Sample feed'``
>>> d.feed.link
``u'http://example.org/'``
>>> d.feed.subtitle
``u'For documentation <em>only</em>'``
>>> d.feed.updated
``u'2005-11-09T11:56:34Z'``
>>> d.feed.updated_parsed
``(2005, 11, 9, 11, 56, 34, 2, 313, 0)``
>>> d.feed.id
``u'tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml'
Entries are available in d.entries, which is a list. You access entries in the order in which they appear in the original feed, so the first entry is d.entries[0].
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.entries[0].title
``u'First entry title'``
>>> d.entries[0].link
``u'http://example.org/entry/3``
>>> d.entries[0].id
``u'tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml:3'``
>>> d.entries[0].published
``u'2005-11-09T00:23:47Z'``
>>> d.entries[0].published_parsed
``(2005, 11, 9, 0, 23, 47, 2, 313, 0)``
>>> d.entries[0].updated
``u'2005-11-09T11:56:34Z'``
>>> d.entries[0].updated_parsed
``(2005, 11, 9, 11, 56, 34, 2, 313, 0)``
>>> d.entries[0].summary
``u'Watch out for nasty tricks'``
>>> d.entries[0].content
``[{'type': u'application/xhtml+xml',
'base': u'http://example.org/entry/3',
'language': u'en-US',
'value': u'<div>Watch out for <span>nasty tricks</span></div>'}]
Note
The parsed summary and content are not the same as they appear in the original feed. The original elements contained dangerous HTML markup which was sanitized. See advanced.sanitization for details.
Because Atom entries can have more than one content element, d.entries[0].content is a list of dictionaries. Each dictionary contains metadata about a single content element. The two most important values in the dictionary are the content type, in d.entries[0].content[0].type, and the actual content value, in d.entries[0].content[0].value.
You can get this level of detail on other Atom elements too.
filename=”atom-detail.html”
Several Atom elements share the Atom content model: title, subtitle, rights, summary, and of course content. (Atom 0.3 also had an info element which shared this content model.) Universal Feed Parser captures all relevant metadata about these elements, most importantly the content type and the value itself.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.feed.title_detail
``{'type': u'text/plain',
'base': u'http://example.org/',
'language': u'en',
'value': u'Sample Feed'}``
>>> d.feed.subtitle_detail
``{'type': u'text/html',
'base': u'http://example.org/',
'language': u'en',
'value': u'For documentation <em>only</em>'}``
>>> d.feed.rights_detail
``{'type': u'text/html',
'base': u'http://example.org/',
'language': u'en',
'value': u'<p>Copyright 2004, Mark Pilgrim</p>'}``
>>> d.entries[0].title_detail
``{'type': 'text/plain',
'base': u'http://example.org/',
'language': u'en',
'value': u'First entry title'}``
>>> d.entries[0].summary_detail
``{'type': u'text/plain',
'base': u'http://example.org/',
'language': u'en',
'value': u'Watch out for nasty tricks'}``
>>> len(d.entries[0].content)
``1``
>>> d.entries[0].content[0]
``{'type': u'application/xhtml+xml',
'base': u'http://example.org/entry/3',
'language': u'en-US'
'value': u'<div>Watch out for <span> nasty tricks</span></div>'}
filename=”uncommon-rss.html”
These elements are less common, but are useful for niche applications and may be present in any RSS feed.
An RSS feed can specify a small image which some aggregators display as a logo.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> d.feed.image
``{'title': u'Example banner',
'href': u'http://example.org/banner.png',
'width': 80,
'height': 15,
'link': u'http://example.org/'}
Feeds and entries can be assigned to multiple categories, and in some versions of RSS, categories can be associated with a domain. Both are free-form strings. For historical reasons, Universal Feed Parser makes multiple categories available as a list of tuples, rather than a list of dictionaries.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> d.feed.categories
``[(u'Syndic8', u'1024'),
(u'dmoz', 'Top/Society/People/Personal_Homepages/P/')]
Each item in an RSS feed can have an enclosure, a delightful misnomer that is simply a link to an external file (usually a music or video file, but any type of file can be “enclosed”). Once rare, this element has recently gained popularity due to the rise of podcasting. Some clients (such as Apple’s iTunes) may automatically download enclosures; others (such as the web-based Bloglines) may simply render each enclosure as a link.
The RSS specification states that there can be at most one enclosure per item. However, Atom entries may contain more than one enclosure per entry, so Universal Feed Parser captures all of them and makes them available as a list.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> e = d.entries[0]
>>> len(e.enclosures)
``1``
>>> e.enclosures[0]
``{'type': u'audio/mpeg',
'length': u'1069871',
'href': u'http://example.org/audio/demo.mp3'}
No one is quite sure what a cloud is.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> d.feed.cloud
``{'domain': u'rpc.example.com',
'port': u'80',
'path': u'/RPC2',
'registerprocedure': u'pingMe',
'protocol': u'soap'}
Note
For more examples of accessing RSS elements, see the annotated examples: annotated.rss10, annotated.rss20, and annotated.rss20dc.
filename=”uncommon-atom.html”
These elements are less common, but are useful for niche applications and may be present in any Atom feed.
Besides an author, each Atom feed or entry can have an arbitrary number of contributors. Universal Feed Parser makes these available as a list.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> e = d.entries[0]
>>> len(e.contributors)
``2``
>>> e.contributors[0]
``{'name': u'Joe',
'href': u'http://example.org/joe/',
'email': u'joe@example.org'}``
>>> e.contributors[1]
``{'name': u'Sam',
'href': u'http://example.org/sam/',
'email': u'sam@example.org'}
Besides an alternate link, each Atom feed or entry can have an arbitrary number of other links. Each link is distinguished by its type attribute, which is a MIME-style content type, and its rel attribute.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> e = d.entries[0]
>>> len(e.links)
``4``
>>> e.links[0]
``{'rel': u'alternate',
'type': u'text/html',
'href': u'http://example.org/entry/3'}``
>>> e.links[1]
``{'rel': u'related',
'type': u'text/html',
'href': u'http://search.example.com/'}``
>>> e.links[2]
``{'rel': u'via',
'type': u'text/html',
'href': u'http://toby.example.com/examples/atom10'}``
>>> e.links[3]
``{'rel': u'enclosure',
'type': u'video/mpeg4',
'href': u'http://www.example.com/movie.mp4',
'length': u'42301'}
Note
For more examples of accessing Atom elements, see the annotated examples annotated.atom10 and annotated.atom03.
filename=”basic-existence.html”
Feeds in the real world may be missing elements, even elements that are required by the specification. You should always test for the existence of an element before getting its value. Never assume an element is present.
Use standard Python dictionary functions such as has_key to test whether an element exists.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.feed.has_key('title')
``True``
>>> d.feed.has_key('ttl')
``False``
>>> d.feed.get('title', 'No title')
``u'Sample feed'``
>>> d.feed.get('ttl', 60)``60
filename=”advanced.html”
filename=”date-parsing.html”
Different feed types and versions use wildly different date formats. Universal Feed Parser will attempt to auto-detect the date format used in any date element, and parse it into a standard Python 9-tuple, as documented in the :program:`Python time module <http://docs.python.org/lib/module-time.html>`_.
The following elements are parsed as dates:
Here is a brief history of feed date formats:
Here is a representative list of the formats that Universal Feed Parser can recognize in any date element:
Description | Example | Parsed Value |
---|---|---|
valid RFC 822 (2-digit year) | Thu, 01 Jan 04 19:48:21 | GMT(2004, 1, 1, 19, 48, 21, 3, 1, 0) |
valid RFC 822 (4-digit year) | Thu, 01 Jan 2004 19:48:21 | GMT(2004, 1, 1, 19, 48, 21, 3, 1, 0) |
invalid RFC 822 (no time) | 01 Jan 2004 | (2004, 1, 1, 0, 0, 0, 3, 1, 0) |
invalid RFC 822 (no seconds) | 01 Jan 2004 00:00 | GMT(2004, 1, 1, 0, 0, 0, 3, 1, 0) |
Universal Feed Parser recognizes all character-based timezone abbreviations defined in RFC 822. In addition, Universal Feed Parser recognizes the following invalid timezones:
Universal Feed Parser supports many different date formats, but there are probably many more in the wild that are still unsupported. If you find other date formats, you can support them by registering them with registerDateHandler. It takes a single argument, a callback function. The callback function should take a single argument, a string, and return a single value, a 9-tuple Python date in UTC.
Registering a third-party date handler
import feedparser import re
_my_date_pattern = re.compile(\r’(\d{,2})/(\d{,2})/(\d{4}) (\d{,2}):(\d{2}):(\d{2})’)
“”“parse a UTC date in MM/DD/YYYY HH:MM:SS format”“” month, day, year, hour, minute, second = \ _my_date_pattern.search(aDateString).groups() return (int(year), int(month), int(day), \ int(hour), int(minute), int(second), 0, 0, 0)
feedparser.registerDateHandler(myDateHandler) d = feedparser.parse(...)
Your newly-registered date handler will be tried before all the other date handlers built into Universal Feed Parser. (More specifically, all date handlers are tried in “last in, first out” order; i.e. the last handler to be registered is the first one tried, and so on in reverse order of registration.)
If your date handler returns None, or anything other than a Python 9-tuple date, or raises an exception of any kind, the error will be silently ignored and the other registered date handlers will be tried in order. If no date handlers succeed, then the date is not parsed, and the *_parsed value will not be present in the results dictionary. The original date string will still be available in the appropriate element in the results dictionary.
Tip
If you write a new date handler, you are encouraged (but not required) to submit a patch so it can be integrated into the next version of Universal Feed Parser.
filename=”html-sanitization.html”
Most feeds embed HTML markup within feed elements. Some feeds even embed other types of markup, such as SVG or MathML. Since many feed aggregators use a web browser (or browser component) to display content, Universal Feed Parser sanitizes embedded markup to remove things that could pose security risks.
These elements are sanitized by default:
Note
The unit tests for HTML sanitizing show many different examples of dangerous markup that Universal Feed Parser sanitizes by default.
The following HTML elements are allowed by default (all others are stripped):a, abbr, acronym, address, area, article, aside, audio, b, big, blockquote, br, button, canvas, caption, center, cite, code, col, colgroup, command, datagrid, datalist, dd, del, details, dfn, dialog, dir, div, dl, dt, em, event-source, fieldset, figure, footer, font, form, header, h1, h2, h3, h4, h5, h6, hr, i, img, input, ins, keygen, kbd, label, legend, li, m, map, menu, meter, multicol, nav, nextid, noscript, ol, output, optgroup, option, p, pre, progress, q, s, samp, section, select, small, sound, source, spacer, span, strike, strong, sub, sup, table, tbody, td, textarea, time, tfoot, th, thead, tr, tt, u, ul, var, video
The following HTML attributes are allowed by default (all others are stripped):abbr, accept, accept-charset, accesskey, action, align, alt, autoplay, autocomplete, autofocus, axis, background, balance, bgcolor, bgproperties, border, bordercolor, bordercolordark, bordercolorlight, bottompadding, cellpadding, cellspacing, ch, challenge, char, charoff, choff, charset, checked, cite, class, clear, color, cols, colspan, compact, contenteditable, coords, data, datafld, datapagesize, datasrc, datetime, default, delay, dir, disabled, draggable, dynsrc, enctype, end, face, for, form, frame, galleryimg, gutter, headers, height, hidefocus, hidden, high, href, hreflang, hspace, icon, id, inputmode, ismap, keytype, label, leftspacing, lang, list, longdesc, loop, loopcount, loopend, loopstart, low, lowsrc, max, maxlength, media, method, min, multiple, name, nohref, noshade, nowrap, open, optimum, pattern, ping, point-size, prompt, pqg, radiogroup, readonly, rel, repeat-max, repeat-min, replace, required, rev, rightspacing, rows, rowspan, rules, scope, selected, shape, size, span, src, start, step, summary, suppress, tabindex, target, template, title, toppadding, type, unselectable, usemap, urn, valign, value, variable, volume, vspace, vrml, width, wrap, xml:lang
The following SVG elements are allowed by default (all others are stripped):a, animate, animateColor, animateMotion, animateTransform, circle, defs, desc, ellipse, foreignObject, font-face, font-face-name, font-face-src, g, glyph, hkern, linearGradient, line, marker, metadata, missing-glyph, mpath, path, polygon, polyline, radialGradient, rect, set, stop, svg, switch, text, title, tspan, use
The following SVG attributes are allowed by default (all others are stripped):accent-height, accumulate, additive, alphabetic, arabic-form, ascent, attributeName, attributeType, baseProfile, bbox, begin, by, calcMode, cap-height, class, color, color-rendering, content, cx, cy, d, dx, dy, descent, display, dur, end, fill, fill-opacity, fill-rule, font-family, font-size, font-stretch, font-style, font-variant, font-weight, from, fx, fy, g1, g2, glyph-name, gradientUnits, hanging, height, horiz-adv-x, horiz-origin-x, id, ideographic, k, keyPoints, keySplines, keyTimes, lang, mathematical, marker-end, marker-mid, marker-start, markerHeight, markerUnits, markerWidth, max, min, name, offset, opacity, orient, origin, overline-position, overline-thickness, panose-1, path, pathLength, points, preserveAspectRatio, r, refX, refY, repeatCount, repeatDur, requiredExtensions, requiredFeatures, restart, rotate, rx, ry, slope, stemh, stemv, stop-color, stop-opacity, strikethrough-position, strikethrough-thickness, stroke, stroke-dasharray, stroke-dashoffset, stroke-linecap, stroke-linejoin, stroke-miterlimit, stroke-opacity, stroke-width, systemLanguage, target, text-anchor, to, transform, type, u1, u2, underline-position, underline-thickness, unicode, unicode-range, units-per-em, values, version, viewBox, visibility, width, widths, x, x-height, x1, x2, xlink:actuate, xlink:arcrole, xlink:href, xlink:role, xlink:show, xlink:title, xlink:type, xml:base, xml:lang, xml:space, xmlns, xmlns:xlink, y, y1, y2, zoomAndPan
The following MathML elements are allowed by default (all others are stripped):annotation, annotation-xml, maction, math, merror, mfenced, mfrac, mi, mmultiscripts, mn, mo, mover, mpadded, mphantom, mprescripts, mroot, mrow, mspace, msqrt, mstyle, msub, msubsup, msup, mtable, mtd, mtext, mtr, munder, munderover, none, semantics
The following MathML attributes are allowed by default (all others are stripped):actiontype, align, columnalign, columnalign, columnalign, close, columnlines, columnspacing, columnspan, depth, display, displaystyle, encoding, equalcolumns, equalrows, fence, fontstyle, fontweight, frame, height, linethickness, lspace, mathbackground, mathcolor, mathvariant, mathvariant, maxsize, minsize, open, other, rowalign, rowalign, rowalign, rowlines, rowspacing, rowspan, rspace, scriptlevel, selection, separator, separators, stretchy, width, width, xlink:href, xlink:show, xlink:type, xmlns, xmlns:xlink
The following CSS properties are allowed by default in style attributes (all others are stripped):azimuth, background-color, border-bottom-color, border-collapse, border-color, border-left-color, border-right-color, border-top-color, clear, color, cursor, direction, display, elevation, float, font, font-family, font-size, font-style, font-variant, font-weight, height, letter-spacing, line-height, overflow, pause, pause-after, pause-before, pitch, pitch-range, richness, speak, speak-header, speak-numeral, speak-punctuation, speech-rate, stress, text-align, text-decoration, text-indent, unicode-bidi, vertical-align, voice-family, volume, white-space, width
Note
Not all possible CSS values are allowed for these properties. The allowable values are restricted by a whitelist and a regular expression that allows color values and lengths. URIs are not allowed, to prevent platypus attacks. See the _HTMLSanitizer class for more details.
I am often asked why Universal Feed Parser is so hard-assed about HTML and CSS sanitizing. To illustrate the problem, here is an incomplete list of potentially dangerous HTML tags and attributes:
style? Yes, style. CSS definitions can contain executable code.
This sample is taken from ` <http://feedparser.org/docs/examples/rss20.xml>`_: <description>Watch out for <span style=”background: url(javascript:window.location=’http://example.org/‘)”> nasty tricks</span></description>
This sample is more advanced, and does not contain the keyword javascript: that many naive HTML sanitizers scan for:<description>Watch out for <span style=”any: expression(window.location=’http://example.org/‘)”> nasty tricks</span></description>
Internet Explorer for Windows will execute the Javascript in both of these examples.
Now consider that in HTML, attribute values may be entity-encoded in several different ways.
To a browser, this:<span style=”any: expression(window.location=’http://example.org/‘)”>
is the same as this (without the line breaks):<span style=”any: expre ssion(window .location='h ttp://exampl e.org/')”>
which is the same as this (without the line breaks):<span style=”any: expr ession(win dow.locati on='http:/ /example.o rg/')”>
And so on, plus several other variations, plus every combination of every variation.
The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code.
filename=”content-normalization.html”
Universal Feed Parser can parse many different types of feeds: Atom, CDF, and nine different versions of RSS. You should not be forced to learn the differences between these formats. Universal Feed Parser does its best to ensure that you can treat all feeds the same way, regardless of format or version.
You can access the basic elements of an Atom feed using RSS terminology.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d['channel']['title']
``u'Sample Feed'``
>>> d['channel']['link']
``u'http://example.org/'``
>>> d['channel']['description']
``u'For documentation <em>only</em>``
>>> len(d['items'])
``1``
>>> e = d['items'][0]
>>> e['title']
``u'First entry title'``
>>> e['link']
``u'http://example.org/entry/3'``
>>> e['description']
``u'Watch out for nasty tricks'``
>>> e['author']
``u'Mark Pilgrim (mark@example.org)'
The same thing works in reverse: you can access RSS feeds as if they were Atom feeds.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> d.feed.subtitle_detail
``{'type': 'text/html',
'base': 'http://feedparser.org/docs/examples/rss20.xml',
'language': None,
'value': u'For documentation <em>only</em>'}``
>>> len(d.entries)
``1``
>>> e = d.entries[0]
>>> e.links
``[{'rel': 'alternate',
'type': 'text/html',
'href': u'http://example.org/item/1'}]``
>>> e.summary_detail
``{'type': 'text/html',
'base': 'http://feedparser.org/docs/examples/rss20.xml',
'language': u'en',
'value': u'Watch out for <span>nasty tricks</span>'}``
>>> e.updated_parsed
``(2002, 9, 5, 0, 0, 1, 3, 248, 0)
Note
For more examples of how Universal Feed Parser normalizes content from different formats, see annotated.
filename=”microformats.html”
An emerging trend in feed syndication is the inclusion of microformats. Besides the semantics defined by individual feed formats, publishers can add additional semantics using rel and class attributes in embedded HTML content.
Note
To parse microformats. Universal Feed Parser relies on a third-party library called Beautiful Soup, which is distributed separately. If Beautiful Soup is not installed, Universal Feed Parser will silently skip microformats parsing.
The following elements are parsed for microformats:
The rel=enclosure microformat provides a way for embedded HTML content to specify that a certain link should be treated as an enclosure. Universal Feed Parser looks for links within embedded markup that meet any of the following conditions:
When Universal Feed Parser finds a link that satisfies any of these conditions, it adds it to reference.entry.enclosures.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/rel-enclosure.xml')
>>> d.entries[0].enclosures``[{u'href': u'http://example.com/movie.mp4', 'title': u'awesome movie'}]
The rel=tag microformat allows you to define tags within embedded HTML content. Universal Feed Parser looks for these attribute values in embedded markup and maps them to reference.entry.tags.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/rel-tag.xml')
>>> d.entries[0].tags``[{'term': u'tech', 'scheme': u'http://del.icio.us/tag/', 'label': u'Technology'}]
The XFN microformat allows you to define human relationships between URIs. For example, you could link from your weblog to your spouse’s weblog with the rel="spouse" relation. It is intended primarily for “blogrolls” or other static lists of links, but the relations can occur anywhere in HTML content. If found, Universal Feed Parser will return the XFN information in reference.entry.xfn.
Universal Feed Parser supports all of the relationships listed in the XFN 1.1 profile, as well as the following variations:
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/xfn.xml')
>>> person = d.entries[0].xfn[0]
>>> person.name``u'John Doe'``
>>> person.href``u'http://example.com/johndoe'``
>>> person.relationships``[u'coworker', u'friend']
The hCard microformat allows you to embed address book information within HTML content. If Universal Feed Parser finds an hCard within supported elements, it converts it into an RFC 2426-compliant vCard and returns it in reference.entry.vcard.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/hcard.xml')
>>> print d.entries[0].vCard
``BEGIN:vCard
VERSION:3.0
FN:Frank Dawson
N:Dawson;Frank
ADR;TYPE=work,postal,parcel:;;6544 Battleford Drive;Raleigh;NC;27613-3502;U
.S.A.
TEL;TYPE=WORK,VOICE,MSG:+1-919-676-9515
TEL;TYPE=WORK,FAX:+1-919-676-9564
EMAIL;TYPE=internet,pref:Frank_Dawson at Lotus.com
EMAIL;TYPE=internet:fdawson at earthlink.net
ORG:Lotus Development Corporation
URL:http://home.earthlink.net/~fdawson
END:vCard
BEGIN:vCard
VERSION:3.0
FN:Tim Howes
N:Howes;Tim
ADR;TYPE=work:;;501 E. Middlefield Rd.;Mountain View;CA;94043;U.S.A.
TEL;TYPE=WORK,VOICE,MSG:+1-415-937-3419
TEL;TYPE=WORK,FAX:+1-415-528-4164
EMAIL;TYPE=internet:howes at netscape.com
ORG:Netscape Communications Corp.
END:vCard
Note
There are a growing number of microformats, and Universal Feed Parser does not parse all of them. However, both the rel and class attributes survive HTML sanitizing, so applications built on Universal Feed Parser that wish to parse additional microformat content are free to do so.
filename=”namespace-handling.html”
Universal Feed Parser attempts to expose all possible data in feeds, including elements in extension namespaces.
Some common namespaced elements are mapped to core elements. For further information about these mappings, see reference.
Other namespaced elements are available as .
The namespaces defined in the feed are available in the parsed results as namespaces, a dictionary of {prefix: namespaceURI}. If the feed defines a default namespace, it is listed as namespaces[''].
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/prism.rdf')
>>> d.feed.prism_issn
``u'0028-0836'``
>>> d.namespaces
``{'': u'http://purl.org/rss/1.0/',
'prism': u'http://prismstandard.org/namespaces/1.2/basic/',
'rdf': u'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
The prefix used to construct the variable name is not guaranteed to be the same as the prefix of the namespaced element in the original feed. If Universal Feed Parser recognizes the namespace, it will use the namespace’s preferred prefix to construct the variable name. It will also list the namespace in the namespaces dictionary using the namespace’s preferred prefix.
In the previous example, the namespace (http://prismstandard.org/namespaces/1.2/basic/) was defined with the namespace’s preferred prefix (prism), so the prism:issn element was accessible as the variable d.feed.prism_issn. However, if the namespace is defined with a non-standard prefix, Universal Feed Parser will still construct the variable name using the preferred prefix, not the actual prefix that is used in the feed.
This will become clear with an example.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/nonstandard_prefix.rdf')
>>> d.feed.prism_issn
``u'0028-0836'``
>>> d.feed.foo_issn
``Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "feedparser.py", line 158, in __getattr__
raise AttributeError, "object has no attribute '%s'" % key
AttributeError: object has no attribute 'foo_issn'``
>>> d.namespaces
``{'': u'http://purl.org/rss/1.0/',
'prism': u'http://prismstandard.org/namespaces/1.2/basic/',
'rdf': u'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
This is the complete list of namespaces that Universal Feed Parser recognizes and uses to construct the variable names for data in these namespaces:
Recognized Namespaces ———————PrefixNamespaceadminhttp://webns.net/mvcb/aghttp://purl.org/rss/1.0/modules/aggregation/annotatehttp://purl.org/rss/1.0/modules/annotate/audiohttp://media.tangent.org/rss/1.0/blogChannelhttp://backend.userland.com/blogChannelModulecchttp://web.resource.org/cc/creativeCommonshttp://backend.userland.com/creativeCommonsRssModulecohttp://purl.org/rss/1.0/modules/companycontenthttp://purl.org/rss/1.0/modules/content/cphttp://my.theinfo.org/changed/1.0/rss/dchttp://purl.org/dc/elements/1.1/dctermshttp://purl.org/dc/terms/emailhttp://purl.org/rss/1.0/modules/email/evhttp://purl.org/rss/1.0/modules/event/feedburnerhttp://rssnamespace.org/feedburner/ext/1.0fmhttp://freshmeat.net/rss/fm/foafhttp://xmlns.com/foaf/0.1/geohttp://www.w3.org/2003/01/geo/wgs84_pos#icbmhttp://postneo.com/icbm/imagehttp://purl.org/rss/1.0/modules/image/ituneshttp://www.itunes.com/DTDs/PodCast-1.0.dtdituneshttp://example.com/DTDs/PodCast-1.0.dtdlhttp://purl.org/rss/1.0/modules/link/mediahttp://search.yahoo.com/mrsspingbackhttp://madskills.com/public/xml/rss/module/pingback/prismhttp://prismstandard.org/namespaces/1.2/basic/rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#rdfshttp://www.w3.org/2000/01/rdf-schema#refhttp://purl.org/rss/1.0/modules/reference/reqvhttp://purl.org/rss/1.0/modules/richequiv/searchhttp://purl.org/rss/1.0/modules/search/slashhttp://purl.org/rss/1.0/modules/slash/soaphttp://schemas.xmlsoap.org/soap/envelope/sshttp://purl.org/rss/1.0/modules/servicestatus/strhttp://hacks.benhammersley.com/rss/streaming/subhttp://purl.org/rss/1.0/modules/subscription/syhttp://purl.org/rss/1.0/modules/syndication/szfhttp://schemas.pocketsoap.com/rss/myDescModule/taxohttp://purl.org/rss/1.0/modules/taxonomy/thrhttp://purl.org/rss/1.0/modules/threading/tihttp://purl.org/rss/1.0/modules/textinput/trackbackhttp://madskills.com/public/xml/rss/module/trackback/wfwhttp://wellformedweb.org/CommentAPI/wikihttp://purl.org/rss/1.0/modules/wiki/xhtmlhttp://www.w3.org/1999/xhtmlxlinkhttp://www.w3.org/1999/xlinkxmlhttp://www.w3.org/XML/1998/namespace
Note
Universal Feed Parser treats namespaces as case-insensitive to match the behavior of certain versions of iTunes.
Caution
Data from namespaced elements is not sanitized (even if it contains HTML markup).
filename=”resolving-relative-links.html”
Many feed elements and attributes are URIs. Universal Feed Parser resolves relative URIs according to the XML:Base specification. We’ll see how that works in a minute, but first let’s talk about which values are treated as URIs.
These feed elements are treated as URIs, and resolved if they are relative:
In addition, several feed elements may contain HTML or XHTML markup. Certain elements and attributes in HTML can be relative URIs, and Universal Feed Parser will resolve these URIs according to the same rules as the feed elements listed above.
These feed elements may contain HTML or XHTML markup. In Atom feeds, whether these elements are treated as HTML depends on the value of the type attribute. In RSS feeds, these values are always treated as HTML.
When any of these feed elements contains HTML or XHTML markup, the following HTML elements are treated as URIs and are resolved if they are relative:
Universal Feed Parser resolves relative URIs according to the XML:Base specification. This defines a hierarchical inheritance system, where one element can define the base URI for itself and all of its child elements, using an xml:base attribute. A child element can then override its parent’s base URI by redeclaring xml:base to a different value.
If no xml:base is specified, the feed has a default base URI defined in the Content-Location HTTP header.
If no Content-Location HTTP header is present, the URL used to retrieve the feed itself is the default base URI for all relative links within the feed. If the feed was retrieved via an HTTP redirect (any HTTP 3xx status code), then the final URL of the feed is the default base URI.
For example, an xml:base on the root-level element sets the base URI for all URIs in the feed.
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.feed.link``u'http://example.org/index.html'``
>>> d.feed.generator_detail.href``u'http://example.org/generator/'
An xml:base attribute on an <entry> overrides the xml:base on the parent <feed>.
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.entries[0].link``u'http://example.org/archives/000001.html'``
>>> d.entries[0].author_detail.href``u'http://example.org/about/'
An xml:base on <content> overrides the xml:base on the parent <entry>. In addition, whatever the base URI is for the <content> element (whether defined directly on the <content> element, or inherited from the parent element) is used as the base URI for the embedded HTML or XHTML markup within the content.
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.entries[0].content[0].value``u'<p id="anchor1"><a href="http://example.org/archives/000001.html#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>'
The xml:base affects other attributes in the element in which it is declared.
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.entries[0].links[1].rel``u'service.edit'``
>>> d.entries[0].links[1].href``u'http://example.com/api/client/37'
If no xml:base is specified on the root-level element, the default base URI is given in the Content-Location HTTP header. This can still be overridden by any child element that declares an xml:base attribute.
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/http_base.xml")
>>> d.feed.link``u'http://example.org/index.html'``
>>> d.entries[0].link``u'http://example.org/archives/000001.html'
Finally, if no root-level xml:base is declared, and no Content-Location HTTP header is present, the URL of the feed itself is the default base URI. Again, this can still be overridden by any element that declares an xml:base attribute.
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/no_base.xml")
>>> d.feed.link``u'http://feedparser.org/docs/examples/index.html``
>>> d.entries[0].link``u'http://example.org/archives/000001.html'
Though not recommended, it is possible to disable Universal Feed Parser‘s relative URI resolution by setting feedparser.RESOLVE_RELATIVE_URIS to 0.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/base.xml')
>>> d.entries[0].content[0].base``u'http://example.org/archives/000001.html'``
>>> print d.entries[0].content[0].value``<p id="anchor1"><a href="http://example.org/archives/000001.html#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>``
>>> feedparser.RESOLVE_RELATIVE_URIS = 0
>>> d2 = feedparser.parse('http://feedparser.org/docs/examples/base.xml')
>>> d2.entries[0].content[0].base``u'http://example.org/archives/000001.html'``
>>> print d2.entries[0].content[0].value``<p id="anchor1"><a href="#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>
filename=”version-detection.html”
Universal Feed Parser attempts to autodetect the type and version of the feeds it parses. There are many subtle and not-so-subtle differences between the different versions of RSS, and applications may choose to handle different feed types in different ways.
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.version``'atom10'``
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom03.xml>`_')
>>> d.version``'atom03'``
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20.xml>`_')
>>> d.version``'rss20'``
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss20dc.xml>`_')
>>> d.version``'rss20'``
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/rss10.rdf>`_')
>>> d.version``'rss10'
Here is the complete list of known feed types and versions that may be returned in version:
If the feed type is completely unknown, version will be an empty string.
filename=”character-encoding.html”
Tip
Feeds may be published in any character encoding. Python supports only a few character encodings by default. To support the maximum number of character encodings (and be able to parse the maximum number of feeds), you should install cjkcodecs and iconv_codec. Both are available at ` <http://cjkpython.i18n.org/>`_.
RFC 3023 defines the interaction between XML and HTTP as it relates to character encoding. XML and HTTP have different ways of specifying character encoding and different defaults in case no encoding is specified, and determining which value takes precedence depends on a variety of factors.
In XML, the character encoding is optional and may be given in the XML declaration in the first line of the document, like this: <xml version=”1.0” encoding=”utf-8”?>
If no encoding is given, XML supports the use of a Byte Order Mark to identify the document as some flavor of UTF-32, UTF-16, or UTF-8. Section F of the XML specification outlines the process for determining the character encoding based on unique properties of the Byte Order Mark in the first two to four bytes of the document.
If no encoding is specified and no Byte Order Mark is present, XML defaults to UTF-8.
HTTP uses MIME to define a method of specifying the character encoding, as part of the Content-Type HTTP header, which looks like this: Content-Type: text/html; charset=”utf-8”
If no charset is specified, HTTP defaults to iso-8859-1, but only for text/* media types. For other media types, the default encoding is undefined, which is where RFC 3023 comes in.
According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is
# the encoding given in the charset parameter of the Content-Type HTTP header, or
# the encoding given in the encoding attribute of the XML declaration within the document, or
# utf-8.
On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is
# the encoding given in the charset parameter of the Content-Type HTTP header, or
# us-ascii.
Universal Feed Parser initially uses the rules specified in RFC 3023 to determine the character encoding of the feed. If parsing succeeds, then that’s that. If parsing fails, Universal Feed Parser sets the bozo bit to 1 and sets bozo_exception to feedparser.CharacterEncodingOverride. Then it tries to reparse the feed with the following character encodings:
# the encoding specified in the XML declaration
# the encoding sniffed from the first four bytes of the document (as per Section F)
# the encoding auto-detected by the :program:`Universal Encoding Detector <http://chardet.feedparser.org/>`_, if installed
# utf-8
# windows-1252
If the character encoding can not be determined, Universal Feed Parser sets the bozo bit to 1 and sets bozo_exception to feedparser.CharacterEncodingUnknown. In this case, parsed values will be strings, not Unicode strings.
RFC 3023 only applies when the feed is served over HTTP with a Content-Type that declares the feed to be some kind of XML. However, some web servers are severely misconfigured and serve feeds with a Content-Type of text/plain, application/octet-stream, or some completely bogus media type.
Universal Feed Parser will attempt to parse such feeds, but it will set the bozo bit to 1 and set bozo_exception to feedparser.NonXMLContentType.
filename=”bozo.html”
Universal Feed Parser can parse feeds whether they are well-formed XML or not. However, since some applications may wish to reject or warn users about non-well-formed feeds, Universal Feed Parser sets the bozo bit when it detects that a feed is not well-formed. Thanks to Tim Bray for suggesting this terminology.
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.bozo``0``
>>> d = feedparser.parse('` <http://feedparser.org/tests/illformed/rss/aaa_illformed.xml>`_')
>>> d.bozo``1``
>>> d.bozo_exception``<xml.sax._exceptions.SAXParseException instance at 0x00BAAA08>``
>>> exc = d.bozo_exception
>>> exc.getMessage()``"expected '>'\\n"``
>>> exc.getLineNumber()``6
There are many reasons an XML document could be non-well-formed besides this example (incomplete end tags) See advanced.encoding for some other ways to trip the bozo bit.
filename=”http.html”
filename=”http-etag.html”
ETags and Last-Modified headers are two ways that feed publishers can save bandwidth, but they only work if clients take advantage of them. Universal Feed Parser gives you the ability to take advantage of these features, but you must use them properly.
The basic concept is that a feed publisher may provide a special HTTP header, called an ETag, when it publishes a feed. You should send this ETag back to the server on subsequent requests. If the feed has not changed since the last time you requested it, the server will return a special HTTP status code (304) and no feed data.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.etag``'"6c132-941-ad7e3080"'``
>>> d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', etag=d.etag)
>>> d2.status``304``
>>> d2.feed``{}``
>>> d2.entries``[]``
>>> d2.debug_message``'The feed has not changed since you last checked, so
the server sent no data. This is a feature, not a bug!'
There is a related concept which accomplishes the same thing, but slightly differently. In this case, the server publishes the last-modified date of the feed in the HTTP header. You can send this back to the server on subsequent requests, and if the feed has not changed, the server will return HTTP status code 304 and no feed data.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
>>> d.modified``(2004, 6, 11, 23, 0, 34, 4, 163, 0)``
>>> d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', modified=d.modified)
>>> d2.status``304``
>>> d2.feed``{}``
>>> d2.entries``[]``
>>> d2.debug_message``'The feed has not changed since you last checked, so
the server sent no data. This is a feature, not a bug!'
Clients should support both ETag and Last-Modified headers, as some servers support one but not the other.
Important
If you do not support ETag and Last-Modified headers, you will repeatedly download feeds that have not changed. This wastes your bandwidth and the publisher’s bandwidth, and the publisher may ban you from accessing their server.
Note
You can control the behaviour of HTTP caches between your application and the origin server by using the :ref:`extra_headers <example.http.headers.request>` parameter. For example, you may want to send Cache-control: max-age=60 to make the caches revalidate against the origin server unless their cached copy is less than a minute old. Again, this should be used with consideration.
filename=”http-useragent.html”
Universal Feed Parser sends a default User-Agent string when it requests a feed from a web server.
The default User-Agent string looks like this:UniversalFeedParser/4.2 +http://feedparser.org/
If you are embedding Universal Feed Parser in a larger application, you should change the User-Agent to your application name and URL.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', \\
...`` agent='MyApp/1.0 +http://example.com/')
You can also set the User-Agent once, globally, and then call the parse function normally.
>>> import feedparser
>>> feedparser.USER_AGENT = "MyApp/1.0 +http://example.com/"
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')``
Universal Feed Parser also lets you set the referrer when you download a feed from a web server. This is discouraged, because it is a violation of RFC 2616. The default behavior is to send a blank referrer, and you should never need to override this.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', \\
...`` referrer='http://example.com/')
filename=”http-redirect.html”
When you download a feed from a remote web server, Universal Feed Parser exposes the HTTP status code. You need to understand the different codes, including permanent and temporary redirects, and feeds that have been marked gone.
When a feed has temporarily moved to a new location, the web server will return a 302 status code. Universal Feed Parser makes this available in d.status.
There is nothing special you need to do with temporary redirects; by the time you learn about it, Universal Feed Parser has already followed the redirect to the new location (available in d.href), downloaded the feed, and parsed it. Since the redirect is temporary, you should continue requesting the original URL the next time you want to parse the feed.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/temporary.xml>`_')
>>> d.status``302``
>>> d.href``'http://feedparser.org/docs/examples/atom10.xml'``
>>> d.feed.title``u'Sample Feed'
When a feed has permanently moved to a new location, the web server will return a 301 status code. Again, Universal Feed Parser makes this available in d.status.
If you are polling a feed on a regular basis, it is very important to check the status code (d.status) every time you download. If the feed has been permanently redirected, you should update your database or configuration file with the new address (d.href). Repeatedly requesting the original address of a feed that has been permanently redirected is very rude, and may get you banned from the server.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/permanent.xml>`_')
>>> d.status``301``
>>> d.href``'http://feedparser.org/docs/examples/atom10.xml'``
>>> d.feed.title``u'Sample Feed'
When a feed has been permanently deleted, the web server will return a 410 status code. If you ever receive a 410, you should stop polling the feed and inform the end user that the feed is gone for good.
Repeatedly requesting a feed that has been marked as gone is very rude, and may get you banned from the server.
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/gone.xml>`_')
>>> d.status``410
filename=”http-authentication.html”
Universal Feed Parser supports downloading and parsing password-protected feeds that are protected by HTTP authentication. Both basic and digest authentication are supported.
The easiest way is to embed the username and password in the feed URL itself.
In this example, the username is test and the password is basic.
>>> import feedparser
>>> d = feedparser.parse('`http://test:basic@feedparser.org/docs/examples/basic_auth.xml <http://feedparser.org/docs/examples/basic_auth.xml>`_')
>>> d.feed.title``u'Sample Feed'
The same technique works for digest authentication. (Technically, Universal Feed Parser will attempt basic authentication first, but if that fails and the server indicates that it requires digest authentication, Universal Feed Parser will automatically re-request the feed with the appropriate digest authentication headers. This means that this technique will send your password to the server in an easily decryptable form.)
In this example, the username is test and the password is digest.
>>> import feedparser
>>> d = feedparser.parse('`http://test:digest@feedparser.org/docs/examples/digest_auth.xml <http://feedparser.org/docs/examples/digest_auth.xml>`_')
>>> d.feed.title``u'Sample Feed'
You can also construct a HTTPBasicAuthHandler that contains the password information, then pass that as a handler to the parse function. HTTPBasicAuthHandler is part of the standard :file:`urllib2 <http://docs.python.org/lib/module-urllib2.html>`_ module.
Downloading a feed protected by HTTP basic authentication (the hard way) ————————————————————————import urllib2, feedparser
# Construct the authentication handler auth = urllib2.HTTPBasicAuthHandler()
# Add password information: realm, host, user, password. # A single handler can contain passwords for multiple sites; # urllib2 will sort out which passwords get sent to which sites # based on the realm and host of the URL you’re retrieving auth.add_password(‘BasicTest’, ‘feedparser.org’, ‘test’, ‘basic’)
# Pass the authentication handler to the feed parser. # handlers is a list because there might be more than one # type of handler (urllib2 defines lots of different ones, # and you can build your own) d = feedparser.parse(‘http://feedparser.org/docs/examples/basic_auth.xml‘, \ handlers=[auth])
Digest authentication is handled in much the same way, by constructing an HTTPDigestAuthHandler and populating it with the necessary realm, host, user, and password information. This is more secure than stuffing the username and password in the URL, since the password will be encrypted before being sent to the server.
Downloading a feed protected by HTTP digest authentication (the secure way) —————————————————————————import urllib2, feedparser
auth = urllib2.HTTPDigestAuthHandler() auth.add_password(‘DigestTest’, ‘feedparser.org’, ‘test’, ‘digest’) d = feedparser.parse(‘http://feedparser.org/docs/examples/digest_auth.xml‘, \ handlers=[auth])
Caution
Prior to Python 2.3.3, urllib2 did not properly support digest authentication. Mac OS X 10.3 Panther ships with Python 2.3. Users of previous versions of Mac OS X will need to upgrade to the latest version of Python in order to use digest authentication.
The examples so far have assumed that you know in advance that the feed is password-protected. But what if you don’t know?
If you try to download a password-protected feed without sending all the proper password information, the server will return an HTTP status code 401. Universal Feed Parser makes this status code available in d.status.
Details on the authentication scheme are in d.headers['www-authenticate']. Universal Feed Parser does not do any further parsing on this field; you will need to parse it yourself. Everything before the first space is the type of authentication (probably Basic or Digest), which controls which type of handler you’ll need to construct. The realm name is given as realm=”foo” – so foo would be your first argument to auth.add_password. Other information in the www-authenticate header is probably safe to ignore; the urllib2 module will handle it for you.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/basic_auth.xml')
>>> d.status``401``
>>> d.headers['www-authenticate']``'Basic realm="Use test/basic"'``
>>> d = feedparser.parse('http://feedparser.org/docs/examples/digest_auth.xml')
>>> d.status``401``
>>> d.headers['www-authenticate']``'Digest realm="DigestTest",
nonce="+LV/uLLdAwA=5d77397291261b9ef256b034e19bcb94f5b7992a",
algorithm=MD5,
qop="auth"'
filename=”http-other.html”
You can specify extra HTTP request headers as a dictionary. When you download a feed from a remote web server, Universal Feed Parser exposes the complete set of HTTP response headers as a dictionary.
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/atom03.xml',
extra_headers={'Cache-control': 'max-age=0'})``
>>> import feedparser
>>> d = feedparser.parse('` <http://feedparser.org/docs/examples/atom03.xml>`_')
>>> d.headers``{'date': 'Fri, 11 Jun 2004 23:57:50 GMT',
'server': 'Apache/2.0.49 (Debian GNU/Linux)',
'last-modified': 'Fri, 11 Jun 2004 23:00:34 GMT',
'etag': '"6c132-941-ad7e3080"',
'accept-ranges': 'bytes',
'vary': 'Accept-Encoding,User-Agent',
'content-encoding': 'gzip',
'content-length': '883',
'connection': 'close',
'content-type': 'application/xml'}
filename=”command-line.html”
filename=”command-line-basic.html”
TODO
TODO
filename=”command-line-arguments.html”
TODO
TODO
``mark@atlantis$
filename=”command-line-output-formats.html”
TODO
filename=”annotated-atom10.html”
This is a sample Atom 1.0 feed, annotated with links that show how each value can be accessed once the feed is parsed.
Caution
Even though many of these elements are required according to the specification, real-world feeds may be missing any element. If an element is not present in the feed, it will not be present in the parsed results. You should not rely on any particular element being present.
<?xml version=”1.0” encoding=”utf-8”?> <feed xmlns=”http://www.w3.org/2005/Atom” xml:base=”http://example.org/” xml:lang=”en“> <title type=”text“> Sample Feed </title> <subtitle type=”html“> For documentation <em>only</em> </subtitle> <link rel=”alternate” type=”html” href=”/“/> <link rel=”self” type=”application/atom+xml” href=”http://www.example.org/atom10.xml“/> <rights type=”html“> <p>Copyright 2005, Mark Pilgrim</p> </rights> <generator uri=”http://example.org/generator/” version=”4.0“> Sample Toolkit </generator> <id>:ref:tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml <reference.feed.id>`</id> <updated>:ref:`2005-11-09T11:56:34Z <reference.feed.updated>`</updated> <entry> <title>:ref:`First entry title <reference.entry.title>`</title> <link rel=”:ref:`alternate <reference.entry.links.type>” href=”/entry/3“/> <link rel=”related” type=”text/html” href=”http://search.example.com/“/> <link rel=”via” type=”text/html” href=”http://toby.example.com/examples/atom10“/> <link rel=”enclosure” type=”video/mpeg4” href=”http://www.example.com/movie.mp4” length=”42301“/> <id>:ref:tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml:3 <reference.entry.id>`</id> <published>:ref:`2005-11-09T00:23:47Z <reference.entry.published>`</published> <updated>:ref:`2005-11-09T11:56:34Z <reference.entry.updated>`</updated> <author> <name>:ref:`Mark Pilgrim <reference.entry.author_detail.name>`</name> <uri>:ref:`http://diveintomark.org/ <reference.entry.author_detail.href>`</uri> <email>:ref:`mark@example.org <reference.entry.author_detail.email>`</email> </author> <contributor> <name>:ref:`Joe <reference.entry.contributors.name>`</name> <url>:ref:`http://example.org/joe/ <reference.entry.contributors.href>`</url> <email>:ref:`joe@example.org <reference.entry.contributors.email>`</email> </contributor> <contributor> <name>:ref:`Sam <reference.entry.contributors.name>`</name> <url>:ref:`http://example.org/sam/ <reference.entry.contributors.href>`</url> <email>:ref:`sam@example.org <reference.entry.contributors.email>`</email> </contributor> <summary type=”:ref:`text <reference.entry.summary_detail.type>“> Watch out for nasty tricks </summary> <content type=”xhtml” xml:base=”http://example.org/entry/3” xml:lang=”en-US“>:ref:<div xmlns=”http://www.w3.org/1999/xhtml”>Watch out for <span style=”background-image: url(javascript:window.location=’http://example.org/’)”> nasty tricks</span></div> <reference.entry.content.value> </content> </entry> </feed>
filename=”annotated-atom03.html”
This is a sample Atom 0.3 feed, annotated with links that show how each value can be accessed once the feed is parsed.
Caution
Even though many of these elements are required according to the specification, real-world feeds may be missing any element. If an element is not present in the feed, it will not be present in the parsed results. You should not rely on any particular element being present.
Annotated Atom 0.3 feed ———————–<?xml version=”1.0” encoding=”utf-8”?> <feed version=”0.3” xmlns=”http://purl.org/atom/ns#” xml:base=”http://example.org/” xml:lang=”en”> <title type=”text/plain” mode=”escaped”> Sample Feed </title> <tagline type=”text/html” mode=”escaped”> For documentation <em>only</em> </tagline> <link rel=”alternate” type=”text/html” href=”/“/> <copyright type=”text/html” mode=”escaped”> <p>Copyright 2004, Mark Pilgrim</p>< </copyright> <generator url=”http://example.org/generator/” version=”3.0“> Sample Toolkit </generator> <id>:ref:tag:feedparser.org,2004-04-20:/docs/examples/atom03.xml <reference.feed.id>`</id> <modified>:ref:`2004-04-20T11:56:34Z <reference.feed.updated>`</modified> <info type=”:ref:`application/xhtml+xml <reference.feed.info_detail.type>” mode=”xml”> <div xmlns=”http://www.w3.org/1999/xhtml”> </info> <entry> <title>:ref:First entry title <reference.entry.title>`</title> <link rel=”:ref:`alternate <reference.entry.links.rel>” type=”text/html” href=”/entry/3“/> <link rel=”service.edit” type=”application/atom+xml” title=”Atom API entrypoint to edit this entry” href=”/api/edit/3“/> <link rel=”service.post” type=”application/atom+xml” title=”Atom API entrypoint to add comments to this entry” href=”/api/comment/3“/> <id>:ref:tag:feedparser.org,2004-04-20:/docs/examples/atom03.xml:3 <reference.entry.id>`</id> <created>:ref:`2004-04-19T07:45:00Z <reference.entry.created>`</created> <issued>:ref:`2004-04-20T00:23:47Z <reference.entry.published>`</issued> <modified>:ref:`2004-04-20T11:56:34Z <reference.entry.updated>`</modified> <author> <name>:ref:`Mark Pilgrim <reference.entry.author_detail.name>`</name> <url>:ref:`http://diveintomark.org/ <reference.entry.author_detail.href>`</url> <email>:ref:`mark@example.org <reference.entry.author_detail.email>`</email> </author> <contributor> <name>:ref:`Joe <reference.entry.contributors.name>`</name> <url>:ref:`http://example.org/joe/ <reference.entry.contributors.href>`</url> <email>:ref:`joe@example.org <reference.entry.contributors.email>`</email> </contributor> <contributor> <name>:ref:`Sam <reference.entry.contributors.name>`</name> <url>:ref:`http://example.org/sam/ <reference.entry.contributors.href>`</url> <email>:ref:`sam@example.org <reference.entry.contributors.email>`</email> </contributor> <summary type=”:ref:`text/plain <reference.entry.summary_detail.type>” mode=”escaped”> Watch out for nasty tricks </summary> <content type=”application/xhtml+xml” mode=”xml” xml:base=”http://example.org/entry/3” xml:lang=”en-US“>:ref:<div xmlns=”http://www.w3.org/1999/xhtml”>Watch out for <span style=”background-image: url(javascript:window.location=’http://example.org/’)”> nasty tricks</span></div> <reference.entry.content.value> </content> </entry> </feed>
filename=”annotated-rss20.html”
This is a sample RSS 2.0 feed, annotated with links that show how each value can be accessed once the feed is parsed.
Caution
Even though many of these elements are required according to the specification, real-world feeds may be missing any element. If an element is not present in the feed, it will not be present in the parsed results. You should not rely on any particular element being present.
Annotated RSS 2.0 feed ———————-<?xml version=”1.0” encoding=”utf-8”?> <rss version=”2.0“> <channel> <title>:ref:Sample Feed <reference.feed.title>`</title> <description>:ref:`For documentation <em>only</em> <reference.feed.subtitle>`</description> <link>:ref:`http://example.org/ <reference.feed.link>`</link> <language>:ref:`en <reference.feed.language>`</language> <copyright>:ref:`Copyright 2004, Mark Pilgrim <reference.feed.rights>`</copyright> <managingEditor>:ref:`editor@example.org <reference.feed.author>`</managingEditor> <webMaster>:ref:`webmaster@example.org <reference.feed.publisher>`</webMaster> <pubDate>:ref:`Sat, 07 Sep 2002 0:00:01 GMT <reference.feed.updated>`</pubDate> <category>:ref:`Examples <reference.feed.tags.term>`</category> <generator>:ref:`Sample Toolkit <reference.feed.generator>`</generator> <docs>:ref:`http://feedvalidator.org/docs/rss2.html <reference.feed.docs>`</docs> <cloud domain=”:ref:`rpc.example.com <reference.feed.cloud.domain>” port=”80” path=”/RPC2” registerProcedure=”pingMe” protocol=”soap“/> <ttl>:ref:60 <reference.feed.ttl>`</ttl> <image> <url>:ref:`http://example.org/banner.png <reference.feed.image.href>`</url> <title>:ref:`Example banner <reference.feed.image.title>`</title> <link>:ref:`http://example.org/ <reference.feed.image.link>`</link> <width>:ref:`80 <reference.feed.image.width>`</width> <height>:ref:`15 <reference.feed.image.height>`</height> </image> <textInput> <title>:ref:`Search <reference.feed.textinput.title>`</title> <description>:ref:`Search this site: <reference.feed.textinput.description>`</description> <name>:ref:`q <reference.feed.textinput.name>`</name> <link>:ref:`http://example.org/mt/mt-search.cgi <reference.feed.textinput.link>`</link> </textInput> <item> <title>:ref:`First item title <reference.entry.title>`</title> <link>:ref:`http://example.org/item/1 <reference.entry.link>`</link> <description>:ref:`Watch out for <span style=”background: url(javascript:window.location=’http://example.org/’)”> nasty tricks</span> <reference.entry.summary> </description> <author>:ref:mark@example.org <reference.entry.author>`</author> <category>:ref:`Miscellaneous <reference.entry.tags.term>`</category> <comments>:ref:`http://example.org/comments/1 <reference.entry.comments>`</comments> <enclosure url=”:ref:`http://example.org/audio/demo.mp3 <reference.entry.enclosures.href>” length=”1069871” type=”audio/mpeg“/> <guid>:ref:`http://example.org/guid/1 <reference.entry.id>`</guid> <pubDate>:ref:`Thu, 05 Sep 2002 0:00:01 GMT <reference.entry.updated>`</pubDate> </item> </channel> </rss>
filename=”annotated-rss20-dc.html”
This is a sample RSS 2.0 feed that uses several allowable extension modules in namespaces. The feed is annotated with links that show how each value can be accessed once the feed is parsed.
Caution
Even though many of these elements are required according to the specification, real-world feeds may be missing any element. If an element is not present in the feed, it will not be present in the parsed results. You should not rely on any particular element being present.
Annotated RSS 2.0 feed with namespaces ————————————–<?xml version=”1.0” encoding=”utf-8”?> <rss version=”2.0” xmlns:dc=”http://purl.org/dc/elements/1.1/” xmlns:admin=”http://webns.net/mvcb/” xmlns:content=”http://purl.org/rss/1.0/modules/content/” xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”> <channel> <title>:ref:Sample Feed <reference.feed.title>`</title> <link>:ref:`http://example.org/ <reference.feed.link>`</link> <description>:ref:`For documentation only <reference.feed.subtitle>`</description> <dc:language>:ref:`en-us <reference.feed.language>`</dc:language> <dc:creator>:ref:`Mark Pilgrim <reference.feed.author_detail.name> (mark@example.org)</dc:creator> <dc:rights>:ref:Copyright 2004 Mark Pilgrim <reference.feed.rights>`</dc:rights> <dc:date>:ref:`2004-06-04T17:40:33-05:00 <reference.feed.updated>`</dc:date> <admin:generatorAgent rdf:resource=”:ref:`http://www.exampletoolkit.org/ <reference.feed.generator_detail.href>“/> <admin:errorReportsTo rdf:resource=”mailto:mark@example.org“/>
<item> <title>:ref:First of all <reference.entry.title>`</title> <link>:ref:`http://example.org/archives/2002/09/04.html#first_of_all <reference.entry.link>`</link> <guid isPermaLink=”false”>:ref:`1983@example.org <reference.entry.id>`</guid> <description> :ref:`Americans are fat. Smokers are stupid. People who don’t speak Perl are irrelevant. <reference.entry.summary> </description> <dc:subject>:ref:Quotes <reference.entry.tags.term>`</dc:subject> <dc:date>:ref:`2002-09-04T13:54:20-05:00 <reference.entry.updated>`</dc:date> <content:encoded><![CDATA[:ref:`<cite>Ian Hickson</cite>: <q><a href=”http://ln.hixie.ch/?start=1030823786&count=1?> Americans are fat. Smokers are stupid. People who don’t speak Perl are irrelevant. </a></q> <reference.entry.content.value> ]]> </content:encoded> </item> </channel> </rss>
filename=”annotated-rss10.html”
This is a sample RSS 1.0 feed, annotated with links that show how each value can be accessed once the feed is parsed.
Caution
Even though many of these elements are required according to the specification, real-world feeds may be missing any element. If an element is not present in the feed, it will not be present in the parsed results. You should not rely on any particular element being present.
Annotated RSS 1.0 feed ———————-<?xml version=”1.0” encoding=”utf-8”?> <rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:dc=”http://purl.org/dc/elements/1.1/” xmlns:admin=”http://webns.net/mvcb/” xmlns:content=”http://purl.org/rss/1.0/modules/content/” xmlns:cc=”http://web.resource.org/cc/” xmlns=”http://purl.org/rss/1.0/“> <channel rdf:about=”http://www.example.org/index.rdf”> <title>:ref:Sample Feed <reference.feed.title>`</title> <link>:ref:`http://www.example.org/ <reference.feed.link>`</link> <description>:ref:`For documentation only <reference.feed.subtitle>`</description> <dc:language>:ref:`en <reference.feed.language>`</dc:language> <cc:license rdf:resource=”:ref:`http://web.resource.org/cc/PublicDomain <reference.feed.license>“/> <dc:creator>:ref:Mark Pilgrim <reference.feed.author_detail.name> (mark@example.org)</dc:creator> <dc:date>:ref:2004-06-04T17:40:33-05:00 <reference.feed.updated>`</dc:date> <admin:generatorAgent rdf:resource=”:ref:`http://www.exampletoolkit.org/ <reference.feed.generator_detail.href>“/> <admin:errorReportsTo rdf:resource=”mailto:mark@example.org“/> <items> <rdf:Seq> <rdf:li rdf:resource=”http://www.example.org/1” /> </rdf:Seq> </items> </channel> <item rdf:about=”http://www.example.org/1“> <title>:ref:First of all <reference.entry.title>`</title> <link>:ref:`http://example.org/archives/2002/09/04.html#first_of_all <reference.entry.link>`</link> <description> :ref:`Americans are fat. Smokers are stupid. People who don’t speak Perl are irrelevant. <reference.entry.summary> </description> <dc:subject>:ref:Quotes <reference.entry.tags.term>`</dc:subject> <dc:date>:ref:`2004-05-30T14:23:54-06:00 <reference.entry.updated>`</dc:date> <content:encoded><![CDATA[:ref:`<cite>Ian Hickson</cite>: <q><a href=”http://ln.hixie.ch/?start=1030823786&count=1”> Americans are fat. Smokers are stupid. People who don’t speak Perl are irrelevant. </a></q>]]> <reference.entry.content> </content:encoded> </item> </rdf:RDF>
filename=”history.html”
filename=”changes-42.html”
Universal Feed Parser 4.2 was released on 2008-03-12.
filename=”changes-41.html”
Universal Feed Parser 4.1 was released on January 11, 2006.
filename=”changes-402.html”
Universal Feed Parser 4.0.2 was released on December 24, 2005.
filename=”changes-401.html”
Universal Feed Parser 4.0.1 was released on December 24, 2005.
filename=”changes-40.html”
Universal Feed Parser 4.0 was released on December 23, 2005.
filename=”changes-33.html”
Universal Feed Parser 3.3 was released on July 15, 2004.
filename=”changes-32.html”
Universal Feed Parser 3.2 was released on July 3, 2004.
filename=”changes-31.html”
Universal Feed Parser 3.1 was released on June 28, 2004.
filename=”changes-301.html”
Universal Feed Parser 3.0.1 was released on June 21, 2004.
filename=”changes-30.html”
This got a little out of hand. Changes in version 3.0
Universal Feed Parser 3.0 was released on June 21, 2004.
Changes in version 3.0fc3
Universal Feed Parser 3.0fc3 was released on June 18, 2004.
Changes in version 3.0fc2
Universal Feed Parser 3.0fc2 was released on May 10, 2004.
Changes in version 3.0fc1
Universal Feed Parser 3.0fc1 was released on April 23, 2004.
Changes in version 3.0b23
Universal Feed Parser 3.0b23 was released on April 21, 2004.
Changes in version 3.0b22
Universal Feed Parser 3.0b22 was released on April 19, 2004.
Changes in version 3.0b21
Universal Feed Parser 3.0b21 was released on April 14, 2004.
Changes in version 3.0b20
Universal Feed Parser 3.0b20 was released on April 7, 2004.
Changes in version 3.0b19
Universal Feed Parser 3.0b19 was released on March 15, 2004.
Changes in version 3.0b18
Universal Feed Parser 3.0b18 was released on February 17, 2004.
Changes in version 3.0b17
Universal Feed Parser 3.0b17 was released on February 13, 2004.
Changes in version 3.0b16
Universal Feed Parser 3.0b16 was released on February 12, 2004.
Changes in version 3.0b15
Universal Feed Parser 3.0b15 was released on February 11, 2004.
Changes in version 3.0b14
Universal Feed Parser 3.0b14 was released on February 8, 2004.
Changes in version 3.0b13
Universal Feed Parser 3.0b13 was released on February 8, 2004.
Changes in version 3.0b12
Universal Feed Parser 3.0b12 was released on February 6, 2004.
Changes in version 3.0b11
Universal Feed Parser 3.0b11 was released on February 2, 2004.
Changes in version 3.0b10
Universal Feed Parser 3.0b10 was released on January 31, 2004.
Changes in version 3.0b9
Universal Feed Parser 3.0b9 was released on January 29, 2004.
Changes in version 3.0b8
Universal Feed Parser 3.0b8 was released on January 28, 2004.
Changes in version 3.0b7
Universal Feed Parser 3.0b7 was released on January 28, 2004.
Changes in version 3.0b6
Universal Feed Parser 3.0b6 was released on January 27, 2004.
Changes in version 3.0b5
Universal Feed Parser 3.0b5 was released on January 26, 2004.
Changes in version 3.0b4
Universal Feed Parser 3.0b4 was released on January 26, 2004.
Changes in version 3.0b3
Universal Feed Parser 3.0b3 was released on January 23, 2004.
Changes in version 3.0b2 and 3.0b1
Universal Feed Parser 3.0b2 and 3.0b1 have been lost in the mists of time.
filename=”changes-27.html”
The 2.7 series was a brief but necessary transition towards some of the core ideas in version 3.0. Changes in version 2.7.6
Ultra-liberal Feed Parser 2.7.6 was released on January 16, 2004.
Changes in version 2.7.5
Ultra-liberal Feed Parser 2.7.5 was released on January 15, 2004.
Changes in version 2.7.4
Ultra-liberal Feed Parser 2.7.4 was released on January 14, 2004.
Changes in version 2.7.3
Ultra-liberal Feed Parser 2.7.3 was released on January 14, 2004.
Changes in version 2.7.2
Ultra-liberal Feed Parser 2.7.2 was released on January 13, 2004.
Changes in version 2.7.1
Ultra-liberal Feed Parser 2.7.1 was released on January 9, 2004.
Changes in version 2.7
Ultra-liberal Feed Parser 2.7 was released on January 5, 2004.
filename=”changes-26.html”
Ultra-liberal Feed Parser 2.6 was released on January 1, 2004.
filename=”changes-early.html”
Universal Feed Parser began as an “ultra-liberal RSS parser” named rssparser.py. It was written as a weapon for battles that no one remembers, to work around problems that no longer exist. Changes in version 2.5.3
Ultra-liberal Feed Parser 2.5.3 was released on August 3, 2003.
Changes in version 2.5.2
Ultra-liberal Feed Parser 2.5.2 was released on July 28, 2003.
Changes in version 2.5.1
Ultra-liberal Feed Parser 2.5.1 was released on July 26, 2003.
Changes in version 2.5
Ultra-liberal Feed Parser 2.5 was released on July 25, 2003.
Changes in version 2.4
Ultra-liberal Feed Parser 2.4 was released on July 9, 2003.
Changes in version 2.3.1
Ultra-liberal RSS Parser 2.3.1 was released on June 12, 2003.
Changes in version 2.3
Ultra-liberal RSS Parser 2.3 was released on June 11, 2003.
Changes in version 2.2
Ultra-liberal RSS Parser 2.2 was released on January 27, 2003.
Changes in version 2.1
Ultra-liberal RSS Parser 2.1 was released on November 14, 2002.
Changes in version 2.0.2
Ultra-liberal RSS Parser 2.0.2 was released on October 21, 2002.
Changes in version 2.0.1
Ultra-liberal RSS Parser 2.0.1 was released on October 21, 2002.
Changes in version 2.0
Ultra-liberal RSS Parser 2.0 was released on October 19, 2002.
Changes in version 1.1
Ultra-liberal RSS Parser 1.1 was released on September 27, 2002.
Changes in version 1.0
Ultra-liberal RSS Parser 1.0 was released on September 27, 2002.
Initial release
Ultra-liberal RSS Parser (unversioned) was released on August 13, 2002.
Aaron Swartz has been looking for an ultra-liberal RSS parser. Now that I’m experimenting with a homegrown RSS-to-email news aggregator, so am I. You see, most RSS feeds suck. Invalid characters, unescaped ampersands (Blogger feeds), invalid entities (Radio feeds), unescaped and invalid HTML (The Register’s feed most days). Or just a bastardized mix of RSS 0.9x elements with RSS 1.0 elements (Movable Type feeds). Then there are feeds, like Aaron’s feed, which are too bleeding edge. He puts an excerpt in the description element but puts the full text in the content:encoded element (as CDATA). This is valid RSS 1.0, but nobody actually uses it (except Aaron), few news aggregators support it, and many parsers choke on it. Other parsers are confused by the new elements (guid) in RSS 0.94 (see Dave Winer’s feed for an example). And then there’s Jon Udell’s feed, with the fullitem element that he just sort of made up. rssparser.py. GPL-licensed. Tested on 5000 active feeds.
filename=”reference.html”
filename=”reference-feed.html”
A dictionary of data about the feed.
Tip
This element always exists, although it may be an empty dictionary.
filename=”reference-feed-title.html”
The title of the feed.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
filename=”reference-feed-title_detail.html”
A dictionary with details about the feed title.
Same as reference.feed.title.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the feed title.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the feed title.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the feed title.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
filename=”reference-feed-link.html”
The URL of the HTML page associated with this feed.
For site feeds, this is probably the home page of the site. For category feeds, this is probably the category’s archive page. For search feeds, this is probably the web page that displays the search results for the given search parameters.
If this is a relative URI, it is resolved according to a set of rules.
filename=”reference-feed-links.html”
A list of dictionaries with details on the links associated with the feed. Each link has a rel (relationship), type (content type), and href (the URL that the link points to). Some links may also have a title.
The relationship of this feed link.
Atom 1.0 defines five standard link relationships and describes the process for registering others. Here are the five standard rel values:
The content type of the page that this feed link points to.
The URL of the page that this feed link points to.
If this is a relative URI, it is resolved according to a set of rules.
The title of this feed link.
filename=”reference-feed-subtitle.html”
A subtitle, tagline, slogan, or other short description of the feed.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
filename=”reference-feed-subtitle_detail.html”
A dictionary with details about the feed subtitle.
Same as reference.feed.subtitle.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the feed subtitle.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the feed subtitle.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the feed subtitle.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
filename=”reference-feed-rights.html”
A human-readable copyright statement for the feed.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
Note
For machine-readable copyright information, see reference.feed.license.
filename=”reference-feed-rights_detail.html”
A dictionary with details on the feed copyright.
Same as reference.feed.rights.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the feed copyright.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the feed copyright.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the feed copyright.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
filename=”reference-feed-generator.html”
A human-readable name of the application used to generate the feed.
filename=”reference-feed-generator_detail.html”
A dictionary with details about the feed generator.
Same as reference.feed.generator.
The URL of the application used to generate the feed.
If this is a relative URI, it is resolved according to a set of rules.
The version number of the application used to generate the feed. There is no required format for this, but most applications use a MAJOR.MINOR version number.
filename=”reference-feed-info.html”
Free-form human-readable description of the feed format itself. Intended for people who view the feed in a browser, to explain what they just clicked on. This element is generally ignored by feed readers.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
filename=”reference-feed-info-detail.html”
A dictionary with details about the feed info.
Same as reference.feed.info.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the feed info.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the feed info.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the feed copyright.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
filename=”reference-feed-updated.html”
The date the feed was last updated, as a string in the same format as it was published in the original feed.
This element is parsed as a date and stored in reference.feed.updated_parsed.
filename=”reference-feed-updated_parsed.html”
The date the feed was last updated, as a standard Python 9-tuple.
filename=”reference-feed-id.html”
A globally unique identifier for this feed.
If this is a relative URI, it is resolved according to a set of rules.
The author of this feed.
A dictionary with details about the feed author.
The name of the feed author.
The URL of the feed author. This can be the author’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of the feed author.
filename=”reference-feed-contributors.html”
A list of contributors (secondary authors) to this feed.
The name of this contributor.
The URL of this contributor. This can be the contributor’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of this contributor.
filename=”reference-feed-image.html”
A dictionary with details about the feed image. A feed image can be a logo, banner, or a picture of the author.
The alternate text of the feed image, which would go in the alt attribute if you rendered the feed image as an HTML img element.
The URL of the feed image itself, which would go in the src attribute if you rendered the feed image as an HTML img element.
If this is a relative URI, it is resolved according to a set of rules.
The URL which the feed image would point to. If you rendered the feed image as an HTML img element, you would wrap it in an a element and put this in the href attribute.
If this is a relative URI, it is resolved according to a set of rules.
The width of the feed image, which would go in the width attribute if you rendered the feed image as an HTML img element.
The height of the feed image, which would go in the height attribute if you rendered the feed image as an HTML img element.
A short description of the feed image, which would go in the title attribute if you rendered the feed image as an HTML img element. This element is rare; it was available in Netscape RSS 0.91 but was dropped from Userland RSS 0.91.
This is a feed image:<image> <title>Feed logo</title> <url>http://example.org/logo.png</url> <link>http://example.org/</link> <width>80</width> <height>15</height> <description>Visit my home page</description> </image>
This feed image could be rendered in HTML as this:<a href=”http://example.org/“> <img src=”http://example.org/logo.png” width=”80” height=”15” alt=”Feed logo” title=”Visit my home page”> </a>
filename=”reference-feed-icon.html”
A URL to a small icon representing the feed.
If this is a relative URI, it is resolved according to a set of rules.
filename=”reference-feed-logo.html”
A URL to a graphic representing a logo for the feed.
If this is a relative URI, it is resolved according to a set of rules.
filename=”reference-feed-textinput.html”
A text input form. No one actually uses this. Why are you?
The title of the text input form, which would go in the value attribute of the form’s submit button.
The link of the script which processes the text input form, which would go in the action attirbute of the form.
If this is a relative URI, it is resolved according to a set of rules.
The name of the text input box in the form, which would go in the name attribute of the form’s input box.
A short description of the text input form, which would go in the label element of the form.
This is a text input in a feed:<textInput> <title>Go!</title> <link>http://example.org/search</link> <name>keyword</name> <description>Search this site:</description> </textInput>
This is how it could be rendered in HTML:<form method=”get” action=”http://example.org/search“> <label for=”keyword”>Search this site:</label> <input type=”text” id=”keyword” name=”keyword” value=”“> <input type=”submit” value=”Go!”> </form>
filename=”reference-feed-cloud.html”
No one really knows what a cloud is. It is vaguely documented in SOAP meets RSS.
The domain of the cloud. Should be just the domain name, not including the http:// protocol. All clouds are presumed to operate over HTTP. The cloud specification does not support secure clouds over HTTPS, nor can clouds operate over other protocols.
The port of the cloud. Should be an integer, but Universal Feed Parser currently returns it as a string.
The URL path of the cloud.
The name of the procedure to call on the cloud.
The protocol of the cloud. Documentation differs on what the acceptable values are. Acceptable values definitely include xml-rpc and soap, although only in lowercase, despite both being acronyms.
There is no way for a publisher to specify the version number of the protocol to use. soap refers to SOAP 1.1; the cloud interface does not support SOAP 1.0 or 1.2.
post or http-post might also be acceptable values; nobody really knows for sure.
filename=”reference-feed-publisher.html”
The publisher of the feed.
filename=”reference-feed-publisher_detail.html”
A dictionary with details about the feed publisher.
The name of this feed’s publisher.
The URL of this feed’s publisher. This can be the publisher’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of this feed’s publisher.
A list of dictionaries that contain details of the categories for the feed.
Note
Prior to version 4.0, Universal Feed Parser exposed categories in feed.category (the primary category) and feed.categories (a list of tuples containing the domain and term of each category). These uses are still supported for backward compatibility, but you will not see them in the parsed results unless you explicitly ask for them.
The category term (keyword).
The category scheme (domain).
A human-readable label for the category.
filename=”reference-feed-docs.html”
A URL pointing to the specification which this feed conforms to.
This element is rare. The reasoning was that in 25 years, someone will stumble on an RSS feed and not know what it is, so we should waste everyone’s bandwidth with useless links until then. Most publishers skip it, and all clients ignore it.
If this is a relative URI, it is resolved according to a set of rules.
filename=”reference-feed-ttl.html”
According to the RSS specification, ttl stands for time to live. It’s a number of minutes that indicates how long a channel can be cached before refreshing from the source. This makes it possible for RSS sources to be managed by a file-sharing network such as Gnutella.
No one is quite sure what this means, and no one publishes feeds via file-sharing networks.
Some clients have interpreted this element to be some sort of inline caching mechanism, albeit one that completely ignores the underlying HTTP protocol, its robust caching mechanisms, and the huge amount of HTTP-savvy network infrastructure that understands them. Given the vague documentation, it is impossible to say that this interpretation is any more ridiculous than the element itself.
filename=”reference-feed-language.html”
The primary language of the feed.
filename=”reference-feed-license.html”
A URL of the license under which this feed is distributed.
If this is a relative URI, it is resolved according to a set of rules.
filename=”reference-feed-errorreportsto.html”
An email address for reporting errors in the feed itself.
filename=”reference-entry.html”
A list of dictionaries. Each dictionary contains data from a different entry. Entries are listed in the order in which they appear in the original feed.
Tip
This element always exists, although it may be an empty list.
filename=”reference-entry-title.html”
The title of the entry.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
filename=”reference-entry-title_detail.html”
A dictionary with details about the entry title.
Same as reference.entry.title.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the entry title.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the entry title.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the entry title.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
filename=”reference-entry-link.html”
The primary link of this entry. Most feeds use this as the permanent link to the entry in the site’s archives.
If this is a relative URI, it is resolved according to a set of rules.
Some RSS feeds use guid when they mean link. guid can also be used as an opaque identifier that has nothing to do with links. If an RSS feed uses guid as the entry link and no link is present, Universal Feed Parser detects this and makes the guid available in d.entries[i].link.
In other words, you can always use ``d.entries[i].link`` to get the entry link, regardless of how the feed is actually structured.
filename=”reference-entry-links.html”
A list of dictionaries with details on the links associated with the feed. Each link has a rel (relationship), type (content type), and href (the URL that the link points to). Some links may also have a title.
The relationship of this entry link.
Atom 1.0 defines five standard link relationships and describes the process for registering others. Here are the five standard rel values:
The content type of the page that this entry link points to.
The URL of the page that this entry link points to.
If this is a relative URI, it is resolved according to a set of rules.
The title of this entry link.
filename=”reference-entry-summary.html”
A summary of the entry.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
Some publishing systems auto-generate this value from the first few words or first paragraph of the entry. Other publishing systems misuse it to include the full content. In the latter cases, Universal Feed Parser ought to detect it and put the value in reference.entry.content instead, but it doesn’t.
Note
Some feeds include both a summary and description element for each entry. In this case, the first element will be available in entry['summary'] and the second will be available in entry['content'][0].
filename=”reference-entry-summary_detail.html”
A dictionary with details about the entry summary.
Same as reference.entry.summary.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
If this contains HTML or XHTML, it will be parsed for microformats.
The content type of the entry summary.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the entry summary.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the entry summary.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
filename=”reference-entry-content.html”
A list of dictionaries with details about the full content of the entry.
Atom feeds may contain multiple content elements. Clients should render as many of them as possible, based on the type and the client’s abilities.
The value of this piece of content.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
If this contains HTML or XHTML, it will be parsed for microformats.
The content type of this piece of content.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of this piece of content.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within this piece of content.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
filename=”reference-entry-published.html”
The date this entry was first published, as a string in the same format as it was published in the original feed.
This element is parsed as a date and stored in reference.entry.published_parsed.
filename=”reference-entry-published_parsed.html”
The date this entry was first published, as a standard Python 9-tuple.
filename=”reference-entry-updated.html”
The date this entry was last updated, as a string in the same format as it was published in the original feed).
This element is parsed as a date and stored in reference.entry.updated_parsed.
filename=”reference-entry-updated_parsed.html”
The date this entry was last updated, as a standard Python 9-tuple.
filename=”reference-entry-created.html”
The date this entry was first created (drafted), as a string in the same format as it was published in the original feed).
This element is parsed as a date and stored in reference.entry.created_parsed.
filename=”reference-entry-created_parsed.html”
The date this entry was first created (drafted), as a standard Python 9-tuple.
filename=”reference-entry-expired.html”
The date this entry is set to expire, as a string in the same format as it was published in the original feed).
This element is parsed as a date and stored in reference.entry.expired_parsed.
This element is rare. It only existed in RSS 0.93, and it was never widely implemented by publishers. Most clients ignore it in favor of user-defined expiration algorithms.
filename=”reference-entry-expired_parsed.html”
The date this entry is set to expire, as a standard Python 9-tuple.
This element is rare. It only existed in RSS 0.93, and it was never widely implemented by publishers. Most clients ignore it in favor of user-defined expiration algorithms.
filename=”reference-entry-id.html”
A globally unique identifier for this entry.
If this is a relative URI, it is resolved according to a set of rules.
The author of this entry.
A dictionary with details about the author of this entry.
The name of this entry’s author.
The URL of this entry’s author. This can be the author’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of this entry’s author.
filename=”reference-entry-contributors.html”
A list of contributors (secondary authors) to this entry.
The name of this contributor.
The URL of this contributor. This can be the contributor’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of this contributor.
filename=”reference-entry-enclosures.html”
A list of links to external files associated with this entry.
Some aggregators automatically download enclosures (although this technique has known problems). Some aggregators render each enclosure as a link. Most aggregators ignore them.
The RSS specification states that there can be at most one enclosure per item. However, because some feeds break this rule, Universal Feed Parser captures all of them and makes them available as a list.
The URL of the linked file.
If this is a relative URI, it is resolved according to a set of rules.
The length of the linked file.
The content type of the linked file.
filename=”reference-entry-publisher.html”
The publisher of the entry.
filename=”reference-entry-publisher_detail.html”
A dictionary with details about the entry publisher.
The name of this entry’s publisher.
The URL of this entry’s publisher. This can be the publisher’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of this entry’s publisher.
A list of dictionaries that contain details of the categories for the entry.
Note
Prior to version 4.0, Universal Feed Parser exposed categories in feed.category (the primary category) and feed.categories (a list of tuples containing the domain and term of each category). These uses are still supported for backward compatibility, but you will not see them in the parsed results unless you explicitly ask for them.
The category term (keyword).
The category scheme (domain).
A human-readable label for the category.
filename=”reference-entry-source.html”
A dictionary with details about the source of the entry.
The author of the source of this entry.
A dictionary containing details about the author of the source of this entry.
The name of the author of the source of this entry.
The URL of the author of the source of this entry. This can be the author’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of the author of the source of this entry.
A list of contributors to the source of this entry.
The name of a contributor to the source of this entry.
The URL of a contributor to the source of this entry. This can be the contributor’s home page, or a contact page with a webmail form.
If this is a relative URI, it is resolved according to a set of rules.
The email address of a contributor to the source of this entry.
The URL of an icon representing the source of this entry.
If this is a relative URI, it is resolved according to a set of rules.
A globally unique identifier for the source of this entry.
The primary permanent link of the source of this entry
A list of all links defined by the source of this entry.
The relationship of a link defined by the source of this entry.
Atom 1.0 defines five standard link relationships and describes the process for registering others. Here are the five standard rel values:
The content type of the page pointed to by a link defined by the source of this entry.
The URL of the page pointed to by a link defined by the source of this entry.
If this is a relative URI, it is resolved according to a set of rules.
The title of a link defined by the source of this entry.
The URL of a logo representing the source of this entry.
If this is a relative URI, it is resolved according to a set of rules.
A human-readable copyright statement for the source of this entry.
A dictionary containing details about the copyright statement for the source of this entry.
Same as entries[i].source.rights.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the copyright statement for the source of this entry.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the copyright statement for the source of this entry.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the copyright statement for the source of this entry.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
A subtitle, tagline, slogan, or other short description of the source of this entry.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
A dictionary containing details about the subtitle for the source of this entry.
Same as entries[i].source.subtitle.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the subtitle of the source of this entry.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the subtitle of the source of this entry.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the subtitle of the source of this entry.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
The title of the source of this entry.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
A dictionary containing details about the title for the source of this entry.
Same as entries[i].source.title.
If this contains HTML or XHTML, it is sanitized by default.
If this contains HTML or XHTML, certain (X)HTML elements within this value may contain relative URIs. If so, they are resolved according to a set of rules.
The content type of the title of the source of this entry.
Most likely values for type:
For Atom feeds, the content type is taken from the type attribute, which defaults to text/plain if not specified. For RSS feeds, the content type is auto-determined by inspecting the content, and defaults to text/html. Note that this may cause silent data loss if the value contains plain text with angle brackets. There is nothing I can do about this problem; it is a limitation of RSS.
Future enhancement: some versions of RSS clearly specify that certain values default to text/plain, and Universal Feed Parser should respect this, but it doesn’t yet.
The language of the title of the source of this entry.
language is supposed to be a language code, as specified by RFC 3066, but publishers have been known to publish random values like English or German. Universal Feed Parser does not do any parsing or normalization of language codes.
language may come from the element’s xml:lang attribute, or it may inherit from a parent element’s xml:lang, or the Content-Language HTTP header. If the feed does not specify a language, language will be None, the Python null value.
The original base URI for links within the title of the source of this entry.
base is only useful in rare situations and can usually be ignored. It is the original base URI for this value, as specified by the element’s xml:base attribute, or a parent element’s xml:base, or the appropriate HTTP header, or the URI of the feed. (See advanced.base for more details.) By the time you see it, Universal Feed Parser has already resolved relative links in all values where it makes sense to do so. Clients should never need to manually resolve relative links.
The date the source of this entry was last updated, as a string in the same format as it was published in the original feed.
This element is parsed as a date and stored in entries[i].source.updated_parsed.
The date this entry was last updated, as a standard Python 9-tuple.
filename=”reference-entry-comments.html”
A URL of the HTML comment submission page associated with this entry.
If this is a relative URI, it is resolved according to a set of rules.
filename=”reference-entry-license.html”
A URL of the license under which this entry is distributed.
If this is a relative URI, it is resolved according to a set of rules.
filename=”reference-entry-xfn.html”
A list of XFN relationships found in this entry’s HTML content.
entries[i].xfn is a list. Each list item represents a single person and may contain the following values:
A list of relationships for this person. Each list item is a string, either one of the constants defined in the XFN 1.1 profile or one of these variations.
The URI for this person.
If this is a relative URI, it is resolved according to a set of rules.
The name of this person, a string.
filename=”reference-entry-vcard.html”
An RFC 2426-compliant vCard derived from hCard information found in this entry’s HTML content.
filename=”reference-version.html”
The format and version of the feed.
Here is the complete list of known feed types and versions that may be returned in version:
If the feed type is completely unknown, version will be an empty string.
Tip
This element always exists, although it may be an empty string if the version can not be determined.
filename=”reference-namespaces.html”
A dictionary of all XML namespaces defined in the feed, as {prefix: namespaceURI}.
Note
The prefixes listed in the namespaces dictionary may not match the prefixes defined in the original feed. See advanced.namespaces for more details.
Tip
This element always exists, although it may be an empty dictionary if the feed does not define any namespaces (such as an RSS 2.0 feed with no extensions).
filename=”reference-encoding.html”
The character encoding that was used to parse the feed.
Note
The process by which Universal Feed Parser determines the character encoding of the feed is explained in advanced.encoding.
Tip
This element always exists, although it may be an empty string if the character encoding can not be determined.
filename=”reference-status.html”
The HTTP status code that was returned by the web server when the feed was fetched.
If the feed was redirected from its original URL, status will contain the redirect status code, not the final status code.
If status is 301, the feed was permanently redirected to a new URL. Clients should update their address book to request the new URL from now on.
If status is 410, the feed is gone. Clients should stop polling the feed.
Tip
status will only be present if the feed was retrieved from a web server. If the feed was parsed from a local file or from a string in memory, status will not be present.
filename=”reference-href.html”
The final URL of the feed that was parsed.
If the feed was redirected from the original requested address, href will contain the final (redirected) address.
Tip
href will only be present if the feed was retrieved from a web server. If the feed was parsed from a local file or from a string in memory, href will not be present.
filename=”reference-etag.html”
The ETag of the feed, as specified in the HTTP headers.
The purpose of etag is explained more fully in http.etag.
Tip
etag will only be present if the feed was retrieved from a web server, and only if the web server provided an ETag HTTP header for the feed. If the feed was parsed from a local file or from a string in memory, etag will not be present.
filename=”reference-modified.html”
The last-modified date of the feed, as specified in the HTTP headers.
The purpose of modified is explained more fully in http.etag.
Tip
modified will only be present if the feed was retrieved from a web server, and only if the web server provided a Last-Modified HTTP header for the feed. If the feed was parsed from a local file or from a string in memory, modified will not be present.
filename=”reference-headers.html”
A dictionary of all the HTTP headers received from the web server when retrieving the feed.
Tip
headers will only be present if the feed was retrieved from a web server. If the feed was parsed from a local file or from a string in memory, headers will not be present.
filename=”reference-bozo.html”
An integer, either 1 or 0. Set to 1 if the feed is not well-formed XML, and 0 otherwise.
See advanced.bozo for more details on the bozo bit.
Tip
bozo may not be present. Some platforms, such as Mac OS X 10.2 and some versions of FreeBSD, do not include an XML parser in their Python distributions. Universal Feed Parser will still work on these platforms, but it will not be able to detect whether a feed is well-formed. However, it can detect whether a feed’s character encoding is incorrectly declared. (This is done in Python, not by the XML parser.) See advanced.encoding for details.
filename=”reference-bozo_exception.html”
The exception raised when attempting to parse a non-well-formed feed.
See advanced.bozo for more details.
Tip
bozo_exception will only be present if bozo is 1.
filename=”license.html”
Universal Feed Parser documentation is copyright 2004-2008 Mark Pilgrim. All rights reserved.
Redistribution and use in source (XML DocBook) and compiled forms (SGML, HTML, PDF, PostScript, RTF and so forth) with or without modification, are permitted provided that the following conditions are met:
# Redistributions of source code (XML DocBook) must retain the above copyright notice, this list of conditions and the following disclaimer unmodified.
# Redistributions in compiled form (transformed to other DTDs, converted to PDF, PostScript, RTF and other formats) must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS DOCUMENTATION IS PROVIDED BY THE AUTHOR AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.