WeKeyPedia python toolkit documentation

The wekeypedia python toolkit is a set of class and helpers that have been written during the overall wekeypedia project. It main purpose is to give back some shortcuts to the science community. We hope this work will help future data scientist and web scrappers make them win some time about the tedious part of the work, be able to spend more time on the more fun parts and conduct studies with wikipedia materials.

Its main features are :

  • data retrieval from the wikipedia API and 3rd party (statistics, semantic web, etc)
  • information extraction of API contents
  • network modeling of graph structures included in pages architecture
  • computation of various metrics (readibility, convergence, lsm, etc)
  • generation of reading maps based on a recommandation system

contents:

Information retrieval

Retrieve and extract information of a wikipedia page

Methods
Create a page handler
WikipediaPage.__init__([title, lang])
WikipediaPage.fetch_info(title[, ...])
Retrieving revisions
WikipediaPage.get_revision([revid, force, ...]) Retrieve the content of a revision by its revision id
WikipediaPage.get_revisions_list([extra_params]) Retrieve all the revisions and their info
WikipediaPage.get_current() Retrieve the content of the current revision
Retrieving parts
WikipediaPage.get_links([extra_params]) Retrieve links contained by a wikipedia page according to the API
WikipediaPage.get_categories([extra_params]) Retrieve a list of all categories used on the provided pages
WikipediaPage.get_langlinks() Retrieve the list of hyperlinks to translation of the current page
WikipediaPage.get_pageviews([fr, to]) Retrieve daily page view statistics from http://stats.grok.se/
Extracting editors
WikipediaPage.get_editors([revisions_list]) Retrieve revisions and extract editors
Retrieving and parsing diff
{
  "comment": "/* Overview */",
  "timestamp": "2007-01-11T19:06:02Z",
  "revid": 100042918,
  "anon": "",
  "user": "129.24.51.153",
  "parentid": 100036516,
  "diff": {
    "to": 100042918,
    "*": "<tr>\n  <td colspan=\"2\" class=\"diff-lineno\">Line 21:</td>\n  <td colspan=\"2\" class=\"diff-lineno\">Line 21:</td>\n</tr>\n<tr>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"></td>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"></td>\n</tr>\n<tr>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>In ordinary use, ''love'' usually refers to interpersonal love, an experience felt by a person for another person. Love often involves caring for or identifying with a person or thing, including oneself (cf. [[narcissism]]).In this use, love is actually the greatest proportion on selfishness, in that one only wants one thing, and that is for another being to be happy, and the one in love will do anything to fulfill this wish. This case however, does not necessarily end in marriage, because another may make the loved one happier, meaning the lover will give the other up.</div></td>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>In ordinary use, ''love'' usually refers to interpersonal love, an experience felt by a person for another person. Love often involves caring for or identifying with a person or thing, including oneself (cf. [[narcissism]]).In this use, love is actually the greatest proportion on selfishness, in that one only wants one thing, and that is for another being to be happy, and the one in love will do anything to fulfill this wish. This case however, does not necessarily end in marriage, because another may make the loved one happier, meaning the lover will give the other up.</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)[[User:129.24.51.153|129.24.51.153]] 19:06, 11 January 2007 (UTC)</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>See \"Love is the Real Thing\" on the web addresses below[[User:129.24.51.153|129.24.51.153]]</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>* </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>[[User:129.24.51.153|129.24.51.153]]NEWS FLASH[[User:129.24.51.153|129.24.51.153]] </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>WASHINGTON (AP)---\"Using anti-depressants Increases the risk of </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>Suicidal thoughts and behavior among young people\"---12/06/2006 </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)~~ </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>* </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div> </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>Please help volunteer effort to save our impulsive youths: TeenAnswers is for Everyone!</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)19:06, 11 January 2007 (UTC)~</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://groups.google.com/group/TeenAnswers</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://groups.google.com/group/answers-for-teens</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://groups.yahoo.com/group/TeenAnswers</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://groups.yahoo.com/group/answers-for-teens</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>[All \"groups\": 5 permanent, proven monographs &amp; no chat!] </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>*</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>[[User:129.24.51.153|129.24.51.153]]Ending suicide/impulsive depression/self-injury is NOW possible!</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>It works, even though you might not agree, but other people's lives</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>are more important than Anyone's subjective opinion![[User:129.24.51.153|129.24.51.153]]</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>*</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://s2.excoboard.com/exco/index.php?boardid=24582</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://CaptainChurch.proboards57.com</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://b4.boards2go.com/boards/board.cgi?user=ChurchCaptain</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>http://www.bev.net/users/homepages/JamesSorrell</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>*</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>sorrell.james@gmail.com</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>*</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>WASHINGTON - Teens increasingly are getting high with legal drugs like painkillers and mood stimulants, and they're turning to cough syrup as well, says a government survey released Thursday. </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>The annual study by the National Institute on Drug Abuse, conducted by the University of Michigan, showed mixed results in the nation's longtime campaign against teen drug abuse.</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>It found that while fewer teens overall drank alcohol or used illegal drugs in the last year, a small but growing number were popping prescription painkillers like OxyContin and Vicodin and stimulants like </div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>Ritalin.</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>As many as one in every 14 high school seniors said they used cold medicine \"fairly recently\" to get high, the study found.</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>It was the first year that the government tracked the frequency of teens who reported getting high from over-the-counter medicine for coughs and colds.</div></td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"></td>\n</tr>\n<tr>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"></td>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"></td>\n</tr>\n<tr>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>The very existence of love is itself subject to debate. Some categorically reject the notion as false or meaningless. Others call it a recently-invented abstraction, sometimes dating the \"invention\" to courtly Europe during or after the middle ages, although this is contradicted by the sizable body of ancient love poetry.&lt;ref&gt;[http://www.TrueOpenLove.org/reference/AncientLovePoetry.html Ancient Love Poetry] - TrueOpenLove.org&lt;/ref&gt; Others maintain that love really exists, and is not an abstraction, but is undefinable, being an essence which is [[spirituality|spiritual]] or [[metaphysics|metaphysical]] in nature. Some psychologists maintain that love is the action of lending one's \"boundary\" or \"[[self-esteem]]\" to another. Others attempt to define love by applying the definition to everyday life.</div></td>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>The very existence of love is itself subject to debate. Some categorically reject the notion as false or meaningless. Others call it a recently-invented abstraction, sometimes dating the \"invention\" to courtly Europe during or after the middle ages, although this is contradicted by the sizable body of ancient love poetry.&lt;ref&gt;[http://www.TrueOpenLove.org/reference/AncientLovePoetry.html Ancient Love Poetry] - TrueOpenLove.org&lt;/ref&gt; Others maintain that love really exists, and is not an abstraction, but is undefinable, being an essence which is [[spirituality|spiritual]] or [[metaphysics|metaphysical]] in nature. Some psychologists maintain that love is the action of lending one's \"boundary\" or \"[[self-esteem]]\" to another. Others attempt to define love by applying the definition to everyday life.</div></td>\n</tr>\n\n<!-- diff cache key enwiki:diff:version:1.11a:oldid:100036516:newid:100042918 -->\n",
    "from": 100036516
  }
}
WikipediaPage.get_diff([rev_id]) Retrieve diff content between a revision and its predecessor.
WikipediaPage.get_diff_full([rev_id]) Retrieve the full json response from a request for diff.
WikipediaPage.extract_plusminus(diff_html) Transform HTML Wikipedia API response into a plus/minus dict.
WikipediaPage.count_stems(sentences[, ...]) Count the number of stems in a list of sentences.
Page views
WikipediaPage.get_pageviews([fr, to]) Retrieve daily page view statistics from http://stats.grok.se/
Function helpers
url2title(url)

Transform an url into a title

Parameters:url (string) –
Returns:title
Return type:string
url2lang(url)

Transform an language code into a title

Parameters:url (string) –
Returns:lang
Return type:string

Make custom queries to the wikipedia api

Le toolkit wekeypedia inclut une classe qui permet de passer des requêtes plus fines et adaptées à des recherches d’information spécifiques et peu généralisables. Par exemple, la plupart des classes implémentées gèrent des objets à une échelle individuelle alors que pour des raisons d’optimisation, il est parfois nécessaire d’affiner les requêtes afin d’en réduire leur nombre.

class api(lang='en')
Parameters:lang (string, optional) –
get(query, method='get')
Parameters:query (dict) –
Returns:result
Return type:dict
Examples

Here is piece of code that retrieve all links included in the Wisdom page and check if all these links (n=184) have an equivalent in the french wikipedia. It does so by asking for langlinks of 50 pages at once instead of building one query per links. In this case, the network load reduction goes from 184 queries to 4. #win

from __future__ import division
from math import ceil
from collections import defaultdict

import wekeypedia
from wekeypedia.wikipedia.api import api as api

def api_bunch(page_titles, lang, req):
  results = defaultdict(list)
  param  = req

  w = api(lang)

  for i in range(0,int(ceil(len(page_titles)/50))):
    param["titles"] = "|".join(page_titles[i*50:i*50+50-1])

    while True:
      r = w.get(param)
      results.update({ p["title"]: p['langlinks'] for pageid, p in r["query"]["pages"].items() if 'langlinks' in p })

      if "continue" in r:
        param.update(r["continue"])
      else:
        break

  return results

def get_lang_projection(pages, source, target):
  """
  Retrieve all correspondance from a set of pages into another language

  Parameters
  ----------
  pages : list
    List of page titles

  Returns
  -------
  correspondances : list
    List of `(redirect(initial page), corresponding page)`
  """

  params = {
    "redirects": "",
    "format": "json",
    "action": "query",
    "prop": "info|langlinks",
    "lllimit": 500,
    "lllang": target,
    "continue":""
  }

  r = api_bunch(pages, source, params)

  return [ (page, t["*"]) for page,tt in r.items() for t in tt if t["lang"] == target ]

u = wekeypedia.WikipediaPage("Wisdom")
pages = list(set([ x["title"] for x in u.get_links() ]))

get_lang_projection(pages, "en", "fr")

wikipedia user

class WikipediaUser(lang='en', name=None)

create a new wikipedia user object

Keyword Arguments:
 
  • name (string)
  • lang (string)

Example

>>> from wekeypedia.wikipedia_user import WikipediaUser as User
>>>
>>> u = User(name="taniki")
fetch_contribs()

get all contributions from a user

Computing metrics

Linguistic Style Matching

lsm.compare(text1, text2) Compare two texts using the Linguistic Style Matching (LSM) [1]_ method
lsm.extract_categories(text) Extract percentages of LSM word categories over total words counting
lsm.extract_categories_raw(text) Extract raw counting of LSM word categories

installation (with virtual env)

$ virtualenv e/py --no-site-packages
$ source e/py/bin/activate
(py)$ pip install wekeypedia

todo list: