Content Of Infobox Of Wikipedia
Solution 1:
Another great MediaWiki parser is mwparserfromhell.
In [1]: import mwparserfromhell
In [2]: import pywikibot
In [3]: enwp = pywikibot.Site('en','wikipedia')
In [4]: page = pywikibot.Page(enwp, 'Waking Life')
In [5]: wikitext = page.get()
In [6]: wikicode = mwparserfromhell.parse(wikitext)
In [7]: templates = wikicode.filter_templates()
In [8]: templates?
Type: list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length: 31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [10]: templates[:2]
Out[10]:
[u'{{Use mdy dates|date=September 2012}}',
u"{{Infobox film\n| name = Waking Life\n| image = Waking-Life-Poster.jpg\n| image_size = 220px\n| alt =\n| caption = Theatrical release poster\n| director = [[Richard Linklater]]\n| producer = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer = Richard Linklater\n| starring = [[Wiley Wiggins]]\n| music = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing = Sandra Adair\n| studio = [[Thousand Words]]\n| distributor = [[Fox Searchlight Pictures]]\n| released = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country = United States\n| language = English\n| budget =\n| gross = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]
In [11]: infobox_film = templates[1]
In [12]: for param in infobox_film.params:
print param.name, param.value
name Waking Life
image Waking-Life-Poster.jpg
image_size 220px
alt
caption Theatrical release poster
director [[Richard Linklater]]
producer [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West
writer Richard Linklater
starring [[Wiley Wiggins]]
music Glover Gill
cinematography Richard Linklater<br />[[Tommy Pallotta]]
editing Sandra Adair
studio [[Thousand Words]]
distributor [[Fox Searchlight Pictures]]
released {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}
runtime 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>
country United States
language English
budget
gross $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>
Don't forget that params are mwparserfromhell objects too!
Solution 2:
Instead of reinventing the wheel, check out DBPedia, which has already extracted all Wikipedia infoboxes into an easily parsable database format.
Solution 3:
Any infobox is a template transcluded by curly brackets. Let's have a look to a template and how it is transcluded in wikitext:
Infobox film
{{Infobox film
| name = Actresses
| image = Actrius film poster.jpg
| alt =
| caption = Catalan language film poster
| native_name = ([[Catalan language|Catalan]]: '''''Actrius''''')
| director = [[Ventura Pons]]
| producer = Ventura Pons
| writer = [[Josep Maria Benet i Jornet]]
| screenplay = Ventura Pons
| story =
| based_on = {{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}
| starring = {{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna Lizaran]]|[[Mercè Pons]]}}
| narrator = <!-- or: |narrators = -->
| music = Carles Cases
| cinematography = Tomàs Pladevall
| editing = Pere Abadal
| production_companies = {{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de Cultura]]|[[Televisión Española]]}}
| distributor = [[Buena Vista International]]
| released = {{film date|df=yes|1997|1|17|[[Spain]]}}
| runtime = 100 minutes
| country = Spain
| language = Catalan
| budget =
| gross = <!--(please use condensed and rounded values, e.g. "£11.6 million" not "£11,586,221")-->
}}
There are two high level Page
methods in Pywikibot to parse the content of any template inside the wikitext content. Both use mwparserfromhell
if installed; otherwise a regex is used but the regex may fail for nested templates with depth > 3:
raw_extracted_templates
raw_extracted_templates
is a Page
property with returns a list of tuples with two items each. The first item is the template identifier as str, 'Infobox film'
for example. The second item is an OrderedDict with template parameters identifier as keys and their assignmets as values. For example the template fields
| name = FILM TITLE
| image = FILM TITLE poster.jpg
| caption = Theatrical release poster
results in an OrderedDict as
OrderedDict((name='FILM TITLE', image='FILM TITLE poster.jpg' caption='Theatrical release poster')
Now how get it with Pywikibot?
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.page.raw_extracted_templates
for tmpl, paramsin all_templates:
if tmpl == 'Infobox film':
pprint(params)
This will print
OrderedDict([('name', 'Actresses'),
('image', 'Actrius film poster.jpg'),
('alt', ''),
('caption', 'Catalan language film poster'),
('native_name',
"([[Catalan language|Catalan]]: '''''Actrius''''')"),
('director', '[[Ventura Pons]]'),
('producer', 'Ventura Pons'),
('writer', '[[Josep Maria Benet i Jornet]]'),
('screenplay', 'Ventura Pons'),
('story', ''),
('based_on',
"{{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}"),
('starring',
'{{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna ''Lizaran]]|[[Mercè Pons]]}}'),
('narrator', ''),
('music', 'Carles Cases'),
('cinematography', 'Tomàs Pladevall'),
('editing', 'Pere Abadal'),
('production_companies',
'{{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla ''S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - ''Departament de Cultura]]|[[Televisión Española]]}}'),
('distributor', '[[Buena Vista International]]'),
('released', '{{film date|df=yes|1997|1|17|[[Spain]]}}'),
('runtime', '100 minutes'),
('country', 'Spain'),
('language', 'Catalan'),
('budget', ''),
('gross', '')])
templatesWithParams()
This is similar to raw_extracted_templates property but the method returns a list of tuples with again two items. The first item is the template as a Page
object. The second item is a list of template parameters. Have a look at the sample:
Sample code
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.templatestemplatesWithParams()
for tmpl, params in all_templates:
if tmpl.title(with_ns=False) == 'Infobox film':
pprint(tmpl)
This will print the list:
['alt=',"based_on={{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}",
'budget=','caption=Catalan language film poster','cinematography=Tomàs Pladevall','country=Spain','director=[[Ventura Pons]]','distributor=[[Buena Vista International]]','editing=Pere Abadal','gross=','image=Actrius film poster.jpg','language=Catalan','music=Carles Cases','name=Actresses','narrator=',"native_name=([[Catalan language|Catalan]]: '''''Actrius''''')",
'producer=Ventura Pons','production_companies={{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla ''S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de ''Cultura]]|[[Televisión Española]]}}','released={{film date|df=yes|1997|1|17|[[Spain]]}}','runtime=100 minutes','screenplay=Ventura Pons','starring={{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna ''Lizaran]]|[[Mercè Pons]]}}','story=','writer=[[Josep Maria Benet i Jornet]]']
Solution 4:
You can get the wikipage content with pywikipdiabot, and then, you can search for the infobox with regex, a parser like mwlib [0], or even stick with pywikipediabot and use one of his template tools. For example on textlib you'll find some functions to deal with templates (hint: search for "# Functions dealing with templates"). [1]
[0] - http://pypi.python.org/pypi/mwlib
[1] - http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/pywikibot/textlib.py?view=markup
Post a Comment for "Content Of Infobox Of Wikipedia"