Skip to content Skip to sidebar Skip to footer

Web Data(wiki) Scraping Python

I am trying to obtain lat lng for some university from wikipedia, I have a base url= 'https://de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien' with list of universities and i

Solution 1:

Sorry for not answering directly, but I always prefer to use MediaWiki's API. And we're lucky to have mwclient in Python, which makes working with the API even easier.

So, for what it's worth, here's how I would do it with mwclient:

import re
import mwclient

site = mwclient.Site('de.wikipedia.org')
start_page = site.Pages['Liste_altsprachlicher_Gymnasien']

results = {}
for link in start_page.links():
    page = site.Pages[link['title']]
    text = page.text()

    try:
        pattern = re.compile(r'Breitengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/N')
        breiten = [float(b) for b in pattern.search(text).group(1).split('/')]

        pattern = re.compile(r'Längengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/E')
        langen = [float(b) for b in pattern.search(text).group(1).split('/')]
    except:
        continue

    results[link['title']] = breiten, langen

This gives a tuple of lists [deg, min, sec] for each link it succeeds in finding coordinates in:

>>> results

{'Akademisches Gymnasium (Wien)': ([48.0, 12.0, 5.0], [16.0, 22.0, 34.0]),
 'Akademisches Gymnasium Salzburg': ([47.0, 47.0, 39.9], [13.0, 2.0, 2.9]),
 'Albertus-Magnus-Gymnasium (Friesoythe)': ([53.0, 1.0, 19.13], [7.0, 51.0, 46.44]),
 'Albertus-Magnus-Gymnasium Regensburg': ([49.0, 1.0, 23.95], [12.0, 4.0, 32.88]),
 'Albertus-Magnus-Gymnasium Viersen-Dülken': ([51.0, 14.0, 46.29], [6.0, 19.0, 42.1]),
 ...
}

You could format any way you like:

for uni, location in results.items():
    lat, lon = location
    string = """University {} is at {}˚{}'{}"N, {}˚{}'{}"E"""print(string.format(uni, *lat+lon))

Or convert the DMS coordinates to decimal degrees:

def dms_to_dec(coord):
    d, m, s = coord
    return d + m/60 + s/(60*60)

decimal = {uni: (dms_to_dec(b), dms_to_dec(l)) for uni, (b, l) in results.items()}

Note, not all of the linked pages might be universities; I didn't check that carefully.

Post a Comment for "Web Data(wiki) Scraping Python"