Web Data(wiki) Scraping Python
I am trying to obtain lat lng for some university from wikipedia, I have a base url= 'https://de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien' with list of universities and i
Solution 1:
Sorry for not answering directly, but I always prefer to use MediaWiki's API. And we're lucky to have mwclient
in Python, which makes working with the API even easier.
So, for what it's worth, here's how I would do it with mwclient
:
import re
import mwclient
site = mwclient.Site('de.wikipedia.org')
start_page = site.Pages['Liste_altsprachlicher_Gymnasien']
results = {}
for link in start_page.links():
page = site.Pages[link['title']]
text = page.text()
try:
pattern = re.compile(r'Breitengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/N')
breiten = [float(b) for b in pattern.search(text).group(1).split('/')]
pattern = re.compile(r'Längengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/E')
langen = [float(b) for b in pattern.search(text).group(1).split('/')]
except:
continue
results[link['title']] = breiten, langen
This gives a tuple of lists [deg, min, sec]
for each link it succeeds in finding coordinates in:
>>> results
{'Akademisches Gymnasium (Wien)': ([48.0, 12.0, 5.0], [16.0, 22.0, 34.0]),
'Akademisches Gymnasium Salzburg': ([47.0, 47.0, 39.9], [13.0, 2.0, 2.9]),
'Albertus-Magnus-Gymnasium (Friesoythe)': ([53.0, 1.0, 19.13], [7.0, 51.0, 46.44]),
'Albertus-Magnus-Gymnasium Regensburg': ([49.0, 1.0, 23.95], [12.0, 4.0, 32.88]),
'Albertus-Magnus-Gymnasium Viersen-Dülken': ([51.0, 14.0, 46.29], [6.0, 19.0, 42.1]),
...
}
You could format any way you like:
for uni, location in results.items():
lat, lon = location
string = """University {} is at {}˚{}'{}"N, {}˚{}'{}"E"""print(string.format(uni, *lat+lon))
Or convert the DMS coordinates to decimal degrees:
def dms_to_dec(coord):
d, m, s = coord
return d + m/60 + s/(60*60)
decimal = {uni: (dms_to_dec(b), dms_to_dec(l)) for uni, (b, l) in results.items()}
Note, not all of the linked pages might be universities; I didn't check that carefully.
Post a Comment for "Web Data(wiki) Scraping Python"