Scraping Instagram With Beautifulsoup

July 15, 2023 Post a Comment

I'm trying to get a particular string from the 'search by tag' in Instagram. I'd like to get the url img from here: #yeşil #manzara #doğa #yayla #nature #naturelo</div><h2 id="solution_1">Solution 1:

</h2><div class="answer-desc"><p>The reason you can

#yeşil #manzara #doğa #yayla #nature #naturelo</div><h2 id="solution_1">Solution 1:

</h2><div class="answer-desc"><p>The reason you can

Selenium.

But, there's one more way to scrape that. Looking at the page source, the data you're after, is available in a <script> tag in the form of JSON. The relevant data is in the form of:

"thumbnail_resources": [
    {
        "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/a3ed0ee1af581f1c1fe6170b8c080e7c/5B2CA660/t51.2885-15/s150x150/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 150,
         "config_height": 150
     },
     {
         "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/7a0bb4fb1b5d5e3b179c58a2b9472b9f/5B2C535F/t51.2885-15/s240x240/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 240,
         "config_height": 240
     },

To get the JSON, you can use this (code taken from this answer):

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

Code to get image link for all the images:

import json
import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.instagram.com/explore/tags/nature/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

for post indata['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
    image_src = post['node']['thumbnail_resources'][1]['src']
    print(image_src)

Partial output:

https://instagram.fpnq3-1.fna.fbcdn.net/vp/e8a78407fb61de834cad7f10eca830fc/5A9DC375/t51.2885-15/s240x240/e15/c0.80.640.640/28766397_174603559842180_1092148752455565312_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/3a20f36647c86c2196f259b5d14ebf82/5A9D5BC9/t51.2885-15/s240x240/e15/28433802_283862648812409_3322859933120069632_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/82216be4596dd9da862ba267cdeab517/5B144226/t51.2885-15/s240x240/e35/c0.135.1080.1080/28157436_941679549319762_5605299824451649536_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/e50eab90b2e0951d67922e49b495e1fc/5B3EC9B8/t51.2885-15/s240x240/e35/c135.0.810.810/28754107_179533402825352_1137703808411893760_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/d3a13e7b81a65421b4318b57fb8ee24e/5B4D9EFF/t51.2885-15/s240x240/e35/28433583_375555202918683_1951892035636035584_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/1b0aeea1b9be983498192d350e039aa0/5B43C583/t51.2885-15/s240x240/e35/28156427_154249191953160_9219472301039288320_n.jpg
...

Note: The [1] in the line image_src = post['node']['thumbnail_resources'][1]['src'] is for 240w. You can use 0, 1, 2, 3 or 4 for 150w, 240w, 320w, 480w or 640w respectively. Also, if you want any other data regarding any image, like, number of likes, comments, caption, etc; everything is available in this JSON (data variable).

Getting Started with Python

Scraping Instagram With Beautifulsoup

Post a Comment for "Scraping Instagram With Beautifulsoup"