Skip to content Skip to sidebar Skip to footer

Can't Modify A Function To Work Independently Instead Of Depending On A Returned Result

I've written a script in python making use of proxies while sending requests to some links in order to parse the product name from there. My current attempt does the job flawlessly

Solution 1:

If you don't need multithreading support (your edits suggest you don't), you can make it work with the following minor changes. proxyVault keeps both the entire proxy pool, and the active proxy (the last one) after shuffling the list (your code had both shuffle and choice, but just one of them is enough). pop()-ing from the list changes the active proxy, until there are no more left.

import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup

linklist = [
    'https://www.amazon.com/dp/B00OI0RGGO',
    'https://www.amazon.com/dp/B00TPKOPWA',
    'https://www.amazon.com/dp/B00TH42HWE'
]

proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
random.shuffle(proxyVault)


classNoMoreProxies(Exception):
    passdefskip_proxy():
    global proxyVault
    iflen(proxyVault) == 0:
        raise NoMoreProxies()
    proxyVault.pop()


defget_proxy():
    global proxyVault
    iflen(proxyVault) == 0:
        raise NoMoreProxies()
    proxy_url = proxyVault[-1]
    proxy = {'https': f'http://{proxy_url}'}
    return proxy


defparse_product(link):
    try:
        proxy = get_proxy()
        print("checking the proxy:", proxy)
        res = requests.get(link, proxies=proxy, timeout=5)
        soup = BeautifulSoup(res.text, "html5lib")
        try:
            product_name = soup.select_one("#productTitle").get_text(strip=True)
        except Exception:
            product_name = ""return product_name

    except Exception:
        """the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
        skip_proxy()
        return parse_product(link)


if __name__ == '__main__':
    for url in linklist:
        result = parse_product(url)
        print(result)

I would also suggest changing the last try/except clause to catch a RequestException instead of Exception.

Solution 2:

Perhaps you can put the proxy handling logic inside a class, and pass an instance to parse_product(). Then, parse_product() will invoke the necessary methods of the instance to get and/or reset the proxy. The class can look something like this:

classProxyHandler:
    proxyVault = [
        "103.110.37.244:36022",
        "180.254.218.229:8080"# and so on
    ]

    def__init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Initialize proxy
        proxy_url = choice(self.proxyVault)
        self.proxy = {"https": f"http://{proxy_url}"}

    defget_proxy(self):
        return self.proxy

    defrenew_proxy(self):
        # Remove current proxy from the vault
        proxy_pattern = self.proxy.get("https").split("//")[-1]
        if proxy_pattern in proxyVault:
            proxyVault.remove(proxy_pattern)

        # Set new proxy
        random.shuffle(proxyVault)
        proxy_url = choice(self.proxyVault)
        self.proxy = {"https": f"http://{proxy_url}"}

Then, parse_product() might look something like this:

defparse_product(link, proxy_handler):
    try:
        ifnot proxy_handler:
            raise
        proxy = proxy_handler.get_proxy()
        print("checking the proxy:", proxy)
        res = requests.get(link, proxies=proxy, timeout=5)
        soup = BeautifulSoup(res.text, "html5lib")
        try:
            product_name = soup.select_one("#productTitle").get_text(strip=True)
        except Exception:
            product_name = ""return product_name

    except Exception:
        """the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
        proxy_handler.renew_proxy()
        return parse_product(link, proxy_handler)

I think you can pass the same ProxyHandler instance to all threads and parallelize too.

Solution 3:

I might be missing something crucial here (as it's pretty late), but it seems a simple problem which was extremely overcomplicated. It almost tends to be an XY Problem. I'm going to post some thoughts, questions (wanderings of mine), observations, suggestions:

  • The end goal is that for each link, access it (once or as many times possible ? if it's the latter, it seems like a DoS attempt, so I'll assume it's the former :) ) using each of the proxies (when a proxy fails, move to next). If that works, get some product (which seems to be some kind of an electric motor) name
  • Why the recursion? It's limited by the stack (in Python by [Python 3.Docs]: sys.getrecursionlimit())
  • No need to declare variables as global if not assigning values to them (there are exceptions, but I don't think it's the case here)
  • process_proxy (question variant) isn't behaving well when proxyVault gets empty
  • global proxy (from the answer) is just ugly
  • Why random instead of simply picking the next proxy from the list?
  • parse_product_info (parse_product) behavior is not consistent, in some cases returns something, in others it doesn't
  • Parallelization occurs only at target URL level. It can be improved a bit more (but more logic needs to be added to the code), if also working at proxy level

Below it's a simplified (and cleaner) version.

code00.py:

#!/usr/bin/env python3import sys
import random
import requests
from bs4 import BeautifulSoup


urls = [
    "https://www.amazon.com/dp/B00OI0RGGO",
    "https://www.amazon.com/dp/B00TPKOPWA",
    "https://www.amazon.com/dp/B00TH42HWE",
    "https://www.amazon.com/dp/B00TPKNREM",
]

proxies = [
    "103.110.37.244:36022",
    "180.254.218.229:8080",
    "110.74.197.207:50632",
    "1.20.101.95:49001",
    "200.10.193.90:8080",
    "173.164.26.117:3128",
    "103.228.118.66:43002",
    "178.128.231.201:3128",
    "1.2.169.54:55312",
    "181.52.85.249:31487",
    "97.64.135.4:8080",
    "190.96.214.123:53251",
    "52.144.107.142:31923",
    "45.5.224.145:52035",
    "89.218.22.178:8080",
    "192.241.143.186:80",
    "113.53.29.218:38310",
    "36.78.131.182:39243"
]


defparse_product_info(link):  # Can also pass proxies as argument
    local_proxies = proxies[:]  # Make own copy of the global proxies (in case you want to shuffle them and not affect other parallel processing workers)#random.shuffle(local_proxies)  # Makes no difference, but if you really want to shuffle it, decomment this linefor proxy in local_proxies:
        try:
            proxy_dict = {"https": f"http://{proxy}"}  # http or https?print(f"    Proxy to be used: {proxy_dict['https']}")
            response = requests.get(link, proxies=proxy_dict, timeout=5)
            ifnot response:
                print(f"    HTTP request returned {response.status_code} code")
                continue# Move to next proxy
            soup = BeautifulSoup(response.text, "html5lib")
            try:
                product_name = soup.select_one("#productTitle").get_text(strip=True)
                return product_name  # Information retrieved, return it.except Exception as e:  # Might want to use specific exceptionsprint(f"ERROR: {e}")
                # URL was accessible, but the info couldn't be parsed.# return, as probably it will be the same using any other proxies.returnNone# Replace by `continue` if  you want to try the other proxiesexcept Exception as e:
            #print(f"    {e}")continue# Some exception occured, move to next proxydefmain():
    for url in urls:
        print(f"\nAttempting url: {url}...")
        product_name = parse_product_info(url)
        if product_name:
            print(f"{url} yielded product name:\n[{product_name}\\n")


if __name__ == "__main__":
    print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64if sys.maxsize > 0x100000000else32, sys.platform))
    main()
    print("\nDone.")

Output (partial, as I didn't let it go through all proxies / URLs):

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q058796837]> "e:\Work\Dev\VEnvs\py_064_03.07.03_test0\Scripts\python.exe" code00.py
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] 64bit on win32


Attempting url: https://www.amazon.com/dp/B00OI0RGGO...
    Proxy to be used: http://103.110.37.244:36022
    Proxy to be used: http://180.254.218.229:8080
    Proxy to be used: http://110.74.197.207:50632
    Proxy to be used: http://1.20.101.95:49001
    Proxy to be used: http://200.10.193.90:8080
    Proxy to be used: http://173.164.26.117:3128
    ...

Post a Comment for "Can't Modify A Function To Work Independently Instead Of Depending On A Returned Result"