Can't Modify A Function To Work Independently Instead Of Depending On A Returned Result
Solution 1:
If you don't need multithreading support (your edits suggest you don't), you can make it work with the following minor changes. proxyVault
keeps both the entire proxy pool, and the active proxy (the last one) after shuffling the list (your code had both shuffle
and choice
, but just one of them is enough). pop()
-ing from the list changes the active proxy, until there are no more left.
import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup
linklist = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE'
]
proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
random.shuffle(proxyVault)
classNoMoreProxies(Exception):
passdefskip_proxy():
global proxyVault
iflen(proxyVault) == 0:
raise NoMoreProxies()
proxyVault.pop()
defget_proxy():
global proxyVault
iflen(proxyVault) == 0:
raise NoMoreProxies()
proxy_url = proxyVault[-1]
proxy = {'https': f'http://{proxy_url}'}
return proxy
defparse_product(link):
try:
proxy = get_proxy()
print("checking the proxy:", proxy)
res = requests.get(link, proxies=proxy, timeout=5)
soup = BeautifulSoup(res.text, "html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception:
product_name = ""return product_name
except Exception:
"""the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
skip_proxy()
return parse_product(link)
if __name__ == '__main__':
for url in linklist:
result = parse_product(url)
print(result)
I would also suggest changing the last try/except clause to catch a RequestException
instead of Exception
.
Solution 2:
Perhaps you can put the proxy handling logic inside a class, and pass an instance to parse_product()
. Then, parse_product()
will invoke the necessary methods of the instance to get and/or reset the proxy. The class can look something like this:
classProxyHandler:
proxyVault = [
"103.110.37.244:36022",
"180.254.218.229:8080"# and so on
]
def__init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Initialize proxy
proxy_url = choice(self.proxyVault)
self.proxy = {"https": f"http://{proxy_url}"}
defget_proxy(self):
return self.proxy
defrenew_proxy(self):
# Remove current proxy from the vault
proxy_pattern = self.proxy.get("https").split("//")[-1]
if proxy_pattern in proxyVault:
proxyVault.remove(proxy_pattern)
# Set new proxy
random.shuffle(proxyVault)
proxy_url = choice(self.proxyVault)
self.proxy = {"https": f"http://{proxy_url}"}
Then, parse_product()
might look something like this:
defparse_product(link, proxy_handler):
try:
ifnot proxy_handler:
raise
proxy = proxy_handler.get_proxy()
print("checking the proxy:", proxy)
res = requests.get(link, proxies=proxy, timeout=5)
soup = BeautifulSoup(res.text, "html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception:
product_name = ""return product_name
except Exception:
"""the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
proxy_handler.renew_proxy()
return parse_product(link, proxy_handler)
I think you can pass the same ProxyHandler
instance to all threads and parallelize too.
Solution 3:
I might be missing something crucial here (as it's pretty late), but it seems a simple problem which was extremely overcomplicated. It almost tends to be an XY Problem. I'm going to post some thoughts, questions (wanderings of mine), observations, suggestions:
- The end goal is that for each link, access it (once or as many times possible ? if it's the latter, it seems like a DoS attempt, so I'll assume it's the former :) ) using each of the proxies (when a proxy fails, move to next). If that works, get some product (which seems to be some kind of an electric motor) name
- Why the recursion? It's limited by the stack (in Python by [Python 3.Docs]: sys.getrecursionlimit())
- No need to declare variables as global if not assigning values to them (there are exceptions, but I don't think it's the case here)
- process_proxy (question variant) isn't behaving well when proxyVault gets empty
global proxy
(from the answer) is just ugly- Why random instead of simply picking the next proxy from the list?
- parse_product_info (parse_product) behavior is not consistent, in some cases returns something, in others it doesn't
- Parallelization occurs only at target URL level. It can be improved a bit more (but more logic needs to be added to the code), if also working at proxy level
Below it's a simplified (and cleaner) version.
code00.py:
#!/usr/bin/env python3import sys
import random
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.amazon.com/dp/B00OI0RGGO",
"https://www.amazon.com/dp/B00TPKOPWA",
"https://www.amazon.com/dp/B00TH42HWE",
"https://www.amazon.com/dp/B00TPKNREM",
]
proxies = [
"103.110.37.244:36022",
"180.254.218.229:8080",
"110.74.197.207:50632",
"1.20.101.95:49001",
"200.10.193.90:8080",
"173.164.26.117:3128",
"103.228.118.66:43002",
"178.128.231.201:3128",
"1.2.169.54:55312",
"181.52.85.249:31487",
"97.64.135.4:8080",
"190.96.214.123:53251",
"52.144.107.142:31923",
"45.5.224.145:52035",
"89.218.22.178:8080",
"192.241.143.186:80",
"113.53.29.218:38310",
"36.78.131.182:39243"
]
defparse_product_info(link): # Can also pass proxies as argument
local_proxies = proxies[:] # Make own copy of the global proxies (in case you want to shuffle them and not affect other parallel processing workers)#random.shuffle(local_proxies) # Makes no difference, but if you really want to shuffle it, decomment this linefor proxy in local_proxies:
try:
proxy_dict = {"https": f"http://{proxy}"} # http or https?print(f" Proxy to be used: {proxy_dict['https']}")
response = requests.get(link, proxies=proxy_dict, timeout=5)
ifnot response:
print(f" HTTP request returned {response.status_code} code")
continue# Move to next proxy
soup = BeautifulSoup(response.text, "html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
return product_name # Information retrieved, return it.except Exception as e: # Might want to use specific exceptionsprint(f"ERROR: {e}")
# URL was accessible, but the info couldn't be parsed.# return, as probably it will be the same using any other proxies.returnNone# Replace by `continue` if you want to try the other proxiesexcept Exception as e:
#print(f" {e}")continue# Some exception occured, move to next proxydefmain():
for url in urls:
print(f"\nAttempting url: {url}...")
product_name = parse_product_info(url)
if product_name:
print(f"{url} yielded product name:\n[{product_name}\\n")
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64if sys.maxsize > 0x100000000else32, sys.platform))
main()
print("\nDone.")
Output (partial, as I didn't let it go through all proxies / URLs):
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q058796837]> "e:\Work\Dev\VEnvs\py_064_03.07.03_test0\Scripts\python.exe" code00.py Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] 64bit on win32 Attempting url: https://www.amazon.com/dp/B00OI0RGGO... Proxy to be used: http://103.110.37.244:36022 Proxy to be used: http://180.254.218.229:8080 Proxy to be used: http://110.74.197.207:50632 Proxy to be used: http://1.20.101.95:49001 Proxy to be used: http://200.10.193.90:8080 Proxy to be used: http://173.164.26.117:3128 ...
Post a Comment for "Can't Modify A Function To Work Independently Instead Of Depending On A Returned Result"