Data Collection: Scraping for Free Proxies (Python)

Griffin Hundley
3 min readFeb 27, 2021

In my previous article, I described the use of proxies and user-agents in Python to assist in the process of web-scraping. For a professional, production scale quantity of web-scraping, the use of paid proxies and VPNs will be necessary for both the volume and security of the data being scraped. Public proxies will have significant downtime, as their resources can be utilized by anyone. Additionally, the data that passes through them will be unsecured. These public free proxies are often misused, and may themselves be blacklisted by some web services.

Photo by Scott Graham on Unsplash

The use of paid, private proxies will ensure the most uptime and data security. However, for relatively small data projects — fast and reliable proxies can be cost prohibitive. Manually searching public lists for free proxies will prove to be an exercise in patience, as many of them do not work, and for those that do work there is no telling how much longer they will continue to work. Using free proxies in a small data project necessitates the use of a function that acquires them automatically.

# import librariesimport requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
# define a function that will return a list of elite https proxiesdef get_proxies():
# generate a random user agent for the request
ua = UserAgent()
user_agent = ua.random
# store the url of the proxy site as a string
url = 'https://free-proxy-list.net/'
# structure request header
ua_header = {'User-Agent': user_agent}
# make a GET request from the url
content = requests.get(url,
headers = ua_header).text
# turn the html text into a BeautifulSoup object for parsing
soup = BeautifulSoup(content,
'html.parser'
# Fill a list with each row in the table
rows=[]
for row in soup.findAll("tr"):
rows.append(row)
# Find the rows that are elite proxy and 'Yes' to the https column
elite_https_proxies = []
for row in rows:
i = row.findAll('td')
try:
if i[4].text == 'elite proxy' and i[6].text == 'yes':
# Append IP to a list (column 0 is the IP, column 1 is the port)
elite_https_proxies.append(
i[0].text + ':' + i[1].text)
except:
continue
# return the valid ip addresses as a list
return elite_https_proxies

To do so in Python 3, I will be using the requests, BeautifulSoup4, and fake_useragent libraries. I first make a get-request to a free public proxy site that has proxies displayed in a table. The html content of this response will then be parseable with BeautifulSoup . Because the values are stored in html tables, using the bs4 findAll.('tr') function will return all of the rows in the table.

To maximize the security and convolution from the original IP, a proxy should be both elite and https. Elite proxies allow the most anonymous form of requests, and https connections have no eavesdroppers. In my example, the table has an ‘Anonymity’ column, and an ‘https’ column. Their respective positions in the BeautifulSoup list are 4 and 6. To filter for proxies that meet those conditions, I iterate through the rows and return the ones where Anonymity = ‘elite proxy’ and https = ‘yes’. Now, running this function once will generate a list of proxies, but they will probably only work for a limited amount of time. The website I reference automatically updates, so running the function again at a later point will generate different proxies (that hopefully still work).

To wrap up, a potentially helpful function to cycle through the list can be found in the itertools library. By passing the list of proxies to a cycle() object, using the next() function will sequentially iterate through the list.

from itertools import cycle# get the list of proxies
proxy_list = get_proxies()
# create an iterator of the list
cycler = cycle(proxy_list)
# returns the next proxy in line
next_proxy = next(cycler)

--

--