Data Collection: Proxies and User-Agents (Python)

Griffin Hundley
3 min readFeb 19, 2021

--

Taking on a new data project comes with many challenges. After identifying the problem you want to solve, the next step is getting viable data. Not every project will have a nicely compiled .csv download link with all of the exact features you need. In that situation you need only turn to your good friend, the web scraper.

So you boot it up the script, the status code returns 200, and everything works great for a short while… until it doesn’t. If you tried to gather any nontrivial amount of data in a robotic, repetitive fashion, then its very likely that you broke a site rule, and the app’s security has put you on timeout.

Photo by Markus Spiske on Unsplash

Certain webcrawling scripts can generate a ton of artificial load and put strain on servers, so websites have methods of detecting and preventing such activity. Most websites have a robots.txt that indicates which domains are off-limits to crawlers. Additionally, some websites restrict the number of requests an IP address can make per minute. Ethically, it’s important to respect these rates and limitations. You do not want to selfishly cripple someones website with a DDoS style attack.

For this demonstration, I’ll be using Python, and the test website will be the scraping friendly site https://books.toscrape.com. A typical, unblocked request will look something like this:

# import libraries
import requests
# make a get request to the desired URL
response = requests.get('https://books.toscrape.com/')
# print the response status code
print(response)
<Response [200]>
# print the contents of the response
print(response.content)
'b'<!DOCTYPE html> etc...'

If your default IP has been timed out from making requests from a server, fret not. There is a way to continue the data collection process, which involves masking your default IP address with a proxy.

To make the same request through a proxy, pass the IP and the port to the proxies argument in a dict like so:

# store the IP as a string in a variable
proxy = '00.00.000.000:0000'
response = requests.get('https://books.toscrape.com/',
proxies={"http": proxy, "https": proxy})

This will route your request through the proxy IP, to the URL. In this case I’m just using a dummy IP. There are free proxies available online, just be aware that there is no guarantee that data going through them is private, as the proxy will have full access to the data that goes through it.

The above type of get-request is a headless request. Some servers will require the browser to have a head or user-agent, so that only users with a browser can view the content. This can be simulated in the Python environment with the fake_useragent library.

from fake_useragent import UserAgent# create a UserAgent type object
ua = UserAgent()
# call on the UserAgent object to create a random browser useragent
fake_chrome_browser = ua.chrome
# print the response
print(fake_chrome_browser)
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

To implement the user-agent into our get-request, simply pass it as a parameter in the header. While we are at it, go ahead and store the base url just to clean up the inputs.

# store the user-agent in a dictionary to pass to the get-request
useragent_header = {'User-Agent': fake_chrome_browser}
url_base = 'https://books.toscrape.com/'
response = requests.get('url_base,
headers = useragent_header,
proxies={"http": proxy, "https": proxy})

Now it is possible to bypass that pesky IP timeout by using a proxy, and further convolution is possible by randomizing the user-agent each request by with UserAgent().random . Putting everything together:

# import libraries
import requests
from fake_useragent import UserAgent
# declare variables
url_base = 'https://books.toscrape.com/'
proxy = '00.00.000.000:0000'
ua = UserAgent()
fake_browser = ua.random
useragent_header = {'User-Agent': fake_browser}
response = requests.get('url_base,
headers = useragent_header,
proxies={"http": proxy, "https": proxy})
print(response.content)

Hopefully this will be helpful those of you in your personal data projects involving web scraping. As a sequel to this tutorial, I’ll break down the process of scraping free proxies for use in the data collection process.

--

--