Proxies

Proxies

Learn how to configure a proxy with Urlbox

When rendering certain sites, you may be blocked from rendering or scraping the content that they serve.

An example of this is when sites are using Cloudflare to protect their site from bots:

In order to get around these protections, the Urlbox API supports the use of proxies.

How proxies work

When you make a request to the Urlbox API, you can specify a proxy to use. The Urlbox API will then make the request to the target site using the proxy you specified.

This has the benefit of making the request appear to come from the IP address of the proxy, rather than from urlbox's data center IP address.

This reduces the chance that the target site will be able to detect that the request is coming from the Urlbox API.

Proxy providers

Urlbox does not provide proxies for you to use. Instead, you must bring your own proxy by using a proxy provider. There are many proxy providers available, and you can use any provider that you like.

Below are some proxy providers:

Using a proxy with Urlbox

Taking brightdata as an example, you can signup for an account there and then create a proxy.

They have several solution types and usually the best proxies are web unlockers, residential or 4G / mobile proxies.

In this example I've created a proxy using the Web Unlocker solution type, this also gives the following benefits:

  • Bypass CAPTCHAs, blocks, and, restrictions
  • Only pay for successful requests
  • Automated IP address rotation
  • User emulation & fingerprints

If you click on the proxy you created, then go to the Access parameters tab, and finally click on the Check out code and integration examples button. With the API type selected and Language set to Node.js you can copy the proxy URL:

The proxy URL should look something like: http://brd-customer-hl_3f08b01c-zone-social_networks:[email protected]:22225

To use this with Urlbox, you can pass it in directly in the request:

Request

curl -X POST \
	https://api.urlbox.io/v1/render/sync \
	-H 'Authorization: Bearer ' \
	-H 'Content-Type: application/json' \
	-d '
{
	"url": "https://www.google.com",
	"proxy": "http://brd-customer-hl_3f08b01c-zone-social_networks:[email protected]:22225"
}
'

Now urlbox will make the request to the target URL using the proxy you specified.

Solving ERR_TUNNEL_CONNECTION_FAILED error

If you are using a proxy and you get an error like ERR_TUNNEL_CONNECTION_FAILED then it is likely that the proxy you are using is blocking requests to certain domains.

When using bright data residential proxies, some domains such as linkedin.com are blocked, unless you go through their full verification process.

  1. If the url you're sending to Urlbox begins with https://, try changing this to http:// instead, and see if there is any extra message.
  2. For example, accessing https://linkedin.com/ with an unverified residential proxy will give the ERR_TUNNEL_CONNECTION_FAILED error. However, if you change this to http://linkedin.com/ you will get a more helpful error message: forbidden requests to this domain are blocked using proxy networks, please get access via a web unlocker zone or IDE tools, or contact your account manager to assist
  3. This means that you either go through their full verification process, or you use a different proxy zone to access the specific domain.

Check proxy connection from command line

You can check whether the proxy works directly from your terminal, for example, here is a request to https://linkedin.com using a proxy with curl:

curl --proxy brd.superproxy.io:22225 
     --proxy-user brd-customer-hl_2f08d01c-zone-residential:password 
     -k "https://linkedin.com"
curl: (56) CONNECT tunnel failed, response 403

and here is the same request using http:// instead of https://

curl --proxy brd.superproxy.io:22225 
     --proxy-user brd-customer-hl_2f08d01c-zone-residential:password 
     "http://linkedin.com"
Forbidden: requests to this domain are blocked using the proxy networks, please get access via a Web unlocker zone or IDE tools, or contact your account manager to assist%

switching to a web unlocker type of proxy, the request works:

curl -I --proxy brd.superproxy.io:22225 
        --proxy-user brd-customer-hl_2f08d01c-zone-social_networks:password 
        -k "https://linkedin.com"
HTTP/1.1 200 OK

Check proxy blacklist and whitelist settings

You should also check in the proxies settings page that you are not accidentally blocking or whitelisting any IP's from accessing the proxy.

At this moment Urlbox cannot share it's IP addresses as they are dynamic and subject to change. We run on a mixture of Google Kubernetes Engine (GKE) and Cloud Run and use Google Clouds IP ranges, so you can whitelist those if you need to.

Check proxy providers status page

As a last resort, it is often worth checking the status page of the proxy provider you are using, as they may be experiencing issues.

For example, bright datas status page is here: https://status.brightdata.com/

Geolocation

Sometimes it is also beneficial to have an IP address from a specific country. For example, if you are rendering a site that has different content for different countries, you may want to use a proxy from that country.

A lot of proxy providers allow you to target locations down to country and city level, even the zipcode. You can also use proxies that originate from certain ASN's.

An example using brightdata again, you can specify the city and country as part of the proxy URL.

Here's an example of a proxy that would originate from New York, USA:

brd-customer-{YOUR_CUSTOMER_ID}-zone-{YOUR_ZONE}-country-us-city-newyork

Proxy gotchas

When using proxies along side Urlbox, expect slower render times, as the request has to go through the proxy before it reaches the target site.

When using residential proxies, there will be a higher change of request failures, as some devices may suddenly go offline, or the connection to the proxy is not stable. Some domains are still blocked by proxy providers, especially high value scraping targets such as linkedin, amazon etc, so you may need to go through their verification process to get access to those domains, or use a web unlocker zone.

Sites may still block proxies, so you may need to try a few different providers and proxy types before you find one that works for you.

Extraneous requests

It is also worth noting that when using a proxy, you may see some extraneous requests to domains such as accounts.google.com in your proxy request logs.

This is because when booting the headless chrome browser, chrome will make some requests to google domains to check for account logins, or updates. We try to reduce as many of these extraneous requests as possible, but there are some that are not possible to remove.

You can use the various block_* options to block many of these requests from happening.

Blocking requests by domain

You can use the block_urls to block specific domains from being requested.

Request

curl -X POST \
	https://api.urlbox.io/v1/render/sync \
	-H 'Authorization: Bearer ' \
	-H 'Content-Type: application/json' \
	-d '
{
	"url": "https://urlbox.io",
	"proxy": "http://brd-customer-hl_3f08b01c-zone-social_networks:[email protected]:22225",
	"block_urls": [
		"*facebook*",
		"*fullstory*",
		"*crisp*",
		"*intercom*",
		"*getdrip*",
		"*olark*",
		"*optimizely.com*",
		"https://shift.com/images/*",
		"*segment.com*",
		"*.optimizely.com",
		"everesttech.net",
		"userzoom.com",
		"doubleclick.net",
		"googleadservices.com",
		"adservice.google.com/*",
		"connect.facebook.com",
		"connect.facebook.net",
		"sp.analytics.yahoo.com"
	]
}
'

Block requests by resource type

You can also block all requests of a certain resource type, to reduce the amount of bandwidth used by the proxy. For example, you can block images and fonts using the following options:

Request

curl -X POST \
	https://api.urlbox.io/v1/render/sync \
	-H 'Authorization: Bearer ' \
	-H 'Content-Type: application/json' \
	-d '
{
	"url": "https://urlbox.io",
	"proxy": "http://brd-customer-hl_3f08b01c-zone-social_networks:[email protected]:22225",
	"block_images": true,
	"block_fonts": true
}
'