Web scraping with a VPN and how not get account blocked

I wrote a bot in nodejs that makes a request and gets a json as a response. It started as a webscraper using puppeteer but later I realized I could just make a fetch request and get the data that I needed, besides chrome uses too many resources and could only get 4 instances running at the same time in this old pc that I’m using as a server.

I was using surfshark as a VPN and I had the script containerized with docker, so each container would connect to a different vpn, make 10 requests and then connect to a different server. The response has an Age header that resets every 30 seconds so the script waits accordingly in order to get a fresh response each time before connecting to a different server.

The thing is surfshark blocked my account because they don’t allow webscraping. I was kind of greedy I guess because I had 24 containers running, each one connected to a different location and rotating all day between servers.

I thought about using proxies but most of the companies block the domain that I’m fetching the data from, so I’m going to try with another vpn.

I’m using Cyberghost since they allow to download the openvpn config files and that way I can connect to a vpn programmatically, but I don’t want to have my account blocked so I need a way to somehow mimmic an actual person using the vpn.

Does anyone knows how I could try and fake traffic to make it look like it isn’t a bot using the vpn ?
Any advice on how to not get blocked by the vpn is greatly appreciated :folded_hands:t2:.

You’re probably blowing through some rate-limiter. If you don’t want to be blocked for acting like a bot, you could stop egregiously behaving like a bot :woman_shrugging:

Most VPN services are using advanced techniques to detect scraping so you don’t get their entire VPN subnet blacklisted by popular sites. You are screwing other VPN users and blatantly breaking their TOS with super aggressive scrapers, so obviously they are going to detect that and ban you.

Can you recommend any resource where we can get the latest and greatest advice on webscraping such as modifying the behaviour for less aggressive webscraping?

:money_bag: Welcome to r/webscraping! Referencing paid products or services is not permitted, and your post has been removed. Please take a moment to review the promotion guide. You may also wish to re-submit your post to the monthly thread.

Highly depends on the site you are targeting.

In general: the more you act like a real human with a real browser the better.

  • take care of all the right headers (have a good look at all requests in the browser console)
  • dont fire too many requests at once (no exact science, start low and increase until you hit a wall. 0.5-1/sec is usually a good starting point)

These two will get you quite far.