The gold standard is residential IP. It is not cheap but its effectiveness is in...

northwest65 · on Oct 31, 2021

Back when we had to scrape airline websites to get the deals they withheld for themselves, residential IP was indeed the way. Once the cottoned on to it and blocked id, you'd simply cycle the ADSL model, get a new IP, and off you'd go again.

Now the best part... one division (big team) of our company worked for the (national carrier) airline , one division of our company worked for the resellers (we had a single grad allocated to web scraping). The airline threw ridiculous dollars at trying to stop it, and we just used a caffeine fueled nerd to keep it running. It wasn't all fun though, they'd often release their new anti scraping stuff on a Friday afternoon. They were less than impressed when they learnt who the 'enemy' was. Good times!

Aperocky · on Nov 1, 2021

Once you get to selenium it's usually over, just had to emulate a couple of heavy users with real browsers and voila.

Jugurtha · on Nov 1, 2021

>Once you get to selenium it's usually over, just had to emulate a couple of heavy users with real browsers and voila.

Can you say more about this? What do you mean by "Once you get to selenium it's usually over", and how do you manage cold starts in Selenium and emulating heavy usage?

Say your program starts right now, I assume you don't go through "adding heavy usage" to "warm-up", then get down to business, correct?

xnyan · on Nov 1, 2021

Selenium and other tools in that class essentially just build an api on top of a standard consumer browser engine(s). There are some differences that are difficult to completely hide, but it’s about as close to real as it gets and can be very difficult if not impossible to tell it’s an automation framework vs a standard web browser.

Travel information is also one of those services where it’s not weird for a significant number of their users to use it quite heavily, making behavioral detection more difficult.

arp242 · on Nov 1, 2021

By default Selenium exposes a few things in JS that are pretty trivial to detect, so you need to disable/hide that for starters. I don't know how easy or hard that is, but stock Selenium is a poor way to get around anti-bot stuff.

mhmmmmmm · on Nov 1, 2021

Its fairly trivial, some stuff needs a modified browser executable or some JavaScript magic but it wont take you longer than 2-5 Hours to bypass most of the heuristics, disabling the window.navigator.webdriver flag and getting a residential IP is on its own enough to get single click captchas every 1-2 tries for example. To be fair I haven't looked into it since 2019 but i doubt that its gotten much harder.

northwest65 · on Nov 1, 2021

Yeah, but this was before Ajax when you had to hand roll XML RPC calls (i.e. long long ago).

1cvmask · on Oct 31, 2021

What do you mean by deals withheld for themselves?

northwest65 · on Oct 31, 2021

Most flights are available through the airline booking systems such as Sabre. However, airlines might have flights available only on their own website at (sometimes massively) reduced cost, which needs to be booked through that site. So the web scraping became two parts, one to provide the data to our search engine to present to our customer (travel agent) customers. The second part was then we would book via the airlines website with the details provided by our customer's customer.

jonatron · on Oct 31, 2021

A residential IP would help for IP based detection. As the Readme mentions, there's also Javascript based detection. If, for example, your browser has navigator.webdriver set incorrectly, then you can still get blocked even on a residential IP.

kbenson · on Nov 1, 2021

The point is that both can be required. You can have the most sophisticated user emulating browser, bit if all you have access to run it on are low quality IPs that have been blocked or that are often used for abuse, you won't get far. You can have residential IPs and if you're just wrapping curl, you also might find you're blocked.

Together, there's little to detect different than a regular user. The reason why the residential IPs is given heavy importance though is that it's the one part that costs a lot of money if you need enough of them you need to use a proxy service and you transfer a lot of data. Entry level pricing is over $15/GB for high quality services.

TedDoesntTalk · on Oct 31, 2021

Not anymore. Now it’s mobile IP addresses.

varenc · on Oct 31, 2021

This! Mobile IPs are far more lucrative. Many services will drop captchas and other anti-bot stuff for consumer mobile IPs. I recall Plaid at some point would run their bank scraping through mobile IPs.

This sketchy company lets mobile app developers monetize user base by letting other people pay $$ to route requests through random people’s mobile IPs: https://brightdata.com/

dzhiurgis · on Nov 1, 2021

Brightdata is formerly known as as Luminati who is owned by same company as Hola VPN.

Similarly NordVPN owns Oxylabs (who mostly hack routers and cameras and sell those as residential IP’s).

varenc · on Nov 1, 2021

But they have a white paper on how ethical their proxies are! https://oxylabs.io/Oxylabs_Residential_Proxy_Acquisition_Han...

herbst · on Nov 1, 2021

I never checked their business practice. But I use Luminati rotating residential proxies for a lot of my scraping work.

Mostly to avoid hitting 'ddos protections' or other security bullshit that doesn't really make sense on 1 daily request or so.

cute_boi · on Nov 1, 2021

This is really bad. Imagine if someone plants these proxy inside app how user are even going to know? I think every OS should come with firewall so if app tries to make connection it should prompt with Accept | Accept Forever | Deny | Deny Forever.

I think these companies used to go for extension developer now it seems they have found new idea to implant malware on apps which is not easy to detect.

TedDoesntTalk · on Nov 1, 2021

This is EXACTLY how mobile proxy companies like Luminati and OxyLabs acquire their ip address pool. They pay devs to embed a lib inside their app.

dzhiurgis · on Nov 1, 2021

They hack IP cams and routers too…

walrus01 · on Nov 1, 2021

One of the reasons for this is that the vast majority of the time, mobile LTE data users are behind cgnat for ipv4. You can't block one ip without possibly blocking hundreds of innocent IPs using the same exit point.

As a scraper operator on a mobile data connection all you need is a new useragent and browser fingerprint, there's no easy way for a scraper-blocker-operator to tell that you're not a totally new person.

cute_boi · on Nov 1, 2021

This is reason why most of services uses App instead of Browser. When there is App it can use lot of thing like phone fingerprint derived from various sources.

dzhiurgis · on Nov 1, 2021

Apple provides a framework or some sort of unique id

varenc · on Nov 2, 2021

Note that Apple explicitly tries to prevent apps from generating any sort of overall device fingerprint besides the ad tracking identifier one which now requires user consent in iOS 14. You can still generate an app-scoped device ID though. (Not sure if these persist across re-installs of the same app or not)

sparkling · on Nov 1, 2021

There are services that detect residential IPs being used for scraping nowadays. Plus there are other ways of detecting scraping: browser fingerprinting, aggressive rate-limiting and CAPTCHAs etc.

walrus01 · on Nov 1, 2021

Captcha solving services are a thing, can be as crude as something that takes a screenshot, sends an image to a click farm worker getting paid $300 a month sitting in a cubicle in Bangladesh.

There's various captcha solving services where you pay in bulk per captcha and submit data via an api.

sparkling · on Nov 1, 2021

Yup, such click farms exist. But driving up the costs and/or technical implementation efforts for bots/scrapers can be a part of your anti-bot strategy.

herbst · on Nov 1, 2021

1000 captchas usually cost around a dollar or two. Honestly captchas never stopped me from anything.

herbst · on Nov 1, 2021

It's always just a question of detecting these things and code them in.

It's like writing a game bot with Java robots and pixel detection. It may is inefficient, may takes longer to make than a network solution. But I have yet to be detected anywhere

bredren · on Oct 31, 2021

Unless you use a residential ip proxy network.