Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The gold standard is residential IP. It is not cheap but its effectiveness is indisputable.


Back when we had to scrape airline websites to get the deals they withheld for themselves, residential IP was indeed the way. Once the cottoned on to it and blocked id, you'd simply cycle the ADSL model, get a new IP, and off you'd go again.

Now the best part... one division (big team) of our company worked for the (national carrier) airline , one division of our company worked for the resellers (we had a single grad allocated to web scraping). The airline threw ridiculous dollars at trying to stop it, and we just used a caffeine fueled nerd to keep it running. It wasn't all fun though, they'd often release their new anti scraping stuff on a Friday afternoon. They were less than impressed when they learnt who the 'enemy' was. Good times!


Once you get to selenium it's usually over, just had to emulate a couple of heavy users with real browsers and voila.


>Once you get to selenium it's usually over, just had to emulate a couple of heavy users with real browsers and voila.

Can you say more about this? What do you mean by "Once you get to selenium it's usually over", and how do you manage cold starts in Selenium and emulating heavy usage?

Say your program starts right now, I assume you don't go through "adding heavy usage" to "warm-up", then get down to business, correct?


Selenium and other tools in that class essentially just build an api on top of a standard consumer browser engine(s). There are some differences that are difficult to completely hide, but it’s about as close to real as it gets and can be very difficult if not impossible to tell it’s an automation framework vs a standard web browser.

Travel information is also one of those services where it’s not weird for a significant number of their users to use it quite heavily, making behavioral detection more difficult.


By default Selenium exposes a few things in JS that are pretty trivial to detect, so you need to disable/hide that for starters. I don't know how easy or hard that is, but stock Selenium is a poor way to get around anti-bot stuff.


Its fairly trivial, some stuff needs a modified browser executable or some JavaScript magic but it wont take you longer than 2-5 Hours to bypass most of the heuristics, disabling the window.navigator.webdriver flag and getting a residential IP is on its own enough to get single click captchas every 1-2 tries for example. To be fair I haven't looked into it since 2019 but i doubt that its gotten much harder.


Yeah, but this was before Ajax when you had to hand roll XML RPC calls (i.e. long long ago).


What do you mean by deals withheld for themselves?


Most flights are available through the airline booking systems such as Sabre. However, airlines might have flights available only on their own website at (sometimes massively) reduced cost, which needs to be booked through that site. So the web scraping became two parts, one to provide the data to our search engine to present to our customer (travel agent) customers. The second part was then we would book via the airlines website with the details provided by our customer's customer.


A residential IP would help for IP based detection. As the Readme mentions, there's also Javascript based detection. If, for example, your browser has navigator.webdriver set incorrectly, then you can still get blocked even on a residential IP.


The point is that both can be required. You can have the most sophisticated user emulating browser, bit if all you have access to run it on are low quality IPs that have been blocked or that are often used for abuse, you won't get far. You can have residential IPs and if you're just wrapping curl, you also might find you're blocked.

Together, there's little to detect different than a regular user. The reason why the residential IPs is given heavy importance though is that it's the one part that costs a lot of money if you need enough of them you need to use a proxy service and you transfer a lot of data. Entry level pricing is over $15/GB for high quality services.


Not anymore. Now it’s mobile IP addresses.


This! Mobile IPs are far more lucrative. Many services will drop captchas and other anti-bot stuff for consumer mobile IPs. I recall Plaid at some point would run their bank scraping through mobile IPs.

This sketchy company lets mobile app developers monetize user base by letting other people pay $$ to route requests through random people’s mobile IPs: https://brightdata.com/


Brightdata is formerly known as as Luminati who is owned by same company as Hola VPN.

Similarly NordVPN owns Oxylabs (who mostly hack routers and cameras and sell those as residential IP’s).


But they have a white paper on how ethical their proxies are! https://oxylabs.io/Oxylabs_Residential_Proxy_Acquisition_Han...


I never checked their business practice. But I use Luminati rotating residential proxies for a lot of my scraping work.

Mostly to avoid hitting 'ddos protections' or other security bullshit that doesn't really make sense on 1 daily request or so.


This is really bad. Imagine if someone plants these proxy inside app how user are even going to know? I think every OS should come with firewall so if app tries to make connection it should prompt with Accept | Accept Forever | Deny | Deny Forever.

I think these companies used to go for extension developer now it seems they have found new idea to implant malware on apps which is not easy to detect.


This is EXACTLY how mobile proxy companies like Luminati and OxyLabs acquire their ip address pool. They pay devs to embed a lib inside their app.


They hack IP cams and routers too…


One of the reasons for this is that the vast majority of the time, mobile LTE data users are behind cgnat for ipv4. You can't block one ip without possibly blocking hundreds of innocent IPs using the same exit point.

As a scraper operator on a mobile data connection all you need is a new useragent and browser fingerprint, there's no easy way for a scraper-blocker-operator to tell that you're not a totally new person.


This is reason why most of services uses App instead of Browser. When there is App it can use lot of thing like phone fingerprint derived from various sources.


Apple provides a framework or some sort of unique id


Note that Apple explicitly tries to prevent apps from generating any sort of overall device fingerprint besides the ad tracking identifier one which now requires user consent in iOS 14. You can still generate an app-scoped device ID though. (Not sure if these persist across re-installs of the same app or not)


There are services that detect residential IPs being used for scraping nowadays. Plus there are other ways of detecting scraping: browser fingerprinting, aggressive rate-limiting and CAPTCHAs etc.


Captcha solving services are a thing, can be as crude as something that takes a screenshot, sends an image to a click farm worker getting paid $300 a month sitting in a cubicle in Bangladesh.

There's various captcha solving services where you pay in bulk per captcha and submit data via an api.


Yup, such click farms exist. But driving up the costs and/or technical implementation efforts for bots/scrapers can be a part of your anti-bot strategy.


1000 captchas usually cost around a dollar or two. Honestly captchas never stopped me from anything.


It's always just a question of detecting these things and code them in.

It's like writing a game bot with Java robots and pixel detection. It may is inefficient, may takes longer to make than a network solution. But I have yet to be detected anywhere


Unless you use a residential ip proxy network.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: