Avoiding bot detection: How to scrape the web without getting blocked?

bsamuels · on Oct 31, 2021

> I need to make a general remark to people who are evaluating (and/or) planning to introduce anti-bot software on their websites. Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

If this guy got to experience how systemically bad the credential stuffing problem is, he'd probably take down the whole repository.

None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined. Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

Mister_Snuggles · on Oct 31, 2021

I wish they'd limit it to just stopping credential stuffing.

Here's my scenario: My electricity provider publishes the month's electricity rates on the first of the month, I want to scrape these so that I can update the prices in Home Assistant. This is a very simple task, and it's something that Home Assistant can do with a little configuration. Unfortunately this worked exactly once, after that it started serving up some JavaScript to check my browser.

The information I'm trying to get is public and can be accessed without any kind of authentication. I'm willing to bet that they flipped the anti-bot stuff on their load balancer on for the entire site instead of doing the extra work to only enable it for just electricitycompany.com/myaccount/ (where you do have to log in).

I also asked the company if they'd be willing/able to push the power rates out via the smart meters so that my interface box (Eagle-200) could pick it up, they said they have no plans to do so.

The next step is to scrape the web site for the provincial power regulator, which shows the power rates for each provider. Of course, the regulator's site has different issues (rounding, in particular), I haven't dug any further to see if I can make use of this.

All of this effort to get public information in an automated fashion.

kevin_thibedeau · on Oct 31, 2021

At a minimum any scraper that doesn't execute JS needs to impersonate a screen reader user agent. Locking out disabled people has to be many levels of illegal in most countries.

Abishek_Muthian · on Nov 1, 2021

As a disabled person myself I will go as far to suggest that the websites should allow unfettered bot access to the disabled i.e. anything a user without malicious intent is allowed to do on a platform should be allowed to be done by a bot for a disabled because accessibility & equity are a joke.

Social media platforms has made physical appearance as the first class citizen of the reputation economy. I'm not even talking about those platforms which outright bury content from the disabled as a policy, I'm talking about those platforms whose algorithms favor selfies, videos over text/URLs and thereby putting those with accessibility issues in severe disadvantage.

Why would you use such platforms one might say, Do something which has nothing to do with the reputation economy they might add; Well have you looked at LinkedIn lately? LinkedIn has become ubiquitous with professional job search and 30 second video intro is the very first thing on the profile, not the skills which the platform was meant to be when it was launched. One must be naive to claim that the physical appearance on that video or profile picture doesn't affect the job prospects(Several studies have stated otherwise).

It's not just the physical appearance, The action of creating videos or posting photos itself is hard as a time-constrained person[1] and so I think it's reasonable to ask the platform to allow bots to post deep-fake videos of the user doing silly things which these platform expects from an average user.

[1] https://abishekmuthian.com/time-constrained-person/

judge2020 · on Nov 1, 2021

Blocking for not supporting JS isn't illegal nor a violation of the US ADA. You can add requirements for disabled people to access your services as long as it's reasonable, and the prevalence of screen readers that work with JS turned on likely qualifies requiring JS a reasonable request. It'd be like saying "you can't deny someone using a IE8 screen reader by only offering TLS 1.3".

xnyan · on Nov 1, 2021

Unfortunately, the days of reliable non-JavaScript capable scraping are over.

Fortunately there are plenty of tools to handle this, and at a hobby level not particularly resource intensive. Something like this is simple and reliable in many cases: https://github.com/berstend/puppeteer-extra/tree/master/pack...

IceWreck · on Nov 1, 2021

> Unfortunately, the days of reliable non-JavaScript capable scraping are over.

Not really. In a lot of cases websites use JavaScript to call some API along with some on the fly generated token to prevent abuse.

As long as that token isn't captcha you can reverse engineer the site to do scraping without javascript and that is so much faster than browser based scraping.

kingcharles · on Nov 1, 2021

I agree with this. This is what I see on a lot of sites I scrape. Reverse engineering the JS to figure out how the fuck the token was generated is a bitch though.

shapefrog · on Nov 1, 2021

So then you use headless browsers to render the js and that is even hackier, but totally worth it to hit another full webpage request to get the token, so you can go back to plain requests.

cute_boi · on Nov 1, 2021

I don't think token is only the thing it comes to play here. If the company wants they can use various other techniques like fingerprinting, tls fingerprinting and lot of thing.

Its just a cat and mouse game. After few year I think hardware attention etc will come to play which can mitigate bot issue somewhat.

Mister_Snuggles · on Nov 1, 2021

I will take a look at this - thank you!

judge2020 · on Oct 31, 2021

> I wish they'd limit it to just stopping credential stuffing.

A product protecting against credential stuffing might as well prevent denial-of-service as well.

S_A_P · on Nov 1, 2021

I’m not sure what rate you are trying to get but the electric market in the US has 5 minute settlement periods. So for your region you would need to grab the price for each period and average that to get a power rate. Take that rate and add transmission fees, taxes and various other fees your provider tacks on then multiply that by usage. In Texas you can go directly to the ERCOT site and get these prices and not worry about counter measures. I’m not sure where you are but there is likely a similar whole sale site that you can access.

cebert · on Nov 1, 2021

Have you considered using Playwright to automate that instead?

Mister_Snuggles · on Nov 1, 2021

I had never heard of this, but it looks like a reasonable option. My go-to for this type of thing would be Python+Selenium+Firefox, but only due to familiarity with those.

at_compile_time · on Nov 1, 2021

I'm glad that I'm not the only one not using Selenium for its stated purpose.

shapefrog · on Nov 1, 2021

Is there another - stated purpose - that I missed?

Primarily it is for automating web applications for testing purposes

Here was I thinking it was just a tool to scrape all the js I am too lazy to reverse engineer.

cebert · on Nov 1, 2021

Playwright is easy to get started with. The even tools that allow you to record your browser actions and covert it into code ( https://playwright.dev/ ).

walrus01 · on Oct 31, 2021

Out of curiosity how is that you have electricity rates that change every month? Are you buying power through a third party organization? The vast majority of place I've seen have a fixed tariff for residential use that changes no more often than every 12-24 months.

TuringNYC · on Nov 1, 2021

In some cases it is because the consumer has opted for variable rates, essentially making a bet that net-net, variable rates will be less expensive than the fixed rates. Or that they would be able to shift usage to reduce usage during spikes. See: https://www.texastribune.org/2021/02/22/texas-pauses-electri...

Also see: https://www.ft.com/content/0e746280-e72c-4087-9c0d-df2a7af82...

Feb 2021 FT: "Bills mount in Texas power market after freeze sends prices soaring: Financial casualties emerge as grid operator Ercot requires billions in payments "

Mister_Snuggles · on Nov 1, 2021

In Alberta, the electricity system has been deregulated so you can buy from numerous providers. The Utilities Consumer Advocate shows 187[0] different electricity plans available in my city. My currently plan and provider changes rates monthly, but some providers allow you to sign up for 3-year or 5-year fixed-rate plans.

Thanks to deregulation, the electricity rate isn't the only thing you pay for though. There is also a Transmission Charge, Distribution Charge, and Local Access Fee. These are all per-kWh charges and change very rarely.

In October, my electricity rate is $0.10730/kWh, but my total cost is actually $0.16346/kWh plus the per-day charge ($0.202/day). Tomorrow the November rate will be published.

[0] https://ucahelps.alberta.ca/cost-comparison-tool-result.aspx...

Tor3 · on Nov 1, 2021

This also depends on the country. Where I live (Europe) the rate now changes by the minute, or thereabouts. That was made possible after everybody had to change to wireless meters. Sometimes you'll get a warning in advance - a newspaper may write "If you live here or here, don't do your cooking at this particular hour". Some providers still have fixed rate options, some apparently don't. What I dislike the most is that they're trying to force us to run our washing during the night, something the insurance company and the fire department warn intensely against. And I don't want to be at sleep if a fire starts (which happens here and there, through the year). But that's what the pricing scheme tries to enforce.

sokoloff · on Nov 1, 2021

I’m curious to see the stats they are relying on and the communications materials the fire department and insurance company are using on this topic.

It seems to me from a life-safety angle that their energy would likely be far better spent on recommending smoke alarms, CO meters, and periodic cleaning of dryer vents than on recommendations against sleeping with washing/drying machines running.

Tor3 · on Nov 2, 2021

They do all of that as well, of course. The problem is that when it does happen (and statistically, it will, somewhere, at some point) there's a chance you don't hear the alarm (very common - just this morning there was a newspaper story about someone who were saved by the neighbours, they didn't wake up right away even with all the alarms blaring. What the alarms did though was to alert the fire department, as per their setup).

In short - if there's a fire it's much better that you're awake and up already.

timthorn · on Nov 1, 2021

Octopus Energy in the UK has a tariff that charges half-hourly rates. They also offer an API to interrogate current pricing and usage and encourage their customers - including domestic users - to take advantage of it: https://developer.octopus.energy/docs/api/

nisse72 · on Nov 1, 2021

Here are providers that offer spot pricing to consumers:

https://www.flickelectric.co.nz/pricing-and-plans

https://www.pauatothepeople.co.nz/cheap-as.html

Aside from this, peak / off-peak pricing is not unusual.

irrational · on Nov 1, 2021

I’ve never seen such a thing. Every locale here has a meter and you are charged monthly for exactly how much you use. Just like water.

cinntaile · on Nov 1, 2021

I think he means the price per kWh changes each month.

sokoloff · on Nov 1, 2021

Rates are $/kWh. GP has some billing tariff where the rates change apparently.

eppp · on Nov 1, 2021

TVA in the United States has a variable rate per month due to fuel surcharges.

herbcso · on Nov 1, 2021

Check if your electricity provider offers Green Button Download. https://www.energy.gov/data/green-button

Breza · on Nov 5, 2021

Seconded. My company offers this and it's a great resource.

jameshart · on Nov 1, 2021

Bots aren't just trying credential stuffing. They are:

- committing clickfraud to game ad and referral revenue systems

- posting fake or spam reviews and comments

- generating fake behavioral signals to help bypass CAPTCHAs to help create accounts on other sites that can post spam comments

- validating stolen credit card details

- screwing with your metrics collection if you can't identify them as bots

All of that is enough reason for sites to use bot detection and blocking technology. The fact that the same tech also has some utility against accidental or malicious traffic-based DoS is also a bonus.

fragmede · on Nov 1, 2021

> -validating stolen credit card details

To be clear, "validating" is an industry euphemism for stealing, just for a different purpose. How do you validate the card is live? Run a real transaction through it and mark it based on the result. But what do you run for this real transaction? Well, whatever you want. Typically it'll be something to avoid suspicion as much as possible, but the thief gets to pick what they test it with, so why not pick something that they'll personally benefit from? There are so many online games with purchasable currency these days, it's hard to choose.

defer · on Nov 1, 2021

Not exactly. Some people are in the business of gathering and selling valid credit cards.

They won't cash out on them or buy items. Instead, they'll collect cards from a source (skimming, hacking, whatever), validate them by adding them to a website that does an authorization (those $1 checks that never get committed). They can then sell them wholesale for a premium compared to non-verified cards.

vidarh · on Nov 1, 2021

You'd think so, but I've personally had to find ways of blocking people who were buying premium services for an online service to validate cards.

defer · on Nov 1, 2021

Oh, for sure. It's definitely not either/or. I've worked in fraud prevention software in the past and our clients would definitely see both.

I've been away from that world for a while but remember that more serious operations will separate the cashing out part (either money or goods) from their acquiring / validating operation because the former carries more risk.

There's also an interesting episode of the darknet diaries podcast (https://darknetdiaries.com/episode/85/) about card cloning which I found interesting.

dxdm · on Nov 1, 2021

Doing tiny transactions to validate cards is a thing. I've seen this happen, it's a known problem, and that at other times it's larger transactions does not make it go away.

Mordisquitos · on Nov 1, 2021

> How do you validate the card is live? Run a real transaction through it and mark it based on the result.

Not necessarily. Some online services, particularly shops that do home delivery, may give their customers the possibility of adding a card to their wallet and perform a verification as part of the process. As a result, it becomes possible to validate a stolen card number without performing an actual transaction.

_y4o5 · on Nov 1, 2021

My bank called as they had red flagged a suspicious charge on on my credit card. The sum was $0. The bank rep told me that this often flies under the radar and doesn't show up in some records and it indeed did not show up in my transaction history that i could see online in my banking details. But yeah the point was exactly the same. The fraudsters testing whether the charge goes through and the card is alive/valid.

todd8 · on Nov 1, 2021

Did you verify that it really was the bank calling and not a scammer? I get calls and texts from “credit card fraud departments”, “banks”, “service warranty departments”, “Amazon billing”, “Microsoft security”, “Social Security Administration”, “IRS”, and others frequently; 95% are scammers.

The most amusing to me are the ones from “Microsoft” to alert me that they have detected malware on my computer.

_y4o5 · on Nov 1, 2021

Yes actually I think my bank sent me a message and asked me to call them. But either case it was all legitimate and my card was renewed.

Maxion · on Nov 1, 2021

If you're trying to fight financial crime it's important to not simplify behavior seen in the wild. Like the other commented noted, there is a very clear difference in behaviour between those validating cards and those using cards in order to steal.

rndgermandude · on Nov 1, 2021

All very valid reasons (except to a degree the "screwing with metrics" one). There a lot of websites which do not really face any of the aforementioned issues, simply because they do not sell you anything, are not an ad network or run referral programs, do not even have user-generated content, etc. And even the sites that do, it's usually a rather small part of the "surface" that needs such protections e.g. the actual API call to make a checkout or post a comment.

And yet if you dare browse the web with TOR or a VPN or sometimes just happen to be on a small ISP[0][1] then you're being punished immediately. You solve your cloudflare-supplied captcha because you may be a bot (you're not, and the dangerous bots will not be defeated by this anyway, but some humans will be), and then you get an error from the website itself because it runs a secondary bot detection thing. And you weren't even anywhere near anything "dangerous" like a checkout page.

[0] My parents use a regional small ISP (but locally very popular) that serves around 50K customers. My parents also use a regional bank (a Volksbank, and those are members of a national association that provides all kinds of services). Suddenly that bank would not even let them see the bank's front page. After some back and forth on the phone support line it turned out the bank had recently deployed some "advanced" bot detection, one that had a whitelist of residential-ISP-associated AS/IP ranges, and of course whoever compiled and maintained that list had forgot to include that small local ISP. For that regional bank it meant they had just shut out a very significant number of their customers (and potential customers just trying to look up what the bank offers), as there was very likely a huge overlap of people using that regional ISP and that regional bank (both are regional, after all). It also was something they couldn't fix themselves, as the "online banking" stuff was not in-house but was run by the national association (which probably used some bot-detection as a service provider). It took the bank (or rather, the national association) a few weeks to fix. Mind you, the last few years that bank has been heavily marketing a cheaper "online only" account, only online banking, online support, and access to the self-serve ATMs and banking terminals, but no face-to-face or even ear-to-ear human interaction. Try contacting "online support" about "the website outright refuses me" when the "online support" is only available on that website. Kudos is you're smart enough to switch to their mobile app, as your phone uses a different ISP, unless you forget to turn off wifi. That's the advice my parents got from the phone support (they sill have an account type that not online-only).

[1] When I recently visited my parents, from the wifi [same ISP as in 0] I couldn't open the website of a bakery too look up if and when they would be open on a Sunday. Some error message about "this website is not available in your network" (English text, for a German bakery... suspicious :P). I could open it via my mobile, tho. I could open it from my regular ISP when back home (in another city) again. Mind you, that website is purely informational and has no "interactive" features let alone let's you buy anything. It's just static text and some pictures.

maven29 · on Nov 1, 2021

It doesn't help that VPN providers in the consumer space pay small ISPs to front their traffic so that netflix et al doesn't drop their traffic coming from a datacenter and they still get to keep up the false pretenses and get valued at billions of dollars.

Smaller ISPs are also more likely to have issues with CPE getting compromised and routers running botnets within the comfort of your home. There are services which invite people to sell their residential bandwidth in return for money, this can potentially have a disproportionate impact at smaller sample sizes.

Networks can declare themselves to be ISPs. You could check if your ISP shows up as an ISP in peeringDB.

https://networkengineering.stackexchange.com/questions/44585...

rndgermandude · on Nov 1, 2021

Yeah, but that's not what happens with this small German ISP my parents use, which was started because some people and local businesses really got disgruntled at the pricing policy and lack of true broadband by the Deutsche Telekom. This thing is essentially run like a non-profit, the little they make in profit meant to be invested into the company again not to make some investors happy, and is majority owned by the city-owned municipal utilities company, with the rest of the ownership I believe in the hands of some local businesses (some of which quite large) who needed broadband but couldn't get it or only at astronomical prices; they are their own customers and thus not very much inclined to fuck themselves.

They are reportedly very proactive when it comes to CPE security, as well, up to giving customers a proactive phone call when they see somebody is using equipment with known vulnerabilities (customers are allowed to operate their own equipment as long as it is deemed compatible, most will use remotely managed equipment, tho, I believe; my dad used to use his own DSL router and once got such a call if I remember correctly. He switched over to their fiber now and managed equipment).

Their AS is indeed identified as an "ISP" in peeringDB.

While I would be extremely surprised if the company was doing shady things, you surely got a point that a small ISP like that could suffer more in reputation from some few customers being up to shady things, including sub-lending the line. I am pretty sure that is against the ToS, but enforcement is a problem of course. Especially detecting such traffic without violating German privacy laws is probably a difficult task, but it's not impossible.

oxymoron · on Oct 31, 2021

Yeah, I used to work for one of the major anti-bot vendors. Customers weren't clueless. Nobody buys these solutions because they're so much fun, it's a cost center and they monitor their ROI quite closely. Credit card charge backs, impact to infrastructure, extra incurred cost due to underlying api's (like in the Airline industry in particular) etc are all reasons why bot mitigation is a better option than nothing for a lot of companies, even if it's not 100% effective.

azalemeth · on Nov 1, 2021

You very much missed the false positive rate! I'm fed up of being classed as a bot just because I browse with uMatrix, a Linux user agent, and a ton of ad filtering and anonymisation tech. I had to try to log in to my bank about ten times today because their js-crap website didn't like me (grumble why does it even need to ask for my desktop's accelerometer data via js...)

Stuff like this is a pain beyond pain. I really hope that the clients you mention know that they piss off a proportion of their users with every move they take.

spijdar · on Nov 1, 2021

> I really hope that the clients you mention know that they piss off a proportion of their users with every move they take.

With all due respect, if the tech can make a large impact on the problems mentioned above, I'm sure it's an easy decision for the big companies to take decimating bot activity over the tiny minority of users who proactively decide to disable JavaScript.

Said as someone who uses NoScript, FWIW.

fragmede · on Nov 1, 2021

You could always go into your local bank branch instead of accessing it over the Internet. Your desktop's accelerometer helps add to your computers 'run by a human' score. Normally I'd take more issue with whatever possible privacy issue there, but my bank is where I keep my money so I'm really okay with them trying hard to keep bots out of my account.

Tor3 · on Nov 1, 2021

The physical presence of banks is going away. Where I live you can't do any kind of monetary transaction in the local branch offices of any of the banks anymore. You can a) apply for a loan (and even that may go away soon), and b) identify yourself and get a physical token used for accessing the bank via the net. You can't withdraw money, you can't pay bills, you can't exchange currency. I haven't been inside my bank for many years, there's just nothing I can do there. The last time I visited the bank was with my wife (an immigrant), with her documents, to get her into the system.

smichel17 · on Nov 1, 2021

Where do you live (country or US state)?

oxymoron · on Nov 1, 2021

Different customers have different attitudes towards this. Some of them are _very_ focused on conversion and will disable anything which causes additional user friction. For others, the economic damage of bots is just so painful that it makes economic sense for them to add friction for a few percent of users.

I'm a linux user myself, so I know for a fact that neither my previous employer, nor other bot vendors, will block linux user agents in particular. Customers generally don't mind a universal requirement for JS execution, so that's just a fact of life. We generally did try to avoid blocking privacy focused browsers, though. We certainly monitored false positive rates and knew pretty well how we affected users.

raxxorrax · on Nov 1, 2021

Cloudflare is pretty guilty of this if you use some more exotic approaches to request info. How often have I seen their captcha that is intended for bots...

striking · on Nov 1, 2021

Doesn't your bank already know who you are?

charcircuit · on Nov 1, 2021

>I'm fed up of being classed as a bot just because I browse with uMatrix, a Linux user agent, and a ton of ad filtering and anonymisation tech.

Have you tried not using these things? Anonymity is exactly what bots want. They want to be able to post a spam message every single second and be impossible to ban since they are anonymous. The internet can't function if people are allowed to be anonymous.

selfhoster11 · on Nov 1, 2021

> The internet can't function if people are allowed to be anonymous.

You must have missed the first 20 or so years of its existence, if that's your position.

charcircuit · on Nov 1, 2021

Okay let's go back to before I was born when people still used IRC and let's say you hated someone else's IRC server. You can just use a program to flood their server with garbage messages. In order to try and stop this spam they first try and deanonymize where this traffic is coming from. This can be done by looking at the IP that these bots are coming from. Now they can gline you and the flood ends. Now let's say the internet didn't leak your IP deanonymizing you. What are they to do? They essentially are forced to lock down the server and whitelist it. They can not allow anonymous users to join or else risk just being flooded.

Stopping abuse has always been a game of trying to deanonymize users in order to try and ban the harmful ones.

orf · on Nov 1, 2021

It was much smaller, and spam messages where everywhere.

leppr · on Nov 1, 2021

With many of these big anti-bot services like Google ReCaptcha, it's not even specialized anonymity tools that can cause shadow banning, just unusual user-agents.

All of these have independently caused me to get into endless ReCaptcha loops: firefox on android, smartphone with unusual screen resolution, clean browser profile with VPN.

It's so common that I now default to using duckduckgo, which never blocks me. I doubt DDG has a lower DDoS/Resources ratio than Google. Some companies are just lazier and less principled than others.

RNCTX · on Nov 1, 2021

> it's not even specialized anonymity tools that can cause shadow banning, just unusual user-agents.

"Unusual" = not Chrome and doesn't allow tracking scripts.

Switch to Safari with an ad blocker for a week, see how many more ReCaptcha prompts you get.

charcircuit · on Nov 1, 2021

As a person who also uses mobile Firefox, I don't feel like I personally have issues with recaptcha.

wumpus · on Nov 1, 2021

This is not quite up there with "won't someone think about the children!!!!", but still, it's sad.

Fortunately, almost all of the websites I visit with my anonymized browser aren't places that I wish to attempt to post a message. Unfortunately, I can easily run into defenses of an entire site when the problem is spam sending.

bryan_w · on Nov 1, 2021

To kinda tweak this since people do tend to like their anonymity, "Do you have to be anonymous to all parties, all the time?"

Parent poster trusts his bank, and his bank would trust his once it knows he's not an fraudster, so maybe it's in everyone's interest to just allow the javascript for that one site.

spookthesunset · on Oct 31, 2021

Not to mention a lot of these bots are after scamming the company’s own customers. Breaking into accounts to commit fraudulent activity, to reach out and “recruit” people into whatever scam they are trying to run.

Nobody wants to spend time trying to stop these bots. It is, however, a very necessary thing to do.

dzhiurgis · on Nov 1, 2021

Do you know much about airline api pricing more?

I’ve noticed most sites won’t let you search business fares efficiently, so I made my own for Google Flights which only worked for like 6months until they added bunch of changes that made it near impossible to scrape.

oxymoron · on Nov 1, 2021

Yeah, there’s a central service that all Flight search is connected to, irregardless of airline. The airlines are charged per search to that api, so they monitor their ”look to book” rationvery closely. That ratio remains quite stable im the absence of bots, but skyrockets with any bot activity. Hence, they know from that metric how big of a bot problem they have and how much money they are losing. Major flight search software vendors have dedicated teams for this.

namdnay · on Nov 1, 2021

In fact the airlines are charged per book, but if and only if the look to book stays within reasonable bounds. If it rockets up, they’re on the hook for the penalties

oxymoron · on Nov 1, 2021

Thanks for the clarification! I probably misremember some of the details.

ivanhoe · on Nov 1, 2021

Saying that Anti-bot software is nonsense is like saying that door locks are snake oil too. We've all seen Lockpicking Lawyer on Youtube opening with ease any lock out there, so how come that all of us haven't got robbed yet?

Well, because protection is not a binary thing - either being 100% safe or 100% not working - instead it's a proportion between the skill/effort/time needed to break in, and the reward you get for it.

To stop majority of attacks you don't have to be absolutely unbreakable, you just need to make it hard enough for majority of attackers so that it doesn't payout for them compared to the value of the data you're protecting. And that's where anti-bot SW has it's place, it slows done spiders and global attacks, forcing for custom tailored scraping that is constantly being fine-tuned, infrastructure to hide your IPs, and that makes the operation way more expensive and harder to run continuously...

melony · on Oct 31, 2021

The gold standard is residential IP. It is not cheap but its effectiveness is indisputable.

northwest65 · on Oct 31, 2021

Back when we had to scrape airline websites to get the deals they withheld for themselves, residential IP was indeed the way. Once the cottoned on to it and blocked id, you'd simply cycle the ADSL model, get a new IP, and off you'd go again.

Now the best part... one division (big team) of our company worked for the (national carrier) airline , one division of our company worked for the resellers (we had a single grad allocated to web scraping). The airline threw ridiculous dollars at trying to stop it, and we just used a caffeine fueled nerd to keep it running. It wasn't all fun though, they'd often release their new anti scraping stuff on a Friday afternoon. They were less than impressed when they learnt who the 'enemy' was. Good times!

Aperocky · on Nov 1, 2021

Once you get to selenium it's usually over, just had to emulate a couple of heavy users with real browsers and voila.

Jugurtha · on Nov 1, 2021

>Once you get to selenium it's usually over, just had to emulate a couple of heavy users with real browsers and voila.

Can you say more about this? What do you mean by "Once you get to selenium it's usually over", and how do you manage cold starts in Selenium and emulating heavy usage?

Say your program starts right now, I assume you don't go through "adding heavy usage" to "warm-up", then get down to business, correct?

xnyan · on Nov 1, 2021

Selenium and other tools in that class essentially just build an api on top of a standard consumer browser engine(s). There are some differences that are difficult to completely hide, but it’s about as close to real as it gets and can be very difficult if not impossible to tell it’s an automation framework vs a standard web browser.

Travel information is also one of those services where it’s not weird for a significant number of their users to use it quite heavily, making behavioral detection more difficult.

arp242 · on Nov 1, 2021

By default Selenium exposes a few things in JS that are pretty trivial to detect, so you need to disable/hide that for starters. I don't know how easy or hard that is, but stock Selenium is a poor way to get around anti-bot stuff.

mhmmmmmm · on Nov 1, 2021

Its fairly trivial, some stuff needs a modified browser executable or some JavaScript magic but it wont take you longer than 2-5 Hours to bypass most of the heuristics, disabling the window.navigator.webdriver flag and getting a residential IP is on its own enough to get single click captchas every 1-2 tries for example. To be fair I haven't looked into it since 2019 but i doubt that its gotten much harder.

northwest65 · on Nov 1, 2021

Yeah, but this was before Ajax when you had to hand roll XML RPC calls (i.e. long long ago).

1cvmask · on Oct 31, 2021

What do you mean by deals withheld for themselves?

northwest65 · on Oct 31, 2021

Most flights are available through the airline booking systems such as Sabre. However, airlines might have flights available only on their own website at (sometimes massively) reduced cost, which needs to be booked through that site. So the web scraping became two parts, one to provide the data to our search engine to present to our customer (travel agent) customers. The second part was then we would book via the airlines website with the details provided by our customer's customer.

jonatron · on Oct 31, 2021

A residential IP would help for IP based detection. As the Readme mentions, there's also Javascript based detection. If, for example, your browser has navigator.webdriver set incorrectly, then you can still get blocked even on a residential IP.

kbenson · on Nov 1, 2021

The point is that both can be required. You can have the most sophisticated user emulating browser, bit if all you have access to run it on are low quality IPs that have been blocked or that are often used for abuse, you won't get far. You can have residential IPs and if you're just wrapping curl, you also might find you're blocked.

Together, there's little to detect different than a regular user. The reason why the residential IPs is given heavy importance though is that it's the one part that costs a lot of money if you need enough of them you need to use a proxy service and you transfer a lot of data. Entry level pricing is over $15/GB for high quality services.

TedDoesntTalk · on Oct 31, 2021

Not anymore. Now it’s mobile IP addresses.

varenc · on Oct 31, 2021

This! Mobile IPs are far more lucrative. Many services will drop captchas and other anti-bot stuff for consumer mobile IPs. I recall Plaid at some point would run their bank scraping through mobile IPs.

This sketchy company lets mobile app developers monetize user base by letting other people pay $$ to route requests through random people’s mobile IPs: https://brightdata.com/

dzhiurgis · on Nov 1, 2021

Brightdata is formerly known as as Luminati who is owned by same company as Hola VPN.

Similarly NordVPN owns Oxylabs (who mostly hack routers and cameras and sell those as residential IP’s).

varenc · on Nov 1, 2021

But they have a white paper on how ethical their proxies are! https://oxylabs.io/Oxylabs_Residential_Proxy_Acquisition_Han...

herbst · on Nov 1, 2021

I never checked their business practice. But I use Luminati rotating residential proxies for a lot of my scraping work.

Mostly to avoid hitting 'ddos protections' or other security bullshit that doesn't really make sense on 1 daily request or so.

cute_boi · on Nov 1, 2021

This is really bad. Imagine if someone plants these proxy inside app how user are even going to know? I think every OS should come with firewall so if app tries to make connection it should prompt with Accept | Accept Forever | Deny | Deny Forever.

I think these companies used to go for extension developer now it seems they have found new idea to implant malware on apps which is not easy to detect.

TedDoesntTalk · on Nov 1, 2021

This is EXACTLY how mobile proxy companies like Luminati and OxyLabs acquire their ip address pool. They pay devs to embed a lib inside their app.

dzhiurgis · on Nov 1, 2021

They hack IP cams and routers too…

walrus01 · on Nov 1, 2021

One of the reasons for this is that the vast majority of the time, mobile LTE data users are behind cgnat for ipv4. You can't block one ip without possibly blocking hundreds of innocent IPs using the same exit point.

As a scraper operator on a mobile data connection all you need is a new useragent and browser fingerprint, there's no easy way for a scraper-blocker-operator to tell that you're not a totally new person.

cute_boi · on Nov 1, 2021

This is reason why most of services uses App instead of Browser. When there is App it can use lot of thing like phone fingerprint derived from various sources.

dzhiurgis · on Nov 1, 2021

Apple provides a framework or some sort of unique id

varenc · on Nov 2, 2021

Note that Apple explicitly tries to prevent apps from generating any sort of overall device fingerprint besides the ad tracking identifier one which now requires user consent in iOS 14. You can still generate an app-scoped device ID though. (Not sure if these persist across re-installs of the same app or not)

sparkling · on Nov 1, 2021

There are services that detect residential IPs being used for scraping nowadays. Plus there are other ways of detecting scraping: browser fingerprinting, aggressive rate-limiting and CAPTCHAs etc.

walrus01 · on Nov 1, 2021

Captcha solving services are a thing, can be as crude as something that takes a screenshot, sends an image to a click farm worker getting paid $300 a month sitting in a cubicle in Bangladesh.

There's various captcha solving services where you pay in bulk per captcha and submit data via an api.

sparkling · on Nov 1, 2021

Yup, such click farms exist. But driving up the costs and/or technical implementation efforts for bots/scrapers can be a part of your anti-bot strategy.

herbst · on Nov 1, 2021

1000 captchas usually cost around a dollar or two. Honestly captchas never stopped me from anything.

herbst · on Nov 1, 2021

It's always just a question of detecting these things and code them in.

It's like writing a game bot with Java robots and pixel detection. It may is inefficient, may takes longer to make than a network solution. But I have yet to be detected anywhere

bredren · on Oct 31, 2021

Unless you use a residential ip proxy network.

hattmall · on Nov 1, 2021

I've always thought credential stuffing and most password hacking attempts could be defeated by simply logging into randomly generated dummy accounts if the password is wrong. Just make it so that the same username / password combo takes you to the same random info. Real users should notice things were wrong immediately but bots would have no way to tell unless they already knew some of the real information.

sparkling · on Nov 1, 2021

> Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

I disagree. Obviously there is no way to 100% stop scraping, but a for a rather small amount of $ you can implement some measures that make it harder. Services like https://focsec.com/ offer ways to detect web scrapers using proxy/VPNs (one of the most common techniques) for little money.

> Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

Keep in mind that they may be legally or contractually forced to do this. Think of Netflix who are investing heavily into their Anti-VPN capabilities, most likely because they have contracts with content publishers & studios that force them to do so.

devit · on Oct 31, 2021

If users using weak/reused passwords is your problem, just don't let users choose a password (generate it for them), or don't use passwords at all (send link by e-mail that adds a cookie), or use oauth login.

folmar · on Oct 31, 2021

Link-only login is the most underused security option, even more so for low-profile sites that need a minimal user account but do not really need full-on security.

catlifeonmars · on Nov 1, 2021

I’m curious. What is link-only login?

toomuchtodo · on Nov 1, 2021

A link is generated, emailed to the user, and clicking the link logs them in.

nieve · on Nov 1, 2021

I.e. what Facebook does if you don't log in for long enough. Two days ago I got a pair of messages to the same address with links to completely bypass login and verbiage about how they'd seen I was having trouble logging in followed an sms message with the same to a phone number they're not supposed to be using. It looks a lot like phishing, but it comes out of Facebook's servers and they've done it to me before.

jrochkind1 · on Nov 1, 2021

> to a phone number they're not supposed to be using

What do you mean?

anthuswilliams · on Nov 1, 2021

I'm not the poster you're replying to, but: Facebook collects asks for your phone number for security/account recovery reasons, but then turns around and uses it to market to you.

folmar · on Nov 3, 2021

... and some sites use it instead of passwords, i.e. there are no passwords at all, only email links.

calvinmorrison · on Nov 1, 2021

So SSO but you have to trust the email provider instead of another random SaaS

xboxnolifes · on Nov 1, 2021

Basically. Most websites already make you login with an email and verify you have access to that email and use the email as a password recovery mechanism. May as well just use the email itself as the login.

Of course if everyone did this, then all of your logins would have the same password (your email login).

catlifeonmars · on Nov 1, 2021

This sounds more or less like OTP.

folmar · on Nov 3, 2021

It does, but usually you also get a long-lived cookie and does not need any setup on the user side, so is nice for the non-technical users.

jimmySixDOF · on Nov 1, 2021

Spotify use this

jjav · on Nov 1, 2021

> None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined.

There is vast amount of profit available in doing just that (see e.g. GOOG and FB market cap). Even companies that truly have no intention of exploiting data collected as a side-effect of whatever product line they do, nearly always eventually end up going for that profit line. Because passing up on more income merely due to moral considerations is too much of a temptation for a company to be able to resist in the long term.

chucksmash · on Nov 1, 2021

The credential stuffing wiki page didn't exist the last time I thought about invalid traffic so I'm pretty out of date.

How is there not an equilibrium here that cuts off credential stuffers? I'd naively imagine the residential IP providers have some measure of bad actors they themselves use to determine if a client is worth it, and that someone getting all your IPs blacklisted would get dropped pretty quickly.

judge2020 · on Nov 1, 2021

In reality residential US ISPs don't really care if their users are getting a sub-par experience since they're often the only fast/fiber provider in the area of their customers, meaning customers have no way to switch. Plus, when a website doesn't work, unless the page itself calls out the ISP (which they never do), customers will think it's an issue with the website and won't possibly attribute blame to their ISP until they're deep in forum threads with people telling them "it's probably your ISP not doing anything about bad customers" - the amount of users going so far to learn this information, then accepting it, is extremely low.

astatine · on Nov 1, 2021

On a site I used to run, there was no content which needed protection. So, it was not much of a pain except that there would be a lot of bot- filled contact forms. Slowly the problem became severe enough that bandwidth fees started becoming an issue. Finally had to use cloudflare in the front to reduce bandwidth usage. It worked but the side-effect was that some valid users may now get blocked.

nextaccountic · on Nov 1, 2021

> Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

I don't know about $10m/year, but many sites block bots just because they don't want competitors to access publicly available data. Which is bullshit.

krageon · on Nov 1, 2021

> he'd probably take down the whole repository.

I know how bad this issue is, and I wouldn't take down this repository. Anti-bot software does not work, anyone who pays 10m per year to have it simply has too much money.

mindslight · on Nov 1, 2021

If your password db is so broken that it's useful to create a term to abstract attacks ("credential stuffing"), then the right answer is to actually fix that security (eg pick users passwords for them, or completely replace with email auth), rather than thinking you're raising the bar by requiring attackers to come from a residential IP.

Gigachad · on Oct 31, 2021

2FA should be a requirement on everything now. And if your site can't for some reason or you don't want to deal with it, then limit your site to external login providers only.

2FA, especially app based, has been proven to work really really well.

ixs · on Oct 31, 2021

It does not. There are myriad ways of extracting the TOTP seed from these apps... Or you just reverse engineer the setup/confirmation process and then you can generate/trigger your own tokens from your automation workflow.

2FA is a good security feature but it does not help against web scraping. Credential stuffing and other 3rd party attacks? Yes, it _can_ help. But it does not always help. There's a phishing group that has seemingly specialised on getting people to click the green confirm button in their Duo app... ¯\_(ツ)_/¯

Check https://github.com/revalo/duo-bypass for a python script that can be used to automate Duo tokens... Has some code from me. There are similar scripts for all the other well known OTP Apps...

Gigachad · on Nov 1, 2021

Having malware installed on every users phone is so many orders of magnitude harder than downloading the latest db dump and testing the email/password on every other site.

At the bare minimum, TFA stops most attacks. That's a whole lot better than the current situation.

1cvmask · on Oct 31, 2021

There are different methods of 2FA like scanning encrypted barcodes that show that you require intent.

It seems that the Duo core app is a variant of HOTP?

What's the name of the phishing group and any details on them? There was a Defcon or Black Hat video where they would constantly send a push approval to the mobile which was not PIN protected and most people would click on it. Don't remember which OTP generator it was.

walrus01 · on Nov 1, 2021

How do you propose to implement two-factor authentication, on something like the public facing homepage of an airline ticket price search website, where if you make people "sign in with google" or whatever, a sizeable proportion won't do it and will just go to the competition?

cultofmetatron · on Nov 1, 2021

thats great till you're in a foreign country and your phone suddnely decides to die leaving you stranded and unable to access bank accounts or prove your identity. (happened to me)

arp242 · on Nov 1, 2021

2FA isn't limited to one device, or specific 2FA mobile apps. For example I use oathtool for most 2FA things; you just need to store the secret (often in the form of a QR code, but many services will also offer a text version, and if not you can decode the QR).

100% reliance on a phone which is easily lost, broken, stolen, etc. without backup is really bad IMO. My bank (Revolut) only had a mobile app, and no way to contact them outside of it (I tried...) I need to switch banks.

iechoz6H · on Nov 1, 2021

Revolut now has a web app [1], which still tries to get you to log in via the mobile app but this is not necessary. So long as you know your pass code and have alternative access to your email then you can log in and do most of the things you can do via the app. You do have to wait 10 seconds for the privilege though (before allowing access via email there is a timer before you can confirm you do not have access to the mobile app.).

1. https://blog.revolut.com/introducing-the-revolut-web-app/

arp242 · on Nov 1, 2021

The web app doesn't allow basic things like transferring money. It's borderline useless.

kingcharles · on Nov 1, 2021

This is becoming the standard now. Can only contact certain corporations through their mobile app, not through their web site (even if logged in).

sowbug · on Nov 1, 2021

This utility will help with that, assuming the services that use 2FA have a backup-code feature: https://github.com/sowbug/quaid

robbedpeter · on Nov 1, 2021

That sounds like a bad planning problem in which you shouldn't have left yourself vulnerable to loss of electronic services. Not a tech issue that justifies intrusive spyware.

selfhoster11 · on Nov 1, 2021

Hard disagree. A recipe sharing website or food delivery service DOES NOT require as much security as my email or bank account, and never will.

5faulker · on Nov 1, 2021

Same thing goes with ad blocking to a similar degree.

mdoms · on Nov 1, 2021

If it was just about credential stuffing they would only put limits on POST requests.

ChuckMcM · on Oct 31, 2021

I am always amazed when otherwise intelligent people assert without data that the marginal cost of serving web traffic to scrapers/bots is zero. It is kind of like people who say "Why don't they put more fuel in the rocket so it can get all the way into orbit with just one stage?"

It sounds great but it is a completely ignorant thing to say.

_ktx2 · on Nov 1, 2021

When I worked in e-commerce as a SRE, bots were doing two things:

- trying to disrupt business processes (eg: false referral listings, gift card scams, etc)

- trying to disrupt systems

I'm sure there are folks who use bots and scrapers for home automation, but these users generate marginal traffic in comparison. The real cost, aside from successfully achieving the points above, is the bandwidth and hardware costs that become overhead. Bots are usually coded with retry mechanisms and ways to change connection criteria on subsequent retries.

tyingq · on Nov 1, 2021

There's also a fair amount of scraping for things like...

- Reselling aggregated data

- Competitive pricing and inventory data

- "Sniping", like with auctions, event tickets, or things like airline check-in processes that are first-come, first-serve

- Weird SEO stuff where people scrape content in the hopes that isn't indexed yet, and they can beat you to it.

- And, sort of in the space you mentioned, searching for existing vulnerabilities by various signatures, or trying to brute-force guess things like passwords.

_ktx2 · on Nov 1, 2021

Most of these I threw under "disrupting business processes". My company considered most of these to be threats to the integrity of their business.

nocturnial · on Nov 1, 2021

> I'm sure there are folks who use bots and scrapers for home automation

I know this is off-topic, but I'm really curious. How does scraping the web help with home automation? Maybe downloading weather data could help, but crawling the web? I think I'm missing something about home automation.

3np · on Nov 1, 2021

I don't think anyone mentioned crawling (in my mind, "crawling" refers to a long-running process over several pages or sites and "scraping" could be a single fetch of a single URI)

I'm sure there are others, just from the top of my head:

* Electricity prices (as OP mentioned). Especially for people with solar panels or multiple options for heating.

* Watching for availability/prices of products or new homes one might be on the lookout for. Notifications at price drops/availability

* Public transport: next bus/trains from closest station, delays and interruptions

* IMDB/tvdb/etc for monitored shows and movies. Common with sonarr.

* Air quality, covid outbreaks, whatnot

maxgashkov · on Nov 1, 2021

Some people have smart mirrors or any other kind of 'news display', so it may be useful for them to scrape the data they may think relevant (this may be weather data, stocks, or even the new Nintendo Switch availability at their nearby retailer).

rstupek · on Nov 1, 2021

Or bots using ecommerce sites to test credit card validity for use elsewhere

herbst · on Nov 1, 2021

For whatever reason in years of running a SaaS this only happened maybe 3 times and never with my online shop. I guess using stripe and a few basic security settings keeps them away mostly

Aperocky · on Nov 1, 2021

Anyone who has a minor website knows that majority of the traffic are bot.

Imagine if the goal is images and videos, now you've got yourself some heavy duty scraper that could cost the website owner lots of data fees.

ohyeshedid · on Oct 31, 2021

Seemingly, most of those people don't have a realistic concept of scale.

matheusmoreira · on Nov 1, 2021

What could my scraper which makes 1 HTTP request per day possibly cost the webmaster?

ChuckMcM · on Nov 1, 2021

Nothing of course. But follow that string a little further.

So you do some thing which once a day scrapes a site and pulls off some data that you use in your thing. Maybe you talk about it to friends, or you have this thing as one of your github repositories. Some of your friends download the repository and also start using your thing. They talk to people about how cool your thing is, or what it does and the nice convenience of automating something that you used to have to do manually.

There are 86,400 seconds in a 24 hour period, probably folks won't change your code at all at first, and as it diffuses into the community some webmaster starts seeing this weird spike of queries that happen once a day at some time. Different addresses but always the same kind of request.

It's not a problem when its like 10 or 20 qps burst but when it starts getting up to 100 - 200 or worse 1000 - 2000, it causes the system to perhaps spin up additional instances that it isn't going to need after the burst and waste money. So the webmaster starts denying those requests with a 404.

Now sometimes your code works and sometimes it doesn't but you don't know that the webmaster is fighting you yet. Maybe eventually you start randomly varying the request time, or the people who have copied your thing are in more varied time zones so you it starts getting spread over the day.

now the webmaster is seeing bursts of traffic nearly every hour on the hour and that is weird so a more aggressive mitigation strategy is enacted.

People using your thing complain that it keeps breaking so you look into it and realize that the site is trying to block your requests. Perhaps you don't understand why this is, or perhaps you do and don't care, either way you come up with some strategies that avoid the block (maybe your rotate the user-agent or something).

Now the query traffic is spiking again and the webmaster is getting complaints that this 'bot traffic' is resulting in useless AWS fees because it isn't part of the revenue traffic and it is forcing the service to add more resources for their customers.

Not all scrapers are malicious, but my experience is that it is rare that a non-malicious scraper application isn't talked about and shared (amongst people who have a similar itch that the thing is scratching) and because its all open source it spreads around.

matheusmoreira · on Nov 2, 2021

> it is rare that a non-malicious scraper application isn't talked about and shared (amongst people who have a similar itch that the thing is scratching)

That's exactly my case though. I have a few scraper scripts that I've never published. So what if it's rare? Do I deserve to be treated like a botnet just because it's inconvenient for some webmaster or company to do otherwise? That's not fair at all.

ufmace · on Nov 1, 2021

What I really enjoy about this thread is all of the completely different perspectives. Lots of people doing anti-abuse research bemoaning that this stuff exists, and lots of people working against what are from their perspective ham-handed anti-abuse tech blocking legitimate useful automation trading tips on how to do it better. I guess the other sides of those we don't see much. People doing actual black-hat work probably don't post about it on public forums, and most of the over-broad anti-abuse is probably a side effect of taking some anti-abuse tech and blindly applying it to the whole site just because that's simpler, often no tech people may be really involved at all.

matheusmoreira · on Nov 1, 2021

When these companies endeavor to stop abuse, they trample all over our freedoms. Suddenly we can't have non-browser user agents anymore. Suddenly we can't root our smartphones anymore. They want nothing to do with us unless it's 100% on their terms with us completely under their control.

Spivak · on Nov 1, 2021

There seems to be wildly different perspectives on "bad actors means we can't have nice things" -- one group says that this is a fact of life, and the other says that this is an affront to freedom. A non-tech example is I've had guys on Tinder get legit angry at me for insisting that our first few dates have to be in public places where we drive separate -- "oh so you think I'm some creepy stalker?" And like I am totally empathetic to their hurt because I'm sure that they know they're good but I don't and there's now way for me to tell in advance. Malicious actors don't exactly announce themselves and actively try to hide their intent.

The solution to the automation problem is to do what most companies do and have registered API integrations.

matheusmoreira · on Nov 1, 2021

> A non-tech example is I've had guys on Tinder get legit angry at me for insisting that our first few dates have to be in public places where we drive separate

That's totally understandable though. I insist on meeting in public as well... It's the real world, safety is most important.

The thing with websites is they already allow me to make thousands of requests from my browser. What harm does it do if I make a bunch of requests from a script? I don't see it.

> The solution to the automation problem is to do what most companies do and have registered API integrations.

Yeah, those are pretty great. I always use those whenever possible. Many of the sites I use lack those though. Some have APIs so badly designed that scraping their web site actually results in fewer requests and less overhead.

ufmace · on Nov 1, 2021

Agreed, I meant the original in a "this is why we can't have nice things" sense.

I generally appreciate registered API integrations, but the trouble is, for the most sites that are most problematic for benign automation, they usually don't have enough demand or revenue to justify well-maintained APIs.

I tend to think the solution is more to somehow make the market prefer more decentralized solutions, preferably federated. Not having one big target for bad actors means much less effort applied to attacking any one of the targets.

Dating might be a bit off topic, but I can see both sides as well. Women have genuine risks, and are very justified in taking precautions. But for the majority of decent guys, it can be tiresome to be constantly treated like you're an evil violent stalker. Maybe it needs a similar solution - a movement to local connections where people can have reputations that you can trust.

bryan_w · on Nov 1, 2021

Yeah, but you're not entitled to use their servers. If your use of their servers is something they don't like, their freedom is to blackhole your packets

matheusmoreira · on Nov 1, 2021

It's always these "take it or leave it" deals with these people, isn't it? This is why adversarial interoperability is the rule today.

marginalia_nu · on Oct 31, 2021

If someone is signalling to you you that they do not want your bot on their site, then maybe respect that? Trying to circumvent it is besides being legally questionable, a serious pain in the ass for the site owner and makes websites more prone to attempt to block bots in general.

Also, in my experience, most websites that block your bot, block your bot because your bot is too aggressive, or because you are fetching some resource that is expensive that bots in general refuse to lay off. Bots with seconds between the requests rarely get blocked even by CDNs.

soheil · on Oct 31, 2021

Google can access any site without being blocked. They dominate the search space and give little incentive for site owners to allow other bots. I'd say bypassing these measures is fair game while there is a monopoly in search space. We don't want a web that only Google can access.

By the way great work on Marginalia search engine, I love it.

marginalia_nu · on Oct 31, 2021

I've honestly not had much problem at all crawling the web as an indie search engine operator. If you want to get past CloudFlare you can register your bot fingerprint with them.

A small number of sites has blocked my crawler , but that's almost always been my own fault, and happened a few instances when the crawler was misbehaving and actually fetching too aggressively (or repeatedly). In every case just sending an email to the site explaining what happened and humbly asking for a second chance been enough to be allowed back in.

Most website owners don't seem to mind small search engines at all, what they don't want is scrapers that aggressively scrape their entire site 10 times a day, ignoring robots.txt, and being a general nuisance.

kristjansson · on Nov 1, 2021

There's also the alignment of interests between you and the site operators, even for a small search engine. Cooperation improves outcomes for both sides. Whereas the more aggressive scrapers almost certainly have interests that conflict with the site operators, regardless of the costs to serve the traffic itself.

gzer0 · on Nov 1, 2021

How did you go about registering your bot fingerprint with Cloudflare? I did a quick search but cannot find anything with regards to that.

marginalia_nu · on Nov 1, 2021

Good question, I can't find the link now either. I found it somewhere on Cloudflare's site.

--edit--

Nevermind, here it is: https://support.cloudflare.com/hc/en-us/articles/36003538743...

It should skip to "I run a good bot...", but if it doesn't, that's where you want to scroll.

jonatron · on Oct 31, 2021

Legitimate uses of scraping include price comparison, and finding copyright or trademark infringement.

Dah00n · on Oct 31, 2021

>Legitimate uses of scraping include price comparison

"Legitimate uses" is what the site operator says it is, nothing more nothing less. There are no laws that says you can scrape a site and circumvent their protection against doing so.

EMIRELADERO · on Oct 31, 2021

On the contrary, there are no laws that say you can't scrape a site. If it's available to the public internet, it's legally scrapable.

marginalia_nu · on Nov 1, 2021

There are laws against unauthorized computer access.

This is a scenario where you have a server explicitly saying "Stop! You are not permitted to access this computer!", and yet you persist in circumventing that by hiding your identity and accessing it anyway. Those are some murky waters.

fragmede · on Nov 1, 2021

For those that are interested in the specifics, Jamie Williams wrote a piece for the EFF[0] in the wake of hiQ vs Linkedin which dealt with this exact question.

It depends on who the server operator is. If it's your server, yeah, anyone I don't want to be there should go away. If it's your enemy's server, the argument that they're sending that page to the rest of the Internet turns out to be a decent one.

[0] https://www.eff.org/deeplinks/2018/04/scraping-just-automate...

Aeolun · on Nov 1, 2021

The server says nothing of the kind. The response that was previously positive is now broken, and it happens to be fixed if you access it from a different IP.

Maybe we need a status code that means ‘lay off all the requests made from this entire system’?

marginalia_nu · on Nov 1, 2021

How do you interpret a

401 Unauthorized

to mean you are authorized to access the resource?

jhgb · on Nov 1, 2021

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#cli...

> Although the HTTP standard specifies "unauthorized", semantically this response means "unauthenticated". That is, the client must authenticate itself to get the requested response.

So it would seem that it actually doesn't positively imply that you're NOT authorized.

Which kind of makes sense; machines can't detect legality of things, just that certain procedural niceties haven't been observed.

marginalia_nu · on Nov 1, 2021

Fine, send a 403 then.

> The client does not have access rights to the content; that is, it is unauthorized, so the server is refusing to give the requested resource.

Machines don't have any legal responsibility, bot-operators do. Which is why respecting these things is sort of important. At any rate, 40x does not mean "try again with a different user agent and another IP"

hattmall · on Nov 1, 2021

403 is per request, not requester. I get random 403s when just browsing some websites. Does that mean I should close the browser and not hit refresh for fear of breaking some wire fraud unauthorized access law?

marginalia_nu · on Nov 1, 2021

If you go by the semantics of what the 403 code means, absolutely, that's excatly what the status code means.

In practice there's of course nuance, like anyone will occasionally type in the wrong password on a log-in screen, maybe try again and then realize it was the wrong log-in prompt. That's mostly fine.

That's different from deliberate trying to circumvent a measure like this. If you are doing the stuff in the link, you are absolutely crossing a line and you know it.

There's a large difference between "I got a 403 so I hit F5 once" and "I got a 403 so I used a residential proxy and spoofed my user-agent".

jhgb · on Nov 1, 2021

I thought the semantics of 403 was "look for the actual semantics in the response body".

BeFlatXIII · on Nov 2, 2021

If I were on a jury, I’d vote to nullify any scraping case that made it this far.

dragonwriter · on Nov 1, 2021

> On the contrary, there are no laws that say you can't scrape a site.

You are both wrong: copyright law both says you can't (in some cases for some uses) and that you can (under implicit license, fair use, and other rules) in others.

EMIRELADERO · on Nov 1, 2021

Depends on what exactly is being scraped. If it's something like price data or exact values then it isn't protected by copyright at all.

kingcharles · on Nov 1, 2021

Price data can be protected by copyright as a compliation of data.

https://www.bitlaw.com/copyright/database.html

EMIRELADERO · on Nov 1, 2021

In that case, the data compilation itself would be protected, not the individual data points. If I used a scraper to copy everything verbatim, then yes, it would be a violation.

jonatron · on Nov 1, 2021

If you look at Ryanair, they've had legal battles with scrapers for more than 10 years, so it's not that simple.

aembleton · on Nov 1, 2021

I found Ryanair one of the more friendly ones to scrape, albeit for my own personal project. When you query for flights, they make a GET request with a JSON object in response, complete with flight times and prices.

For example:

  curl "https://www.ryanair.com/api/booking/v4/en-gb/availability?ADT=1&CHD=0&DateIn=&DateOut=2021-11-15&Destination=BER&Disc=0&INF=0&Origin=MAN&TEEN=0&promoCode=&IncludeConnectingFlights=false&FlexDaysBeforeOut=2&FlexDaysOut=2&ToUs=AGREED" | jq

jonatron · on Nov 1, 2021

I haven't tried to scrape Ryanair, you could be right that it's trivial. It's the legal side that has a long and interesting history. Personally I wouldn't scrape them unless it was when working for a company that can afford lawyers.

shapefrog · on Nov 1, 2021

{"code":"TermsOfUseAreNotAccepted","message":"TermsOfUseAreNotAccepted"}

kzrdude · on Nov 1, 2021

I would say legitimate use includes archiving, and nothing more.

matheusmoreira · on Nov 1, 2021

> If someone is signalling to you you that they do not want your bot on their site, then maybe respect that?

Maybe respect user freedom? If I can access the data using my browser, why can't I access it using my script?

Why is Google the only one who can do it? Must they have yet another monopoly?

> Bots with seconds between the requests rarely get blocked even by CDNs.

I've had scripts that made 1 request per day get blocked for no reason. Not to mention the endless cloudflare javascript bullshit they made me support for it to even work.

marginalia_nu · on Nov 1, 2021

> Why is Google the only one who can do it? Must they have yet another monopoly?

I run a search engine and do my own crawling, and this does not correspond to my view of reality at all.

I have had almost no problems with getting blocked. If I have gotten blocked, it's usually been my own fault and I've been able to get unblocked by sending them an email explaining I run a search engine and asking for forgiveness because my bot wasn't behaving well.

The bots that do get blocked are bots in most cases bots that misbehave, ignore robots.txt, fetch the same resources repeatedly or with insane crawl-delays.

There are a few rare exceptions, but the whole "why does Google get a free pass when I don't?"-angle just doesn't hold water at all.

> I've had scripts that made 1 request per day get blocked for no reason. Not to mention the endless cloudflare javascript bullshit they made me support for it to even work.

Google doesn't repeat requests every day. I don't repeat requests every day. That's a weird thing to do, and it's well within a site owner's prerogative to block that nonsense.

matheusmoreira · on Nov 1, 2021

> Google doesn't repeat requests every day. I don't repeat requests every day.

I scrape one site whose content changes every second. How is it "nonsense" to make one request every 24 hours? I make hundreds, thousands when I browse their site normally using my browser.

We've got people in this very thread talking about bots making hundreds of requests per second. How is one request every 24 hours harming anyone? People told me to make one request per hour to avoid hammering their servers, I decided to wait a day instead. It boggles my mind that this generous interval could possibly be considered abuse. How long should the interval be then? A month? A year? Infinitely long so the scraper never makes requests?

colinmhayes · on Nov 1, 2021

I built a search engine in school. I was lazy and stupid with the scraper and ended up writing a bug that caused it to loop on certain sites. That lead to the entire Duke law school site being DOS's during their class sign up period. Sorry for ruining scraping for everyone, but this is why websites don't want people scraping them.

Spivak · on Nov 1, 2021

But you have no freedom when contacting my servers. I can send you 403's just because I don't like your face. There is zero entitlement that you have any access to my servers in any way I don't permit. If I say no automated access then on what grounds to you have to do it anyway?

> Why is Google the only one who can do it?

Because site operators explicitly allow them automated access. If you want the same treatment you have to ask for it.

matheusmoreira · on Nov 1, 2021

> I can send you 403's just because I don't like your face.

Sure. At least then you're being honest. If you hate me, it doesn't matter what user agent I use to access your site. Browsers, scripts, they are all me.

Spivak · on Nov 1, 2021

Why do you think it’s dishonest to send 403s to bots but not browsers? Method of access matters — you the human might have access to your safety deposit box but the bank is still allowed to make rules about your access — like you have to come during business hours and you can’t send someone on your behalf.

matheusmoreira · on Nov 2, 2021

> you can’t send someone on your behalf

I totally can though. If sign a document saying another person can do such and such on my behalf, that person can totally do that. Yes, even at the bank. No different from a user agent, really.

hbgl · on Nov 1, 2021

> Why is Google the only one who can do it? Must they have yet another monopoly?

Google can do it because most website operators want them to index their site. Plus, it is trivial to tell google to stop. That goes for all search engines.

TruthWillHurt · on Oct 31, 2021

You're forgetting a case where a website offers a garbage API that doesn't provide all the data available via the web interface,

either due to neglegance (Apple Store developer console),

Security (Google Play Store accounting data),

Or financial gain (AppsFlyer "premium API").

Dah00n · on Oct 31, 2021

That doesn't change anything. The site owner decides what you can and cannot do. If a badly made API meant you could do anything you wanted to then everything could be done to those sites running them. That is not how this work. Outside normal usage you need permission.

temp8964 · on Nov 1, 2021

I really don't believe "The site owner decides what you can and cannot do." statement. What is the base of this? This does not seem to apply to anything in the real world.

In most cases you have very limited ability to decide what other people cannot do. And other people has mostly infinite choices of what they can do. I never heard anything as broad as you said. What you said is like a person standing on the street with a T-shirt says "do not look at me more than twice" and claim it has a legal binding to the whole world.

arp242 · on Nov 1, 2021

> I really don't believe "The site owner decides what you can and cannot do." statement. What is the base of this? This does not seem to apply to anything in the real world.

It kind of does though; if I own a store and say that only people with hats can enter, then I'm free to do so. Silly? Yes. Legal? Also yes.

There are some circumstances where it's not legal, mostly centred around discrimination. Details on this differ per jurisdiction, but generally speaking you have a right to refuse customers.

geysersam · on Nov 1, 2021

Interesting discussion.

To me it seems sending a http request is somewhere inbetween looking (legal) and entering (illegal if not permitted).

However, most important is that the web as a system makes positive interactions easy and negative difficult. We have already found some set of constraints achieving this for interactions in public city streets. But it's not obvious the same rules (that we have internalized) have the same effect in another medium of communication.