Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to Crawl the Web Politely with Scrapy (scrapinghub.com)
139 points by stummjr on Aug 25, 2016 | hide | past | favorite | 41 comments


In the past we built and operated Greece’s largest search engine(Trinity), and we would crawl/refresh all Greek pages fairly regularly.

If memory serves, the frequency was computed for clusters of pages from the same site, and it depended on how often they were updated(news sites front-pages were in practice different in successive updates, whereas e.g users homepage were not, they rarely were updated), and how resilient the sites were to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than expected to download the page contents than what we expected based on site-wide aggregated metrics, we ‘d adjust the frequency, etc).

The crawlers were all draining multiple queues, whereas URLs from the same site would always end up on the same queue(via consistent hashing, based on the hostname’s hash), so a single crawler process was responsible for throttling requests and respecting robots.txt rules for any single site, without need for cross-crawler state synchronisation.

In practice this worked quite well. Also, this was before Google and its PageRank and social networks (we ‘d probably have also considered pages popularity based on PageRank like metrics and social ‘signals’ in the frequency computation, among other variables).


In the current web, sites like Amazon are so large that you'll need many crawlers. On the plus side, it appears that almost all large sites don't have rate limits.


Crawl-delay is not in the standard robots.txt protocol, and according to Wikipedia, some bots have different interpretations for this value. That's why maybe many websites don't even bother defining the rate limits in robots.txt.


I was referring to an actual rate limit, not crawl-delay. For example, YouTube is pretty strict about rate limits:

http://www.bing.com/search?q=%22We+have+been+receiving+a+lar...

I agree that crawl-delay is rare, and often it's set too long so that it's impossible to fully crawl a site -- as if the webmaster set it up 10 years ago and never updated it as their site got faster and bigger.


Hi Mark, out of curiosity, which search engine is that?


It was called Trinity -- it was initially developed for Pathfinder.gr, and soon thereafter was the search provider for in.gr, and was also accessible at trinity.gr for some time.


In my experience the best way to crawl in a polite way is to never use an asynchronous crawler. The vast majority of small to medium sites out there have absolutely no kind of protection from an aggressive crawler. You make 50 to 100 requests per second chances are you’re DDoS-ing the shit out of most sites.

As for robots.txt problem is most sites don’t even have one. Especially e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to hit every url just to find the structure of the site. Being polite in many cases takes a considerable effort.


Scrapy is asynchronous, but it provides many settings that you can use to avoid DDoS a website, such as limiting the amount of simultaneous requests for each domain or IP address.

And yes, crawling politely requires a bit of effort from both ends: the crawler and the website.


I agree, at the end of the day being polite or not is on the developer and not the tool itself...


Search engine crawlers use adaptive politeness: start being very polite, and ramp up parallel fetches if the site responds quickly and has a lot of pages.


That's kind of what Scrapy's AUTO_THROTTLE middleware does.


You can rate-limit asynchronous crawlers too.


See also Tuesday's HN discussion on the ethics of data scraping (https://news.ycombinator.com/item?id=12345952), in which Hacker News is completely split on whether data scraping is ethical even if the Terms of Service explicitly forbids it.


I wouldn't say completely split. I think most of HN considers the current state of scraping law to be complete and utter hogwash. Many of the expected consumer rights don't apply online because of the way the law considers normal communication with a server on the internet an excursion onto private property.

We need a modern law addressing these issues instead of the pre-Internet CFAA. Malicious actors should still be punished, and it may be reasonable to still allow a provision for the civil liability (not criminal) of large-scale accidental DoS from poorly-implemented scrapers, but users should be free to choose their own browsing devices -- even if those browsing devices are highly optimized to extract only the specific pieces of data that the user cares about.

This law should also clarify that normal communication over HTTP cannot be punished unless the plaintiff can demonstrate real and serious interruption to their services, that local RAM copies that are never externally transmitted cannot be considered infringing in themselves, that hosting a site on the internet grants an implied copyright license to read and access its content with any HTTP-capable client, and that browsewrap/clickwrap contracts are unenforceable unless the user undertakes a significant relationship with the company, among other things.


Are you trying to imply that's a ridiculous position? I don't see it as one.


I'll say it's a ridiculous position. How can it possibly be ethical? The owner of the server and the content has specifically told you to stop sending packets at it.

I honestly don't know how to construct an argument for this because it's so obvious to me.


It's ethical because it's a public internet. The same way you can't use the force of law to stop a homeless guy from asking you for change as you walk along a public street, you can't [shouldn't be able to] use force of law to stop a client from asking your server for data as it sits connected to a public network.

It's not unethical for a beggar to continue asking for change. It's up to the passerbys to choose whether or not they'll honor his request, but he is free to make it as long as he doesn't get out of control. Many people see the client-server relationship that exists online similarly. As the beggar can't receive anything that the giver doesn't willingly give, neither can the client receive anything the server doesn't willingly give.

It wouldn't make any sense if a guy could give the beggar change and then sue him and say that he shouldn't have gotten change because he actually wanted to use it for his lunch. The judge would say, "Well, why did you give it away? You can't just change your mind and then sue someone over it." This is also what judges should ask servers who dispense information to clients and then try to take it back.

tl;dr there's no harm in asking for data, even after someone has told you no, as long as you do so reasonably.


Strictly speaking there are a lot of places where panhandling is not legal.


Is it unethical to record a TV show and skip through the ads when you watch it if the network would prefer you didn't? If you agree that it is not unethical, then merely the "content owner's" (keep in mind many things people want to scrape are factual information) saying so is not sufficient to make it ethically impermissible to scrape.


In the context of this submission, it's definitely not polite.


It's certainly more considerate if you limit your scraping to rates similar to a human user instead of just going as fast as possible.


Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.

Instead of fighting against scrapers that don't want to harm you, maybe it's about time to invest in your robots.txt and cooperate.

You could say that scraping you're website is FORBIDDEN, but come on : if Airbnb can rent houses, I can scrap you site.


>Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.

Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic.

This is what happened to my company. It didn't stop them from pretending that we were setting their servers on fire, even though they had no way to know whether we were or not since they couldn't distinguish our traffic from that generated by other browsers.

We were scraping only factual data in the which the company cannot hold a copyright interest. Nonetheless, under Ticketmaster v. RMG, just holding a copy of a page in RAM long enough to parse it constitutes infringement (you have to prove fair use, as Google supposedly did in Perfect 10 v. Google, to avoid this).

The difference between yourself and Google/airbnb is that the latter have a lot of money and are trendy technology companies, and you don't and aren't (yet).

The lesson is become really big before someone sues you and the judiciary will be on your side.


"Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic."

How would they know you're scraping them?

Surely the capability of any given website admin to detect a particular scraper would depend on many factors such as whether they're even looking for scrapers or are technologically capable of doing so, how many/which IPs the scraping is originating from, and how cleverly the scraper goes about their scraping, no?

It's a bit of a cat and mouse game, wouldn't you say?


They know you're scraping them because their site is the only source of the data you're scraping. The most common example here is airlines. Airlines that haven't agreed to be included in fare aggregators often have their booking information scraped. Even if your traffic blends in, they know that you're reading out fare data from them, because where else would you get it from? This is especially true if you follow it up with a link to buy the specific fare at the airline's site. The only plausible way to have that is to read it off of their site (and, even if you can use a template based on their URL structure, I think there would probably be a case to be made that URLs qualify for copyright and trademark protection).

As for the game of cat and mouse, it lasts until they call in their lawyers. Then it's a game of "quit now or get destroyed".

But yes, if you can scrape the data without ever tipping off the company you're scraping, you can probably continue indefinitely, but you have to consider whether you can plausibly argue that you're getting that data from someplace else. If they sued you on the suspicion that you're scraping them, they'll probably subpoena the code to confirm that (or similar -- IANAL), and then proceed to try to make a case on things other than CFAA violations.


Oh, you're talking about them inferring that you must have scraped them because you used or published data that only they had.

Not every scraper has publishing or using data in a detectable way as their motive.

For instance, I sometimes scrape a website to make an archive of it for my own personal use. I never publish the results or use them in any way that the website owners would ever know about. So the only way they could know that they were scraped is if I left some kind of scraping signature while scraping (such as scraping from a single IP and doing it quick enough to pop on their radar or perhaps regularly enough -- ie. without random waits between request, etc).

What you're talking about is probably mostly a concern to people/companies who are somehow making money from scraping data on other people's websites.


It depends on your definition of harm. When your product is what's published on the websites and you regularly find ripoffs of said website publishing your ripped off content, maybe you'd feel differently about it.


Not sure what that has to do with scraping. A desktop browser can be used to copy and paste chunks of content and plagiarize a site. We have reasonable copyright protections to protect authors against that. What we need to discharge are the unreasonable laws regarding network access. It's not all or nothing.


Yeah, but that's not just because of web scraping. Plagiarism has been an issue for centuries.


fair enough but I don't think that's the main purpose. There are many many cases where you would want to scrape something and often people would probably be encouraged in doing so in a "polite" way if websites didn't make it hard.


Yes, or if they just provided a csv with all the data most people wanted to scrape anyway with a plain English explanation about how it can be used.


That argument only holds up if you believe in intellectual property. Many of us here do not.


I worked on a research project to develop a web-scale "google" for scientific data and we found very interesting things on robots.txt, from "don't crawl us" to "crawl 1 page every other day" or even better "don't crawl unless you're google".

Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it.

Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...)


This is why I think Google's position as the #1 search engine will never go away. Many sites will tell your bot to go away if you're not Google. They don't care if you're building a search engine that will compete with Google.


At blekko, we did not find this issue to be a significant one... almost everyone who banned our crawler was a crappy over-SEOed website.


https://www.linkedin.com/robots.txt

https://yelp.com/robots.txt

There goes all Linkedin + Yelp content from your index.


What about https://www.facebook.com/robots.txt

..and medium-sized/small sites are even worse.

The irony of Facebook being a core part of all NSA surveillance programs and their terms of service including their "Automated Data Collection Terms" https://www.facebook.com/apps/site_scraping_tos_terms.php


If you surf LinkedIn logged out, you'll see that there isn't very much information available anyway. And there's no money in people search.

Yelp was very responsive when blekko wrote them; as you can see ScoutJet has the same access as googlebot.


The current protocols promote data exchange and since websites are primarily designed to be consumed, there is really no way to stop automated requests. Even companies like distilli[1] networks that parse inflight requests have trouble stopping any sufficiently motivated outfit.

I think data should be disseminated and free info exchange is great. If possible, devs should respect website owners as much as possible; although in my experience people seem to be more willing to rip off large "faceless" sites rather than mom&&pops. Both because that is where valuable data is, and it seems more justifiable even if morally gray.

Regardless, the thing I find most interesting is that Google is most often criticized for selling user data/out their users privacy. However, it is oft not mentioned that Googlebot & the army of chrome browsers are not only permitted, but encouraged to crawl all sites except a scant few that gave achieved escape velocity. Sites that wish to protect their data must disallow and forcibly stop most crawlers except google, otherwise they will be unranked. This creates an odd dichotomy where not only does google retain massive leverage, but another search engine or aggregator has more hurdles and less resources to compete.

[1] They protect crunchbase and many media companies.


If you're worried about being a pain in the ass to administrators, with a web-scraper, they probably need to rethink the way they have their website set up.


An alternative to Scrapinghub: PhantomJsCloud.com

It's a bit more "raw" than Scrapinghub but full featured and cheap.

Disclaimer: I'm the author!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: