Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A few points I can confirm from talking with Jud about this over the last few weeks:

- Billions is the correct magnitude. It's the result of Gnip's aggregation of multiple Twitter-firehose-scale feeds over a year. They are massive users of S3 storage.

- I'm unsure why he's using an XL instance in this case, but I know he's been experimenting heavily with different configs to improve performance

- How do you get sustained 700 deletes a second? I'm not being facetious, I see much slower performance using commercial and open-source interfaces to S3 like s3cmd or Bucket Explorer. I'd love to find some faster approaches.



Use non-blocking IO, like PyCurl.Multi interface (or any other Curl.Multi bindings). You'd have to generate DELETE cmmand URLs to call from some S3 library, the rest is up to Curl (to call those URL).

It's not a trivial task, but reasonably exprienced Python (or maybe even C) programmer would do it easily. (I'd say in a couple of days) And I can confirm that even on EC2 small instances 700 HTTP req's a sec is achievable (not sure about S3 API usage limiting).


nbio isn't a magic wand; context switching is virtually free in a language without a GIL

http://paultyma.blogspot.com/2008/03/writing-java-multithrea...


the solution I used leveraged non-blocking I/O in a roundabout way. 150 separate procs (quick and dirty) per instance. no threads. I let the OS manage the I/O in that regard. My thinking was that if I had "a lot" of procs trying I/O... I'd get the same effect. again... quick/dirty.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: