Maybe. Running your own infrastructure at large scale is actually really hard and therefore surprisingly expensive in risk and talent cost. If you look at how big the internal infra teams are at companies that host their own infrastructure, they're often individually the size of growth stage companies.
The problem is that provisioning, reliability, and security are by themselves really tough problems. If those issues aren't in your company's core competencies, it's not necessarily efficient to invest in building out all of that.
I look at it as the question: can you get the same set of agility/reliability/security guarantees for your narrower set of use cases by paying for your own hardware and engineering? I won't even begin to pretend I have any answers there, but I think that's the calculus.
> If those issues aren't in your company's core competencies, it's not necessarily efficient to invest in building out all of that.
Maybe that's just the story cloud providers tell you.
Until you try, do you really know if it's all that complicated? People have been running datacenters for a long time, and not all of them work for Amazon.
But there may be also a beneficial side effect of having gearheads around, and maybe that's the real cost to going cloud.
Having done a bunch of bare metal, I can tell you the calculus isn't really that hard. Bare metal will save you money.
Operating bare metal at scale requires talent that doesn't exist, not necessarily at an engineering level, but at all levels.
As an example, I worked at a place that had a large bare metal deployment, i.e. >1MW worth of compute. It was woefully inefficient and costly to operate. The product that they offered required network QOS and compute with real time capabilities, neither of which was available from any cloud provider at the time.
One of our executives (formerly a leader in the DC ops org at AWS) left the company to be replaced by another executive by another well-known silicon valley org who then insisted we should migrate everything to the cloud.
I showed him the relatively easy math that efficiently utilized bare metal was way less costly and that the aforementioned QOS and RT requirements would be a deal breaker anyway. He failed to fully grok this and remained insistent. When I quite, he seemed surprised. After the fact, I discovered that they'd made a deal with IBM to move everything into their cloud. A year later it was an utter failure and they abandoned the project.
There are lots of folks in the valley with lots of experience on their resumes that suggests that they should be capable of understanding these kinds of things that simply don't. Lacking that understanding leads to poor decision-making, which leads to failure, which leads to risk-aversion, which leads to everyone believing that it must be cheaper in the cloud.
Or so goes the old adage, "nobody ever got fired for buying IBM."
Feel free to ignore this, but would you shoot me an email? I'm working on a project that uses bare metal and would like to pick your brain.
EDIT: To whoever downvoted this, the commenter hasn't listed an email address, or I would have reached out directly. This is an honest attempt at communication that doesn't require someone to break anonymity.
From a Lyft engineering perspective would rather focus on things like how do I make sense, process, extract the ton of data. How do I focus on customer experience rather than
how do I save money in data-center, how do I keep my data-center stack updated and many more
You can't just eliminate that 10%. Even if going to fully bare-metal lowers costs it takes a lot of time and manpower to make that transition. When that investment can be made in other areas that have much more impact it really doesn't make sense.
Bare metal works when your workload is well-defined and understood. Then you can actually put reasonable estimates for what you need and hire/purchase infra accordingly.
The balance here is tricky. Based on public data, it seems that Netflix has ~$16B in revenue against $300m/yr cloud spend. 2% seems much more reasonable to me.
I feel like a drive toward efficiency is a worthwhile endeavor for a startup in terms of establishing a competitive advantage.
The description here seems to be more of the compute and storage. How about the boat load of services that are offered with AWS. Plugging and playing with services maintained by AWS makes it easier for companies to focus on their product logic. The major expense is actually engineering.
My forecasting on RDS is that it’s a dead-end product – all the future hotness is going to be in Aurora Serverless. Multi-region active/active Postgres with totally usage-based pricing and totally elastic performance is going to be a game-changer.
But if you need bleeding-edge Postgres performance, you hire a DBA, and they probably build something on EC2 or bare metal.
———
As I understand it, RabbitMQ is probably a better point of comparison for SNS/SQS, and Kinesis is the Kafka peer.
Regardless, the reason you don’t “just” run Kafka is: you don’t have a team that knows how to tune, deploy, and operate a production Kafka cluster. I learned enough about SNS and SQS to get it running in an afternoon, and I really haven’t needed to think about it since. Kafka (or RabbitMQ, or ActiveMQ, or etc) need instrumentation and monitoring and patching and quorums and capacity planning and etc, and at some scale those are worthwhile, but that scale is MUCH larger than what most Kafka clusters are actually serving.
———
The theme here is: if you have a business requirement for 90th percentile specialized performance, great! Hire domain specialists who can make your systems run at that tier! But for everyone else in the world, when you can get usage-based pricing, elastic resources, and automatic durability and patching... why would you go to the trouble of learning how to deploy and manage a service?
I've been on both sides and it's not as simple as "bare metal saves you money." It really depends on the company and the type of applications being hosted and where the business is growing (or not growing). An established company with an established workload, especially if it's simple, will probably do better on bare metal, but cloud is popular in the Valley because ideas are still being developed and iterated on heavily where you don't want to be stuck on multi-year hardware leases that might not sync with what that future business looks like. Lyft is probably in that in-between stage, but I still think it's an enormous undertaking to become an infrastructure company over simply being able to hire full-stack developers, which is a lot easier and cheaper. Right now their time is better spent elsewhere.
I remember how hard it was to hire senior operations people. There are not many of them, and there are not many of them at the level of being able to deliver something amazing. The ubiquity of the cloud has only made these kind of experts less common.
Every place I've worked that did bare metal was always drowning in maintenance instead of working on the next big thing. And no big surprise, our internal infrastructure was nowhere near as high quality or capable as AWS. And most of our developers had experience working directly with cloud providers, without ops people, so we were delivering them a worse experience and slowing them down, and we required more ops people to help them and maintain it and keep everything online.
Also, a move to IBM's cloud isn't the greatest example. I had hundreds of bare metal servers in an IBM-owned datacenter and their cloud offering was consistently behind AWS/GCP; if anyone recommended IBM cloud to me I would have laughed at them. It seemed to me that IBM was trying to up-sell on the "cloud" buzz word without actually delivering anything except higher prices, just like how they're now trying to ride the buzz of the blockchain.
Dropbox is a good example of a company that took quite a while to move to their own platform, away from AWS (and they still have 10% of their stuff in AWS to this day). Dropbox is basically a storage infrastructure company, unlike Lyft, but it still took them years to invest in the development (and migration) of that custom platform to replace AWS, an investment that not many companies are going to want to gamble on, especially if their primary business is not storage:
And I think it's telling that Dropbox started on AWS, grew the business on AWS, and moved to a custom platform once their business model was perfected and they wanted to cut costs prior to going public. If Dropbox had started on bare metal from day one, would they have been able to pull it off?
IBM cloud is a joke. That's why I put it in there. The aforementioned executive was clearly not thinking.
There's nothing you've written that I disagree with. It's easy to do the math that shows where bare metal saves money inclusive of the labor costs. For some reason most everyone seems to fail at it. I could expound one why, but this:
>I remember how hard it was to hire senior operations people. There are not many of them, and there are not many of them at the level of being able to deliver something amazing. The ubiquity of the cloud has only made these kind of experts less common.
Those folks just don't exist. Building infra is more than just buying infra. It takes actual development, which is why I think so many fail at it.
Your anecdote about Dropbox is telling. They adopted cloud, and more importantly cloud methodologies and then went back to bare metal. There are others that have done the same. I recall a talk at an Openstack conference given by Verizon in which they described their approach. Developers begin in AWS, utilize a cloud-based approach, and then when cost concerns become an issue, they aim to offer similar services in-house on bare-metal.
>Or so goes the old adage, "nobody ever got fired for buying IBM."
This is true but its never really hit me before even though I've already been operating based on the assumption that trusting the cloud is less risky than trusting my own skills.
I think the point is, Lyft doesn't want to be in the business of running enterprise could infrastructure.
They want to make money brokering rides.
Taking on their own cloud infrastructure -- in theory -- could economically make sense. But that's just an extra layer of risk and complexity they'd rather forego to focus on their core business.
After all, their core business is already losing $930M on $2B in revenue. They're cash-flow doesn't put them in a good position to make large up-front investments on data centers.
So, yeah, like a broke renter in an expensive city. In theory, it might be better to buy a house, but you don't have the down payment, and maybe you should be focused on increasing your earning power rather than saving money anyway...
Yes, I've worked with a few of those datacenters. A few examples:
- Recently had to purchase new servers, because of signed contracts the only servers we were allowed to purchase and put in the datacenter were four years old and technically EOF.
- Firewall changes, AD changes, provisioning a VM, etc. are 48 hour turnaround. Purchasing new hardware requires 4-6 weeks.
- Had an intermittent issue with their edge firewall, it'd slow certain connections to a crawl and eventually they'd timeout. Took six months to fix it, for the first three months they told us it wasn't their fault (turning off their deep packet inspection ended up fixing it).
I still remember when we opened the first ticket about it, and the reply was "no other customers are experiencing problems" and it was closed.
That's just a few examples of how painful it can be. To give you the other side of the coin, having worked with an enterprise contract in AWS, we were having an intermittent issue with DNS resolving failing for a few seconds every few days. They put an engineer on it full time till they found the problem (we misconfigured it), and it didn't cost us anything more than the enterprise support. I was actually shocked they'd invest that much on such a vague issue.
Yes AWS is expensive, but you're getting world class engineering proven at scale, and access to some very smart/motivated people to support it (and they have access to the teams who built it, when they can't solve it). I don't think I'd ever choose managed datacenter over AWS/GCP/Azure/etc. Either do it in-house where there's accountability, or use cloud providers who have proven they're competency.
To be clear, I'm talking about VPC/EC2/etc. I can't really discuss a lot of their higher level and newer managed services; they either weren't as good, or I haven't tried them. But the bedrock these clouds are built on is solid, and that's worth paying good money for.
I've done that too. It's obviously easier than running an entire datacenter, but you still need to manage all the underlying services that you deliver to your development team. For example, running things like this on your own:
- storage clusters
- database clusters
- compute clusters
They are often very easy to setup, but when things go wrong, they go very wrong. And welcome to a stressful environment because if you can't figure it out and your people can't, well, your business just sits and burns while you do.
Even when AWS has a system-wide outage, it's nice to know that I don't have to be dealing with those underlying problems anymore and I know they have the best people working on them.
I cannot put into words, after operating MySQL clusters on my own and playing back transactions after failures, how nice it is to use AWS RDS and how it's just been zero problems. Zero. I sleep through automatic updates of our database system with RDS. I would have never done that on our own system.
And in most places, even "managed" leased hardware, you still will need to purchase/lease and run your own hardware firewalls and ddos mitigation. The datacenter might offer that protection "built-in" but you'll soon find the limitations of that offering when you face a substantial attack.
Seems like your experience is a case of ”ad hoc” systems management and I know what you’re saying is all to common in the enterprise world though.
Having spent my entire working life automating infrastructure of all kinds I know you can achieve an enourmous increase in efficency rather easiliy with a few well placed automated processes.
I’ve always been baffled by the fact that at any given larger company there are 100’s of employees trying to supply the business with tools to automate business processes — the IT dept.
Yet, they are completely incapable of using these very same tools to automate their own ”business”.
And the resistance I’ve been met with at different places through the years when trying to implement the simplest of automation is massive.
I used to laugh at the ”cloud” bacause, back then, at 25 years of age, sitting at a medium size company with boatloads of cash, I assumed everyone was doing it the way we were; automating all the things.
Now, many years later I’v obviously realized that many places simply does not have the right culture and mindset as it’s not “core business”.
I believe however that this is changing, and changing quickly. In many ways thanks to the “cloud”.
Having worked in the EC2 org, the people doing the customer service tickets were usually the oncall, when nothing was on fire, or the daytime SREs. You could get hit-or-miss depending on whom is oncall and picks up your ticket, some people were non-empathic and would reply to those tickets with one word answers or you'd get lucky with people that would dig deep.
No. That is my experience too. If your company’s core value prop isn’t running and managing a datacenter, your datacenter is gonna suck when compared to folks who do it for a living.
I work for one of said teams, doing datacenter cyber-security. You could say "trying" is my day job.
Since we're internal and we manage a lot of capacity, we do often provision and roll our own equivalents of things that cloud providers will sell you, rather than just buying a cloud solution. It's often ambiguous whether it was a good use of time/money. If it weren't for the economies of scale that kick in at the sheer size of this operation, it would definitely not be worth it.
I'm actually curious how AWS is able to scale support so well with what seems reasonable quality. I had an issue (my fault in end), got excellent support - and they didn't tell us to take a hike at end when it turned out not AWS fault.
Conversely, with GCP 4 years ago now had some support issues - didn't come away impressed - I'm convinced even internally GCP isn't well doc'd or something.
I think it comes down to Amazon vs. Google's company cultures around customer support, which informs attention and resource budgeting. GCP isn't the only Google product where people hate its customer support, and AWS isn't the only Amazon product where people love its customer support.
Amazon takes the customer support very far - with one exception - bogus products on marketplace seems like a big miss.
But what I paid for and got on aws support is so far out of whack there is NO way they made money on my account for that whole year. And the person was actually competant which was a shock. So many "technical support" folks seem like idiots.
Comcast for example, I'd purchased my modem, they started charging a rental fee - I had to call these bozos every month to reverse the charge - a total waste of time. I cancelled finally - I just couldn't take it, and each one lied to me or didn't have a clue. Things like condescendingly saying - you have to pay for the modem.
Not really, at that scale it eventually will be a lot cheaper running your own thing - Lyft might still be too small, but if they go internationally and grow ×10 then AWS seems like a choice to reconsider.
The problem is that provisioning, reliability, and security are by themselves really tough problems. If those issues aren't in your company's core competencies, it's not necessarily efficient to invest in building out all of that.
I look at it as the question: can you get the same set of agility/reliability/security guarantees for your narrower set of use cases by paying for your own hardware and engineering? I won't even begin to pretend I have any answers there, but I think that's the calculus.