Infra is hard to get right, to change, and standardize.
That's why people use cloud providers, but they are very expensive.
With code, you can solve problems using cheap fleets of heterogenous vps from different providers.
And when you have a small budget, that makes a x10 difference in hosting cost. Not to mention you don't need to hire a super dupper 300k <cloud brand> expert to manage your system, a regular dev will do.
Most products don't need unlimited scaling, zero downtime and so on. It can even crash in prod once in a while, espacially in the early years.
You can alway migrate from code to infra later when you need to scale up and standardize. The opposite is way harder, if you even reach the point at all.
E.G: on a video streaming plateform with a decent audience (800k users a day), we had to transcode user uploaded videos then ensure we had several copies of them load balanced accross more or less servers depending of the media popularity. Today you would put some web scale key value store to manage that, maybe a cloud virtual FS or at least some CDN. At minimum, you'd spawn docker instances, kubernatis if you feel fancy. We solved the problem with redis, nginx and the cheapest naked vps possible. The site ran using only 7 servers, including encoding, streaming, database and the django app itself and was super fast.
To this day it is still maintained by one single dev working part time. It's 10 years old and still uses python 2.7, jquery and a hellish handcrafted css mess.
So sure, infra is the clean solution. But Pareto has something to say about it.
That's in fact how I read the author. He advocates using already written well know software like Traefik, LinkerD, Envoy and so on, and not to reinvent their functionality in your code.
Just so I'm not just nitpicking, k0s and k3s are both good solutions to hosting kubernetes on bare metal or cheap VPSs. Hetzner in particular is starting to get most of the features you typically lack on VPS providers in that price range (software-defined networks, load balancers, storage) packaged in a kubernetes cloud-controller.
That is by design. k3s installs a bunch of things that assume you’re on bare metal, k0s knows you’ll probably bring your own ingress and cloud controller. It is just kubeadm, but in one binary.
Sure but one single dev part time costs somewhere between 1/2 south of 80T€ or 200T$ per year - running this on AWS managed services probably will cost you no more than 15T€ and this is really exaggerated calculation
Infrastructure is so often arbitrary and opaque configuration hell where no amount of sense can be made of the sprawling orchestration just by looking at it. It's also third party and out of your full control. Where code can often be a lot more transparent and readable, and it is yours to own faults and all. If you can code and specifically solving the problem is within scope and capability I think you should solve it in code.
But I would think that as my only experience with infrastructure is when it fails and I have to fix it.
I feel this. If you ARE going to solve your problem with infrastructure, then get ready to descend into the madness of infrastructure-as-code. One way or the other, you are going to end up having to codify it.
Exactly. It is basically delegating problem from one place to another thing which is quite opaque and hard to debug.
But again, it depends upon whom you ask. An infrastructure engineer or ops guys are more comfortable with infrastructure scaling your app whereas a developer is more comfortable with application scaling itself.
I'm sitting here trying to make sense of how nginx parses regexes and I feel this in my bones. Any critical function buried in infrastructure is going to be inherently so much harder to debug and test in a predictable way. I lean towards putting as much as possible into application code and keeping infra as simple as possible.
I agree. I think k8s virtualizing ingress, networking, monitoring, restarting, logging is brilliant; I just wish I could write code to make it happen instead of reams of YAML.
Remember that kubernetes doesn't actually use YAML anywhere - it just makes it available as convenience, and you can just as well call the API manually or use one of the client libraries.
If you are using your roads to do gps pathfinding, then you are indeed doing it wrong.
Do not do in infrastructure what can be done in code. You can't version control your infrastructure, but you can version control your code.
And with cloud instances and VMs providing abstractions that don't map 1:1 to the hardware they're running on, all your infrastructure becomes code to create reproducible deployments, or serverless execution.
You don't build your roads immediately before you start driving, and destroy them after you're done, and build only as many miles of road as you need each time you go for a ride. They're not a great analogy to what we do with computers.
You need to do some processing, like for example create a summary report of several gigs of really big json docs every few seconds. This is a problem that can really be sped up by using more cores.
Which option will you choose:
1.) Install configure and maintain a stream processing tool/framework and add dependencies to your code? Oh and don't forget to add service discovery, special filesystems and install 7 different runtimes and all the dependencies.
2.) Run Kubernetes + Kafka and several python or node.js microservices (each with many instances, because node.js is practically single-threaded) - one to chop up the data and put it on a queue, another that reads it from the queue and actually processes it then placing the results on the queue and another one reading it from there and responding to the ui.
3.) Create a function that does all the processing per document. Run it using pmap (parallel map, starts off threads in the background) or reducers in Clojure or your favourite framework. It uses all the cores on your machine.
I know I'll chose the more Boring technology in most cases, as most cases are pretty boring and don't need all that advanced services, orchestrators and architectures.
+1 ... You do (3) until someone hires a "data engineer" straight out of a bootcamp and tells everyone above you that the team is not following "modern practices" and using archaic approaches.
Then, you get pulled into regular meetings to explain the time line for migrating to a proper structure.
Now, there will come a time where you'll need to automate this. If/when that time comes, if you already have decent cloud competency, then it is far easier to tie everything using S3, SNS, SQS, Lambda (or other equivalent) than going the Kubernetes/Kafka route.
I think there's another side to this coin: very small software teams moving quickly in markets where truly senior devs are scarce. It makes far more engineering and business sense to invest in cloud infrastructure than it is to build and control all of those systems in-house. I can hire a middle-of-the-road developer and trust they've (at the very least) heard of the AWS or GCP tools/services we're using, but if I wrote my own systems in clojure/elixir/whatever (even though that's what I'd prefer to do), then there's nowhere near the likelihood that the new engineer will know what to do, and it'll take months to train them up to even a basic level of competency. You can make all sorts of "it's better in the long run" kind of arguments, but those don't help when the C suite says "yeah sure, maybe that's the best approach, but we need to get this done right now." That's where clicking a few buttons to spin up a load balancer in front of a handful of serverless handlers becomes rather nice.
I don't disagree, but if you are fully invested in some cloud vendor, there are maybe easier and cheaper options than basically rolling your own mapreduce.
In AWS for instance, you are one API call away from running a Spark job in Glue or EMR, that effectively does all that for you. Plus you get logging, monitoring, concurrency limiting, retries.
Or, even easier, you could just dump the big json docs in S3 and use Athena to directly query them with SQL.
Until you come to look for your next job, and suddenly find that your experience with Boring technology is worth nothing, because everyone is running their should-be-a-monolith, serves-maybe-10-customers-an-hour application as a set of microservices in Kubernetes.
Depends if you're developing to be a developer or developing to solve business problems. If you're doing the latter, your clients hire you to get their problems solved and couldn't care less about the tech behind it (but will be pleased to be paying $100/month for a server instead of $1000/month for various AWS services).
For many companies (1) and (2) are the boring choices, as the infra is already in place, well-supported, and there are existing patterns to follow for how it's already being used by other teams in your company.
Right, if you have the resources and the ability, then writing bespoke tailored solutions will yield better results. You only get problems when your custom solution isn't as good as a hardened solution, or you can't spend the time to make it so.
Be wary of doing in infrastructure what could be done with a little bit of code.
Maintaining a dozen bits of pieces written consistently in one programming language may be cheaper than maintaining complicated infrastructure with various components each requiring separate competencies.
As with everything, tradeoffs are usually involved. Be suspicious of people claiming otherwise.
100 times this. There's a massive organizational cost to having N "languages" that a developer needs to know to be able to solve a problem end to end. Unless you have a very large team and redundancy within every specialty, having N heavyweight tools is going to destroy you on a regular basis while you wait for the new person to spin up on them.
Pick 5 tools and use them religiously. They're rarely the optimal tool for anything, but unless you're operating at large scale, you'll always get better results sooner.
When people complain that nowadays new developers can't get anything right, I think we forget that when those experienced people started their careers, myself included, you just had to know one or two languages, couple of simple APIs and that would be enough to code away almost everything that you were expected to do.
Nowadays I seem to need 20 different complex technologies to be able to do even a simple service.
I can do that because I had the luxury of learning them slowly over the years as they were taking over the market.
But what about those new guys? They can't learn everything well all at the same time. So they must do shoddy work at least somewhere, the whole system is stacked against them.
Or they need to Ctrl+c/Ctrl+v most of their solution without understanding where it came from or what it does exactly.
And compromise on understanding the important parts like computer architecture, OS internals or networking protocols that they just don't have bandwidth to learn after learning all those frameworks, libraries, DSLs, tools and so on so that they can see something running.
I think it is responsibility of the senior people in the project to manage cognitive load of the rest of the team and keep it at a reasonable level. Sometimes a new tool looks shiny and fun. But is the added benefit worth the disruption to everybody? I think of ability to learn and keep things in memory as a kind of budget.
I want to copy-paste this whole comment onto my mirror so I see it every morning.
Yes, managing the team's cognitive load is one of the primary jobs of a senior technical leader. I feel lucky that I've been able to pick up as many tools as I have, but it was generally only two or three at a time, and the more I look around, the more great tools I see that I still don't know how to use. That doesn't mean they aren't great, but I know that even with my decades of experience, I wouldn't be able to build something with those tools without a big upfront investment in learning them.
There is a point of critical mass you can reach with vertical integration where you realize that you are almost entirely independent of others and can sidestep all of the pitfalls that virtually anyone else would have to suffer through. You are in charge of your own kingdom and no one else but your immediate customers can tell you otherwise. This is why Apple is making their own silicon. Total control over the entire product value chain.
Anyone who is trying to sell you on spreading your product value proposition across some cloud product portfolio is either incompetent, or simply deprecating the engineering of your product for a profit motive. In very few cases does it make engineering sense to move product complexity from the most manageable domain (codebase with line-by-line scrutiny) to some bloated web interface with no documentation. Certainly, there are declarative configuration techniques for infrastructure, but I think we can all agree these abstractions leak like a sieve and are prone to frequent breaking changes.
Once you experience the degree of control that you get with all in-house development, you will never ever want to give it up. We don't have total vertical integration today, but it is still really fucking nice compared to what I typically see being bandied about on HN every day. Our foundation is bare metal hosting, SQLite & .NET Core (on windows but we could move to linux with minimum pain). It is really hard to get more independent than this in software without rolling your own OS/DB/compiler. Certainly, we could have selected a more "independent" language & framework, but productivity/security/stability is also a huge consideration for our customers.
We have some other 3rd parties we work with, but at no point do we consider our core product value to exist in their spaces. It is just a 2-way integration with clean separation of duties between businesses.
As with everything in software architecture: everything is a compromise.
At large scale, and for application running on end-user devices, I try to keep as much as possible complexity in the software because you don't pay for its execution.
But again, that's a compromise that depends on scale and infrastructure costs.
If the app drains too much battery, the #1 thing smartphone users care about, then expect a dismal rating and uninstall ratio.
As with anything software, the more there's code, the more the potential for bugs. Things should always be engineered to optimize for that metric (lower bugs). Of course, if it is cheaper/safer/secure to do something at the client, then its benefits must be factored against the complexity it may introduce.
I really doubt battery usage is the #1 thing users care about...I'd imagine users care about intuitiveness, usability, value way before worrying an app kills their batter 3% faster than other apps...
Get a phone that's a few years old, with a battery whose capacity is severely reduced. Use it as your only phone for a month or two. See how you feel about battery-draining apps then.
Doing load balancing at the client level if you can for example is not something that's going to significantly increase power usage but it has the opportunity to reduce infrastructure cost at the expense of complexity. That's the sort of compromise I'm taking about.
Gotcha. For load balancing in particular, one is better off doing it server-side. Many reasons, one among them being ability to change load balancing strategies in face of unforeseen circumstances.
The biggest source of config based outages comes from infra deployments. You can easily rollback application code, use featureFlags etc to control the blast radius. But with infra deployments, it gets very tricky very soon.
I think what the author is trying to say, albeit does a poor job of explaining it, is that we could leverage well matured Infra level tools like Envoy, Consul etc instead of executing their features at application layer. Which I completely agree. You dont have to rewrite consensus protocol or traffic routing at application layer, instead leverage the existing infra level tools to achieve it. But you still cannot get away without understanding how these tools function at core.
That is not the case. Your code will end up in container setup, or in library like Kubernetes and say Salt packages
It may be good, or it may have features that are unsuitable, however you have to remember that adapting it can be harder than writing redundancy handling from scratch on a good VM.
Especially if you have special requirements for data consistency that a heavy database can others handle.
A particular trap is getting this heavy service infrastructure and then trying to scale 100x, which not even Docker (the lightest) can pull off without a lot of hardware, and nanokernels also have problems. And then you get additional infrastructure problems which require even more infrastructure and staff. Which you cannot handle then outsource, which then gets done partly unsuitable or expensive, and you no longer own your stuff.
IMO: The article needs more concrete examples. Give some specific examples of writing code when a specific infrastructure product was a better example. THEN, give some examples of when a specific infrastructure product was a bad choice.
This depends on how much access and control you have. If I am building something on my own, or a small team where I can configure everything, or at least see everything, I totally agree with this. But in a large corp where as a software developer on an application team I do not have access to any of the firewall configuration, the database configuration, the networking configuration, the operating system configuration, etc. and that requires paperwork to get someone from the right team to look at things and allocate time. Then I will just do it in software. Its unfortunate, but if it saves me a month of paperwork and hassle, I would rather write a function in software in an afternoon, even if sub optimal.
As an operations person scaling up large-scale distributed infrastructure and tools like Kafka, RabbitMQ & OpenSearch being mission-critical, it's becoming increasingly clear that the consumer applications themselves need to manage the failover/maintenance process of this infrastructure for things to run smoothly and people to stay sane.
As someone also with a fair bit of experience writing Erlang, there's a temptation to not have any of these external dependencies and have everything in software anyway. Erlang can handle it.
for Elixir, it's similar. Yes, you can use Phoenix.PubSub by using Redis as your infrastructure-based MQ, but that ADDS complexity, over using erlang's native PG/PG2.
Title doesn't mean what many responses here seem to think it means. Most of the infrastructure recommended in the article is not "infra" in the sense of cloud-hosted serverless Lambdas, etc, but rather best-in-breed software services or frameworks like envoy and Consul (which themselves require dedicated physical or cloud infra upon which to run). So, it's still arguing "buy vs build", but it isn't about cloud or serverless architecture.
Just because you can bring in some infrastructure level tool to ensure that “X always happens” does not mean it is necessarily a good idea. If you have a small team, small budget, tight timeline, etc. it could serve as a major distraction when “just doing it in code” would be good enough.
In practice, more often than not “X always happens” doesn't hold, there's always exceptions. Next thing you know, you're bringing your business logic into your infrastructure.
I've seen this notion play out at my own company: use AWS Lambda for everything and don't put too much of anything in one place. This is to avoid creating a monolith. Cross-cutting concerns should be put into a lambda layer or (more rarely) a shared library.
The issue I see is that most of the problems they're trying to solve this way could be solved with good application architecture.
Kubernetes and Erlang are really solving different problems. Kubernetes makes sure my OS process is running - Erlang lets me compose my application from a set of actors in a single OS process that are fault tolerant and supervised. Somewhat similar problems but at totally different levels of abstraction. Erlang out-of-the-box isn't especially great at building distributed applications and benefits from leader services, message queues, RPC mesh frameworks etc nearly as much as any other application; its just that you can solve pretty small-scale simple problems without reaching for those in some cases. Jose Valim (creator of Elixir) wrote about this in a great blog article a few years ago: http://blog.plataformatec.com.br/2019/10/kubernetes-and-the-...
Whether to use infrastructure depends on the complexity of what you need and how many steps removed it is from the core value you're delivering. Infrastructure has configuration costs, an opaque debugging and hardening process, and maintenance costs too. The more complex the problem this one component solves, the better the value proposition of using a third-party solution as infrastructure.
However, if the component is key to your value-add, the less likely such solutions will fit your needs exactly, and so it might quickly become a bottleneck. You might then find yourself rolling your own solution anyway. Just use good judgment.
This is why there are so many different database types now, ie. key-value, time series, etc. Relational databases aren't perfect for every problem, they're just good at most problems. But you should probably stick with relational databases unless you know for sure you're better off without one.
I think the article oversells the guarantees infrastructure can give you, especially around this:
> We can make sure that we always have X instances of a given service or process running in our infrastructure.
Kubernetes will make sure to try to start new instances if existing ones fail. It cannot make sure that you always have those X instances running by any means!
This may sound nitpicky - and of course, any code-level solution to redundancy or failure issues won't work if you don't have anywhere available to run on - but it's important to understand both your requirements and what specific guarantees your infrastructure can make. There are cases where "let it fail fast, Kubernetes will restart it" may not be your best choice.
The situation where you can choose between one or the other without a huge change in complexity is not common.
I'd say that all of the examples on the article are bad. Rate limiting, circuit breaking and system configuration always happen on the infrastructure. If you do something on code, it will be redundant and not solve the entire problem (and create many more). The opposite applies to call retrial, it always happens on code, and if you do it on the infrastructure, it won't solve the entire problem (and create many more).
I disagree, for the simple reason that infrastructure is nearly impossible to test (particularly when testing in local development), and that infrastructure is inherently non-portable.
If you are doing it right, infra is code. The only distinction in terms of this article is that the "orchestration" is not innately tied to the design of an application. Loosely coupled and composeable software using (sometimes) standard interfaces.
> Pick the right tool for the job
This is such an abused phrase. How about pick the right methodology for the use case? Tooling quickly becomes a cargo cult, be it languages, frameworks, or "infra tools". The deficiencies in said "tools" can often force bad designs or patterns that people apologize for rather than fix. You don't have to pick one tool for the job if you can pick a method that allows an array of tools used in a standard way.
For example, S3 bucket changes. Everyone first thinks "Terraform". But there are certain S3 changes that are either difficult or impossible with Terraform. What if it's easier to allow a developer IAM role to make specific changes to specific objects and leverage said role outside Terraform? You can craft your Terraform to ignore those changes, and use any AWS SDK to make ad-hoc object changes. Maybe later you find that not using Terraform at all for that bucket, and instead just managing roles that can manipulate that bucket, works better for a larger array of use cases.
During my time as a SRE-SE I began to piece together that infrastructure tends to be born from small software patterns. Looking at messaging systems, key/value stores, and even computational orchestrators much of their high-level mechanics comes from how you would produce such a solution in a monolithic repo. Obviously there's a lot of layered abstraction to make the solution generic enough, and there's code to make the system fault tolerant inside a distributed system which inevitably requires failure tolerance.
I don't think I agree with this article though. Infrastructure solutions are good if you require robustness or the same solution is needed by many components that can share the infrastructure.
Robustness in this case would be comparing a layer 8 load balancer and a thread that load balances workers. You could seemingly start up many threads on the same machine, or even segregate them by virtual machines (for example Tomcat does this) but if failure of a single host is of concern (or catastrophic enough concern) then dedicating the load balancer to its own machine is desirable. My point being, if you step back and look at what you're actually trying to solve for it will tell you where the infrastructure should go, and has the added benefit of influencing the rest of your surrounding ecosystem.
Conceptually, infrastructure is just "application" code that's grown into a highly cohesive, lowly coupled standalone application.
If you need a distributed hashmap or distributed lock, (or really any distributed dedicated purpose code), something premade is a good option, especially if that's not your company's specialty.
These articles are always a bit interesting that they seem to ignore the debate is "3rd party/outsource" vs "insource". EC2, S3, etc are all just internal "Amazon applications" to support horizontal scaling that Amazon realized there was a huge market for. Somewhere there is code for these someone is writing, maintaining, and operating just like any other "business app". It's especially apparent at large companies for shared services like email, authentication, storage, compute.
To the end user, anything across the internet is "infrastructure" for whatever app they're using
I'm curious: I would expect that this approach would give rise to more coding errors (because the consequences of coding errors are smaller). Is that the case?
If you have a "cattle" setup, how often do your server instances die and get recreated? Is this something you measure and try to minimise? Do you work out why any particular instance died?
It seems like one of the easiest to explain to non technical end users. "Yeah, you see we've measured the amount of time it takes a piece of information to move over the longest wire. Then we've calculated the length of your wire and determined the difference. We added that to this piece of code written in (proceed to lose most non technical users)..."
Every exchange that I know of that provides collocation has contracted standard length cable routes as part of the offering. So that’s not the gimmick.
The gimmick that IEX brought to the table is that they are delaying orders a fixed amount. They own the matching engine so can trivially add that fixed amount in code.
But that’s not nearly as fun to show off as a big box of spooled fiber.
It's at least a design pattern to consider. Main tradeoffs are cost, expressiveness, and fine-grained testing/control. But leveraging infra does provide me with much joy.
Wow. So much hate for the article in the comments.
Years ago, I would have said the same. But living in the AWS world for a while, I now see that infrastructure-as-code is so damn powerful.
The reason I posted this link was because I think I've turned around and have embrased this next evolution in industry (but its taken a while for me), and thought others here would have agreed with me. I guess not!
I've had fun throwing this rule out. Specifically, spinning up and down resources in Fargate in response to events in my application have allowed to build some architectures that really weren't possible 10 or even 5 years ago.
kafka is not an enterprise service bus. It doesn't do transformations. It doesn't even do any other things such as throttling, replays, etc. That is all handled by the client.
Kafka would count as what martin fowler calls a "dumb pipe". That is to say, kafka isn't aware of its clients, it just pipes them together.
note, the term "message bus" is similar, but not quite the same as "service bus"
Infra is hard to get right, to change, and standardize.
That's why people use cloud providers, but they are very expensive.
With code, you can solve problems using cheap fleets of heterogenous vps from different providers.
And when you have a small budget, that makes a x10 difference in hosting cost. Not to mention you don't need to hire a super dupper 300k <cloud brand> expert to manage your system, a regular dev will do.
Most products don't need unlimited scaling, zero downtime and so on. It can even crash in prod once in a while, espacially in the early years.
You can alway migrate from code to infra later when you need to scale up and standardize. The opposite is way harder, if you even reach the point at all.
E.G: on a video streaming plateform with a decent audience (800k users a day), we had to transcode user uploaded videos then ensure we had several copies of them load balanced accross more or less servers depending of the media popularity. Today you would put some web scale key value store to manage that, maybe a cloud virtual FS or at least some CDN. At minimum, you'd spawn docker instances, kubernatis if you feel fancy. We solved the problem with redis, nginx and the cheapest naked vps possible. The site ran using only 7 servers, including encoding, streaming, database and the django app itself and was super fast.
To this day it is still maintained by one single dev working part time. It's 10 years old and still uses python 2.7, jquery and a hellish handcrafted css mess.
So sure, infra is the clean solution. But Pareto has something to say about it.