I think you're 100% right that the tasks that can be accomplished in Airflow are currently being unbundled by tools in the modern data stack, but that doesn't erase the need for tools like Airflow. Sure, you can now write less code to load your data, transform it, and send it out to other tools. As the unbundling occurs, the end result is more fragmentation and fragility in how teams manage their data.
Data teams I talk to can't turn to any single location to see every touchpoint their data goes through. They're relying on each tool's independent scheduling system and hoping that everything runs at the right time without errors. If something breaks, bad data gets deployed and it becomes a mad scramble to verify which tool caused the error and which reports/dashboards/ML models/etc. were impacted downstream.
While these unbundled tools can get you 90% of the way to your desired end goal, you'll inevitably face a situation where your use case or SaaS tool is unsupported. In every situation like this I've ever faced, the team ultimately ends up writing and managing their own custom scripts to account for this situation. Now you have your unbundled tool + your custom script. Why not just manage all of the tools and your scripts from a singular source in the first place?
While unbundling is the reality, this new era of data technology will always still have a need for data orchestration tools that serve as a centralized view into your data workflows, whether that's Airflow or any of the new players in the space.
(Disclosure: I'm a co-founder of https://www.shipyardapp.com/, building better data orchestration for modern data teams)
No amount of tooling will make data transformation a painless process; all you end up doing is burying the business logic under so many layers of abstraction that it becomes impossible for anyone to understand.
Isn't the main selling point of airflow the bundling in the first place? Why would you want many different specialized tools to manage scheduled tasks?
1) Specialized tools reduce the amount of engineering overhead. As a business, I primarily care about time to value. If I can use specialized SaaS to get my data centralized, clean, and synced across my tools in a week, why would I want to spend months building all of these processes from scratch?
Sure, I lose control, visibility, and more... but I was able to deliver value 3 months ahead of schedule.
2) Existing tools like Airflow are highly technical to get started with. You can't just focus on building out scripted solutions. You have to set up and manage the infrastructure. You have to sift through the tool's documentation to understand how to effectively build DAGs. You have to inject your business logic with platform logic to make sure your code will run on Airflow.
Because the demand for data professionals is high and the supply is low, the technology ends up trying to offset the need for those highly technical skills in your organization.
I get what you're saying but trying to make sure your code will run on airflow is the wrong way of thinking about it IMHO. You should be trying to get airflow to make sure your code runs (could be in airflow, could be anywhere else).
A lot of the stuff we do with airflow is just basically sending commands and looking at the result (and handling any errors), this part is generic enough that you usually only need to implement it once for whatever platform your code is running on.
The tricky bit is when your DAG crosses platforms, but that's always a problem. If anything it's easier to solve when the tool scheduling tasks isn't part of the platform (note however that airflow is not a tool for solving dataflow, though some glue code in python does often work wonders).
Exactly my thoughts as well.
I have one point where I can see if all the remote services that I am using are operating correctly. I don't need to connect to various other apps to figure this out.
Data teams I talk to can't turn to any single location to see every touchpoint their data goes through. They're relying on each tool's independent scheduling system and hoping that everything runs at the right time without errors. If something breaks, bad data gets deployed and it becomes a mad scramble to verify which tool caused the error and which reports/dashboards/ML models/etc. were impacted downstream.
While these unbundled tools can get you 90% of the way to your desired end goal, you'll inevitably face a situation where your use case or SaaS tool is unsupported. In every situation like this I've ever faced, the team ultimately ends up writing and managing their own custom scripts to account for this situation. Now you have your unbundled tool + your custom script. Why not just manage all of the tools and your scripts from a singular source in the first place?
While unbundling is the reality, this new era of data technology will always still have a need for data orchestration tools that serve as a centralized view into your data workflows, whether that's Airflow or any of the new players in the space.
(Disclosure: I'm a co-founder of https://www.shipyardapp.com/, building better data orchestration for modern data teams)