How to Reduce Data Platform Spend in 2023
2023 has been a hard year to run a data platform. Across the board it seems like the collective theme of the year has been cutting costs. A few years ago it felt like a lot of companies ran their data platform like the old saying “throw enough mud at the wall, and some of it will stick.” Unfortunately data costs can get out of hand pretty quickly and this year’s tumultuous market has made a lot of companies reflect on their data spend with more scrutiny. It has left companies wondering if they can actually prove ROI to justify their data platform costs. Although data is essential to almost all modern businesses, a lot of companies over-spend, especially when the tech sector is in a bull market and budgets are loose. This is not true this year for most companies, so in this article I will list five ways that you can cut costs in your data platform:
- ) Clearly Define ‘Business Critical’ Stakeholder Needs — The first thing to do when you need to cut costs in a data platform is to meet with your stakeholders to clearly define their needs. The goal here is to explicitly define what they need, why they need it, and the details around delivery. If the data platform team does not clearly understand what the stakeholders need it is impossible for them to deliver the data in a cost effective manner, because they are making assumptions about the minimum requirements. Also, if they don’t clearly understand why the data is needed they will not be able to efficiently prioritize the requests that are most important to the overall objectives of the company. Data platform teams should work with their stakeholder groups to build stakeholder agreements around their data needs. Some things that I like to include in these agreements are:
- What data is needed?
- Who need access to the data? Which people/systems?
- When do they need it? Refreshed at a specific time or event?
- Where does the data need to be stored? What format?
- Why are we doing this? Does it directly relate to the company objectives and priorities? Can we track the ROI?
- How will it be accomplished? Who will do the initial work? Who will support future maintenance and features?
- Info on data governance and quality standards like uptime, acceptable error rates, availability, deprecation info, etc…
Although this may seem tedious, going through this exercise with stakeholders on a regular basis will help the data platform team to understand stakeholder usage and ultimately decide what defines business critical data. From there, it is then the job of the data platform team to decide what data to expose to users and how they can do that in a cost effective way.
Priorities amongst business users can change fast, and after going through this exercise sometimes companies find that they are overpaying for data services that don’t actually make a big impact on the company. They may also find that some requests for new work do not truly align with the success of the company objectives, and implementing them brings no ROI. After finishing this exercise a data platform team should fully understand the needs of their stakeholders and how they relate to the objectives of the company. They should then be able to use that information to design and implement a system that meets the minimum stakeholder requirements while still staying within budget for the company. Without fully understanding what is needed and why it is needed, its impossible for the team to deliver the lowest cost option.
2.) Prioritize Stakeholder Deliverables — If you give a mouse a cookie….they will ask for 5 new data sources added to their data model. Just kidding- but there is some truth there. It is important to work with your stakeholders and make sure that everyone understands that the amount of work that you can deliver is limited by the budget given by the company.
Stakeholders usually want more data than they truly need to get their job done, and data platform teams usually want to deliver the minimum amount of data they can to reduce headcount and operating costs. This leads to a disconnect; stakeholders want all the data they can get, and data platform teams want to give them the minimum amount that they need to be successful. Ultimately someone needs to decide what is truly business critical and what is just ‘nice to have’. This is the job of the data platform team; they must decipher what is actually needed and what is not important. As crucial as it is to fulfill stakeholder requests, it is even more important to say ‘no’ to the requests that you can not implement or believe are not necessary. In most cases if data teams were to deliver everything that stakeholders ever asked for, they would need a huge team with an unlimited budget.
What is actually delivered to stakeholders should be determined by the needs of the company. All analysts and engineers can relate to this scenario; you receive a request and after researching it further you learn that it is not related to the objectives or mission of the company. The stakeholder was just experimenting or doing something they think is interesting that is vaguely related to their job. In situations like this, it is critical that data teams determine why they are completing a request and not just take the stakeholders word for it. Since the team is limited by the budget they are given, they must make sure they are prioritizing the work that is most important to the company.
3.) Track Expenses like a Hawk — It is important now more than ever to watch your costs closely. Most companies are looking to cut costs this year while large cloud providers are simultaneously jacking up prices. In the past year we have seen a big price hikes from managed data tools providers like Google and dbt Labs. With prices rising it is crucial to have good cost management practices and make sure you are not also experiencing unexpected or unnoticed cost anomalies.
One thing is for sure, the only way that you will notice price anomalies in your data platform is if you are keeping a close eye on them. When cost and usage spikes go unnoticed it can end up being very expensive when the bill is due. Unfortunately it can also be difficult to notice them; Data platforms are complex and are usually not setup with any type of cost or usage alerting out of the box. Even once they are tracked, it can be hard to isolate the root cause of a price anomaly and articulate that to the rest of the team. Luckily there are tools out there for tracking and alerting on cost and usage spikes that you can use to autonomously alert your team. If you are using one of the large cloud providers you can use the built in tools like Google Cloud’s Operations Suite, Amazon CloudWatch or Microsoft Cost Management. If these tools do not work for your data platform infrastructure you can also use third party tools like Cloud Zero, Harness or Datadog. Whichever tool you choose, be sure that you are tracking cost and usage spikes and pushing alerts autonomously to the proper team members to fix.
4.) Audit Old Code and Clean Up Stakeholder Trash — Unless you work at a brand new start up with all new employees and greenfield work, there are probably some old data assets that can be cleaned up or deleted all together. Most data platforms have a good bit of bloat, and development usually outpaces depreciation to a large degree. After years of this it can sometimes be hard to tell what is still needed and what is stale code that is just wasting money in the data platform because everyone is afraid to delete it. To put it bluntly, prior years trash is usually still being run in most data platforms. Data platforms teams must monitor usage and clean up this old trash that is still in the system, otherwise it will continue to leak money and add latency.
Beyond that, it is also essential to periodically clean up after the stakeholders. Remember that stakeholders are not responsible for the health of the system and they will rarely, if ever, clean out old stuff on their own. They won’t delete old dashboards and schedules, request for an old table to be dropped, or ask for a data pipeline to be deleted. These are cost containing decisions that the data platform team needs to make after defining stakeholder need and reviewing usage. Leaving these things running without periodic clean-ups is like throwing money out the window.
5.) Use Open Source and Inexpensive Infrastructure when Possible — While there is usually great benefit to using expensive and proprietary data tools, there is also significant downside when looking at the cost. These tools are amazing; they empower the masses to perform intense modern analytics at the snap of the fingers with very little setup, but they come with a high price tag that can not be ignored. While most companies can not get away from using expensive tools completely, it is important to properly assess when you need to use expensive tools and when something more cost effective will work.
Here is an example; lets say I am working with a company that uses Snowflake as their main data warehouse. Snowflake is expensive but it is necessary for the company’s production environment. This is because of the use case; there are tons of users running lots of concurrent queries at once that all expect high availability and low latency. Using a lesser data warehouse would simply not work for the production environment, but this does not mean that the entire data platform needs to be built using Snowflake. There are still opportunities to use a less expensive database like PostgreSQL for development. For example, I can cut costs by running all of my dbt tests and development to this database, and since dbt handles the DDL all of the modeling code can be the same between systems. Since development environments generally do not have the same frequency of queries or availability requirements as production, it does not necessarily need to be built using Snowflake. Using PostgreSQL is a big savings since you can start a managed instance for a low monthly cost and do not need to pay per transaction, which can account to a savings of hundreds to thousands of dollars a month depending on your usage.
The important piece here is not simply to use PostgreSQL instead of Snowflake because it is less expensive, it is to let the use case drive the solution that you choose. Don’t spend money simply to say you are following ‘best practices’ without thinking about alternate solutions. In this case a production environment needs to use an expensive data warehouse because the use case necessitates it, but a development environment does not have the same requirements so it can use a less expensive option. We chose each option because of what the use case called for, and let the needs of the company drive the decision.