As a data practitioner, I’ve realized there is one thing often overlooked by analysts – infrastructure. I can say that I have been fortunate (or blessed) to work with organizations that have managed infrastructure for me. On the other hand, I have also taken it upon myself to manage my own data for educational and formational purposes. As it pertains to my data, mine does not even touch what you would consider a gargantuan amount. It’s not even close to one million rows, which is pretty much the threshold for what’s regarded as a small amount. Last I checked, I had one table with just over 300,000 rows.
In the grand scheme of things, this volume of data isn’t hard to manage or process. However, when you have so many rows and so little time and resources, how do you manage it? In this post, I will share a little about some of the options and avenues that I’ve seen organizations take, what may be out there, and what some of the best approaches are (as well as the things to watch out for). So, let’s get started!
Before I begin with anything concrete, I must say that no single solution is right for everyone. In my experience performing analysis for organizations with millions of rows of data, I have had to understand the many unique nuances of processing large, diverse datasets. Working with databases and data systems that span from sales transactions to workforce and telephony, I’ve come across many different and intricate details that organizations need to consider as they map out their journey to the cloud. More often than not, what is be right for one organization may be entirely wrong for another. Be sure to consider your specific case and application before making a decision. For that reason, I will try to speak in broad terms so that, whether you and your organization are at the beginning of your data journey or you and your organization are nearer to the end, there will be helpful information for you.
Quick and simple
In the field of “cheap and dirty,” there are several options and considerations. One option is to use flat files – text databases that store data in plain text format – and combine them in whatever tool you’re going to use to process them, whether that’s Power Query, Power BI, Excel, Tableau, or any other. This works for ranges of data, but usually, “small” quantities of data work best here. A major pro of this approach is that it allows organizations to host data on inexpensive cloud storage and provide access to end-users of this data.
As I stated initially, I’ve worked with organizations that manage this for me, and I’ve used solutions like SharePoint, OneDrive, and the Microsoft 365 and Power platforms. The flat-file method works well for specific applications but is certainly not a one-size-fits-all solution. If the destination is a tool that doesn’t readily combine files (such as Excel, Power BI, or a custom-built tool/script), then flat-file storage won’t work. Also, if you have large amounts of data (more than 10GB or many millions of rows), this solution can quickly run out of steam and become problematic for your application. Remember, Excel files have a million-row limit if you’re using this file type over CSV or JSON, so that’s an important limitation to consider if you’re dealing with large amounts of data. If you’re just getting started and want to build out some options, this might be the way to go.
Heading to the cloud?
If you are thinking of heading to the cloud and you have a substantial amount of data to store, there are a few things to ponder before taking the plunge. A significant consideration is the cloud provider itself because that will dictate how much you’ll end up paying for storage. Rates and pricing structures vary widely between providers may very well be the biggest deciding factor. However, when you layer in a robust, well-defined data strategy, you can find that sweet spot that works perfectly for your needs. That sweet spot should be a balance between the cost of the provider and the cost of time to administer and maintain the data. The provider costs should also be compared with the potential resource costs of an alternative on-prem infrastructure along with the qualified labour needed to maintain this infrastructure. The provider cost should be balanced against user demand and scalability options, where many cloud providers offer solutions for short-term high-demand bursts and long-term scalability.
Future-proofing your cloud solution
Data growth and usage
It is important to understand the long- and short-term needs. Consider how much data you store today. How much incremental data do you store on a monthly, weekly, and annual basis? Additionally, think about how that data is going to grow over time. Are you storing a million rows of data that are growing 5% annually? These factors are essential because a storage solution that works today may not work tomorrow. Think about what the rest of the organization is doing now and what they anticipate doing in the future in terms of processing data. Suppose your organization consumes and analyzes data in Excel but wants to move to Power BI, consider solutions that work for both or are better suited for PBI. If your organization wants to move to Tableau or in-house solutions, get on the platforms that work best with these tools.
Infrastructure, maintenance, and usage costs
Figuring out your decision to move to the cloud involves more than just the budget. Be sure to consider all of the implications and weigh out ALL the costs:
- Capital investments in infrastructure, software licensing, upgrades
- Labour investments for patching, DB maintenance, import and export, and data integrity
- Usage costs include month-to-month storage fees, ingress and egress fees, and possibly row traversal, as some providers charge based on queries that traverse many rows
Consumer demand considerations
In any data storage exercise, there are essentially two usage considerations: application needs and user needs. Although one might think that they are the same, the reality says otherwise. If your data requirements revolve around an application (think of any e-commerce platform like Amazon that creates user accounts for individuals, stores personal information, and provides additional user-specific services), the data needs are very different from the end-user needs. A user might need access to many years of historical information, whereas an application may only need 90 days’ worth. The application might need precise, detailed information, and security might be a big concern. Users, on the other hand, can work with less granular information.
I would be remiss if I didn’t offer solutions. Here are some general guidelines that you can use to develop a healthy strategy to manage your data and potentially save on costs.
Keep your most-used data in the most accessible places.
Data can be stored in aggregated tables to reduce the volume and the amount of storage required. Recent data can be stored in the cloud, while archival data can reside in cheaper, local storage solutions.
Segment and partition. One organization I was working with (a national retailer) was exploring cloud data storage, and they encountered issues with “row traversal” (meaning, analyst queries were spanning millions of rows), which made queries expensive. One solution was to implement partitioning so that queries would be less expensive.
Leverage existing resources.
This same organization was rolling out Power BI throughout the organization. Since most users already had a Power BI Pro license, they found that they could leverage this to store a reasonably large and comprehensive data set, with all the appropriate security applied to it, in the Power BI Service (the cloud). Doing so allowed analysts to connect to this robust data set which was updated with “incremental refresh.” Since the organization already paid for Power BI, leveraging the existing expenditure helped to reduce cost and bring the right data to the right people at the right time.
Making the foray into cloud-based data storage can be a daunting task. If done right, this endeavor can be the catalyst that unlocks your organization’s full potential. Taking the time to analyze current data usage, needs, and trends can help make the right decision for everyone involved.
If you’re looking for more guidance on how you can manage your data in the cloud more effectively, reach out anytime. I’d be happy to discuss it with you.