We picked up this seemingly obvious tip from the software development world–but it applies for data analysis quite nicely.
In SD, there is a very precise set of rules which programmers follow to make sure that any changes they have to make while they build are easily trackable and reversible.
In web dev in particular, careful thought is put into organizing the data that underpins the various pages on the site (i.e. if you wanted to show a table on a webpage that had earnings results over the past 10 quarters–that data has to come from somewhere.)
In web development, developers make sure to consolidate all the data they need in one place BEFORE delivering the relevant data onto your screen, rather than pulling it from multiple sources and putting it together as it’s being fed to the website.
How does this relate to data analysis, you ask?
Most of the time when you’re asked to analyze a business question, you’ll only need to use 1-3 data sources (tables) in order to do so. However as you become a more experienced analyst, sometimes your analyses will require that you use multiple data sources, and that’s when it gets hairy.
At first, it seems easier to just put tables together if your analysis calls for it–this is a more organic approach to problem solving, where you end up consolidating and formatting the data WHILE you analyze it.
The problem with this method is that it only works for a small number of tables, because 1-3 tables is easy to keep track of in your head. If there’s any issue with duplication in the datasets, or any other problem it’s usually pretty easy to locate the culprit when you’re choosing between 2 or 3 possible suspects.
However, this type of troubleshooting becomes significantly more difficult when you start using 4 or more tables in your analyses–now if you make a technical mistake, you may have a harder time finding it again to correct it. Worse, even, is the possibility that you may not catch the error at all.
The solution then, is to do what programmers do with data–consolidate ALL the data you think you might need into one table BEFORE you conduct your analysis.
Here’s a contrived example: Let’s say I need to measure a client’s revenue as a function of their purchase history, age, and overall estimated net worth. Now let’s say each of those pieces of information is in a different table.
I start by creating a new table with the 4 base tables that has 5 columns: client_id, purchase_history, client_age, net_worth, and revenue for all clients.
Now let’s add some complexity–our boss comes back and says, ‘We only want to look at women over 40 for this analysis‘. Now I have a choice–The first option is to create an entirely new dataset with the same 4 base tables, but that now only includes women over 40. This is the wrong thing to do.
The second (and the right) option is to place the “women over 40” filter on the existing dataset rather than placing it on the base table that has client_age in it.
But AH, these two things sound the same. How are these two options different?
I’ll tell you, dear reader:
By choosing option 2, you effectively separate the data collection process (consolidating all the information you need into one table) from the analysis (filtering based on parameter values, drawing conclusions, etc.)
Why separate them? Because they’re different:
The data collection should be done first, and after it’s done it should never change–that’s how you show that the data you used for your analysis is reliable.
The analysis, on the other hand, should be very flexible, because different people are inevitably going to want to see the data sliced different ways.
Structuring your projects this way makes it so that when your boss inevitably changes his/her mind and decides that they want it to be “women over 50” now, you don’t have to go all the way back to the base table to change the filtering parameters.
Better yet, what if your boss decides after seeing both datasets that he/she wants to compare women from 40-49 and 50-59? Now you have to consolidate your “women over 40 table” and your “women over 50 table” into one big table–this is way easier if you already have one big table that you built at the beginning.
Fast forward: You just presented your findings above, and now your director is asking you to add millenials to the mix, and you gotta go alllll the way back to the f*cking beginning to generate a new dataset and stick it together with your two existing ones which are already stuck together (poorly) and you’re gunna mess it up if you’re not careful and AGHHHHH!!!
All that wasted time and effort when you could have just built a solid base table in the beginning.
The moral of this article is: consolidate all of the data you need in one place before starting your analysis.
Business rules and needs change all the time, and the best analysts are the ones flexible enough to change with them.
If you’ve ever seen an analyst do work so quickly that it seems impossible, I can guarantee you that this is how they did it.