Data tools like Tableau or Alteryx have made data analysis much more accessible to the average user than they ever have been. Non-technical users can now, with only a few hours of training, start building data pipelines, data sets and dashboards that have required complex coding and data skills in the past . It is still hugely beneficial to know basic development concepts and techniques to make sure whatever you built doesn’t fall over with the next data refresh. You can read part 1 and part 3 here.
Building one-off reports or dashboards is comparatively easy. You know exactly what data you are dealing with and if there are any unexpected surprises, you can fix them right there without a problem. This approach doesn’t translate into anything that needs to be refreshed on a regular basis.
You can make some assumptions but more likely than not, something will change in the future and you will need to go in and fix it. But there are ways to develop any data flow or dashboard to avoid many of the common problems and spare yourself and your users the effort in dealing with broken dashboards and unreliable data.
Hard or soft-coding?
For everything we develop, we have to make a few assumptions, otherwise we don’t even need to start. The problem is that you need to decide which things you can safely assume to be true and for which things you want to develop a fall-back, in case anything changes.
In nearly all situations the problem is some form of hard-coding of values, which breaks if those values change. In its most blatant form it looks something like this:
SELECT *
FROM table
WHERE productID != "1234"
But the reality is that all of us hard-code things all the time, what differs between “good” hard-coding and “bad” hard-coding is the way we deal with exceptions and how we justify setting something in stone for its lifetime!
Here are some things to look out for that broadly fall into this category and that I always look out for when developing something:
Filters
1. Use include/exclude filters consciously
Most tools provide the option for filters to be based on an “Include” or “Exclude logic”; in a static dashboard you’ll likely chose the option based on the number of fields you need to select: If you want to keep 27 of 30 values, you probably just exclude the three ones you don’t need. If you want to keep only 3, you rather include those three values explicitly. In a dynamic dashboard though there might be new values added over time. Do you want those values to show up or be filtered? If you specifically include those three values, anything new will be excluded, if you exclude everything else, all new values will be included.
2. Filter on the most appropriate level
There are often hierarchies in data, at which point you can technically chose at which level you want to filter. Let’s say you work for a European company that expands to Oceania, opening their first location in Australia. The regional director would like to see all her data for the region, so you filter the data by “Australia”. This will, without a doubt give you the correct data. A short while later, your company also opens a site in New Zealand and all of a sudden the dashboard is not accurate any longer because “New Zealand” is not “Australia”. You would have been better off filtering on the region “Oceania” than on just a single country name, assuming that new countries would be setup with the correct region. The same is valid for products and their categories, employees and their departments, etc.
3. Use dynamic dates
Just based on my own behaviour, I wouldn’t be surprised if this is the most common mistake. You draft a dashboard and try a few things and because you will need to test, you set a date filter to 2024; this makes it easy to switch it to 2023 when trying to check something and back to the current year. I do it all the time. Before you know it you build something everybody is happy with and it’s promoted to production. Everything works fine until January 2025 when the dashboard is empty and your leadership asks you about why the dashboard is broken. Make sure that you dynamically set your dates (typically based on todays date), ie. show everything from the 365 days before TODAY(), show everything with the same year as the year of TODAY(), etc.
Calculations
4. Use ELSE statements
Depending on your tool or programming language of choice, conditional statements (IF and CASE) might or might not require an ELSE clause. Even if it is not required, you may be better off adding one anyway. If you leave it out, you might at some point get values that you didn’t anticipate, in the best case there will be an ugly NULL() on your dashboard, in the worst case they are just ignored until they accumulate to an amount that is big enough that it can’t be ignored any more; at that point something might have already gotten wrong. Just put something simple like “ELSE ‘undefined!'” or similar in there. This will guarantee that the values show up and hopefully raises questions early why there is something in the data that is “undefined”. Depending on the particular situation, there could be an exception to the rule, when it is actually beneficial to have NULL as a result (ie. you definitely don’t want values to appear). Like all tips in here, if that’s done as a conscious choice it’s a perfectly valid approach.
5. Use “Other” in groups
Vary related to the point before, Tableau offers the concept of groups, which really is just a way to define CASE statements through the UI. It has a checkbox for “Include Other” which will group all values that haven’t been explicitly assigned to a certain group to the “Other” group. This option is there for a reason and like the ELSE statement will make sure that nothing falls through the cracks until it’s too late and that those 200 new products are not automatically added to your neatly designed 3-bar bar-chart.
6. Select columns dynamically
If you work with a workflow tool like Alteryx to move data from A to B and transform it, are you able to select certain columns dynamically? ie. All numerical columns or all text columns or all columns with “ABC” in the column header. It might not be a super common use case but for example when parsing excel sheets with a column for each new month this makes it easy to make sure all needed columns are always included in the dataset. This ensures that if certain things change, you don’t need to go in and adjust the workflow every single time.
7. Union tables dynamically
Tableau allows you to union different files within the tool. That’s great but often when you have to union X number of files, it is likely that there will be X+1 in the near future. Make sure you don’t only union the files you have already, instead use the dynamic union feature to union all files in a folder or all files with a certain prefix, to be sure you don’t leave out data in the future.
8. Consider automated testing and error handling
If you worked with data pipelines before, you likely saw a situation where the data was loaded successfully, just for the end user to complain that a dashboard says you shipped millions of tons of product when it should be thousands of tons. The data itself was correct but the source suddenly changed their reporting from tons to kilograms, which means all your dashboards will be out by a few orders of magnitudes. Because the data was correct, no red flags were raised. Some tools might have smarts integrated already to pick up these differences, if they don’t, can you build in tests that make sure the pipeline fails with an error when certain columns are missing, you have values that are outside of your usual order of magnitude, etc. You could also implement something like this in Tableau, either as a warning on a workbook, by creating a calculation that displays some warning text when values are not within a range or with a separate workbook that monitors different data sources for outliers. In both cases, doing it in Tableau however only highlights a problem, it doesn’t prevent it.
Visuals
9. Make sure your legends adapt
Colours can make and break dashboards! You probably were in situations where your colour legend was skewed by an outlier which meant most of the values where the same shade and only the outlier had a different shade. Though technically correct, this is probably not what you had in mind when creating the chart. This can happen easily and if you don’t account for this, next time somebody places an exceptionally large order, your business users will just see that one order and have no idea what is going on with all others. Maybe you can colour by percentiles rather than absolute values. Or you hardcode and upper limit (suddenly hardcoding can be ok!). There is no perfect solution but at least think about it to see if there is a way to address this.
10. Make sure your axis’ adapt
The same is true for chart axis. Tableau recently released a feature that allows for dynamic axis ranges. So if you do need defined axis ranges rather than the automatic sizing, consider this feature to make sure all data is displayed and you stay within your defined requirements.
Team changes
11. The big red bus
Imagine you (or somebody in your team) was hit by a big red bus and won’t show up at work tomorrow. Would everybody else in the team be comfortable to take over their work without a significant downtime? Are you sure you’d understand that complex logic you implemented yesterday when it needs to be troubleshooted in 6 months? It’s probably the most boring part but documentation IS useful. There are different ways of approaching it and you decide what works best, whether to use a word document on sharepoint, a dedicated tool or relying on in-tool features like comments and descriptions. The important part is that it is kept up-to-date and everybody knows where to find it.
Summary
You can never be sure that you capture all cases, even if you applied everything listed in this article, there are still more, unexpected ways how things can break. I am also not advocating to always do all of these things. Sometimes it is just not worth it to spend the additional time because it’s a temporary solution (famous last words!) or you can be completely certain that no changes will happen. Or, frankly, it doesn’t matter if something breaks because it’s a nice-to-have thing that isn’t crucial.
What I do advocate for though is to be aware of these things and – either consciously or unconsciously – consider them in your development process.
If you keep them in mind and consciously decide against implementing them, that’s fine, but at least you considered the pro’s and con’s and took an informed decision.
If you have any questions, something is unclear or you disagree with something I said here, get in touch and let me know! Special thanks to Steve Wood and Heather Cox who provided feedback and pointed out some errors and omissions!
Pingback: Think like a hacker pt. 3 - Break things! - Curvediscussion
Pingback: Think like a hacker pt. 1 - Never do anything more than twice! - Curvediscussion