Someone I know was moaning recently about a lot of tedious electronic form filling they had to do for work. It was something that happened once a year, but it was much more lengthy and tedious this year than before. It struck me that this was a sharply focused example of when user experience (UX) and data quality collide.
I realised that I haven’t talked about these two together so far, so this article is a fairly random collection of thoughts about the relationship between UX and data quality.
I’ve already gone into the meaning of UX a few times, so I won’t repeat that here. Data quality could (like UX) have lots of definitions, but I’ll go with one inspired by the RST definition of a bug: poor data quality means that you have data that’s less valuable to you than it might otherwise be.
This is things like:
- You have less data than you’d like (in terms of rows, filled-in i.e. non-null columns, or both);
- Some or all of it is unusable;
- If you process the data then the output will be less valuable than you’d like, or it will take a lot of effort to get the data into a state where processing it will produce valuable output.
Drilling down one more level:
- Data might be unusable because e.g. a field that should hold the name of a day of the week instead contains the word hippopotamus;
- Data might produce bad outputs if e.g. it contains duplicate rows, or a field that should hold the name of a day of the week contains “ Monday ” i.e. there’s space around what you expect, that might prevent a simple comparison from working.
An odd couple
UX and data quality aren’t often brought together, and they’re quite different kinds of thing. Exaggerating the differences slightly (but not too much to lose the point):
- UX is holistic, qualitative, people-centred, and subjective.
- Data quality is focussed, quantitative, data-centred, and objective.
I nonetheless think there’s interesting stuff to think about when you consider them together.
The example I gave above highlights a common difference in motivation about collecting and processing data:
- The person supplying the data is at best neutral about doing so. It’s likely that they’re actively annoyed or frustrated by the process. They have no interest in the quantity or quality of the data, and just want this data thing to go away.
- The person collecting and processing the data needs it for some reason, e.g. their job. Their life will be easier or otherwise better if there’s more data and it’s of higher quality.
The data you’re collecting might be the centre of your world, and so you want as much of it as possible, and you want it to be as neat and tidy and shiny as possible, but it’s probably not the centre of anyone else’s world. Or, to put it another way, when was the last time you jumped out of bed enthusiastic that you’ll be filling in someone else’s form later today? Particularly a form that was sprung on you at an inconvenient time, such as when you’re in the middle of doing something?
This mismatch of motivation might be helped (in favour of the analyst) by their being able to wield a big stick. By that I mean that it might be legal requirement that the source of the data hands it over. But it might be that there’s no requirement on the quality or quantity, so the bare minimum is all that you get. I’ll now discuss alternatives approaches that might help with this mismatch in motivation.
Applying UX to improve data quality – no interface
In UX, often the best user interface is no user interface. If a user isn’t bothered by you to make a decision or perform an action, they will probably be happier. (Assuming that they still get the outcome they want.)
The best way to gather data can be to not make your users do any more work. Is there data already available? Is it available but dropped before it can be stored? If so, consider changing code such that it stores the extra data rather than asking the user for it. Note that you need to worry about consent – data acquired from the user for purpose A can’t always be used for different purpose B. Have you registered with your local data protection authority that you’re holding and processing data for purpose B? Have you told the user that when they give you their data you’ll be using it for both purposes?
Applying UX to improve data quality – better interfaces
I don’t think I’m alone in my reaction to websites that have a cookie opt-out screen that lists what seems like 200 separate vendors, and there’s a toggle for each one. To me this screams of passive aggression. The website creator wants you to not opt out, but they’re forced to give you the option, so while technically still possible it’s so hard that in practice people won’t.
I have two reactions to this kind of opt-out. If I’m feeling bothered then I make a point of spending the two or so minutes (that seems like a lifetime) to track down each toggle and opt out. Take that, passive aggressive site! If I’m not bothered then I simply go elsewhere.
This passive aggressive opt out is, to my mind, a UX dark pattern. Like many UX dark patterns, while superficially attractive, it can backfire on you. In my case either you lose my traffic and / or custom completely, or you lose the ability to record information about me. In trying to record more information, you end up recording less. A sniff test I use as to whether something’s a dark pattern or not is: would I want this done to me or someone I care about? Or, putting it another way, the Golden Rule should be part of every UI design brief.
Less extreme interfaces than this can still have data quality problems. Free text fields are pretty much the enemy of data quality. The amount of variation they allow makes the data too hard to work with reliably. It’s much better if you can make it easy for the user to give you data that you can process easily.
Often this means picking from a list. However, sometimes there are problems with lists. Either the list gets too long and users give up or don’t find what they want. Or it’s impossible to come up with every possible option, i.e. a user might not be able to find what they want in the list.
If the list gets too long, you could have a text box that has auto-complete (like a Google search box). This lets you create a short list on demand, and then the user can pick from a list again. If it’s impossible to come up with every option, then a text box can be given as a last resort after a list of the most common options. The ideal length of this list is something that depends on circumstance – there’s a trade-off between too short (the text box is used a lot) and too long (the users struggle to find what they want).
On the subject of lists, you might have a form that collects personal information, which includes a list to record the person’s gender. How many options are on this list? You might feel comfortable describing yourself as female or male, but will all your users? This is an extreme, but nonetheless valid, example. Will your UI make your users frustrated, sad or angry, so that they, for instance, skip a question in your form altogether? Note that this isn’t simply a UI issue – having more than two genders, for instance, has implications for how you store and process the data too.
Using UX when working with data
I realise that I have gone on about UX and data quality on this blog before, but not calling it by those terms. Life is messy, which often means that data is messy. This will set upper limits on what analysis you can base on that messy data. You need to be careful not to set (or allow) the expectations of the users of some analysis too high. You also need to think about what you can offer the user despite the data being poor quality.
If you have data about people, for some people you might have age, gender and salary, while for others you might have just age and salary. Imagine you want to do two analyses on this data:
- Average salary by age and gender
- Average salary by age
You now have at least two options. The first is that you reject the people who lack gender information, so that you can use the remaining data for all analyses. The second is that you use the people with age and gender information for the first analysis, and everyone for the second.
There are a couple of advantages of the first approach. The first is that you don’t need to worry about which analyses use which data – you’ve made sure that all (remaining) data can go everywhere. The second is that because the same data is used for every analysis, you can compare and otherwise cross-reference different analyses. For instance, you can sum the number of people in the (57, female), (57, male) etc. buckets from the first analysis, and be confident that this will match the number of people in the (57) bucket of the second. This means you can combine the two analyses to do percentages and other similar calculations.
The advantage of the second approach is that you can make each analysis based on as good data as possible. If all the 17 year olds in your data have no gender information, the first approach means there will be no 17 year olds in either analysis, even though they could be in the second analysis.
I’m not saying one is better than the other – you need to understand your users to make that decision.
There’s a related point to do with slicing and dicing the data too far. Imagine that you have nice clean data and you start off giving e.g. average salary of a group of people. You then let the user drill down a little to get more detail. So you display the average salary of the same people, but grouped by age bracket (e.g. 30-40 years old etc.). You have more detail – more numbers – but based on the same set of data. So each number is based on less data.
If you continue this process, e.g. giving the average salary by age bracket, and within those categories by where they live, and within those by gender etc, some of the numbers might be based on very little data. Eventually you’re likely to hit a limit – either the numbers won’t be reliable or useful because the effect of individual outliers is too great, or you hit a limit due to privacy and so can’t display the number at all. (In a previous job we could display analysis on groups of 8 or more or 22 or more people – below those limits that part of the analysis was blank.)
This problem is compounded by the fact that the data is likely to not fit evenly into categories. By that I mean you might have e.g. lots of 30-40 year olds and very few 70-80 year olds. This means different parts of the analysis will hit the too-small limit before other parts. What do you do? Is it better to adopt a strong-as-the-weakest-link approach, and stop the user drilling down when any of the numbers will be missing or wrong? Or should you let the user drill down wherever it’s safe to do so? If you go with this approach, how do you avoid the user having a nasty surprise when they look for data and it’s missing (even though they looked for data in a similar place elsewhere, e.g. in a different age bracket, and found it). There’s no simple right or wrong answer, as it depends on your circumstance.
UX and data quality can help each other. As I’ve described above, UX can improve data quality. Data quality can improve UX too – for instance, when you are able to link together all the bits of information I have already given you, you can offer me a better experience. This could be things like not asking me for information I’ve already given you, or sending me offers on bacon when I have already told you I’m vegan.