Mental models for data engineering and data science

For programmers like me, it can be a bit of wrench when you get more into data work, particularly data engineering and data science. You’re used to data being around (in the background) and so think everything will be OK. This wasn’t the case for me, and so here are some mental models (glorified metaphors) that I found helpful with growing my brain to fit the new requirements. None of them is perfect – I would welcome better ones in the comments – but I hope that they are useful nonetheless.

Data work as arable farming

As a programmer, I got used to seeing code as the valuable thing. Data was there in a support role, but worrying about code was what I was mostly paid to do. Shifting to data work such as data engineering forced me to adjust these priorities. Code is still important, but now it’s the thing in the support role.

It’s like an arable farmer producing crops. They have all kinds of tools (ploughs, combine harvesters, irrigation systems etc.) but they are in the service of the crops. They won’t be relieved to have a new shiny tractor as much as they would be to have a good harvest that fetches a good price. Just as farmers use tools to care for and tend the crops, I was using tools to care for and tend the data.

Similarly, timescales are often different. A farmer worries about the land, about its fertility, drainage, pH etc, and how these are nurtured over time. Trees planted this year and then tended until they reach maturity might be an effective windbreak only in several years’ time. This long view of time might extend to passing on a farm from one generation to the next.

Likewise, data can hang around for a long time – up to years in some cases (this can be a good and / or bad thing, depending on the circumstances). This is certainly longer than the run time of much code. This, coupled with how much more unwieldy data can be than code, can prompt different questions, and different answers to the same questions.

For instance, ‘throwing it away and starting again’ is possible with code and data, but often at different costs. An Azure Function that effectively does a cold boot every time (if you ask it to) means you’re less likely to hit problems caused by memory leaks silting things up over time. Redeploying code or rolling it back to a previous version can be painful, but recreating a database from a series of backups possibly in different places and formats can be a different world of pain.

Part of this comparison of data and arable farming is thinking about where the valuable stuff lives. You might immediately think ‘crops = field’ and similarly ‘data = reasonably normalised tables in a relational database’. These are both good choices for many circumstances.

But what if you’re growing rice or cranberries and want the fields to be flooded for some or all of the time? Or you’re growing mushrooms that need a stable environment much more than they need lots of light? Then you’d grow your crops in paddy fields or cranberry bogs, or indoors in mushroom sheds. Similarly, depending on your requirements you might want to store your data in a document database in a series of big lumps of JSON, or a set of interlocked fact and dimension tables in a data warehouse and so on.

A cranberry bog flooded for harvest, with a farmer standing in it wearing waders. — -jkb-, CC BY 3.0 https://creativecommons.org/licenses/by/3.0, via Wikimedia Commons

Data engineering c.f. data science

I’ve already talked about just mental models for data engineering, so I suggest you go to that article for more details about data engineering. This will be about the comparison of data engineering and data science.

There is a common theme that I hope will become obvious, but please don’t think that I think that one is better, more important or harder than the other. It really depends on the circumstances as to which tool or tools you need. Also, as I say at the end, the line between them is blurry if you look carefully enough.

Hairdressing

Like most of what I write about, I don’t know much about hairdressing other than as a customer, so take all this with a pinch of salt. It strikes me that a hairdresser’s artistry and creativity is expressed more through cutting the hair than by washing or brushing it. That is, the sink, the brush and comb could be seen as less important than the scissors and clippers.

However, cutting hair often goes wrong or is harder work than it needs to be if it’s not free from existing styling products and tangles. Similarly, a brush or comb (plus possibly clips) can break the cutting problem down into more manageable chunks. The bulk of the hair can be moved away by the brush or comb, so that the scissors can concentrate on only one small part. As you may have guessed, the analogy I’m drawing is that data engineering is the washing and brushing, and data science is the cutting.

Bulldozer vs. F1 racing car

A Formula 1 racing car is an amazing piece of engineering, created and maintained by very skilled people. It pushes the limits of speed, acceleration, grip, braking and many other things. However, it assumes that it will run on relatively smooth and flat tarmac, rather than a muddy meadow or a field strewn with boulders.

A bulldozer is part of the equipment that could turn a muddy meadow or boulder-strewn field into a race track. It can’t go nearly as fast as an F1 car, and in many other ways is not as good as it, but it is rugged. It can move things other than the driver, and tackle uneven and sloping ground. Again, I assume you can see the analogy of F1 car is data science and bulldozer is data engineering.

Blurry lines

One reaction to the comparison, and to these analogies, is to say that data engineering goes where the mess is. It gets rid of the mess so that data science can do its thing.

I think that this has some truth, but might be a bit misleading. I think it’s more accurate, if more awkward, to say that data engineering tackles accidental or incidental mess (or complexity), and data science tackles essential mess. (Here I’m using the terms that Fred Brooks used about software engineering. Essential means to do with the end user problem being solved, and accidental or incidental means to do with how you solve the problem.)

I’ll take recognition of text from an image as an example. The essential mess is to do with the variation of where the text is in the image (its absolute position and/or its position relative to other bits of text), how zoomed in or out it is, how rotated it is, the type face or style of handwriting used and so on. These are all hard, and the kind of thing you might tackle using data science.

There is another set of mess that can also be hard, that doesn’t have to exist as part of a text recognition problem but its existence will make the problem even harder. This is stuff like the image being in a variety of formats such as PDF, PNG and JPG, and whether or not the numbers used to represent the image are scaled to e.g. the range 0-1 so that it’s easy to start off with them appearing equally important and then learning their relative importance over time. This is all data engineering.

Summing up

I hope that this has given you a little insight into the differences and similarities I feel there are between the code-centric view of many programmers and the data-centric view of people like data engineers, and between data engineering and data science.

In both of these comparisons, I don’t think there is better or worse, just better or worse for a particular situation. Knowing what a tool is good and poor at can help you pick the right tool for a job.