This situation has come up an awful lot during my career, and it’s something that was never really talked much about in college or university. In college and in university, instructors drilled into us things like third normal form – normalizing the database to avoid duplicate data (among other reasons). One thing they never prepared us for was stuffing key/value data into a single database column. One of the easiest ways to do this is to convert the data to JSON, then store the data as a JSON string inside a text column in the database.
Sometimes requirements change quickly. I get it. In theory, there’s not a lot of difference between theory and practice, but in practice, there is. Sometimes business needs aren’t fully known at the time of development (the horrors!). Sometimes you won’t know the exact format of data you are going to be getting from a 3rd party service, but you know you’ve got to store it. Or maybe there’s just too much variability in whatever it is you are dealing with that trying to come up with a properly normalized schema isn’t practical. These are the sorts of things that happen in the real world. And given in the real world (or at least the business world) that time is money, decisions need to be made.
So what do I mean by stuffing key/value data into a single database column?
Suppose you have a database model that stores data about a vehicle. It could have any number of attributes (e.g.: number of tires, engine displacement, kW rating of the engine/motor, seating capacity), etc. There could be a near infinite number of attributes about a vehicle (e.g.: how long does each cycle of the indicator light last? How bright are the taillights in lumens?). If you tried to model this with a traditional 3rd normal form database, you’d end up with a table with hundreds of columns, which isn’t ideal. The table itself could end up storing very sparse data, depending on the different types of vehicles being stored. This is when key/value stores come in handy.
In an ideal situation, you’d use some sort of key/value store database system to store this kind of data. But often, it needs to be mixed in with an existing database that is already in use. Maybe you don’t have the resources to get people trained on a new system, or maybe there’s an issue of time (how long would it take to get someone trained on how to properly/securely configure a new key/value store, and how long would it take to train developers to use it?). There’s a lot of reasons why stuffing the data into an existing relational database might be the best call at that moment. But what sort of considerations need to be made?
- Will the keys or values need to be queried on? If yes, you’ll want to handle things very carefully. Systems like PostgreSQL that have a build in data-type (HSTOR or JSONB) can allow querying within fields. Otherwise you’ll be stuck trying to do substring matches on text fields, which is going to be slower. If the data is never going to be queried on, you can probably get away with stuffing the JSON string inside a regular text column, and that’d be fine.
- Is the schema known beforehand? If it is, maybe there’s a better way than stuffing everything into a key/value field. This especially holds true if the schema is unlikely to change.
- Am I abusing a JSON field for the sake of avoiding database migrations? If you answer “yes” to this, you’re probably doing it for the wrong reason. Repent, and do some proper normalization.
- Is the schema likely to drastically change in the future? (If so, I feel for you!) If the structure of the data is expected to change frequently, stuffing it into some sort of key/value store isn’t the worst idea. But remember that you will likely have to deal with handling those different formats in code (unless all you are doing is storing/displaying whatever the data is).
I’ve seen my share of JSON strings stuffed inside database fields. Sometimes it was the right call to make, and in other times… not so much.
If you do have to do it, just… do it wisely.