When should I stuff key/value data into a single database column?

This situation has come up an awful lot during my career, and it’s something that was never really talked much about in college or university.  In college and in university, instructors drilled into us things like third normal form – normalizing the database to avoid duplicate data (among other reasons).  One thing they never prepared us for was stuffing key/value data into a single database column.  One of the easiest ways to do this is to convert the data to JSON, then store the data as a JSON string inside a text column in the database.

Sometimes requirements change quickly.  I get it.  In theory, there’s not a lot of difference between theory and practice, but in practice, there is.  Sometimes business needs aren’t fully known at the time of development (the horrors!).  Sometimes you won’t know the exact format of data you are going to be getting from a 3rd party service, but you know you’ve got to store it.  Or maybe there’s just too much variability in whatever it is you are dealing with that trying to come up with a properly normalized schema isn’t practical.  These are the sorts of things that happen in the real world.  And given in the real world (or at least the business world) that time is money, decisions need to be made.

So what do I mean by stuffing key/value data into a single database column?

Suppose you have a database model that stores data about a vehicle.  It could have any number of attributes (e.g.: number of tires, engine displacement, kW rating of the engine/motor, seating capacity), etc.  There could be a near infinite number of attributes about a vehicle (e.g.: how long does each cycle of the indicator light last?  How bright are the taillights in lumens?).  If you tried to model this with a traditional 3rd normal form database, you’d end up with a table with hundreds of columns, which isn’t ideal.  The table itself could end up storing very sparse data, depending on the different types of vehicles being stored.  This is when key/value stores come in handy.

In an ideal situation, you’d use some sort of key/value store database system to store this kind of data.  But often, it needs to be mixed in with an existing database that is already in use.  Maybe you don’t have the resources to get people trained on a new system, or maybe there’s an issue of time (how long would it take to get someone trained on how to properly/securely configure a new key/value store, and how long would it take to train developers to use it?).  There’s a lot of reasons why stuffing the data into an existing relational database might be the best call at that moment.  But what sort of considerations need to be made?

  1. Will the keys or values need to be queried on?  If yes, you’ll want to handle things very carefully.  Systems like PostgreSQL that have a build in data-type (HSTOR or JSONB) can allow querying within fields.  Otherwise you’ll be stuck trying to do substring matches on text fields, which is going to be slower.  If the data is never going to be queried on, you can probably get away with stuffing the JSON string inside a regular text column, and that’d be fine.
  2. Is the schema known beforehand?  If it is, maybe there’s a better way than stuffing everything into a key/value field.  This especially holds true if the schema is unlikely to change.
  3. Am I abusing a JSON field for the sake of avoiding database migrations?  If you answer “yes” to this, you’re probably doing it for the wrong reason.  Repent, and do some proper normalization.
  4. Is the schema likely to drastically change in the future?  (If so, I feel for you!)  If the structure of the data is expected to change frequently, stuffing it into some sort of key/value store isn’t the worst idea.  But remember that you will likely have to deal with handling those different formats in code (unless all you are doing is storing/displaying whatever the data is).

I’ve seen my share of JSON strings stuffed inside database fields.  Sometimes it was the right call to make, and in other times… not so much.

If you do have to do it, just… do it wisely.

Case Sensitivity Can Be a Sensitive Issue

There are two kinds of software developers: those that have been bitten by a case-sensitive bug and those that haven’t yet been bitten by a case sensitive bug.

It seems like almost every piece of software eventually runs into a case sensitivity issue.

One example I saw somewhat recently was comparing the domain portion of an email address.  The original comparison wasn’t taking case into account.  As a result, a small piece of functionality was failing to work as expected.

Say, for example, users signing up for a website.  If you record the emails in a case sensitive way, you’re pretty much guaranteed to have some user complain that they can’t log in because they are entering “John.Doe@example.com” instead of “john.doe@exmple.com” (or maybe it was “john.doe@Example.com”?).  In the case of an email address, it might not be that big of a deal to lower-case everything and store it like that, but you are sure to run across other cases where you can’t just lower-case everything.

Take, for example, Python dictionaries.  Are you sure that the key name you were after was was “UserID”?  Or maybe it was “UserId”?  Or was it “userId”?  Maybe you managed to avoid it by always using snake case, and it’s always “user_id”?

At least with whatever you are doing, do it consistently.

It seems like an elementary problem, but it’s guaranteed to bite you at some point.

Good, Better, Best

When I was a kid, we only had two TV channels.  This mean that the selection of TV shows available on a Saturday afternoon was severely limited.  Instead of the cartoons that were on in the morning, the afternoon shows were usually about sports, fishing, or home improvement.  This was in the days before decent internet speeds and streaming video, and I can recall on more than one occasion as a bored youngster watching a home improvement show.  The show in question was called “Home Check” by Shell Busey.  At least my parents can’t fault me for learning something from it!

On the show, Mr. Busey would often show three comparable tools or ways of doing something.  One way would be a “good” way of doing something.  It would do the job.  Then there would be a “better” option, that had some sort of visible improvement over the “good” way.  Then there was the “best” way – again, showing improvement over the previous two options.  For some reason, this has always stuck with me, and it still applies to software development.

The “good” way of doing something gets the job done.  It meets the requirements, and can be done within a reasonable amount of time.  But there may be some drawbacks to doing something this way.  Maybe it makes future maintenance more difficult, or maybe it’s a less flexible approach, meaning it will need to be reworked if future changes are needed.  But it still gets the job done.

The “better” way of doing something still gets the job done, but maybe incorporates a few good/best practices.  Maybe it makes maintenance easier, and will save yourself (or a future developer) some work.  But maybe the solution here takes a bit more time to implement, or involves more infrastructure or more changes.

The “best” way of doing something has some sort of advantage over the “good” and “better” ways.  Maybe it’s the most easy to maintain, the fastest, or the most flexible.  But chances are, there’s going to be some sort of drawback.  Maybe it takes even more time to implement.  Or maybe it ends up being overkill for a system that won’t see future updates.  Or maybe it requires extensive code changes.

Sometimes the “good” solution is good enough.  Sometimes something better is needed.  That’s half the battle with software development – knowing when to take which approach.  When you are doing something you’ve ever done before, it’s hard to say if you are doing it the right way.

Keep Your Data Together

When working with data, there’s few things that will haunt you more than inconsistent data.  For example:

  • A database column or model value that could be either an empty string or a null value.
  • A dictionary which may or may not contain a given key.
  • A piece of JSON where a value may be either a string or a number.
  • Data being stored in multiple different places, and no clear idea on which one is the correct value.

In my career, I’ve seen plenty of the first three.  Those usually aren’t too bad to deal with.  It’s the last one that is… ugly.

With the one particular example I’m thinking of, we had at least 3 different places that we could pull a given value from.  And the problem is, it was completely inconsistent on which one was the “correct” place to pull it from.  There was the original (old) model.  There was a newer model which was supposed to replace the “old” model (but was never fully utilized).  Then there was another microservice that was supposed to have the value (which may or may not have been filled out).  To add to the confusion, some of the data was partially backfilled from one place to another.  Some places in code would pull it from location A, then B, then C, while other places would pull it from B, then A, then C.  There was no clear requirements on which one should be the most trusted, and which one should be the least trusted.

The moral of the story here is to keep your data together.  Ideally keep it in one place, and if you can’t manage that, at least have some sort of consistent ordering in pulling it from other sources.  Document that order, and ensure that everyone understands it.  It’ll save you a lot of grief in the long run.

The Stinkholder

You probably know what a stakeholder is (someone who has a vested interest in the software you are developing).  And you probably know what something stinky is.  But you may have never heard of a stinkholder.

This was a phrase coined by a former coworker of mine.  It was more of a verbal mix-up, but the name stuck.  So what is a stinkholder?  It’s a stakeholder that stinks.  (Or, as all the “cool” or “lit” kids would be calling it these days – a stankholder).

Chances are, unless you are developing software for yourself, there are other people who are going to have a say in what you are building.  This especially holds true if they are the ones paying for the software to be built.  If you are developing software to be consumed by different groups, you may end up in a situation where one stakeholder demands things which may not make sense for the other stakeholders.  Or they ask for things that are unreasonable or asinine.  Rather than focusing on the core of the application, and things that are critical to its operation, they want things done their way, and they demand it.  Picture the girl from Charlie in the Chocolate Factory – “I want it! I need it! Give it to me now!”

One example I heard was a demand to change the color scheme of an app.  Sometimes color changes are justified.  In this case, the proposed color scheme didn’t fit with the corporate branding, and didn’t match anything else being done.  And it was only a single person requesting the change to the background of a loading screen.  And that wasn’t the only similar request they had.  Although this person wasn’t specifically labelled as a stinkholder, the name very well could have been applied.

Stinkholders can be dangerous.  If you have too many of them, you end up spending precious development time building things of no value, and not getting critical work done.  A gaggle of stinkholders can poison a project.

Beware of stinkholders.

Weird Stuff is Going to Happen (In Software)

In the world of software development, weird stuff is going to happen.  What do I mean by weird stuff?

  • Performing a minor version update to a 3rd party library breaks your app in a horrible way.
  • A user somehow makes a really strange web request, which generates an unhandled exception in a way you didn’t think was possible.
  • Something glitches out and you get a bug report for something you have no idea how to replicate.
  • The fix that you push that should have fixed an issue isn’t working 100% of the time, despite you being sure it should.
  • Telling a user to refresh/reload/reboot fixes an issue, only to never be seen again.

When you think about it, computers are slightly terrifying things.  There’s layers upon layers of things that can go wrong.  Everything from floating point precision errors in hardware to someone getting mad and removing a library that breaks other major libraries.  Unless you are closely involved with the hardware, and are programming specifically for it, it’s almost impossible to avoid the lasagna that is current software development.

Because there are so many things outside of our control, what can we do?

  1. Understand that odd things are going to happen – it’s pretty much guaranteed.  Understand that it’s probably not prudent to sink days into tracking down a one-off issue (unless you work in anything related to healthcare, finance, or security!).  Chances are, your job has deadlines, and chasing rabbits isn’t going to help you meet those deadlines, unless you really suspect there is an underlying issue.
  2. Monitor for odd issues.  Without capturing things like unhandled exceptions, you may never know that you are even having issues.  There’s already a plethora of libraries out there for almost every platform imaginable to track this kind of thing, so be hesitant to rewriting your own.
  3. Do what you can about the things that are under your control.  There are going to be things that happen that you have no control over.  Cloudflare having issues?  Amazon AWS going down?  That’s probably beyond your control, and you don’t need to sweat it.  Your app throwing an unhandled exception because of poor handling of a user request?  That’s something you can do something about.

When you can do something about it, do something about it. But accept the fact that weird stuff is going to happen.

Paper Towels and Performance

I recall hearing a (potentially ficticous) story something like the following:

A company wanted to reduce costs.  They decided that they could do this by using paper towels in their washrooms that were smaller, in hopes that people would use less of them when drying their hands.  After they introduced the smaller paper towels, they noted that total paper towel usage actually went up, rather than down.  When people were drying their hands with the older (larger) paper towels, they only felt the need to grab a single one.  But with the newer, smaller sized paper towels, they felt they needed more than one, and would grab two or three to do the same job.  As a result, total paper towel usage went up, instead of down.

There’s a few interesting takeaways here:

  1. Without having a way of measuring performance, it’s impossible to know the effects of a change.  If the company in question didn’t know how many paper towels were being used in the first place, it would have been difficult to tell if the outcome was positive or negative.  There needs to be some way of recording values in order to make a point of comparison. Maybe in the above story it was as simple as asking the cleaning staff how many cases of paper towels they used in a month. In other cases, more exact measurements may be needed.
  2. Sometimes there are unexpected side-effects to actions.  If the company in question knew that reducing the size of the paper towels would have led to an increase in total usage, they wouldn’t have tried to reduce costs in that way.
  3. Not all experiments are successful.  There’s still things that can be learned.  Sometimes what we learn is “Don’t do that” or “That didn’t work”.  But without having done the experiment, it would be speculation on what the outcome would be.  Decisions based on speculation are not ideal.
  4. Sometimes the best changes aren’t small ones.  How cheap does a company need to be to reduce the size of paper towels?  Surely there are better ways to reduce costs than nickel and diming areas like paper towels.  Maybe it would make sense to install electric hand driers?  That sounds like an experiment to me.

Refactoritus – a potentially deadly condition for developers (and projects)

Software developers are not immune to disease.  For example, the disease of refactoritus – the destire to constantly re-write code.  If it takes hold, it can ruin a developer or a project.

Why are software developers always wanting to re-write code?  There’s a few reasons, including (but not limited to):

  • “I don’t understand it, therefore I want to rewrite it.”  This is probably the most common cause of Refactoritus that I’ve seen.  Just because one developer doesn’t understand why something was written a certain way, it shouldn’t mean it needs to be rewritten.  This is more a failure on documentation and clear code comments.  The context of why something was written the way it was has more value than the “what”, “who”, or “when”.  That’s what source control (and things like Git Blame) are for.
  • “This code wasn’t written by us, and we don’t like that”.  There is a time and a place to use code written by someone else.  For example, 3rd party libraries that do something better and faster (and with more testing) than you could have done.  Or code written by contractors that you are forced to integrate.  It happens.  But just because it’s “not invented here” doesn’t mean you should throw it out.
  • “I learned a new design pattern, and it looks like it could work here!”.  Also known as “When you’ve got a shiny new hammer, everything looks like a nail”.  There are times and places to refactor things because there’s a better way of doing something.  But there also needs to be a business driver for it.  Maybe rewriting it will reduce the amount of bugs.  Or reduce the amount of time spent maintaining it.  But if there’s no tangible outcome other than “we used this new thing because it is cool”… it’s probably not a good enough reason for rewriting something.
  • It’s honestly bad code, and should be rewritten to meet a higher standard (in testability, reusability, or made clearer).  This is a symptom of other failures within the organization (or team).  Bad code shouldn’t pass code review and shouldn’t be merged in, in the first place.  That’s why we have code reviews.  On small projects, this may happen – especially when the primary developer who started the project wasn’t a software developer.  I’ve seen some code written by engineers… and it can get pretty gnarly.  Again, there should be some reason for doing the work.  If it’s going to reduce bugs, maintenance, testing time, or further development time, that can be a valid reason.  If anything, add tickets to your backlog for further refining it when you’ve got some downtime.
  • The code is no longer applicable, and no longer makes sense (or the business needs around it have changed).  Businesses change over time, and code sometimes needs to be adjusted to reflect that.  Large sections of dead code can be a symptom of this.  There’s usually not a lot of reason for keeping dead code around when tools like Git (or proper documentation) can bring it back if needed.  For many languages there are static code analysis tools that can help identify code that will never be called.  Dead code is more code that needs to be maintained, and can make tracking down other issues more time consuming.  Sometimes taking a bit of time to clean things up is appropriate.

So what happens if you spot a case of Refactoritus?

  • Identify it (call it out).  This doesn’t need to be (and shouldn’t be) done in an accusatory way.
  • Identify the reason why.  If it’s a valid reason, then put a cap on it.  Limit refactoring time to X% of your total development time.
  • Make changes to workflows (or the organization) to prevent it from happening in the future.  Things like better code reviews, or properly allocating time for cleaning things up (that do genuinely need cleaning up) as part of the actual development work.

Rewriting code for the heck of it can be dangerous.  It can become a serious time sink (and source of contention between developers).  Unless kept it check, it can add time to deadlines, and potentially even sink a project.  Remember, if you spot refactoritus, call it out and do something about it before it grows!

Ruby on Rails: When a Single Character Makes a Huge Difference

I think we’ve all heard of cases where a developer makes some subtle mistake which ends up reaking havok on either a system or some poor, unsuspecting user.  Maybe it’s something as simple as a developer using a single equal sign instead of two equal signs in a C++ comparison.  In theory, good languages and frameworks try to prevent this sort of thing from happening.  But it’s not possible to catch every single case.  As someone once said “Do what I mean, not what I say!”.  Until we develop artificial intelligence systems that can read our minds, we’re stuck trying to type out our intentions with the meaty little sausages attached to our hands.

The particular bug I saw recently was in a Ruby on Rails page.  It was subtle.  Very subtle.  Compare this:

<% render_some_important_thing %>

With this:

<%= render_some_important_thing %>

Spot the difference?  Probably, because you were told there was going to be one.  Now stuff those two lines into a much larger file, and give them identical syntax highlighting, and it becomes a lot less apparent where the bug is.  So what is the difference?

The first line will evaluate the expression… but won’t actually add anything to the DOM (i.e.: the web page).  The second will evaluate the expression AND add the content to the DOM.  If something isn’t in the DOM, it doesn’t get rendered to the page, and the end-user doesn’t see it.  Hence, missing equal sign = potentially unhappy user.

Most other web frameworks I’ve seen aren’t as subtle as this.  Not to say that they don’t have issues either, but this… yeah.  Subtle.  And a little saddening that it’s so easy to make a mistake like that.

That’s “wingardium leviosa“.

Parking Town: A Stupid Name for a Really Good Meeting

As software developers, we’ve got a chip our shoulders about meetings.  “Ugh, _more_ meetings?” is a common response.  Or “Oh, thank goodness that meeting is over so I can go back to being productive.”  It’s difficult to think that meetings are productive when your output is gauged on things produced outside of them.  But it doesn’t need to be that way, and it shouldn’t be that way!

As we used to say at a place I worked at: “Days of development could have saved you hours of planning!”

The usual SCRUM meetings consist of things like sprint planning, daily stand-up, sprint review, and a sprint retrospective.  So where does “Parking Town” come in?  What is it? And what’s with the stupid name?

First, the name. The name comes from a combination of two things: Parking Lot and Taco Town.  The “parking lot” is the sort of conversation that happens when people are on the way to their cars after leaving the building.  It’s friendly but frank, and as long as it needs to be.  Some of the best discussions I’ve ever had have been either in a parking lot or a hallway.  They weren’t formal, but brought on discussion that made a huge difference. The formal SCRUM process doesn’t really seem to account for that sort of spontaneity.

What about the “Taco Town” part of it?  It was a Saturday Night Live skit for a fictitious restaurant where tacos are served in a progressively larger serving size, ending with a grocery bag filled with salsa.  It’s really got nothing to do with the meeting name, but someone accidentally called it by that, and it stuck.  So instead of “Parking Lot”, we had “Parking Town”.

So what is it? It’s a post-stand-up discussion where we, as a team, would discuss things that weren’t urgent, but could have benefited from some additional technical discussion.  People were free to leave after stand-up, but most of the time they stayed. Our stand-up would usually take ~15 minutes, but we’d have half an hour booked in the calendar. If anyone had any technical things they wanted to bring up (questions or comments), we’d use that extra 15 minutes for Parking Town. If no one had anything to discuss, we’d all go back to whatever work we were doing.

Anyway – Parking Town.  A stupid name for a really good meeting.