’tis better to prevent bad data…
’tis better to have loved and lost than never to have loved at all.
– Lord Tennyson
Or in programming terms:
’tis better to prevent bad data from entering the system than to have to clean it up after.
Allow me to share a recent example.
At my job, I work on a team that interacts with a lot of services. Some of these services are under our control, and others aren’t. One, in particular, is a legacy service that we are working to replace. This particular service contains a set of data about machines. We currently have a script within our main system that calls this other system, fetches data from it, and attempts to insert it into the main database.
One of the problems with the old system is that there’s a lot of user-entered data that isn’t clean. For example, let’s say there’s a database field for the manufacturer of the machines. Users could have entered values such as:
Chevrolet
chevrolet
Chevorlet <- note the bad spelling
chevy
Chevy
When doing string comparisons*, to a computer, all of these values are different. So as a result, when we pull data into the new system, we end up pulling in this same bad data.
Thankfully the revisions we are making to the main system will make this kind of user-entered data a lot more difficult to enter, but unfortunately we aren’t there yet. As a result, we now have bad data in the main system. What really should have been done in a case like this, is that any data coming into the main system should have been cleaned prior to being entered. The new system will eventually force a user to select the manufacturer from a drop-down list (not enter it in a text field). This fixes things going forward, but still doesn’t clean up the older bad data.
Like I said, ’tis easier to prevent bad data…
*Avoid doing string comparisons as much as possible – especially on user-entered data! You are just asking for pain if your code contains a lot of string comparisons.