Category: General

’tis better to prevent bad data…

’tis better to have loved and lost than never to have loved at all.
– Lord Tennyson

Or in programming terms:

’tis better to prevent bad data from entering the system than to have to clean it up after.

Allow me to share a recent example.

At my job, I work on a team that interacts with a lot of services.  Some of these services are under our control, and others aren’t.  One, in particular, is a legacy service that we are working to replace.  This particular service contains a set of data about machines.  We currently have a script within our main system that calls this other system, fetches data from it, and attempts to insert it into the main database.

One of the problems with the old system is that there’s a lot of user-entered data that isn’t clean.  For example, let’s say there’s a database field for the manufacturer of the machines.  Users could have entered values such as:

Chevrolet
chevrolet
Chevorlet <- note the bad spelling
chevy
Chevy

When doing string comparisons*, to a computer, all of these values are different.  So as a result, when we pull data into the new system, we end up pulling in this same bad data.

Thankfully the revisions we are making to the main system will make this kind of user-entered data a lot more difficult to enter, but unfortunately we aren’t there yet.  As a result, we now have bad data in the main system.  What really should have been done in a case like this, is that any data coming into the main system should have been cleaned prior to being entered.  The new system will eventually force a user to select the manufacturer from a drop-down list (not enter it in a text field).  This fixes things going forward, but still doesn’t clean up the older bad data.

Like I said, ’tis easier to prevent bad data…

*Avoid doing string comparisons as much as possible – especially on user-entered data!  You are just asking for pain if your code contains a lot of string comparisons.

Advertisements

Pre-calculate and cache vs. slow calculations

Let’s set the scenario: you are sitting on a ton of data, and the owners of that data want to generate a report based on it.  Running the report takes a while (say, several minutes).  Do you:

1. Pre-calculate the report at a given interval (e.g.: nightly), and store the pre-calculated results so that they can be displayed quickly?
2. Make the user wait several minutes while the report is generated?

In cases I’ve seen like this, it’s a bit of a mixed bag.  No one likes waiting several minutes for a report to run (especially impatient users), but from a development standpoint pre-calculating and storing results has its own set of drawbacks.  What happens if we have new data being entered into the system each hour, and users want an up-to-date report?  What happens if the nightly process that generates the report fails?

I don’t think there’s always a cut-and-dry answer here.  But lately my experience has me leaning toward option #2.  Yes, it’s an inconvenience to the user, but it may save a lot of development effort later on.  It’s a bit like how fixing a bug during development can save 10x the amount of time later on if that bug eventually makes its way into production.  Rather than spending that development time dealing with caching (which isn’t always simple as it seems*), it might be better to spend that time optimizing the system, and organizing your data in a better way to increase performance.  I’m not really a fan of duplicated data, especially when there’s a potentially better option.

*There’s two difficult problems in computer science:
1. Naming things.
2. Cache invalidation.

“Should” and “is”: expectation vs. reality

In the past 24 hours, I’ve heard several people use the word “should”.  As in “Feature A should be performing action B”.  Or “Everything should be working fine”.  In both cases, what “should” have been happening wasn’t what was actually happening.

My advice?  Avoid saying “should”.  Check your facts and back up your claims, or don’t make a strong claim.  Especially in software development.

Flexibility vs. Strictness in Data

There’s almost always a set of trade-offs in software development.  There are trade-offs between performance and memory consumption, trade-offs in making it fast, cheap, and good, and trade-offs between using different technologies.  I don’t often hear about trade-offs between structured vs. unstructured data.

My software development path has been interesting.  When I first started programming, I was doing graphics programming in C++ (and DirectX).  I then learned Java, VB.Net, C#, and have now found myself using Python.  Somewhere along there I picked up SQL (including a bit of both Transact SQL and PL-SQL).  More recently, I’ve dabbled in using other tools like Redis and ElasticSearch.  Along the way, it’s been interesting to see the different methodologies when it comes to storing data.

In the strict camp, database models have a set number of fields, with specific types and rigid relationships between data.  Either values are present, or they are null.  But you can always assume that there will be a certain amount of structure.  This is great, until a schema change comes along.  Schema changes are generally fine, so long as you know what the future of your data should look like.

In the flexible camp, you’ve got no guarantee that a particular key will exist, but you have the flexibility to add it if it isn’t there.  As business requirements change, so can your data – quite easily.  Need to add some new attributes to a database model?  No problem.  Not sure what the attribute names are going to be?  Also not a problem – just wing it!  This is great for rapid development, but then it comes back to haunt you when you have to continually check for the presence of something before you try to use it.

I generally like the approach of starting strict, then loosening restrictions as necessary.

If you don’t have clearly defined requirements, including clearly defined requirements of the data you need to store, that’s a pretty good indication that you shouldn’t be starting to code yet.  To solve a problem, you first have to understand the problem.  Although flexible data allows you to discover how to solve the problem, it’s often a good idea to revisit how you are storing your data once you’ve figured out the solution.

Software optimizations, in a nutshell

I think pretty much all software optimizations can be summed up like this:

Don’t do more work than you need to.

Whether that means choosing a better algorithm, reducing memory allocations, or minimizing instruction counts, that’s really what it comes down to.

By choosing a better algorithm, you’re doing less work.  By reducing memory allocations, you’re moving things around less in memory, and are doing less work.  By minimizing instruction counts, you’re doing less work.

Let me give an example.  Earlier this week, a co-worker needed some way to check values in our production data.  This was previously possible through the UI, so there was a manual process of switching between two websites to check for a particular condition.  Because of some recent UI changes, this was no longer possible, so they needed another way of doing it.  I was able to put together a small Python script that did this.  The script made some HTTP calls against our production site.

The first version of the script I did worked, but was not efficient.  First, it made one call to fetch a list of items, which it then iterated over.  For each of those items, it would make another HTTP call to get some additional values.  It worked, but it was slow – especially in cases where the first call returned a large number of items.  Supposing that the secondary calls took at least one second, if the first call returned 1000 items, you’d be waiting over 16 minutes for it to complete.  Not cool.  Yes, it worked, but it was way too slow to be useful.  In the mantra of “Make it.  Make it work.  Make it work right.  Make it work right, fast.”, the second set of modifications for the script made just two HTTP calls total.  The calls were still fairly large, but the overall script was much faster.  It went from a run-time on 1000 items down to somewhere < 10 seconds.

I could have taken things further, and rather than iterating over a list of 1000 items (several times), I could have swapped the list for a hash table.  But at this point, it was already fast enough (and I had a large number of other tasks on my plate).  We already took a manual process from 10-ish minutes a day down to ~30 seconds.  (A user still had to go and check the CSV file generated by the script).  It would have certainly been possible to cut that 10-ish second run time down to just a second or two, but the law of diminishing returns kicks in, and at a certain point, it wasn’t worth more optimizing.

I find the amount of complexity that we add to systems: we add layers upon layers of abstractions.  We add frameworks that call other frameworks that call microservices.  Some of this does indeed help make a developer’s life easier.  But at the same time, it also adds more work.  I’d be curious to know how many large scale systems could be sped up by simply doing less work.

The moral of the story here is:  Don’t do more work than you need to.

Working from a queue?

There’s an idea that keeps floating around my mind, but I’m not quite sure how to describe it: What if everyone was to work out of a queue?

We’ve all got task lists.  In the particular office environment that I work in, I can get a request in any number of forms: an email, a JIRA ticket, a slack message, or someone actually coming up to me in person to ask me something.  What if all these got funneled into a priority queue?  The reason for doing this is:

  1. People making urgent requests would be able to see the impact of their request in relation to other pending tasks.
  2. The important items would float to the top more quickly.
  3. It’d be easier to concentrate on getting things done if I can turn off notifications in one spot.

Imagine if all email requests, slack messages, JIRA tickets, and anything else (except in-person communication) got pushed into a queue.  As a user, you would have the ability to prioritize which items come first.  For example, my priorities would look something like:

  1. Outlook Calendar notifications
  2. Direct slack messages
  3. Specific project channel related Slack messages
  4. Outlook emails
  5. Github pull request/comment notifications
  6. Less important Slack channels
  7. Emails sent from particular people
  8. RSS feeds
  9. Twitter, etc.

It would be possible to see the queue length, as well as “peek” at any given thing in the queue, but the general idea is that you are always working off the one end.  Messages are shoved into one end, and processed out the other.  The priority allows me to set the order that I want to see things.  For example, if someone sends me a direct slack message at the same time I have a meeting reminder, the meeting reminder takes precedence, and shows up on the top.  Once I’ve dismissed it, then I’d see the next thing in the queue – the direct message.

Users would be able to see the position of their requests in other user’s queues.  Privacy settings would allow you to make some queue items public (so anyone can see it) vs. private, so no one can actually see the contents of the queue messages.  (By default, queued items would be private, except perhaps the source – Outlook, Slack, Github, etc.).

Users would also be able to put their own items into their own queue.  For example, I could put in an item like “Send a Happy Father’s Day message to dad” into my queue.  Users would also have the ability to push things back into the queue, to a given depth.  For example, maybe I’m dealing with a few really important tasks right now, so I push a reminder 10 or 15 items deep into the queue.  It’d be similar to snoozing an alarm clock or meeting notification.

There’s also the idea of allowing a user to push their item to a higher priority, and in doing so, anyone else who’s requests are bumped by it, would be able to see it.  That way, if there’s several people waiting in the request line, and someone comes along with a “high priority” request, and bumps everyone else down, those other people can then go contend with the person making the high priority request.  That way it’s not up to the person handling the requests to settle the fight of who’s request is highest priority – the people waiting in line get to decide that, leaving the person handling requests alone, so they can actually get some work done.

I imagine that writing such a system would be pretty complicated – you’d need hooks into the various systems, and it would require everyone in a given office to be on board with it.  That being said, I think the idea sounds entertaining, if anything, at least for an experiment.

Pubescent Software

A co-worker of mine described the current state of a certain software company (paraphrased):

It’s no longer a start-up.  It’s bigger than that.  It’s not a mature company either.  It’s somewhere in between.  It’s pubescent.  It’s gangly, awkward, and clumsy.

I’d say that’s a pretty good description of the phase between start-up and mature company.  It’s no longer small, young, and agile, but it isn’t yet old and fixed in its ways.  The clumsiness comes from not knowing quite how to operate.  It can’t continue to operate in the way that it did (in “start-up mode”) because things no longer fit.  But it also doesn’t have down the efficient business processes that come with experience.