Troubleshooting: Some common, but often unchecked, assumptions to investigate

Are you tearing your hair out trying to find the root cause of a bug? Have you been troubleshooting the same problem for days and feeling like you’re getting nowhere? Minimize your IDE, put the keyboard down, and ask yourself a simple, but hard question…

What do you know for sure is true, and what are you merely assuming is true?

It sounds silly. Ridiculous even. “I’m an engineer!” you tell me, “I have a solid grasp on reality! I know how my code works and why!” But stay with me, fellow human… I’ll give an example of my own tenuous grasp on truth in the face of a rather simple bug, and see if you can relate.

We were checking some git metadata using a script with some regex to find the git SHA in our CI pipeline. GitLab was insisting the commit we wanted to reference couldn’t be found and failed the pipeline. Couldn’t be found? I was looking straight at it, it was right there, what do you mean it couldn’t be found!? In addition, this operation had been running without issue for several weeks since it was written, and hadn’t been changed.

So I asked myself the question above and wrote down my answers…

What did I know to be true?

  • The pipeline is pulling the code correctly.
  • The commit is there.
  • The code hasn’t changed.
  • My script works.

What was I merely assuming?

  • Can’t think of anything. Those things are obviously true!!

Check for Proof

But… how did I know FOR SURE each of those was true? I checked for proof of each thing I knew and wrote those down, too.

  • The pipeline is pulling the code correctly.
    • The pull happened without an error.
    • The logs are showing the correct branch.
    • The script is successfully running several steps before the commit not found problem that would have failed if the code wasn’t there or the repo wasn’t intact.
  • The commit is there.
    • The logs are showing the correct commit.
  • The code hasn’t changed.
    • The diff on the script file and everything related shows no changes since it went into production.
  • My script works.

….. um, well, I had myself there…. it… actually does NOT work… at least this time. I guess I was just assuming that… I’ll cross this off.… but now I realize I know some other things ARE true:

  • Something outside of the code is different.
  • The script cannot find the git SHA

This led to a new item I could add to the list of assumptions. If we know the commit is there, but the script isn’t finding the git SHA…

  • There is something different about this SHA vs the ones the script worked on

Now we’re getting somewhere…. It helps to write these known and assumed truths down. Our brains like to slide around the thing they want to be true, and sneak out through the back door before anyone can tell them otherwise. Having a list in clear black and white locks the escape route, and makes it more likely we’ll contend with the fact we don’t know everything our brains would like us to think we do. It also helps us connect the dots.

Your job while troubleshooting is not to write code.

Your job while troubleshooting is to find as many things as you can that you merely assume are true; and prove or disprove them until you find and fix the bug. Yes, you can open your IDE again at this point if you need to.

Back to my bug…

Looking for proof there is something different about the SHA (we were only looking at the short SHA), I found

  • This SHA: 3589374
  • The past five working SHAs: a543b33, c5539fb, bedd987, 1c2bc08, 7997d6b

They’re all the same length, some start with numbers and some with letters…. oh… there are no letters in the current SHA. I scrolled through the rest of the SHAs and sure enough, not a single one until this one was completely devoid of letters. Proven fact, this one is different!

New assumption:

  • the lack of letters is the reason the commit cannot be found.

Testing that, my regex looked like this:r'(\d+\w+)'

Which…. will only work if there are numbers AND letters in the SHA. I researched a “proper” SHA-finding regex string and tested it against the broken SHA, a SHA with only letters, and some of the past working ones before replacing my broken one and making a PR for the rather embarrassingly simple fix with the updated regex.r'\b[0-9a-fA-F]{5,40}\b'

Fear of No Known Truths

This was a tiny example. It took just a few minutes to go from “what the…” to submitting a PR, but sometimes you’ll find you have a dozen assumptions. Sometimes you’ll fear there are no known truths! Sometimes every shred of proof for an assumption brings up multiple more assumptions like a dastardly digital hydra and can take days to work through. This is normal and you’re not doing it wrong, I promise. We’ve all been there. Anyone who says they haven’t is lying.

In addition to keeping your own brain in line, your list of known and assumed truths can assist you in getting help from others. If you’re new, the answers will often be quick, easy, and make your coworkers seem like geniuses. (They might be, but more likely they just had to ask that same question before… maybe just last week?) But sometimes, you’ll have to ask every single person on your team; the team next door, your boss’ boss, and maybe the admin’s pet cat-a long list of questions before you find out if your assumptions are true or not. Sometimes those questions will get you looks of terror as other brains start reaching for the knob on that same mental escape hatch you’re avoiding.

But in any case, if you come asking questions armed with a list of exactly what you know and why, you’re more likely to get better answers, faster, and look smart as heck while doing it.

As a bonus, here are some common, but often unchecked, assumptions to investigate:

  • There is only one problem causing the bug
  • No one else is seeing the bug or they would’ve said something
  • I am using the same versions of all of the tools as everyone else on my team
  • The bug is in the code that the error message came from
  • The vendor’s code works
  • The customer knew exactly what they wanted, clearly asked for a specific solution, and someone confirmed we are building what they asked for.

Testing Assumptions

If it feels weird to scribble out notes to yourself, or you’re finding yourself writing the same items on your list every time you troubleshoot a bug, it may be time to learn how to write unit tests.

With tests, when things break in the future, many of your potential assumptions will already be known and proven truths. They’ll continue to be tested every time you change the code with no further effort on your part-but that’s a topic for another post.

Happy troubleshooting. Or at least happier!

Dojo Five can help you with troubleshooting. We bring modern tools, techniques, and best practices from the web and mobile development environments. Paired with leading-edge innovations in firmware our talented engineers are on hand and ready to help you with all aspects of your embedded firmware journey. Bring your interesting problems that need solving. You can reach out at any time on LinkedIn or through email.