Have you ever run into a technical resource (programmer or engineer) who outright confessed not being very good at troubleshooting? I suspect not. However, unlike Lake Wobegon’s children, we are not all above average, and it is a natural human tendency to overestimate one’s capabilities.
Due diligence up front minimizes problems down the road. We believe that ample design and testing reduces the need to focus on troubleshooting, but neither life nor systems nor machines are always predictable. I’m not sure how many of us teach troubleshooting best practices.
One of the cardinal rules in troubleshooting is to divide the problem—if at all possible. I would also submit that being systematic is also a cornerstone. I have seen very clever, very smart engineers bypass a pragmatic approach to troubleshooting by going directly to the problem area in an attempt to quickly resolve it. I have also seen those individuals dig a deeper hole from which they needed to extricate themselves by backtracking.
The more complex the system (including size, which introduces a complexity all its own), the more diligent you have to be when troubleshooting. Begin by developing a documented test record. Determine what you want to test and what you hope to be able to learn from that test. What is the anticipated result of the test? If you have a team, come to an agreement on your next step. Describe the test, log the anticipated result and also log the actual result (columns and rows work well for this).
Observe the actual result and evaluate. You don’t always have to get the anticipated result, but it is important to evaluate what you did see as a result of the test. Even if you make it worse instead of better, you at least know that you are near the nerve center of the problem. I once had a very smart colleague who was as good at troubleshooting as any I’ve seen before or since. We were starting up a combined-cycle power station and remember him saying, “I don’t care what I think I just saw. The laws of physics are still the same!” The lesson here is to take the time to determine whether what you think you saw actually makes sense. Can you prove or disprove that observation by some other route or perhaps duplicate the result if that would not be dangerous?
Look for obvious issues such as a typo or a duplicated address being overwritten later in the program or perhaps an order of operation issue in the code. Could it be timing related—a result of using timing relationships to advance logic rather than event relationships?
The document trail seems like a formality that introduces unnecessary delays, but undocumented changes that are forgotten and left behind can add hours or even days to the diagnosis. If physical or software jumpers are installed, be sure to document accordingly so their removal can be assured. Also, having the team employ such a formal document (if there is a team) brings the team together in a synergistic way rather than having multiple contributors acting as a loose collection of heroes, which can actually complicate the matter.
Another important rule is to implement only one change at a time. When you already have something that isn’t behaving as planned, you don’t want to introduce multiple variables before testing because then you won’t truly know how to interpret the results. Be sure to understand the impact of a change before taking the next step with a systematic approach.
When troubleshooting, move with urgency, but don’t rush!
Ray Bachelor is chairman of the board at Bachelor Controls Inc., a certified member of the Control System Integrators Association (CSIA). For more information about Bachelor Controls, visit its profile on the Industrial Automation Exchange.