Sunday, October 29, 2017

The scientific method of troubleshooting

The aim of this article is to provide a brief guidance for IT administrators, System Engineers and whoever interested on a systematic and established approach to troubleshoot problems.
  1. Define the problem

    To identify the problem ask the below questions.
    1. What is the expected behavior ?
    2. What is the current or actual behavior ?
    3. What is the criteria for success ?
    4. Time frame when the problem started or identified ?
    5. What is the impact of the issue ? What all related services/ who all are affected ?

  2. Do your research

    1. Know your environment.
    2. Collect necessary/ related background information.
    3. Refer existing documentations.
    4. Verify change logs.
    5. Conduct discussions to gather multiple opinions.
    6. Refer knowledge base (KB) to check whether it is a known issue.
    7. Is it possible to reproduce the issue ?
    8. Are there any dependencies associated ?

  3. Establish a hypothesis

    Design an experiment/ test strategy to validate your hypothesis based on the evidence collected in previous step.

  4. Experiment

    1. Isolate the problem by divide and conquer method.
    2. Limit the number of variables while conducting the test.
    3. Follow a hierarchy and figure out what is most likely to cause the problem.

  5. Gather data

    Check the current status by verifying logs, error messages etc.

  6. Analyze results

    1. Verify whether the problem is resolved.
    2. Consolidate the learnings garnered from the troubleshooting efforts.

  7. Document the problem and the solution

    1. Make sure you document the problem and the solution.
    2. Update necessary documentations if any.
    3. Blog it.
And finally, if you have resolved the issue, take a moment to embrace success. Cheers !

Reference video: