Gathering Meaningful Troubleshooting Data with PowerShell
Through Technology produced a script to perform over 100 tests of client connectivity, configuration and performance from a client's managed Win10 devices. This is now allowing an enterprise customer and their managed service providers to understand incidents, reduce resolution times across 1,000 sites and reduce the impact of infrastructure outages to >30,000 staff.
One of our enterprise customers has several different IT suppliers responsible for different parts of their infrastructure (a disaggregated multi-source model), meaning that there are different suppliers for End User Computing, LAN, WAN and other services.
They also have more than 30,000 users split across around 1,000 sites and no user experience monitoring tools in place. In the case of Priority 1 or Major incidents, gathering information on exactly what was wrong and which problems users were experiencing was very difficult, increasing incident resolution times.
When they had a significant infrastructure incident, information gathered from users to enable troubleshooting was :
Subjective - with language such as "Access to [my internet hosted app] is very slow" "only some of us are affected" or "nothing is working"
Incomplete - in some cases, users had to be contacted several times to be asked different questions or have their machines remote controlled.
Slow to gather - relying on telephone calls, skype chats, remote control and conversation to gather information.
Second hand - having reached 2nd/3rd line support via either 1st line support, individual onsite staff or incident managers key pieces of information were often missing or lost in translation.
Due to - and in some cases exacerbating - these issues, the client's suppliers were:
Deciding which information they wanted from users during each individual incident;
Often having to go back to several times to request further information from users; and
Struggling to make progress given the subjectivity, response time and (in)completeness of information from the end-users.
Having joined incident management calls for the client, Tony Hawk from our team identified these issues and noted that other suppliers were not addressing the problem, mostly waiting for user experience tooling that was still a long way off. Through Technology could improve these issues quickly and simply by scripting a series of tests on end-user devices and reporting the results centrally, providing a volume of consistent, objective, plain-English troubleshooting information quickly and with minimal effort or user interruption.
The script was installed on all the client's Windows 10 devices with a simple shortcut on the user's desktop and instructions to 1st line support to ask users to click it if a significant incident had to be escalated to a 2nd/3rd line resolver group for further investigation.
Once run, the script….
Performed over 100 individual tests of client connectivity, configuration and network performance;
Copied the results of testing to Microsoft Teams accessible by incident management and support staff; and,
Displayed the results of tests in plain English to the user, meaning they could talk them through with support staff if their machine had no network connectivity.
We provided deployment instructions, comprehensive documentation and held awareness sessions for incident managers, ensuring they understood its benefits and use.
Instead of hearing vague and subjective feedback such as "My App is slow", forcing them to contact individual users to find out why. Resolver groups can now receive a volume of detailed factual results through ServiceNow or Microsoft Teams, including connectivity status, latency to crucial services, proxy GET request response times, and many other key metrics from multiple users at different locations.
While not replacing the planned monitoring tools, the script cuts the time taken to gather comprehensive information from over 1 hour per-user to a couple of minutes in total. With more information provided consistently and quickly, incident resolution drop significantly.
The script was first used on a single PC just before we deployed it across the estate, when the client had an internet access issue inconsistently affecting around 4,000 staff across their estate. We ran the script once, correctly identifying a single proxy appliance that was out of service, misreporting its status to a management console and causing the issue. Working with the supplier, we resolved the incident within 10 minutes, when it had been under investigation and impacting the client's staff for many hours.
Find the script and use it yourself
You can find a redacted version of the script in our Powershell Projects Repo on GitHub. Obviously this is a redacted version so if you want to use it in your organisation you will need to add your hostnames or IP Addresses, Microsoft teams and other information as per the commented code. If you build upon it or use it in your infrastructure, we'd love to hear how you get on at firstname.lastname@example.org
[Support Information Script is licensed under the MIT OSI (Open Source Initiative) License to enable its re-use and sharing].