Improving your IT service delivery and operations with ChatOps

Introduction

IT organisations are under pressure to reduce idea-to-product cycle times while improving the service availability of the diverse range of systems and technologies under their charge. In response, IT leaders are seeking to leverage technologies and delivery models such as cloud, infrastructure as code, Continuous Delivery (CD), Big Data and IT Process Automation.  This intersection of contemporary concepts is deriving new IT delivery patterns which are impacting on traditional IT service management (ITSM).

From my recent experiences working with Australian enterprise-sized IT organisations, it is my view that ITSM & Operations teams are not keeping pace with their peers in application development to meet current business demands. ITSM and Operations cannot remain effective and efficient if the teams continue to work in disjointed practices that are underpinned by manually executed processes and activities. To remedy this situation, ITSM & Operations teams need to not only learn and develop capabilities to support their application development peers to develop and support an end to end continuous delivery pipeline, but also move towards a more collaborative engagement model. One suggested engagement model is ChatOps.

What is ChatOps

Definitions

ChatOps, a term credited originally to Jesse Newland of GitHub, “is all about conversation-driven development. By bringing your tools into your conversations and using a chat bot modified to work with key plugins and scripts, teams can automate tasks and collaborate, working better, cheaper and faster." (Sigler, 2014). Sigler (2014) also explains ChatOps as being one or more chat rooms where “team members type commands that the chat bot is configured to execute through custom scripts and plugins. These can range from code deployments to security event responses to team member notifications. The entire team collaborates in real-time as commands are executed.”

Chat based conversations are not new in IT however ChatOps adds the extra dimensions of leveraging bots to automate routine tasks including integration with other systems and data sources to provide information and execute activities within the chat session. Fryman (2015) supports this by stating that “ChatOps is bringing in the work you are already doing, in line with the conversations you are already having”.

ChatOps and DevOps

Typically when ChatOps is mentioned, it often referenced as a practice of DevOps which is not surprising as there is considerable overlap between the two. ChatOps is an enabler for DevOps by consolidating DevOps activities into one forum. ChatOps is therefore seen as an accelerator and useful change agent for DevOps adoption. Further to this, ChatOps supported the CALMS framework for DevOps in the following ways (Chuparkoff, 2015; Fryman, 2015; Hand, 2016; Nasello, 2017; Newland, 2013):

Culture: ChatOps promotes transparency amongst all IT teams by presenting their activities in an open online forum, both in real time and where past ChatOps transactions are recorded in storage.
Automation: ChatOps endorses the use of bots and scripts to securely automate the low value, repetitive tasks typically performed by specialised technicians.
Lean: ChatOps instruments a series of Lean principles (Liker, 2004) including
a) Kaizen (we improve our business operations continuously, always driving for innovation and evolution),
b) Genchi Genbutsu (go to the real place where is done to find the facts to make correct decisions),
c) Teamwork (we stimulate personal and professional growth, share the opportunities of development and maximize individual and team performance),
d) Create continuous process flow to bring problems to the surface, and
e) Using visual control so no problems are hidden.
Regan (2016) supports this by stating that the transparency tightens the feedback loop, improves information sharing, and enhances team collaboration. Not to mention team culture and cross-training.
Measurement: ChatOps logs provide detailed evidence of activities including continuous delivery and incident resolution. Further to this, bots can be requested to import information and measurements from other sources including service performance, availability and configuration item volumes and provide the results into the chat room for the visibility of all personnel.
Sharing: Technicians engaged in ChatOps can share experiences, commands, activities and learning in both real time and with the review of historical ChatOps logs.

Considering all of the above observations of ChatOps and its benefits, it appears that ChatOps is the next step in the evolution of IT Service Transition and Service Operations.

Common tools

Fundamentally, the tools required to provide ChatOps consists of four major components:
Chat service, Integrations & Automation Services and Bots (see Figure 1 below).

Chat Service: Group chat provides the persistent online chat conversation between the various IT teams and can be accessed through mobile devices. While most organisations employ a chat service for adhoc exchanges between individuals, in this scenario the chat service should support a series of one or more persistent conversations, support offline storage of past conversations and sustain the use of the other 3 major components. Some propriety offerings include Campfire, Slack, Hipchat, Flowdock and MS Teams. Open source offerings include Zulip, Jappix, Mattermost and Rocketchat.

Integrations: Most chat services support integration with other services and tools especially to support key IT activities such as continuous delivery and service monitoring. These integrations can permit one way (read only) or two way (create, read, update, delete) actions. While most integrations are simple to turn on and off, more complex integrations may require a bot to provide an abstraction layer and hide the complexities to using these integration services. For continuous delivery, popular integrations include GitHub, Bamboo, Jenkins, Docker, Puppet, Chef, Electric Flow and Ansible. For service monitoring, popular integrations include Nagios, New Relic, Pingdom, Status Page, PagerDuty and VictorOps. For graphing IT asset performance, Graphite is popular and for ticketing systems, JIRA, Zendesk, Freshdesk, Trello & Desk.Com provide supported integrations.

Automation services: Similar to integrations, automation services can combine multiple actions including triggers, process sequences, transitions, conditions, and data transformation. Some also provide visibility to execution state and results. Automation services can be considered as extensions of bots and their integrations and extend out to products outside of the traditional IT operations domain such as Google (Drive, Hangouts, Gmail), Dropbox, Spotify and Fitbit (to name just a few). Automation service offerings include Stackstorm, Zapier and IFTTT (IF This Then That).

Bots: Bots are applications or scripts that can automate numerous tasks and activities. They provide an abstraction layer to the various integrations and can action simple or complex workflows that combine one or more activities to complete a process. A simple activity may be to answer to easy questions like “Who is on-call” or “Are my servers operational”. More complex activities include automating packaging and deploying a new feature and, if it fails testing, automatically rollback the deployment. Leveraging bots to manage repetitive tasks frees up specialised engineers to focus on other high value activities. Some popular open source bots include Hubot, Err and Lita. Bots can be developed and open sourced like Dropbox’s Securitybot. Soon after a security alert is fired, a (DropBox) employee receives a message from Securitybot asking them to confirm whether they performed a potentially malicious action. Their response is then stored and later sent to the security team (Chuparkoff, 2017).

Use cases

By combining these four major components, ChatOps offers opportunities to greatly enhance a series of use cases in IT service delivery and operations. Some of these include:

Event & Incident Management: Events and alerts can be channelled into the persistent chat room for attention of engineers with or without the intervention of a bot. The bot can be designed to automatically execute a pre-configured series of resolution and recovery actions should the event meet a specified criteria. Specialised operations engineers can develop bots and scripts to automate the tasks they do so that others (e.g. developers) can complete those tasks safely and without manual intervention. If the event or incident requires human attention, a bot can page the appropriate on-call person (by integrating the on-call roster). Should there be no response from the on-call engineer, the bot can be configured to automatically escalate to the next level of management. If an engineer is diagnosing an incident, bots can be asked to answer initial investigative questions for example the health of a configuration item or the number of servers behind a load balancer. Incidents can be automatically or requested to be raised, and then assigned to the appropriate personnel via the pre-established integration with the ITSM ticketing system. Bots can review related knowledge base articles and suggest (or execute) specific standard operating procedures to resolve the incident. If required, major incident management procedures can be triggered via commands in the chat room rather than by phone, reducing investigation time. Throughout this incident, the activities and conversations undertaken by the various parties (including the bots) is being viewed and logged in the chat room. This allows other engineers to see and assist in the incident or the chat thread can be stored for post incident analysis. Conducting this activity in the one chat room breaks down the silos and induces collaboration regardless of the organisational structure of the IT teams or their geographical location.

Problem Management: The continuous recording of both events (e.g. new changes released into production, some potentially resulting in new incidents) and conversations between engineers provides a wealth of information (data with context) to assist in proactive trend analysis of incidents.

Change & Release Management: ChatOps is used extensively with organisations employing DevOps or Continuous Delivery (CD) due to ChatOp’s ability to integrate with numerous, popular CD tools and therefore automate a seamless, repeatable and safe deployment pipeline. Bots can be configured to not only report on the health of build branches but execute all stages of the CD process, freeing up engineers to focus on higher priorities. Security for preventing unauthorised changes being released can be provided by two methods: a) bots can initiate personal multi-factor authentication with engineers who trigger the build process, and/or b) upon request for a new build, bots can integrate an on-call or escalation roster and page another (typically more senior) engineer to approve the request for change. Should the new build induce an incident, the bot can initiate a roll-back/roll-forward as required. The employment of bots to support CD is one of the most popular use cases in the IT industry today.

Knowledge Management: In its most basic form, ChatOps improves collaboration and communication by providing visibility to the organisation’s live operations, regardless of organisational size. By employing integrations, service delivery and operations that are normally hidden in various, segregated systems (e.g. ITSM/Ticketing systems, infrastructure provisioning tools, monitoring systems, organisational intranets and social media) becomes visible in one place. This visibility accelerates new employee induction training to the point where leading organisations enable new developers to deploy code on their first day of work. This provides immediate gratification to new employees. Further to this, existing personnel can leverage ChatOps to learn new skills and techniques or request assistance by engaging with other ChatOps participants. Bots can enhance knowledge management by being configured by specialised engineers to answer basic questions from other staff on their behalf. This also reduces the opportunities for the specialised engineers to be distracted and avoids the hidden overhead of context switching.

The use cases above have been restricted to IT service delivery, however as ChatOps continues mature within an organisation, it is possible that this capability can extended to other areas of the business including customer contact/service desks, marketing teams and business operations.

What benefits can ChatOps provide

Based on the above use cases, you may have already derived some of the benefits including the extensive sharing of information, reduced feedback loops for engineers and automating tasks that allow engineers to focus on high-value activities. Hewlett Packard Enterprise’s DevOps Survey in 2016 found that 'informal' DevOps leaders (those without a formal implementation) found that collaboration (e.g. ChatOps) greatly enhance their service transition and support (Perez, 2016).
Hand (2016, p5) encapsulates the benefits the ChatOps into two streams as tabled below.

Social

Technical
Increased collaboration
Increased sharing of domain knowledge
Increased visibility and awareness
Enhanced learning
Improved empathy

Increased automation
Increased speed of actions and executed commands
Improved security and safety
Automatic logging of conversations and actions
Synchronous communication
Reduction in email

Further to this, Chuparkoff (2015) found that ChatOps supports spontaneous collaboration and new patterns of teams working together who previously have not. ChatOps transcends organisational structures and geographies, and can include internal teams and potentially external teams (e.g. suppliers) into the conversation driven IT delivery and support. Newland (2013) stated that if the chat service is securely available on mobile devices, engineers can continue to deliver new features and resolve incidents remotely and therefore organisations can better support flexible working arrangements for their staff. Providing that teams continuously review and improve their ChatOps (including maintenance of their bots), ChatOps has been quoted as “being like a wiki that never goes out of date” (Wallgren, 2017).  Governance, Risk and Security Management are also supported as ChatOps logs provide a wealth of information including what IT service delivery and supports actions were taken, by whom, what conversations and decisions were taken and which IT assets were involved.

What are the high level implementation steps and considerations

Implementing ChatOps will differ from organisation to organisation due to a variety of reasons however here are five (5) key considerations for an implementation plan:

Start small and iterate often: Overall better practice recommends that you design an implementation approach where you start small, keep it simple and improve iteratively.  Employ regular retrospectives with chat room attendees to identify and action improvements (Ansel, 2015). Implement the small wins and then extend the capability but be mindful to manage the cultural change that will occur.
Implement a chat service: Seek to keep your ChatOps implementation tool agnostic so you are not restricted with the improvements you want to implement. ChatOps can become distracting so start by showing only critical alerts to avoid alert fatigue. Establish different chat rooms for specific purposes such as Alerts, Deployments, Operations and Support/Help. Other room suggestions include a chat room for team stand-ups and another chat room for fun stuff or letting team members vent. The number of ChatOps rooms should be small (e.g. 6-10 rooms). Establish a process to log the conversations into a data repository that allows integration and easy search capabilities of past transactions.
Develop a communication and integration framework: Similar to communications in critical services like air traffic control, military and emergency services – you should develop a standardised set of communication protocols for all ChatOps participants including bots. Consider the language convention and structure in your organisation, think about the language and acronyms that your organisation likes to use (Hand, 2016). Establish clear guidance on what information should be shared to reduce noise and keep it simple. Integrations should be planned and implemented in an iterative manner and it may not be necessary to integrate with every system. Start with a small number of highly desired services/data sources (e.g. event and incident management tools, ITSM/ticketing system), seek engineer feedback to identify the next integration source and implement. As your ChatOps matures, expand your connections to include service providers and other third parties, underpinned by reliable, secure integrations. Remember, integrations require ongoing maintenance so ensure that your maintenance capabilities can grow (or become more efficient) to effectively support your suite of integrations.
Build your Bots: At their core, bots are applications/scripts so your considerations for implementation will be the same as those for developing a suite of new applications. Bots should be built with the first feature of being able to explain what they do. Help documentation should be the first consideration and not the last. Keep the bots lightweight, easy to maintain and designed with a single purpose or a narrow scope. Design bots with security in mind (Fell, 2017), in particular, consider who can query a bot for information as opposed to who can command a bot to execute an activity. For the bot, consider what level of permissions the bot should possess to be deemed useful and successful in its role. If you are planning to engage with external bots, consult with your IT security team first. Let those engineers who regularly engage with the bot have a say in what it can do. Give your bot character and a sense of humour – this will make the bot more engaging and work much more fun (Nasello, 2017). Seek to reuse your bot code as much as possible.
Train and support new ChatOps participants: ChatOps is an alternative way of working and combines the real time communication and interactions between many people and bots. To be successful and avoid confusion and noise, new participants should receive training and coaching on this new way of working. Employing ChatOps effectively should see a reduction in other collaboration mediums such as email and phone conferences and therefore IT teams may need assistance in adjusting.

In summary, ITSM and Operations can remain effective and efficient if they evolve their practices in alignment with their peers in application development. ChatOps offers a new collaborative engagement model that facilitates the rapid exchange of knowledge and experience needed to assist in this uplift while at the same time, supporting better, faster and more secure methods of releasing and supporting new features into production.

References

Ansel. M. (2015). Don't Let ChatOps Become ChatOops. Retrieved May 11, 2017 from https://youtu.be/18PNEwt18P0
Bertsch, A. (2017). Meet Securitybot: Open Sourcing Automated Security at Scale. Retrieved June 3, 2017 from https://blogs.dropbox.com/tech/2017/02/meet-securitybot-open-sourcing-automated-security-at-scale/
Chuparkoff, D. (2015). ChatOps FTW! Tips and Best Practices. Retrieved May 09, 2017 from https://youtu.be/nmeVtTYxH2Q
Fell, S. (2017). Are You Talking To Me? ChatOps and the Rise of Conversation-Driven DevOps. Retrieved May 10, 2017 from https://electric-cloud.com/blog/2017/03/talking-chatops-rise-conversation-driven-devops/
Fryman, J. (2015) ChatOps: Technology and Philosophy. Retrieved May 15, 2017 from https://www.youtube.com/watch?v=IhzxnY7FIvg
Hand, J. (2016). ChatOps – Managing Operations in Group Chat. O’Reilly Media, Inc. IBSN 978-1-491-96230-5
Hand, J. (2016). ChatOps "Infrastructure As Conversation" - OSCON 2016. Retrieved May 11, 2017 from https://youtu.be/zHHySxYf-3s
Nasello, S. (2017) ChatOps as Change Agent - DevOpsDays Rockies. Retrieved May 09, 2017 from https://www.youtube.com/watch?v=BOcVV4DXnSg
Newland, J. (2013) ChatOps at GitHub. Retrieved June 3, 2017 from https://youtu.be/NST3u-GjjFw
Perez, D. (2016). Doubling down on ChatOps in the Enterprise - DOES16 San Francisco. Retrieved May 19, 2017 from https://youtu.be/9HDKOHUZwzk
Regan, S. (2016). What is ChatOps? A guide to its evolution, adoption, and significance.  Retrieved May 14, 2017 from https://www.atlassian.com/blog/software-teams/what-is-chatops-adoption-guide
Sigler, E. (2014). So, What is ChatOps? And How do I Get Started? Retrieved May 15, 2017 from https://www.pagerduty.com/blog/what-is-chatops/
Liker, J. (2004). The Toyota Way. McGraw-Hill. IBSN 0-07-139231-9
Wallgren, A. (2017). Episode 65: ChatOps and DevOps - Electric Cloud - Continuous Discussions (#c9d9) Podcast. Retrieved May 21, 2017 from https://electric-cloud.com/blog/2017/03/continuous-discussions-c9d9-podcast-episode-65-chatops-devops/

Comments

Popular posts from this blog

Employing value streams in Enterprise Service Management

Using the Lean Canvas for an IT solution proof of concept