GARAGE_METHOD.txt

Garage Methodology


Cloud is more than just a set of technologies. You can’t separate using cloud technology from the organizational, cultural, and process transformation necessary to make use of the technologies to drive your desired business outcomes.

As you modernize your existing applications, move to the cloud, and start to develop new cloud native applications, it can be challenging to know where to start. What areas can you improve upon and what new ways of working do you want to adopt? To meet that challenge, you need a prescriptive approach based on industry best practices and experience in cloud and transformation to guide you through the changes that will enable you to take best advantage of the cloud, to modernize your existing applications, and to develop new cloud native apps. That is what is provided in the Garage Methodology.

The method is rooted in well-recognized and established practices and provides guidance showing the steps you need to take and the tools you need to use to achieve specific outcomes. It provides a unique, holistic point-of-view that is distinctive in the cloud marketplace.

These practices are easy to consume, complete, prescriptive, and open and extensible. The practices are grouped into seven categories that represent the complete transformation lifecycle for cloud with enhancements made to support wider applicability and scale.

The result is a method approach that is proven, aligned with industry best practices, and scalable to any size of engagement.


The method in action


Most journeys to the cloud conclude in a slightly different place than originally envisioned when the journey first began. The prescriptive approach defined in the Garage Methodology guides you in that journey.

Culture. Transform your organization by combining business, technology, and process innovations that help you create teams that quickly learn from market experiences.

Discover. Dig deep into your problem domain, align everyone on common goals and business outcomes, and identify potential problems and bottlenecks.

Envision. Incrementally deliver awesome apps by using Enterprise Design Thinking and related design practices to establish a repeatable approach to rapidly deliver innovative user experiences. Gain an understanding of your user’s needs and frustrations then define the minimum viable product (MVP) needed for your user to have a delightful experience. An MVP helps you to understand the issues, resolve the risks, and begin to experience the benefits that can be derived from the journey to the cloud. MVP’s can be delivered for any entry point; from building a small cloud-native application to show something really quickly, to migrating an application to the cloud, to showing how you can implement Site Reliability Engineering.

Develop. Using test driven development and pair programming, produce high-quality code that you can confidently deliver to production. Accelerate time-to-market by using continuous integration, continuous delivery, and automation to deliver in a fully tested production environment. Adopt a build to manage approach enabling your developers to instrument your application and provide manageability aspects from the beginning.

Reason. Build a solid information architecture that enables you to turn data into knowledge. Develop analytic models using machine learning approaches. Integrate AI into your solutions and into the execution of the method practices.

Operate. Ensure operational excellence with continuous application monitoring, high availability, and fast recovery practices that expedite problem identification and resolution.

Learn. Continuously experiment by testing hypotheses, using clear measurements to inform decisions, and driving your findings into the backlog so that you can pivot to react to customer feedback and requirements.

The practices that you rely on during the delivery of your first MVP can be expanded and used in the next MVP for the same project and across your enterprise. Likewise, a simplified migration of a single application uses the same tooling and processes, such as application affinity analysis, that will scale to thousands of VM’s. The key is that everyone shares the same vision, the same roadmap and workflows, and the same prescriptive guidance through the practices.


 Transform and innovate with speed

Culture is key to the success of an agile transformation. Your culture must support small, colocated teams that are autonomous and that can make decisions that are based on efficiency and knowledge. Your squads have a diverse set of skills that support your transformation and enable the team to pivot in response to market feedback.
Why change?

Does your team still use the waterfall method of development, completing one or two major releases each year? Or maybe you adopted agile principles and removed a few barriers but still have silos between development and the operations teams that run the software in production.

The shift to continuous delivery might seem daunting, but it's possible, especially if you focus on shifting the culture and mindset in your team, and ultimately, your organization.

One key to culture change is adopting the startup mindset. To move fast, startups eliminate the barriers between innovative ideas and production. They aren’t afraid of redefining everything from development practices to operations, from testing to production, and from tools to management. A radical transformation requires a culture that is focused on innovation.

Cultural evolution enables key business advantages. When you shift to small, empowered teams with the right organizational roles—including product owner, design, and DevOps—team members can gain a deep understanding of the customers who use their products. Each team can deliver new and improved features that the customer wants and change course based on feedback. Pivoting with speed isn't possible when you deliver quarterly or annually.
What changes?

Change is challenging. Overhauling the culture of a team and then expanding that culture shift to an enterprise starts with team organization.


Culture is key to the success of an agile transformation.


Organize the teams

It used to be common for teams to be hierarchical and spend much of their time creating status charts and communicating their status through the organization. In recent years, innovative companies defined new ways of organizing software development teams.

With the introduction of autonomous, colocated squads, the way that status is collected and communicated changed. Teams have the power to do their jobs and determine the best way to get their work done. As a result, roadblocks and wasteful work are eliminated.

It's ideal to have the team be colocated. Colocation streamlines communication and encourages the team to build a sense of camaraderie. When colocation isn't possible, the team must include all members in the daily standup meeting and in playbacks.

Another key to building a successful team is diversity. A team needs members with various skills, including designers, developers, and testers. A team also needs members who vary in style. You need people who jump in and try things first, people who think things all the way through before they start, and people who think differently than everyone else.

In agile development, a common set of organizational roles exists. Each role has specific responsibilities. Everyone on the team has some level of autonomy, and it's best to create an environment where extensive meetings and consensus building aren't required.

When a team is changing its culture, it takes time to adjust to the different roles and their boundaries. Team members adopt roles and adjust responsibilities as needed.
Understand the fail fast philosophy

Teams must recognize that obstacles are often in the way of getting the right idea. The key is failing fast. Experiment with an idea to see whether it's what the customer wants. Implement only what is necessary to prove the idea, get customer feedback, and learn whether the idea is successful. Even if the idea fails, the team learns from the experience.

Adopting the failing fast mentality doesn't mean a lack of fun. People who work in a fun environment enjoy their work more and produce better results. A little laughter goes a long way!
Adopt the methodology

Transformation to continuous delivery starts with understanding agile principles:

    Satisfying the customer through early and continuous delivery

    Delivering new function on a short timescale that the team agrees to

    Fostering communications across the team through daily standup meetings and playbacks

    Developing at a sustainable pace

    Reflecting, at regular intervals, on how the team can improve

Depending on how your team wants to organize and work, you can combine agile principles with aspects of Lean development. For example, your team might deliver time-boxed iterations, as prescribed by an agile approach, or it might work in a Kanban approach. The important thing is that the team decides what works best.
Expand the new culture to the entire enterprise

When you're transforming at the enterprise level, establish a center of competency (COC) to educate teams about new practices and tools and to address any problems. A COC is an independent body that consists of representatives from all areas that are affected by the changes. The COC develops common solutions and acquires new skills that are then adopted throughout the enterprise.

Boot camps are a common way to train teams in an organization that is transforming. In a boot camp, a team can acquire deep skills in new technologies and processes in a short period.
Communicate efficiently

Start each day with a daily standup meeting to ensure that the team knows what work is underway and whether any blocking issues exist.

While the daily standup meeting aligns everyone once a day, team members need to stay in constant communication. To enable constant communication, use a collaboration tool, such as Slack. Ideally, a collaboration tool has these capabilities:

    Allows both communication in real time and the option to leave a message if someone is away

    Stores a history of messages between one or more team members

    Allows for the creation of groups so that you can share information across an entire team

Throughout the cultural transformation, make sure that your team stays excited about the change and views it in a positive way. 


Build a DevOps culture and squads


The Garage Methodology, including DevOps, is a cultural movement; it's all about people. An organization might adopt the most efficient processes or automated tools possible, but they're useless without the people who run the processes and use the tools. Therefore, building a culture is at the core of adopting the Methodology.
Build culture

A DevOps culture is characterized by collaboration across roles, focus on business instead of departmental objectives, trust, and value placed on learning through experimentation. Building a culture isn't like adopting a process or a tool. It requires the social engineering of a squad of people, each with unique predispositions, experiences, and biases. This diversity can make culture-building challenging.


Improve velocity and quality by adopting a DevOps culture.


Building a DevOps culture requires the leaders of the organization to work with their squads to create an environment and culture of collaboration and sharing. Leaders must remove any self-imposed barriers to cooperation. Typically, operations professionals are rewarded for uptime and stability, and developers are rewarded for delivering new features. As a result, those groups are set against each other. For example, operations professionals know that the best protection for production is to accept no changes, and developers are encouraged to provide new functions quickly, sometimes at the expense of quality. In a DevOps culture, squads have a shared responsibility to deliver new capabilities quickly and safely.

The leaders of the organization must further encourage collaboration by improving visibility. Establishing a common set of collaboration tools is essential. Build trust and collaboration by giving all stakeholders visibility into a project's goals through agile planning discussions and playback meetings.

Sometimes, building a DevOps culture requires people to change. People who are unwilling to change, that is, to adopt the DevOps culture, might need to be reassigned.


Build a squad


The most important thing to remember when you build a squad is that a squad in the most basic sense is a group of people who work together to accomplish a common goal. In a DevOps squad, the goal is usually the delivery of a product or a microservice that is part of a product.

What makes a squad successful? Think about the squads that you've worked with. Several might stand out as favorites. Successful squads demonstrate a few key behaviors.

First, successful squads communicate well. They don't always agree, but everyone contributes. Each person respects the ideas and opinions of other squad members, and when the squad gets together, squad members can freely discuss ideas without regard to hierarchy.

Remember that squads are made of people. Pretending that life doesn't impact work isn't realistic. Often, you spend more time with your co-workers than with family and friends. Be sure to consider how your squad members feel on a personal level. Although the squad has a goal to accomplish, squad members shouldn't feel like cogs in a machine. On a successful squad, squad members can discuss both good and bad topics and maybe even have fun and develop friendships.

Finally, having a common goal is key to the success of a squad. Every member of the squad must understand that goal. To understand your squad's common goal, use Enterprise Design Thinking to define a minimum viable product (MVP), the infrastructure to build the MVP, and the related rank-ordered backlog of stories. By planning work from the backlog and participating in playbacks, everyone stays in sync as the squad moves toward its goal.


Squad characteristics


When you build a squad, several more characteristics are critical to its success as it strives to adopt DevOps in the Method:

    Diversity
    Autonomy
    Colocation
    Productivity
    Transparency
    Blameless root-cause analysis
    Peer recognition
    Fun


Diversity

The following content is based on "Building diverse teams" by Adrian Cho.

Diversity is an essential characteristic of a healthy, resilient, high-performance squad. People tend to hire and promote others who think like they do; it's natural to want to be around people who are like you. However, to successfully innovate at scale, squads must be able to run at full speed and pivot without tripping. Without diversity, squads might sink into groupthink and are less likely to pivot when they should. When you build a squad, consider three aspects of diversity: diversity of skills, diversity of style, and diversity of thinking.

A common way to think about diversity of skills is to think of multi-disciplined squads that bring together designers, developers, testers, operations, and so forth. Increasingly, though, many organizations are now building squads of full-stack developers where each person must be multi-disciplined.

Achieving diversity of mindset and culture is harder than achieving diversity of skill. In any squad activity, some people tend to move quickly. Others are cautious, waiting for others to move first. If too many people in a squad rush forward, the squad will take unnecessary risks. If most of the squad tends to wait, the squad won't be competitive enough. A well-balanced squad must have a mix of both styles.

When it comes to getting things done, some people are good at starting tasks and others are good at finishing them. Without strong starters, a squad is slow to build momentum. Without strong finishers, the squad might never achieve its goals. A mix of both styles ensures a high degree of collective productivity.

When someone on your squad thinks differently than everyone else, it can be easy to reject that person's opinion. However, think of it this way: the squad is a box. If thinking outside the box powers innovation, embracing the outliers can be the best path to success.

While diversity is important, the size of a squad must remain small. Ideally, a squad's size follows the two-pizza rule and is no larger than 10 people.


Autonomy

The following content is based on "Autonomous, colocated squads" by Scott Will.

What does it mean to be an autonomous squad? It doesn't mean that a squad can do whatever it wants to. Squads should always be working on items that are in alignment with the overall goals of the project. For example, when the project needs the squad to complete a microservice so that the offering can go live, the squad doesn't get to say, "Let's build another version of Solitaire instead." However, the squad does get to determine the way that they will complete the work on the microservice.

In autonomous squads, "autonomous" means that the squads are responsible to figure out how best to do the work that needs to be done. Autonomous squads get to make these kinds of decisions for themselves:

    Should we adopt pair-programming, or use formal code reviews?

    Should two people pair together for an entire week, or should we switch pair assignments daily?

    Should Sally work on the GUI, or should we let Steve, the new guy on the squad, work on the GUI for this story?

    Should we all plan to show up at 8:30 AM so we can begin our day together, or should people come in whenever they want to?


Colocation

Most of the time, it's preferable to have a face-to-face conversation with a subject-matter expert rather than read an article. When squad members are colocated, they spend less time writing and reading emails, sitting on long calls trying to convey ideas, and worrying about network outages that keep squad members from checking in code from remote locations. In colocated squads, time-zone problems don't exist. Colocation improves both communication and efficiency.

Colocated squads might be together in a shared, open space. Although having squad members with their own offices in the same building is better than having people scattered all over the place, the goal is to have everyone in the same room so that the squad can build the synergy to improve productivity and morale.
Overcome the obstacles to colocation

Unfortunately, colocation is not always easy to accomplish. The following problems are common obstacles that squads face.

Problem: The squad already has members who are remote. If remote squad members are forced to move to another city, several might quit.

Suggestion: Typically, the reason why squad members are remote from each other is because of skills issues: the right mix of skills is not available at one location. To address this problem, the long-term goal is to create the right skills mix at each location. Obviously, this goal takes time and involves folks learning new skills. In the short-term, the remote employees with the needed skills can train employees who are local. Don't forget: the remote employees will also likely need to learn new skills so that they can become part of a squad that is local to them.

Problem: There is not enough open space to bring squad members together in one place.

Suggestion: Squads have solved this problem in different ways. One squad took over a conference room. Another squad went to one of their other buildings that had a modular floor plan and several unoccupied offices and formed their own squad area. Another squad pursued funding to rent space at a nearby office building that was looking for tenants.

Problem: One of the squad members lives in a city with no option to join another squad; that member is the only employee who lives there. The squad member doesn't want to move. The member is also in a time zone that is 5 hours different than the rest of the squad.

Suggestion: In this case, the best solution depends on the squad. To reach a decision, the squad might consider these suggestions:

    Try to find times to work with the remote member so that no one is inconvenienced all the time. For example, if the squad has an 8:00 AM call, move it to noon. Attending a call during your lunch hour is inconvenient, but it's better than forcing the remote squad member to attend a call when it is 3:00 AM in that member's time zone.

    Experiment with remote pair programming.

    Depending on the budget, the remote employee can travel to the location where the rest of the squad is on a semi-frequent basis.

If squads are motivated to gain the benefits that colocation offers, they will come up with the ideas to make it happen.
Productivity

The following content is based on "Minimizing distractions" by Jan Acosta.

All too often, people reach the end of the day and realize that they haven't accomplished even a small percentage of what they set out to do. Why? Email, interruptions from coworkers, meetings, or a barrage of conference calls are often to blame. Minimizing distractions is a key cultural principle and one that many squads struggle with.

Minimizing distractions focuses on looking at how squads spend their time and then empowering the squad members and their managers to act to reduce the noise, thus allowing them to focus on the critical tasks at hand. The benefits are enormous. Squads report higher job satisfaction because they can focus on doing what they love to do. Productivity is higher because developers can get much more done than they would if they were dealing with constant interruptions. The quality of the code developed is higher as well because a developer's train of thought doesn't get derailed, leading to coding errors or omissions.

How do you minimize distractions? Consider these tips as a starting point:

    Decline meetings that are not essential.

    Limit meeting attendance to the smallest number of participants possible.

    Block time on your calendar to do uninterrupted work. Start with a two- or four-hour block each day. No email, no calls, no getting pulled into discussions; this time is reserved so that you can focus on the tasks you must accomplish that day.

    Schedule meetings for the minimum amount of time needed. If you schedule a call for an hour, then you're more likely to spend the hour talking. Challenge yourself and others to see whether you can cover the same material in 30 minutes, or even better, 15 minutes.

    Designate an "interrupt pair" for the squad each day. This pair is responsible for handling questions from other squads, attending meetings, and handling any emergencies that might arise. Then, the rest of the pairs on the squad are free to focus on their tasks.

    Seek support from your management. Managers want their squads to be as effective, efficient, and happy as possible. If you're struggling with interruptions, or things that take you away from your tasks, lean on your manager for help with blocking out the noise so you can have more time without interruptions.

Transparency

Transparency in the workplace means operating in a way that makes it easy for everyone in your organization to see what's going on. Transparency creates trust across virtual squads and dissolves the boundaries between them. Make sure that your squads operate transparently in these areas:

    Code: Everyone in your extended squad must have access to source code. When you define the scope of your organization, consider securing coding practices.
    Backlog: Everyone in your extended squad must have access to functional and nonfunctional requirements and their prioritization. By providing details on your decision-making process, you can get the support that you need from the members of your extended squad.
    Metrics: Your extended squads must have access to availability and metrics data. Depending on what you're working on, you might also need to ensure that the consumers of your services can access that data.
    Incident investigation and problem management: Document information about incidents and lessons learned. Make that information available so that your squad and others can benefit from your experience.

Establishing trust through transparency is key to successful teamwork
Blameless root-cause analysis

Things go wrong. People make mistakes. You need an environment where squad members can share lessons learned to prevent others from making the same mistakes. To create that environment, address any disincentives, such as fear of punishment or reprimands.
Peer recognition

The following content is based on "Peer recognition" by Donna Fortune and Carlton Mason.

People love recognition; it's part of human nature. Recognition builds self-confidence and is an intrinsic workplace motivator.

Different people prefer being recognized in different ways—some people are not fond of public recognition while others are. However, most everyone appreciates peer recognition, and it often has more impact than recognition from leaders or managers. Whether it is for creating outstanding code, fixing a complex bug, or completing an otherwise undesirable challenge, knowing that your peers recognize and appreciate your work feels good.

Peer recognition is a symptom of a healthy squad. Healthy squads are typically far more engaged, collaborative, innovative, and productive.

Establishing a culture of meritocracy, where employees are recognized for their achievements, dissolves the formal hierarchy of an organization. Meritocracies thrive on trust, transparency, and consistency. The silos and political barriers that prevent the sharing of information, balancing workloads, and responding to inevitable failures are broken down. Peer recognition promotes teaming, where once it was "us vs. them." Talent and leadership are recognized within the squad, and the people who are recognized naturally assume greater responsibility within the squad and for the organization's success.

This emergent leadership is prevalent in many open source communities where the engineers who are actively engaged in contributing to and improving the project can gain committer status. The value of your contributions, the mastery of your craft, and your technical credibility, prowess, and knowledge are more important than a title or positional authority.

DevOps cultures thrive instead of stalling, even when faced with disruptions. Fostering peer recognition is an excellent way to enhance a squad's culture.
Fun

When employees have fun in the workplace, they enjoy their work and produce better results. Managers in DevOps environments strive to create an atmosphere that is challenging, creative, and fun for employees and for themselves. For more information about how to create a great work environment, see Fun in the workplace.
Create a social contract

Many squads use a social contract to document their decisions about how to behave and interact. A social contract is a squad-designed agreement for an aspirational set of values, behaviors, and social norms. Think of it as a vision for working on an incredibly safe and powerful squad.

The social contract identifies dysfunctional behaviors and addresses them quickly to mitigate long-term damage. Anyone on the squad can and must enforce the contract by identifying deviations, as people invariably forget agreements over time.

Follow these tips to create a successful social contract for your squad:

    Gather all squad members to create the contract. Use a facilitator to ensure that all perspectives are heard.

    Make sure that the facilitator asks many questions to encourage conversation: What do we value? What's important? What would make this squad powerful? What can we count on from one another? Think about negative experiences you had on projects and identify ways to avoid those problems in the future.

    Allow the participants to voice their thoughts by using sticky notes, either literally or virtually. Give participants 15 - 20 minutes of time to record their thoughts.

    Group similar ideas into an affinity-type map.

    Prioritize the top 5 - 10 groups and agree on a group label. Those labels become the elements of your social contract.

Squad organization

Each autonomous squad must fit into a larger organization. Spotify defined common terminology that is used in the industry to describe squads and combinations of squads.
Tribes

While autonomous squads must be able to do their work their way, they also fit in a larger organization. Typically, delivering an enterprise-grade application requires work from multiple squads, each of which are responsible for a microservice. When those microservices are combined, the entire product is created.

Spotify uses the word tribe to describe a set of squads, and people from disciplines such as Marketing and Finance, that are aligned around the goal of delivering a product or service.
Guilds

Autonomous squads consist of diverse squad members who have a wide variety of skills. Sometimes, it's important for people who share a common skill to discuss ideas and solve problems within their specialty. Guilds gather people from multiple squads around a common discipline.

For example, each squad has one or two people who are familiar with the continuous delivery tools that are used to build, deploy, and manage the product that the squad is delivering. A Continuous Delivery guild brings together people who do that job from each squad. The guild drives best practices in continuous delivery and acts as a forum where people who are struggling with a problem can find answers from fellow guild members.
Squad leadership

In a self-managing, cross-functional squad, everyone is a leader of some sort at some point. What does it mean to serve your squad as a leader? How do you know whether you're a good leader? In an autonomous squad of 10, each person has plenty of leadership opportunities.

    Product ownership: Each squad must have one person who is defined as the product owner. This person is responsible to understand the product that is being delivered. The product owner must ensure that work is represented in the rank-ordered backlog and must set the priorities so that the squad knows what it needs to deliver.

    Technical leadership: On a 10-person squad that is responsible to deliver a product, each person has a unique set of skills that he or she uses to reach the common goal. Leadership in this respect is not reporting status, but being the best at a skill and using that skill to help the squad succeed.

    Coordination and status reporting: The goal of each autonomous squad is to spend as little time on coordination and status-reporting as possible. However, those tasks still must be done. Squads can strive to minimize the effort on those tasks by using playbacks to convey status to management and by using the rank-ordered backlog to surface plans for upcoming work.

Dynamic leadership

The following content is based on "Self-organized teams" by Adrian Cho.

The concept of self-organized squads might be new in the business enterprise, but many examples can be found in other domains. Unlike the permanent leadership and well-defined hierarchy of a symphony orchestra, a group of jazz musicians might start with one member picking the tunes and counting off the tempo, but then the role of leader moves freely throughout the group, in real time and typically with no explicit communication. How do they do this without avoiding situations where people fight for leadership, or even worse, where no one leads?

In jazz, anyone can take the initiative to explore new possibilities that can lead to moments of wonderful creativity. The risk of failure exists, but the same dynamic leadership means that someone is always there to help preserve the stability. The musicians are adept at practicing leadership on demand because they are equally comfortable leading, following, and switching fluidly between the two. This mindset requires a willingness to let others lead.

Software development squads must be willing to work beyond simple static organizational structures. In a modern dynamic organization, it is common for virtual squads to form for one purpose and disband after they accomplish that purpose. Guilds that exist across multiple squads bring together people of like interest or expertise. People often work simultaneously in many squads, in multiple guilds, and in matrix reporting structures.

Static structures are often ill-equipped to respond to the constant change, chaos, and confusion of the new business world. Where squads are trying to design, build, and operate many microservices instead of a single monolithic application, they must similarly organize into decentralized, independent, loosely coupled squads. Otherwise, as Conway's Law suggests, a monolithic organization is constrained to create monolithic software.

This delicate balance between individual and group performance is the difference between a group that works in synergy, performing as more than the sum of its individuals, and one that is just a group of high-performance individuals. People with the squad-first mindset understand that their individual contribution is vital to the squad's success. They also know that without the rest of the squad, they alone cannot achieve the same success.

In software development, certain indicators of stability must be prized above all else. These indicators include the health of the current build based on the main stream of code, the health of running services with zero downtime as the target, and the health of each squad.

Developers must put these things first to ensure that even as they work independently, explore possibilities to innovate, and push boundaries of personal productivity, the squad and its most prized assets are never compromised. These indicators of stability must be quantified, treated as actionable metrics, and shared widely throughout the squad. In many cases, the squad can use tools and services to monitor and report such metrics as build failures, code complexity quality, uptime, incidents in production, and more.


Summary

Your most important goal in building a squad is to ensure that the squad can collaborate without adding a layer of bureaucracy—a development that would defeat the purpose of adopting a new culture.


Build effective squads

To meet the growing need for high-quality customer experiences and rapid business concept introduction, development organizations must change. Adopting the Garage Method for Cloud can help an organization make the transformation. The Method uses Enterprise Design Thinking, Lean Startup, and agile DevOps concepts to enable continuous design, delivery, and validation of new function. A good way to begin to change is to organize your teams into collocated, autonomous squads before you start to build code.

Many types of squads exist. In a large-scale cloud development project, you can organize into many separate squads or create squads with combined responsibilities.
A squad is a small independent team

A squad is a small, independent team made up of these roles:

    A squad leader, who acts as an anchor developer and agile coach for the squad

    3 or 4 development pairs who practice pair programming

Each squad has an associated designer, an associated product owner, and can have an associated application architect. These associated roles can come from specialized support squads. An example of a specialized support squad might be the content squad that handles overall design and UX creation for your squads.

Although it's ideal to embed designers and UX content creators directly into a squad, not every organization has enough of those skills to do so. You can centralize that work, which is often related to the work of several squads anyway, into one squad.

Squads that are responsible for developing application functions might be called build squads.
A squad is responsible for chunks of function

Squads implement epics, which are groups of related user stories that describe higher levels of system functions. A single squad can implement one or more epics within a chunk, but an epic is the smallest element of implementation responsibility for a squad. The user stories that define that epic are added to a rank-ordered backlog of user stories. The organization might choose to use Kanban or other tracking methods to manage and maintain that backlog, but the backlog must be kept up-to-date constantly and reprioritized daily.

One way to size user stories is to make them implementable by a single development pair within a single day. Because the team’s implementation speed can change over time, you might need to adjust the sizing. Stories can be broken up or combined as necessary and added to the backlog.
Squads work by using best practices

A squad can benefit from implementing pair programming. When pairs write all the code, it undergoes continuous code review, making it possible to reduce or eliminate formal code reviews. Rotating programming pairs daily spreads the knowledge of individual system elements across the entire squad. The code is continually read, revisited, and revised as new user stories are implemented. This practice has the added benefit of reducing the dependence on any single person on the squad. Using pair rotation with test-driven development makes it possible for any squad member to participate in a pair with confidence.

Pair programming can be a key advantage of the squad model for large-scale organizations. As a squad gains experience and maturity, you can divide it into two squads. A senior squad member becomes the squad leader of the new squad, which is made up of some original squad members plus some newly trained developers. The new squad begins work on a new chunk. The original squad also adds new developers to gain experience with the squad’s code as they work on epics within the original chunk or in a related chunk. In this way, a team can grow quickly.

This combination of practices, plus practices such as daily standup meetings, speaks to the need for squads to be autonomous and completely own an epic or chunk from end-to-end, and colocated. It is difficult to separate pairs across locations. While the overall project team can have squads in different locations, the individual members of each squad should be colocated.
The role of testing in the squad model

In the squad model, the use of automated testing, test-driven development, and pair programming means that you do not need a large, dedicated testing staff embedded in the teams. The skills of the testing staff are needed, but the people can take on different roles. Instead of acting only in a test role, some testers become developers and others use their deep domain knowledge as product owners.

Teams should follow the practice of test-driven development. Test-driven development requires that you write the test before you write any code, using the concept of tests as specifications. This practice ensures that any member of a squad can understand the code. If developers can read a suite of functional tests, they can understand how a particular code element is implemented. The test suite that is developed through this practice must encompass all of the major forms of testing that are required: functional testing, user interface testing, and performance testing. Fully embracing automated testing can dramatically improve the quality of code and can markedly reduce the time spent running manual tests.

A need still exists for a smaller, more specialized testing squad to conduct types of testing that require specialized skills, such as cross-device mobile testing and end-to-end performance testing.
Summary

Adopting the squad model and adopting the principles of Garage Method for Cloud can help your organization stay competitive in today's fast-paced environment.


 Follow agile principles

Culture is the starting point for innovation and agile transformation. The Garage Method for Cloud has embraced and evolved a number of core principles advocated in the Agile Manifesto. Agile methodology started with a focus on software development as a way of working that keeps everyone focused on quickly producing working software that meets customer needs.

The "Agile Manifesto" dates to 2001, when software development practices were nothing like they are today. Back then, it was typical to spend a year planning and writing specifications and another year writing and testing code. By the time any software shipped, it was already 2 years behind what customers were looking for.

According to the manifesto, an agile culture values individuals, interactions, working software, collaboration with customers, and response to change. The principles of the Agile Manifesto can guide teams to define, design, develop, and deliver innovative solutions across the entire lifecycle, roles, and disciplines.

Your team or organization might expand on those principles depending on its experience with other practices and methodologies, like Enterprise Design Thinking and Lean Startup, but the manifesto and principles have proven themselves for more than 14 years.
Get started

To adopt agile principles, you need an executive sponsor who actively supports the transformation and who will build a team of people who are passionate about these principles, or the variation that your team decides to adopt.

It's easy to say "these are good principles that we want to embrace." However, many of the principles are harder to adopt than they seem to be. Some of the challenges for this type of cultural transformation include:

    Tearing down walls and building collaborative spaces: Even outside the usual corporate buildings, this idea can be tough to sell. However, more enterprise companies are recognizing the need for collaborative spaces.

    Building a diverse team: Diversity can take many forms, but most people tend to want to hire people like themselves. As you have more "alike people" on a team, the less appealing that team is to people who aren't like them. Whenever possible, seed your team with diversity in the leadership and with leaders who actively value and pursue diversity.

    Keeping a sustainable pace: Agile teams work intensely and collaboratively, even more so if pair programming is being used. Meetings are primarily on demand and as needed; they have focused purposes and are held face-to-face. Social media and personal distractions are minimal because you're intensely collaborating with other people and impacting them if you are not present. When you work at that level of intensity, working long hours is not sustainable. Yet in many corporate cultures, if you're working hard or contributing highly, you're working long hours. That is one of the more difficult cultural norms to change.

Alternatives and related practices

Many of the agile methodologies have taken the Agile Manifesto and done their own interpretation. And other disciplines have similar sets of principles that are applicable to agile methods as applied to the entire product lifecycle.


 Fail fast and learn fast

At the heart of every startup and every new product is that "one great idea" that someone thinks the market needs. But not every great idea is that great. Sometimes an idea needs a few changes to become great. Other times, it's hopeless. In the early stages, it's hard to distinguish the good ideas from the not-so-good ones.

Every team's objective is to deliver high-quality function—that great idea that customers love—and to do it fast. One technique to meet this objective is to quickly change course when something isn't working instead of continuing down the wrong path. This technique is known as failing fast.

When you develop applications, you have a limited amount of time, people, and money to get an idea right. The longer it takes you to realize that an idea isn't a winner, the more resources you waste. Conversely, the faster you can verify that an idea is great, the faster you can get more investment to make the idea a reality.

The key to failing fast is to develop enough of your idea to determine whether it's useful to customers. You can have customers validate the function with as little investment as possible, reducing business risk. If a customer doesn't like the new function, you can find out before you invest more time or resources into developing the function and move on to the next idea.

Failing fast requires a culture where the team has the freedom to fail but can learn something from each failure that helps the team succeed faster the next time.

Imagine yourself in a 10-week beginner's pottery course. Half of the students spend the entire 10 weeks working on one clay pot. For those students, the overall grade for the course is based on the quality of that one pot. The other half of the students make as many pots as possible; their grades are based on the quality of pots that they create. At the end of the 10 weeks, an independent expert is brought in to see which pots are the best.

Which half of the class will likely have a higher percentage of great pots? Experiment after experiment suggests that the "trial and error" group will have more great pots. Because the second group of students must try different ideas and techniques, they can more quickly find ideas that work and then use those ideas to produce amazing results.

This idea is at the heart of Enterprise Design Thinking. In the words of David Kelley, "Enlightened trial and error beats the panning of the lone genius."


How to fail fast

You can fail fast—and learn fast—in many ways.

    The costliest failure that you can make is investing in a project that your customers don't want. The first step in failing fast is to prove that your intended customers want and need what you're planning to create. Work with your stakeholders and validate your idea.

    During the workshop phase of a project where the team and stakeholders define the application features and minimum viable product (MVP), mandate that participants explore "crazy" ideas to see whether elements of those ideas lead to good ones. In this way, teams can quickly separate the good ideas from the others.

    In a hypothesis-driven development process, the entire team—design, development, and business—identifies the riskiest assumptions that underpin an idea. Then, the team builds experiments to test them. If the idea is based on flawed assumptions, it fails, and the team can pivot and avoid wasting its limited resources. When the team tests the riskiest parts of the project at the beginning, the cost, risk, and design of the project become more predictable.

    When you work with a new project that requires a large infrastructure and hardware investment, such as an Internet of Things (IoT) project, you must validate how the platform and devices affect your design. Implement as little of your infrastructure as possible to validate an assumption and then implement a bit more to validate the next assumption. If a significant part of your IoT platform isn't built before you encounter a failure, you'll have a better idea about where the failure is happening. 

    From an application-delivery point of view, development teams that use DevOps can get immediate feedback to find out whether code or a deployment is flawed. In this way, a team can move to higher-quality code and a more stable deployment environment.

As your team embraces the notion of failing fast, it gets used to experimenting and recovering from unexpected circumstances as part of the development cycle. Everyone becomes confident that discovering the unknown and reporting on it is respected and treated as a contribution to the project's success. What counts is a great solution that is achieved by making failure and correction part of the team's culture.


 Manage your work by using a Kanban

Many development teams use the Kanban method to manage their work. At its core, kanban is a visual representation of work in progress. A kanban board can be as simple as sticky notes and lanes on a whiteboard or your team may adopt web based tools that can be accessed by everyone on your team. On-line kanban boards are especially useful if your team is not colocated.

Kanban has two essential principles: the workflow is visible to the entire team and the work in progress is limited to the capacity of the people who are doing the work. You can't keep adding work to the queue and expect it to get done—or done well.

By following the Kanban method, teams can deliver features quickly, reduce the inefficiencies of multitasking, and identify bottlenecks and dependencies.

The basic kanban workflow is as follows:

    New: Work that is new to the team and has not yet been added to the backlog.

    Ranked backlog: Work that is waiting to be started, ranked in order of importance and size.

    Work in progress: Work that is in progress by a team member.

    Under review: Another member of the team checks the work.

    Done: If review goes well, the work is complete and ready for production.

    Repeat: A team member takes the next item from the backlog.


Types of kanban boards

The format of your kanban board might vary depending on the group you're working with. For example, a management team might use a board to track broad ideas or epic stories. A development team might use its board to track the backlog and work that is in progress and completed.

Three types of kanban walls and headers are common:

    Portfolio walls: For management-level tracking of whole projects or epics across a larger organization
        Headers: Idea, In Discovery, In Delivery

    Program walls: For project-management-level tracking of a project across several teams
        Headers: Backlog, In progress, Under review, Done

    Team walls: For team-level tracking of work
        Headers: Backlog, In progress, Under review, Done

Create a kanban board

Keeping the basic workflow in mind, gather your team, grab a few sticky notes and a whiteboard, and create a kanban board:

    Draw columns and rows by using tape or a marker on a whiteboard. Alternatively, stick paper or plastic to a window or empty wall.

    Put headings on the columns that correspond to your workflow. The headings depend on the work that you're tracking. You might use headings like Backlog, Doing, and Done, or Idea, In Discover, and In Delivery.

    Write a short description of each piece of work on cards or sticky notes.

    Use magnets, pins, or adhesive to arrange the cards or sticky notes on the board to show the status of the work flow.

You can also use tools such as ZenHub or Trello to create a web-based kanban board for your team. 


Use your board to keep everyone on your team in sync and moving forward on their work.
Kanban for the enterprise

If you're orchestrating work across dozens of teams, you need more than a whiteboard and sticky notes. To meet the needs of a complex organization, you must scale up more than the tool. You must adjust the Kanban method, too.

For a large organization, you need a complete kanban system: a kanban of kanbans. You must apply the kanban process at the highest level first so that hundreds of tasks aren't haphazardly assigned to the teams. If the organization applies and coordinates the kanban processes at each level, the teams can understand the priorities and the decision-makers have a clear view of progress and capacity.


 Run daily stand-up meetings

Each day, your team must discuss how things are going and whether any issues are blocking progress. This discussion is accomplished through the daily stand-up meeting.
The value of stand-up meetings

Conducting daily stand-up meetings provides value in many ways. The daily discussion:

    Improves team building and collaboration

    Exposes blockers that affect team members

    Facilitates knowledge sharing across the team

    Gives team members a few minutes away from their work

How to run an effective stand-up meeting

Before the meeting starts, team members must be ready to commit to a daily goal and answer three questions:


When answering the daily stand-up questions, each team member's answers must align with the tasks that are assigned to them in the kanban board or your other work-item tracking system.

Answers are given in past or future tense. For example, "I completed this" or "I will complete this." Using the present progressive tense, such as "I am working with XYZ," are considered poor form. The idea is to report commitments to the team, not current status, which is already represented in the kanban board.

Answers must cover only one day or the time that has passed since the last stand-up meeting if several days have passed. If a task is too big to be contained in a day, report what part of it will be or was completed. For example, instead of "I am working on task X," say "Tomorrow, I will complete the unit tests for work item X, which is the coding task for the multi-threading story."

Here are some examples of good and bad responses to the daily stand-up questions.


Work with your team to stay on track and answer the questions so everyone is in sync and blockers can be addressed quickly to keep everyone moving forward on their work.

Effective meetings also follow these practices:

    The goal is to keep your daily stand-up meetings concise and no longer than 15 - 20 minutes.

    Good participation happens in fewer than 90 seconds. Questions from others or the squad lead are asked and addressed after all other team members participate.

    Discussion about blockers and questions about what is being done happen outside of the meeting. To discuss those matters, schedule additional time at the end of the meeting or hold another meeting to discuss issues that are more involved. The key is that any person who is not involved in the issue can leave the meeting and return to their work.

Use the daily stand-up meeting to ensure that there is always open and honest communication across your team. In doing so, your team will become stronger over time.


Discover


 Define your problem domain and align on goals

You can't solve a problem without first understanding it. The Discover practices help your team dig deep into your problem domain, align everyone on common goals, and identify potential problems and bottlenecks. They also help you self-assess and understand how prepared you are to tackle issues.
Why change?

In the past, you likely had key deliveries on a quarterly, semi-annual, or even an annual schedule. You built a business case that was based on upgrading your applications in a new release. New features had a long lead time. If you didn't deliver exactly what your customers wanted, they had to wait until the next scheduled release to see their new function requests implemented.

As you shift your culture and way of working to embrace the cloud, continuous integration and delivery, and related practices, your organization adapts to the new way of working. It's important to align your business operations with your new way of delivering application features. You must ensure that you define clear objectives to guide what is being delivered, when, and by whom. You must also ensure that you have a clear understanding of the total cost of ownership for your cloud platform.
What changes?

To become aligned, your team needs to dig deep into your problem domain, establish common goals, and identify potential problems and bottlenecks. By using the Discover practices, you can complete a self-assessment to understand how prepared you are to tackle the issues that you're trying to solve.


Understand your problems and align on common goals.


Define business objectives

Define business objectives with measurable outcomes and an identified business sponsor. As you define your objectives, you need to identify any perceived risks and potential rewards outside the stated qualitative or quantitative measures. While you no longer create a huge business case that is based on knowing the full function of a release, you can react fast to the changes in the market and the functional and nonfunctional requirements that your customers need.
Identify bottlenecks

As you define your business objectives, it's important to review your DevOps process. Work with the team to create value stream maps to identify bottlenecks and waste in your process flow and identify new practices to adopt.

Waste and bottlenecks include these examples:

    Developing functions that aren't required by your stakeholders

    Manually deploying your infrastructure

    Unnecessary tasks, such as repeatedly communicating status

Identify changes that are critical to resolving the bottlenecks and waste. Determine how you can improve your processes to become more efficient. At the same time, you must ensure that your whole organization agrees on a solution. By doing so, you can be certain that by fixing a bottleneck in one place, you're not creating one in another.
Define organizational roles

DevOps at scale calls for several specific roles, each of which requires unique skills and domain knowledge. Build the team that helps you meet your business objectives. In the most basic sense, a team is a group of people who work together to accomplish a common goal. For a DevOps team, the goal is usually the delivery of a product or a microservice that is part of a product. To achieve success, make sure that the key roles are filled.

After your team is established, team members must have a common understanding that the goal is to minimize distractions and deliver a product that delights customers. Distractions include lengthy meetings, generating status reports, and doing any other activities that distract the team from delivering its work.

Be sure to encourage the entire team based on business outcomes instead of creating conflicting individual team incentives. When people know what their common goal is and how their progress toward that goal is measured, fewer challenges exist from teams or practitioners that have their own priorities.
Communicate efficiently

As you define the roles on your team, ensure that all participants, including the product owner, development, operations, and management, clearly understand the business goals that the team must meet. The goals must be measurable and reflected throughout the method, from defining the MVP to delivering the application into production to running experiments that validate that what you delivered meets the business goals. 


Define your business objectives


In many ways, a business objective is like a hill. A hill communicates an intent for a project through statements or diagrams that can provide both direction and purpose. You use a hill to frame a problem in terms of a user outcome, but the hill doesn't describe or prescribe any type of implementation.

In a hill, the statement of purpose identifies a user to be rewarded and describes the end state that you want to achieve. The user and the end state are the "who" and the "what." Finally, a hill includes a statement of differentiation. The differentiation is either a qualitative or quantitative measure of "wow." Therefore, a hill is a statement of purpose that is oriented around who, what, and wow.
An entry point to dialogue

Although they're alike, a business objective serves a different purpose than a hill. A business objective presents an anchor point for beginning a dialogue or holding a discussion. The business objective might represent a hypothesis or postulate an area of need. You craft a business objective as a single atomic-level business-oriented statement. In this case, atomic means a fine-grained level of purpose within a business context, not a technical context.

While you can decompose a hill into a series of sub-hills, you craft a business objective at its atomic level. Within the lifecycle process, a vetted business objective can be recast into a series of hills or sub-hills when you need to define a technical project.

While a business objective doesn't need to specify or be oriented toward a specific user or community of interest, the business objective must have a sponsor.
The focus on "why"

A business objective must center on a measurable outcome, or the "why." In the hill terminology, the "why" is the "wow." If the outcome is oriented toward the near-term, construct the measure in quantitative terms. If the outcome is oriented toward long-term needs, construct the measure in qualitative terms.


A business objective must center on a measurable outcome.


An example of a quantitative measure is a gain of USD 1 million. An upward trend is an example of a qualitative measure.

Identify your business objective during a brainstorming session. In later discussions, you can evaluate the business objective and replace or refine it as needed.
A path to broader understanding


You might state a business objective as a need to evolve a capability, replace a capability, or establish a new capability. In your discussion to evaluate the why-centered business objective, you must understand why the measurable outcome must be met.

This discussion is a natural means to bring in other necessary areas of interest:

    Who benefits? Who participates? Who is impacted?
    When is something triggered? When does something participate in the business cycle?
    Where are all the places of business activity? Where are the customers? Where does the computer processing occur?
    How is something to be achieved? How should a process work?

When you understand the "why," you gain an in-depth, holistic view of the business purpose. That view helps you understand what is known and what can be known: the data or information, the metadata, and the data provenance that you need to support the business.


Work toward stability and viability

Some of the data or information to support a business objective can be directly generated during the normal course of business. Other data might need to be generated or even acquired. Potentially, some information might remain unattainable.

Part of the evaluation of a business objective is to identify any perceived risks and potential rewards outside the stated qualitative or quantitative measures.

In any computer application, the aspects of data management represent a challenge. By understanding a need and how it correlates to a dependency on data and information, you can yield a hill or an MVP that is more stable because the underlying data schemas are more stable over time. This means that the efforts to build a data store, a database, or a data platform might naturally follow a different scope than the application code counterpart.
For some, it's all about the data

By starting with a business objective, you can uncover the knowledge that your organization must build and cultivate. That knowledge is the data or information, depending on your preferred viewpoint.

In the modern enterprise, data is critical. Most organizations require computer systems, and all that an organization knows exists within its databases. An organization can't know anything more than what it stores. To avoid data management becoming burdensome and expensive to maintain, take a long-term view of data management. That view begins with discussions on each atomic business objective.


Identify bottlenecks by using value stream mapping


Many of you are no doubt familiar with Gene Kim's book, The Phoenix Project. In that book, Gene mentions Eliyahu Goldratt, who created the Theory of Constraints. One of Goldratt's observations is that any improvement that is made to a process either before or after a process bottleneck only makes the overall efficiency of the process worse.

When you think about that, it makes sense. However, do you know where your bottlenecks are? You might be familiar with a few bottlenecks that you're regularly confronted with; for example, it might take 5 hours for a build to finish. But is that bottleneck the biggest one in the entire flow of getting value to your customers? Even if it takes 5 hours for a build to complete, does that matter if it takes a full week to deploy the code into production?

A value stream map is a tool for evaluating processes to identify bottlenecks, waste, and improvement opportunities. Identifying the biggest bottlenecks in a process stream is where value stream mapping shines.


Evaluate your processes using value stream mapping.


The benefits of value stream mapping

Identifying bottlenecks is the most readily-apparent benefit of value stream mapping. It helps the entire organization see where the biggest bottleneck is and how it negatively affects getting value to customers. To use the previous example, if you're aware only of the 5-hour build time, you might be inclined to focus effort on improving that alone. If you don't understand the more significant bottleneck—the week required for deployment—improving the build time will likely make the downstream bottleneck even worse. However, if the bigger bottleneck is understood by everyone in the organization through the use of a value stream map, it is clear that the organization must focus on reducing that delay to the benefit of everyone, especially customers.
How to get started with value stream mapping

The steps to create and use a value stream map are straightforward:

    Identify the steps of the entire process. Keep it simple by identifying the most typical scenario through the process.

    Collect data: record the average time to complete each step and the average wait time between each step.

    Create a simple flow diagram that shows the "value add" times for each step and the "wait times" between each step. For an example, see the following image.

    Identify the biggest delay in the flow and determine why the delay is as long as it is. Then, create a plan to reduce the delay and act on it.


In the example value stream map, you can see that the biggest delay is the week between when a defect is added to the queue and work starts on the defect. By having this value stream map available, the organization can ask questions and come up with solutions:

    Is the delay due to skills issues?

    Is the delay due to not having enough people available to fix defects? For example, is our backlog too large?

    If the backlog is large, do quality issues need to be addressed?

    Is there too much emphasis on creating capabilities at the expense of fixing defects?

For example, if the problem is due to not having enough people to handle the defect backlog and create capabilities, the organizational leadership could repurpose an open headcount requisition (req) from sales, support, or administration and give it to development to solve the long delay in getting defect fixes to customers.

If an open headcount req is not available, the leadership can agree that slowing down the delivery of new features is needed so that the team can drive down the backlog of defects. Without a value stream map that shows the need for a change, a shared understanding and agreement on the solution is less likely to happen.

In sum, value stream maps are a simple and effective way to identify bottlenecks in an end-to-end process flow. They also help everyone see where changes are most critical so that the whole organization can agree on a solution.


 Roles in a DevOps organization

A common misconception about agile development is that everyone can do everything. In fact, the opposite is true. To run a disciplined agile process, well-defined roles and responsibilities are essential. 

Although the names of these roles might differ from what you call them, you'll likely find that their descriptions match roles that you're familiar with. A key aspect of being agile is that people are empowered to make decisions without extensive meetings and building consensus. This aspect can be a significant challenge in some enterprise organizations. Project cadence with playbacks to all interested parties is key to making the whole organization more comfortable with decisions.
Core roles in an agile squad

In the most basic sense, a team is a group of people who work together to accomplish a common goal. For a DevOps team, the goal is usually the delivery of a product or a microservice that is part of a product. To achieve success, make sure that the key roles are filled.
Product manager (product owner)

The role of the product manager is critical. The product manager ensures that the team creates an engaging product and delivers business value by meeting the needs of the markets. Product managers must understand company strategy, market needs, the competitive landscape, and which customer needs are unfulfilled. 

The product manager has these responsibilities:

    Owns the scope of the project, works with sponsor users, identifies personas, defines the minimum viable product (MVP), and defines hypotheses

    Collaborates with designers on the overall user experience, attends playbacks with sponsor users and stakeholders, and collaborates through user testing

    Defines, writes, and ranks the user stories that direct the design and development work

    Accepts user stories; that is, decides when a user story has delivered the MVP

    Decides when to go to production and owns or collaborates on the "Go To Market" plan

Larger projects might have more than one product manager. Ideally, the project is divided into components or services that individual squads can work on, and each squad has a product manager who provides direction.

The product managers must understand and coordinate requirements from other components of an overall solution. Ultimately, one voice must direct the design and development work of a squad.
Sponsor

The sponsor is typically an executive who has the vision and owns the overall delivery and success of the project. Ideally, the squad has playbacks and brings any issues to the sponsor each week. Part of the sponsor's job is to ensure that the squad has everything it needs to succeed and to support it in its use of agile methods. 
User experience (UX) design lead

The UX design lead is responsible for all aspects of the project's user experience. The goal of every project is to create experiences that delight users. Reaching that goal requires a strong and experienced design leader who can oversee all aspects of the design, from the initial workshop through visual design and user testing.

The UX design lead has these responsibilities:

    Leads design thinking practices, such as persona definition, empathy mapping, scenarios mapping, ideation, and MVP definition.

    Creates great user experience concepts and produces wireframe sketches to communicate the experience with the broader team.

    Drives collaboration with the extended design team, development team, and product management team to maximize innovation.

    Collaborates with visual designers to deliver high-fidelity designs.

    Ensures a consistent user experience across all facets of an offering, working with UX leaders in other squads when necessary.

    Plans and runs user testing to ensure that real-world feedback is injected in all phases of the project. In some projects, a dedicated user researcher assumes this responsibility.

Larger projects might have a UX designer or dedicated user researcher who works with the UX design lead. In those cases, the UX design lead is responsible for all aspects of the experience, but the UX designer might create a few of the design artifacts.
Visual designer

The visual designer converts the user experience concepts into detailed designs that emotionally connect with users.

The visual designer has these responsibilities:

    Uses his or her understanding of color, fonts, and visual hierarchy to convert conceptual designs into detailed designs that developers can build

    Creates all the necessary visual artifacts, including images, illustrations, logos, and icons

    Ensures a consistent application of corporate styles and branding, or where appropriate, suggests deviations from styles and branding

Sponsor user

A product that doesn't satisfy a real user need is a failed design. The best way to ensure that your project meets users' needs is to involve users in the process. 

A sponsor user is a real user—someone from outside your organization—who will use the product that is being built. A sponsor user must be carefully selected so that he or she accurately represents the needs of the widest possible user base. In some cases, you might want more than one sponsor user. 

The sponsor user has these responsibilities:

    Represents the needs of the user throughout the process

    Participates in all phases of the project, from the initial design workshop through design, development, and user testing

    Ensures that all decisions reflect the needs of the user

In some cases, the sponsor user might be the person who demonstrates the design or prototypes to other stakeholders. What better way to show the value of a design than to have a real user show it and explain why it matters?
Developer

Developers build code by using core agile practices such as "keep it simple," test-driven development (TDD), continuous integration, polyglot programming, and microservice design. Agile development requires more collaboration than is required in the waterfall model.

As part of a modern, agile DevOps team, developers must be adept at rapidly learning and using new technologies. In a DevOps approach, the developer role is merged with quality assurance (QA), release management, and operations.

A developer has these responsibilities:

    Collaborates with the designers and product managers to ensure that the code that is developed meets their vision

    Designs the solution to meet functional and nonfunctional requirements

    Writes automated tests, ideally before writing code

    Writes code

    Develops delivery pipelines and automated deployment scripts

    Configures services, such as databases and monitoring

    Fixes problems from the development phase through the production phase, which requires being on call for production support

In order for developers to have such broad roles, they rely on cloud technologies to free them from many—and sometimes even all—infrastructure tasks. In an ideal DevOps organization, the developers take on operations fully and are on "pager duty" for production problems. In many companies, including virtually all enterprises, an operations team provides the first response to production issues. However, you can still have a DevOps culture if the second call is to a developer to fix the problem. For more information about the operations roles that developers take on, see Cloud Service Management and Operations roles.

Teams might need developers with specialized skills, such as analytics or mobile. In those cases, use pair programming to spread the skills throughout the team.


In a DevOps approach the developer role is merged with quality assurance (QA), release management, and operations.


Anchor developer (technical team lead)

An anchor developer is an experienced developer who provides leadership on architecture and design choices, such as which UI framework to use on a project. Even with the use of agile tracking tools, sponsors and stakeholders often want a direct report on progress, issues, and key technical decisions. The anchor developer is that technical focal point who also does development work.
Agile coach

The agile coach leads the organization in agile methods. The coach must have the knowledge and experience to recommend changes to various practices in response to unique circumstances within the organization. The coach identifies problems and misapplications of the agile principles and suggests corrections and continuous improvements.

One approach that works well is for the agile coach to be part of the squad that has delivery responsibilities. In that way, the coach can provide guidance about the method and practices as part of the day-to-day work. The agile coach usually has experience with agile projects and is passionate about the process and practices. Alternatively, an agile coach can work across several squads and mentor them on the process and practices. 

Depending on your squad structure and project size, the day-to-day responsibilities of facilitating daily standup meetings, planning, and playbacks might be the responsibility of either the agile coach or the anchor developer.
Cloud Service Management and Operations roles

When you move to the cloud, the resulting culture change requires modifications to the structure and roles of your project teams. Some DevOps team members can play more than one role, and groups might be merged to create a cohesive, diverse squad. As you form the Ops side of your squad, consider the addition of several new roles.

Service management introduces processes that teams must implement to manage the operational aspects of their applications.


Incident management roles

Incident management restores service as quickly as possible by using a first-responder team that is equipped with automation and well-defined runbooks. The incident management team members define the incident process and build a tool chain that implements ChatOps across the organization. Incident management roles include the first responder, incident commander, subject matter expert, and site reliability engineer.

First responder

The first responder solves problems by using runbooks and working with subject matter experts (SMEs). This role has these responsibilities:

    Receives alerts through collaboration tools

    Researches to determine the nature of the problem

    Evaluates and adjusts the urgency and priority of the problem, if needed

    Contacts and communicates with the incident commander when a major incident occurs

    Reviews known issues to determine whether the problem is a known issue

    Tries to resolve the issue by using the prescribed runbooks, collaborating with SMEs, or both

    Gains concurrence from the customer when the incident is resolved

Incident commander

The incident commander manages the investigation, communication, and resolution of major incidents. This role has these responsibilities:

    Receives incident information and collaborates with SMEs to restore services as fast as possible

    Updates key stakeholders with status and expected resolution times

    Seeks senior leadership support and endorsement, if needed

    Interfaces and works with vendors to isolate problems and drive resolution

Keeping stakeholders up to date with the status of an incident is a key responsibility of the incident manager.

Subject matter expert (SME)

SMEs apply the deep technical skills that are needed to resolve application issues. Their skills support either specialized application expertise or a specific technical field or domain, such as database administration. An SME has these responsibilities:

    Investigates a problem by using monitoring tools to get more details

    Inspects logs

    Tests and verifies issues

    Recommends fixes if instruction is missing for the first responder, or fixes the problem

    Proposes changes if they're needed and requests change management

    Implements change

    Provides data for the Site reliability engineer (SRE) review

SMEs use their skills to resolve problems as quickly as possible.

Site reliability engineer (SRE)

An SRE takes operational responsibility for supporting applications and services that run on a global scale in the cloud by using a highly automated service management approach. The SRE pays particular attention to removing toil, which is repetitive manual labor that doesn't add real value to a project.

An SRE spends approximately 50% of his or her working time on engineering project improvements. Fundamentally, the role is a combination of software engineering and operations design.
Problem management roles

Problem management aims to resolve the root cause of an incident to minimize its adverse impact and prevent recurrence. The problem owner and problem analyst ensure that problems are fixed and not repeated.

Problem owner

The problem owner oversees the handling of a problem and is responsible to bring it to closure. As needed, this role enlists the help of analysts and specialists. The problem owner is essentially the same role as the traditional IT role. However, the tools that those roles use to identify and solve a problem and ultimately provide a root cause analysis are different. Typically, the problem owner has a strong personal interest in the resolution.

Problem analyst

The problem analyst discovers incident trends, identifies problems, and determines the root cause of a problem. They're SMEs in one or more areas. This role is radically different from the traditional IT because business analytics, runbooks, and cognitive techniques play a major role. However, human supervision and creative thinking are still important in this role. Typically, an SRE takes this role.
Change management roles

The purpose of change management is to successfully introduce changes to an IT system or environment. The roles that are associated with change management are change owner, change manager, and change implementer.

Change owner

The change owner is the person who requests the change. The change owner has these responsibilities:

    Raises the change within the change management tool

    Creates the business case that justifies the change

    Sets the priority

    Determines the urgency

Input from the change owner is used to rank the change against all the other work that the DevOps squad does.

Change manager

The change manager completes the preliminary assessment of a change record to ensure that it contains the correct information. That information includes an accurate description, correct classification, and correct change type of standard, normal, or emergency.

Before the change is implemented, the change manager verifies that all the authorizations are obtained. After implementation, this role reviews changes to ensure accuracy and quality.

Change implementer

This role implements changes. Typically, the change implementer is an SRE or associated SME.
Other possible roles

Your project might need additional roles such as architect, project manager, and user researcher.


Architect

In projects that have experienced developers and a strong anchor developer, those roles provide "just-in-time" architecture. The use of "just-in-time" architecture is typical for greenfield projects, which are not imposed by preexisting architecture. However, if enough complexity exists, you need an architect who is separate from the developers so that development work isn't impacted. You might need an architect if a project is integrating to existing systems by using a wide range of services, security, or movement of data, or if you're coordinating with many other technical teams. 

The architect works closely with the anchor developers across squads. This role also works with other architects to ensure architectural consistency across an offering portfolio. The architect creates only the architectural designs, diagrams, and documents that are actively used by the squads to guide their development. As in all other roles, the focus must be on communication and collaboration that is effective for the squads.
Project manager

Tracking within the squads' day-to-day work is done through user stories and tracking software, but often, dependencies exist on groups outside the squad.

Project managers do a wide variety of tasks:

    Procure software

    Coordinate dependencies on systems, exposing APIs

    Report summaries beyond the tracking software

    Manage issues

    Coordinate integration with third parties

If the dependencies are minimal, the product manager, anchor developer, or both might handle the project management tasks. However, if those roles spend too much time on project management tasks, a project manager is likely needed. Ideally, the project manager has experience with agile teams and understands agile process and tracking.
User researcher

The user researcher validates all the aspects of a design with real users. Sometimes the UX leader fulfills this role. However, where more extensive user research is part of the project, include a user research expert as part of the team.

The user researcher has these responsibilities:

    Completes the initial research of users and their world to build personas, empathy maps, and scenario maps

    Validates the problem statements and MVP with real users

    Plans and conducts usability tests throughout the project to get real-world feedback on all ideas

Automated separation of duties

After you define your organizational roles, set up automated Separation of Duties (SoD) to enforce them and the separation of duties. SoD is defined as ensuring that no single person can introduce fraudulent or malicious code or data without detection. To set up SoD, follow these tips:

    Define SoD-related roles and access rights

    Use tools like UrbanCode Deploy, which provide role-based security and logging

    Use tools to automatically track not only what the change was but also when or who made the change

    Clearly document who can and cannot have access to production, and use access control to enforce your policy

    Add roles to the automated tool

    Use automated scripts to supplement tools; for example, generate a separation of duties matrix


 Cloud workload assessment

Cloud adoption is a popular topic in organizations that seek competitive advantage through capability upgrades and technology. In that context, one of the most important questions is "Where do I start?" With cloud workload assessment, your organization can explore the current environment and prioritize focus areas to enable and accelerate cloud adoption.

Cloud workload assessment fits into the spectrum of activities in the cloud adoption and transformation journey


In the diagram, cloud workload assessment is aligned with steps 1 and 2. It's also linked to other activities, such as affinity analysis and what-if analysis. Those terms are often thought of as having overlapping concepts. The delineation for workload assessment is its focus on understanding the current technical landscape, technical debt, environment challenges, and technical- and business-driven dependencies.
Key practices and lessons learned

Based on decades of experience, key subprocesses have been distilled and captured to assess workloads in a three-step approach


1. Understand your current portfolio

In the first step, you focus on your organization's current portfolio to understand its technical landscape, challenges, and dependencies. Criteria can vary from one organization to another, but the criteria must cover the following attributes:
Architecture

Understand your application architecture in the context of its multiple components. This type of focus might result in you categorizing applications as monolithic with technical debt to be resolved. Levels of distribution and standardization might also drive the need to decompose and group workloads into manageable chunks of effort for cloud adoption.

Other factors that are critical to surface are stateful applications, applications that run on traditional systems, and components that need to be replaced as applications are rearchitected to move to cloud.

On the other side of the spectrum, some workloads require less effort, such as applications that can be moved as-is. Typically, those workloads are standardized, stateless, use a microservices architecture, or follow the twelve-factor app methodology.

Architecture considerations must encompass the full lifecycle of applications, from development through production support. Be sure to harvest and assess all of a workload's environments and support structures as you consider dispositions of the production instance.
Security and compliance

Some applications have attributes that align with regulatory requirements such as PCI or HIPAA, or other externally driven requirements. Your organization might also have its own compliance parameters that are driven by business objectives. It's critical to surface all of these compliance requirements as you analyze the current landscape and prepare to vet cloud target options. You might need to establish plans to satisfy each compliance requirement that is associated with each workload.
Data

Data is an important element to consider. What type of data do you need: object storage, block storage, or real-time storage? Will your data be hosted on-premises? Will it be hosted inside country borders or outside country borders? How will you access the data? What speed do you need?
Technology

Technical options and decisions that have been made available over time play a key factor in workload assessment. For example, if an application uses web containers to encapsulate business logic with Enterprise Java™, it requires less effort to transform into a cloud solution compared to a monolithic application in COBOL. These types of attributes can guide your analysis and prioritization. Analyzing these attributes can also help you realize that you need to use a different solution if the level of effort isn't justified in the context of business alignment.
Dependencies

Upstream and downstream dependencies play a key role in workload assessment. The notion of a stand-alone workload with no impact to surrounding workloads or business capability is rare. Almost all workloads are either closely coupled or loosely coupled with key technical and business functions within an enterprise. Understand the impact of changing, dispositioning, or replacing each workload to avoid unsuccessful cloud adoption attempts.
Scalability

Scalability is inherent to applications that are designed to run in cloud environments. Scalability requires a level of abstraction between particular system calls, particular IP addresses, application state, and underlying platforms. You might need to adjust some applications to be positioned to scale based on usage and peak periods and to take advantage of capabilities such as cloud bursting.
2. Analyze pain and understand gain

It's important to understand the value of moving to cloud and understand the "pain," or effort, which is associated with each workload. Your organization must establish a quantifiable measurement to gauge the pain or gain. This measurement is further dependent on individual workload characteristics and includes factors such as platform type, unique hardware considerations, stability, and business criticality. Attributes that are further informed by the organization's attributes and objectives. Your organization might establish these measurements for pain and gain:

    Sample gains:
        Faster time to market
        Reduced time to deploy a new build
        Increased frequency of updates or releases
        Increased user satisfaction

    Sample pains:
        Unsupported software or operating systems
        Complex code that requires rework
        Level of risk

3. Create your grid

After workload attributes are collected and can be measured, explore prioritization in the context of plotting across a gain-versus-pain grid. This activity helps you decide which workloads can be grouped into waves of effort. A gain-versus-pain grid also informs your sequence as your plan evolves into analyzing your data and including stakeholders.

In this example grid, the workload is plotted based on the assessed attributes, the gain that is associated with cloud disposition, or both. Typically, the upper-right quadrant reflects the workloads to target first, as they represent the most value with minimal effort. 


In summary, cloud workload assessments represent an important initial step in your cloud adoption journey. Focus on due diligence by using proven processes and tools, as your effectiveness in this step has an enduring impact on the success of your journey.


 Total cost of ownership for cloud

Even if you're familiar with total cost of ownership (TCO) models, ensuring a true comparison in the cloud can be challenging. Organizations are taking a holistic perspective to include elements such as talent, capability, speed to market, and cost flexibility. Some comparisons consider metrics that are associated with the number of applications or users, but the encompassing peripheral view is also key.

TCO is the metric that organizations use to quantify and measure cloud adoption success. This perspective helps you understand the return on investment so that you can prioritize and focus. By capturing the key cost considerations, you gain a workable model that you can use to make informed decisions.
Elements of cost

In the past, you likely used TCO to determine costs for on-premises hardware and software. The factors were fairly straightforward to identify and included these costs:

    Organization and talent
    Capacity
    Communications
    Consulting
    Software
    Administration

In the cloud, these same factors apply, but many variations exist, and determining the cost can be difficult because many services are provided by third parties. In addition, cloud can be more costly than on-premises resources, especially if you're not using cloud capabilities such as automation and utilization.


By capturing the key cost considerations, you gain a workable model that you can use to make informed decisions.


Cost considerations

For an accurate TCO, you must catalog your current costs with costs that are expected from the cloud.
Physical environment

When you compare hardware costs, consider server factors, such as hardware type, processing power, disk space, and server utilization.

Virtualization is a great first step in making environments more easily available to users and is a key element in cloud provisioning. If you’re already heavily virtualized, you’ll have an easier time using cloud. However, you might also want to enhance virtualization through automation or containerization.

Images can reduce the installation effort by providing reusable starting points. However, if those patterns don't exist, your startup cost might be higher.

Consider how you might use containers and whether you can optimize density for cost. Also, consider the network impacts, such as latency and network speed. Depending on the cloud location, you might need extra infrastructure to support your applications.

Determine the level of support that you need and the level of support that the cloud service provider can offer. Cloud environments offer many options, from nothing more than access to the environment to a fully managed service with patches and application-level support. It’s important to delineate the roles and responsibilities for the services that you need when you determine the cost impact.
Application level

Many cloud providers offer licenses for common software, or you might have options for different licensing options depending on the virtualized environments. The key is to fully capture the intended software use and model the licensing that you need.

A catalog of images or containers can make installation and deployment much easier. This approach requires more upfront effort and investment.

Review the architecture of each application to identify possible issues with migration to cloud. In some cases, migration is as easy as lifting and shifting the application to another device. In other cases, an application might not run due to its architecture and might need to be refactored, rewritten, or replaced. Establish appropriate patterns for encapsulating costs for each disposition type.

Consider whether applications can be replaced by new capabilities such as “Born on the cloud” apps or SaaS.
Process

Review service management to ensure that you have the coverage and support that you expect and that you can account for the applicable cost. Because self-serve plays such a large role in cloud deployments and adoption, be sure to factor in your current approach with what you need to do with the cloud.

Automation is key in a successful cloud adoption journey. Whether for virtualization or scripting, you must review your current automation maturity and identify areas that you might need to expand, reflecting any applicable costs.

Security and data privacy concerns are heightened when you move to the cloud. The cloud often provides coverage, but it's important to assess what’s available, determine whether it's adequate, and then identify costs to provide any extra coverage.
People

As with any new technology that you deploy to your organization, you must factor in skill building and enablement to your enterprise. The enablement varies based on the role, but make sure that you have appropriate enablement to provide a complete TCO view and ensure successful adoption. You can also address enablement in the context of new hires or reskilling.

Capture the specific roles and responsibilities that are associated with the new environment. Cloud offers various models, from only the hardware to a fully managed service. To accurately factor in the cost, it’s critical to capture what your team does.

Assessing the impact to your organization’s culture can be hard to quantify. Cloud has an affinity toward more agile processes, which might be a learning curve or cultural shift for your organization. When you capture TCO, it’s wise to reflect any cultural changes.

You might consider and implement an organizational upgrade with a focus on cloud roles. You can also coalesce cloud roles and combine cloud functions into a cloud-aligned structure.
Key practices

Although TCO is unique to each organization, a few practices are common:

    Use TCO calculators to establish a baseline. Many analysts and vendors provide calculators to help determine costs. These calculators can be a great starting point, but to account for all the costs, remember the considerations that you made about the physical environment, application, process, and people.

    Keep in mind the different cloud deployment and support models, noting which models might be the best for you, as they affect the cost. Most TCO calculators focus on technological attributes even though much of the consideration for cloud includes people, culture, and process.

    It can become overwhelming to look at all the data to compile a true comparison of your costs. Instead, focus on the aspects that are most critical to your organization. For example, talent might need an upgrade in the context of skills, or organizations might need to adjust structure to support a hybrid cloud model. These factors are key for cloud adoption success and consideration.

    Capture the value of savings. Although TCO focuses on costs, it’s important to capture the corresponding value that is associated with the cost, whether in the form of cost savings, time-to-market improvements, or flexibility to explore different markets. Even if you don’t compile a full business case, a solid investment case that captures the costs and high-level value can help sell the effort and measure its progress.

    Automate where possible. You’re probably already investing in automation for your current environments, but it’s critical to automate with the cloud. You can reap the benefits from self-serve access and achieve cost and time savings from consistent and repeatable automations.

    Cloud requires a different approach. Cloud can be more costly than an on-premises environment if you're not using the right approach for cloud. Cloud provides cost savings through automation, utilization, and standardization. If you're not embracing these updated approaches, you’ll struggle with finding the savings that you expect. Be aware of how your proposed approach affects cost.

    Embrace agile practices. Many organizations use them. With cloud, embracing agile practices is an opportunity to refocus your organization's culture to be willing to fail fast and pivot, resulting in shorter cycle times and quicker startup. Consider your current level of agility and what you might need to do as you build out the cost.


 Pay back technical debt

If you've ever put off a project with the intent to finish it later, you have incurred technical debt. Like financial debt, technical debt is serious and can escalate unless you're aware of it and have a plan to manage it.
How development teams accumulate technical debt

A development team incurs technical debt when it defers software development activity so that it can complete other activities. The team typically intends to finish the postponed activity, but that intention doesn't always translate into action. The postponed activity might be as basic as documenting the code, or it might be as involved as changing a piece of code to make it performant or updating an outdated design document.

To understand the consequences of incurring technical debt, compare it to financial debt. When consumers obtain credit cards, they usually intend to pay off the cards every month. As time passes, many people find themselves in a position where they're spending more than they can afford, and they eventually reach their credit limit. As a result, they incur interest fees that require them to pay more than they would have if they paid off the debt monthly, as they originally intended.

For a team that is starting a new project, the ideal situation is to not incur technical debt at all. A team can avoid technical debt by following a "pay as you go" model and ensuring that work is truly done when it is marked complete.


Five steps to reduce technical debt

However, many teams find themselves overwhelmed with technical debt. If you are in this situation, you must find a way to reduce that debt while you continue to develop new features and move your project forward. The following practices parallel the steps that credit counselors use to help people reduce financial debt.

    Assess what you owe.

    Assessing your situation is the first and most critical step. After your team realizes that it must pay back its technical debt—which is not an easy realization, and one that must be made with the business—the team must determine what debts are owed. The best way to determine debts is to review the design and code from top to bottom.

    Start by looking at your design documentation:
        Is it up to date?
        Does it capture the most important points of your design?

    Next, examine the areas of the code that have caused the most trouble:
        What parts are the hardest to modify?
        What parts have the highest error rates?
        What parts are most important to your business?

    Answering these questions can help you rank the activities that you must do to improve the situation. But your system is composed of more than code and documentation, and those other elements can incur technical debt, too. Consider operational debt:
        Are you running on back-level versions of platform software, such as Tomcat, Spring, or WebSphere Liberty? While the proper use of Cloud Foundry buildpacks can ameliorate this concern to an extent, you must ensure that all your applications, not only the ones that you most commonly change, are kept up to date.
        Is your operations team manually doing tasks that should be automated, wasting time and money?
        Do you have monitoring in place so that you can discover problems before they bring your site down, rather than wasting time in post-mortem analysis?

    Bring in an outside expert to help you evaluate your project status and get an assessment that is based purely on the technical merits of your solution.

    The result of the assessment must include a description of what must change, ranked by priority, plus recommendations to make the changes and cost estimates for each change. With these facts, you can start to negotiate with the business owners about which debts to pay back and in what order.

    Stop incurring new debt.

    While step 1 is the most important, step 2 often leads to the most difficult discussions for the business. The hardest part of this process is learning to change your organization’s culture so that you don’t accumulate more debt than you can service. Think of the financial debt analogy. Changing your spending habits so that you use credit to buy only what you need and not what you want is a hard lesson to face and a harder behavior to learn.

    To address the issues that you identified in step 1, you must take time to implement the changes and put aside new development. Whatever approach you take to implement the changes must be negotiated with the business to maintain an acceptable balance of technical debt reduction and new development. These strategies have worked for many groups:

        Include debt reduction activities as user stories that you're implementing as part of an agile development method. Work with the business to make sure that some of the activities that you identified in step 1 are included in each development cycle. The tricky part is balancing debt reduction activities and new development activities that touch the same area of code.

        Suppose you’re developing an e-commerce site and you find that most of your problems are in the checkout component. You might want to defer new development in only that area while you repay technical debt in that section, refactoring code, updating documentation, and so on. Adding new promotions or changing your product pages are valid new development activities while checkout changes proceed.

        Employ a beta strategy. If your infrastructure is built to sustain two websites—one main site and another beta site—you can continue new development on the beta site while refactoring the main code stream. This approach has the advantage of not slowing any new development, but it can incur the cost of more difficult integration when changes to the beta site must be reintegrated with the main site. 

    Choose your debt reduction strategy.

    As with financial debt reduction strategies, consider two ways of prioritizing your technical debt "repayments":

        The "highest interest rate first" approach: In the financial world, following this approach means that you repay the credit cards with the highest rate first because they cost you the most. In the technical world, this approach means starting with the tasks that have the biggest impact. With those tasks out of the way, you clear the path for other changes that provide the biggest payback in terms of performance, maintainability, and so on.

        The "lowest balance first" approach: In financial terms, this approach means paying off the credit cards with the lowest balance first, which can result in an immediate sense of accomplishment that you did something of substance. In technical debt terms, this approach means that you complete the smallest fixes first. This approach can be useful either if you must convince the business or management of the value of paying back technical debt, or if your company culture is results-oriented, so showing immediate progress is important for sustaining the funding for larger efforts.

    Depending on the project and the available resources, you might be able to implement a mixture of these strategies by addressing several high impact issues and several small fixes to show progress.

    Stick with the plan!

    Make debt reduction part of your ongoing activities for the long term. Many companies became disillusioned with buzzwords like refactoring because they expect short-term results from a process that is a long-term investment. While your team might continue to incur new debt, make sure that you can repay it rather than allow it to accumulate.

    Track and reassess your progress.

    You must be able to report on your progress in your debt repayment activities. Capture metrics that justify the time that is spent on those activities to management and the business.

    If you refactor code to improve performance, be sure to gather the right statistics to show how refactoring improves the user experience. Another important metric that demonstrates value to the business is the rate at which you can add new features to a simplified code base.

These steps can't resolve all your technical debt, but they can help you determine what to do, what value your changes bring to the development process, and how well they address the problem.
Summary

If your team has technical debt, get started today to reduce it. When you persevere in this process, you can make your development artifacts easier to maintain and your development process less stressful.


 Translate a business problem to an AI and data science solution

When a data scientist interviews a line-of-business expert about a business problem, the data scientist listens for key words and phrases. The data scientist breaks the problem into a process flow that always includes an understanding of the business problem, an understanding the data that is required, and the types of artificial intelligence (AI) and data science techniques that can solve the problem. Together, this information drives an iterative set of thought experiments, modeling techniques, and evaluation against the business goals.

The focus must stay on the business. Bringing in technology too soon can guide the solution to a technology, and the actual business problem might be forgotten or not fully answered.

AI and data science require a level of precision that is important to capture up front:

    Describe the problem to be solved

    Specify all the business questions as precisely as possible

    Determine any other business requirements, such as not losing a customer while increasing cross-sell opportunities

    Specify the expected benefits in business terms, such as reducing churn among high-value customers by 10%

Determine data analysis goals

After the business goals are clear, you need to translate them into data analysis goals and activities. For example, if the business objective is to reduce churn, you might set these analysis goals:

    Identify high-value customers based on recent purchase data.

    Build a model by using available customer data to predict the likelihood of churn for each customer.

    Rank customer based on churn propensity and customer value.

A key question to answer in follow-on activities is whether the data from the customer contains the correct information to answer the business problem. It's also important to consider how you might act on the results of this analysis to support the business goals. How do you consume and deploy the results of the analytics, and what action do you take within the business?

The business problem often requires post-processing the outputs of the AI and data science models. For example, the list of customers who might churn needs to be ordered to determine a next-best action. A low lifetime-value customer with a high churn might not be worth spending time and money to retain. Customers with high-lifetime value might take priority over lower-value customers with higher churn scores.
Data analysis success criteria

To keep the data analysis on track, define success in technical terms:

    Describe the methods for model assessment, such as accuracy and performance.

    Define benchmarks for evaluating success, being sure to provide specific numbers.

    Define subjective measurements as best you can and determine the arbiter of success.

Patterns of analytic business problems

At a broad level, solving business problems with data and AI requires a combination of business analytic patterns. Although they're often indicated in a lifecycle, the actual application of these patterns is more iterative in support of the overall business objective


When you forecast and budget based on past activity, you answer the question "What is our plan?" This stage is forward thinking and often focuses on budgeting and related milestones to create more accurate plans, budgets, and forecasts that are based on past activity.

To understand past activity, you provide descriptive answers. Answer the question, “What happened?” Examples include the exploration of financial results to find root causes, directed “what-if” analysis, year-over-year, results-vs-plan, and forecasting by using numeric input and time series.

Next, you predict what actions might happen in a sequence or what events happen at the same time. Examples include churn, financial forecasts, next best action, and predicting a device failure. Predicting a future event is a supervised data science activity, as it requires a baseline of historical data to affect outcomes.

As you segment data and detect anomalies, you predict a future event based on related data elements. This exercise is unsupervised, as the patterns that are exposed in support of the business objective might not have been known before the exercise. Examples include creating market segments, detecting fraud, and identifying entities in images.

To discover insights in content, you use content analytics, deep learning, and other techniques for text, images (faces), audio (voice printing), and video and other unstructured content. You can use techniques such as feature extraction, sentiment analysis, and deep learning to extract meaning from unstructured content in support of the business objectives.

While the ability to interact in a natural language is a possible target deployment of AI and data science models, the actual interaction is another focus area of AI and data science development. Natural language processing (NLP) analyzes unstructured text data. Many prebuilt NLP models allow the analysis of sentiment, emotion, attitude, and personality. Some prebuilt models can also consume ontologies and taxonomies to extract conceptual terms and entities from text. Often, custom NLP models are required because the business use case requires specific concepts to be modeled, and the related data is often written in a domain- or industry-specific vocabulary.

To determine optimal quantity, price, resource allocation, or best action, you use decision optimization techniques that use forecasts of future scenarios. An example in next-best-action and call-center applications is ordering lists by business priority with weights from predictive analytics.
Start with an example

A bank wants to optimize its loan underwriting procedures. Currently, it applies filters on loan applications that automatically reject the riskier ones. However, the bank is still approving too many applications that run into repayment issues.

The bank collects a large amount of data for each approved loan: 146 fields. These fields can be split into a few distinct groups:

    Loan demographics, such as the amount, the term, the interest rate, and the reason for loan.

    Applicant demographics, such as age, salary, employment length, and home ownership.

    Numerous risk factors, such as the number of public records, credit card delinquency, and bankruptcy. Roughly 70 sparsely populated risk fields cover these risk factors.

The goal is to create a predictive model to identify loans that might be bad loans. However, in the raw data, no one field indicates whether the loan is good or bad.

Rachel, a data scientist, sits down with the customer to understand the customer's business problem. Her first goal is to articulate the overall business problem: “Can I detect attributes about a person or a type of loan that can be used to flag a risky loan that needs to be processed by my loan underwriting team?”

Rachel can then expand the business problem to the next level of detail: what needs to be done to answer the problem. She needs to be able to do these tasks:

    Detect what a risky loan is or detect which loans ran into problems

    Segment loans into different categories by using information such as the purpose for the loan, the loan amount, and the term of the loan

    Sort customers into different groups based on their demographic data

    Discover patterns that cross all three types

Understanding the business problem is fundamental to AI and data science solutions. To break complex needs into manageable and repeatable methods to solve the problem, you need a clear understanding of the problem, the metrics to baseline and validate, and the patterns to solve the problem.


Envision


 Incrementally deliver awesome solutions

The Envision practices provide your development teams with a repeatable approach to rapidly deliver innovative user experiences. To deliver a minimum viable product (MVP), teams need to understand the user and identify the highest priority items.
Why change?

Too frequently, development teams focus on delivering features and functions, not on delivering innovative experiences. When the development team's primary objective is staying on schedule, the customer's experience suffers. User experience issues that are identified in the market might take months or even years to resolve.

Teams must change how they think about their customers' needs and how they design, develop, and deliver outstanding experiences that provide customer value. When a team envisions the future delivery of applications, it must consider the platform where it will deliver the application to production.

To guide development teams through this transition, Enterprise Design Thinking was created. It provides development teams with a repeatable approach to rapidly deliver innovative user experiences. Enterprise Design Thinking focuses on delivering a minimum viable product (MVP) each time that a new feature is introduced. By delivering MVPs, a team can validate new user experiences with speed. After an MVP is validated, the next feature becomes the next MVP that is considered for design. The Envision practices show how to incorporate Enterprise Design Thinking with DevOps and the full software lifecycle of the Garage Method for Cloud.
What changes?

The Envision practices ensure that the target audience of users and customers remains at the center of all activities. All the other practices benefit from a having a deep understanding of the target audience, and each practice can contribute new knowledge about the target audience.


Embrace Enterprise Design Thinking and deliver MVPs.


Understand the target audience

The team starts by developing a thorough understanding of its target audience. The goal is to understand the audience's motivations, frustrations, and goals. To gain that understanding, the team uses personas, empathy maps, and as-is scenario maps. Those techniques build a knowledge base that the team returns to and expands as it builds, delivers, and learns more about its customers. The effort to understand the target audience better and to develop real empathy with it never stops.
Understand challenges and envision the future user experience

Based on what it learned about the target audience, the team identifies critical challenges and potential responses by using techniques such as ideation, storyboards, and to-be scenario maps. This information helps the team envision the user experience that it wants the target audience to have. The team's goal is to identify experiences that the target audience finds compelling.
Define the vision and direction for the development effort

The team defines an MVP and testable hypotheses (user experiments) that drive the development effort. The team's goal is to build something that is small, compelling, and that can be measured to test its impact on customers.
Move from design to development

After the team defines the MVP and hypotheses, it creates user stories in a rank-ordered backlog that the development team uses to determine which work to complete first. Wireframe sketches and prototypes are effective ways to take the user stories and start to envision the actual implementation.
Evaluate and select your tools and cloud platform

While the designers and development team create the application that meets the needs of the customer, the operations team evaluates and selects the tools and the operating environment for the application. The operations team determines the benefits and drawbacks of tools and different cloud solutions and chooses where the application runs in production.
Communicate efficiently

Ultimately, the Envision practices of the Method never end. They are bound to all the other practices. As each iteration is delivered to the cloud, the team learns more about the target audience and can identify new compelling experiences. More MVPs and related hypotheses are created, which are then delivered and validated.


 Define your service portfolio

Organizations are increasingly adopting an outcome-focused mentality to serve both internal and external customers. This mentality is manifest in the form of physical or digital products and services. It allows for greater focus on creating minimum viable products (MVPs) that achieve business goals and improve overall satisfaction. This is especially true for information technology (IT) organizations that are seeing external vendors become the preferred advisors for internal users.

To accomplish this outcome-focused mentality, service portfolio management is emerging as an integrated approach to merge the teams, business and IT goals, and related channels that support a discrete capability.

By using roles such as offering or product managers, and tools and techniques such as Enterprise Design Thinking, agile team constructs, and marketplaces with catalogs, your organization can create competitive differentiation. 


A key step in defining a service portfolio is understanding what your organization does well.


Key practices and lessons learned

The shift to cloud introduces new characteristics, such as consumption-based billing and self service. These characteristics are changing the way users request, use, and maintain capabilities. Users are requesting consumer-shopping-like experiences where they can use catalogs and shopping carts to request new products and services.

This shift provides an opportunity for enterprise IT teams to remain relevant to the business and prove value to users by implementing digital storefronts and catalogs. Through those storefronts and catalogs, they offer capabilities that users can request and provision as needed. The capabilities might be directly provided through a cloud service model or through third parties with valued-added capabilities such as monitoring or compliance to organizational standards through a cloud broker model. Ultimately, the capabilities are informed by the service portfolio.

A key step in defining a service portfolio is understanding what your organization does well. To understand what your organization does well, ask these questions:

    What are our core services?
    Which areas will we focus on or specialize in?
    How will our core services map to our strategic goals?
    What capabilities are most popular and how are they being used?
    What are the customer wants and needs?

After you answer those questions, you can start to identify focus areas that you can map to your currently offered and upcoming capabilities. To pinpoint more focus areas, use maturity models and user input through Enterprise Design Thinking. After you identify all the possible capabilities, prioritize them by value and demand to develop a holistic portfolio.

For organizations that conduct enterprise architecture planning, these new activities might seem similar, but a key lesson is that your capabilities might need to be augmented with a view of the holistic business and user perspective. This added view helps you avoid the common challenge where an organization releases a service only to see limited usage by the target audience.

By introducing multi-disciplinary roles, service portfolios are more likely to meet business and IT objectives. Such roles include portfolio managers who are responsible for the overall business and technical strategy of all physical or digital products and services and offering or product managers who are in charge of specific, individual capabilities.
Industry trends

While service portfolios from a cloud perspective are still relatively new, specific trends exist that can help improve the chances of success.

One key enabler of this activity is the use of a center of competency (COC). A COC is a centralized, independent third party in an organization that assesses the value of capabilities and target opportunities. A COC can help your organization to pilot several activities:

    Hosting dedicated roles for service portfolio management, offering management, product management, or business liaisons that work with third parties to provide capabilities in the catalog.
    Expanding the catalog to create a marketplace that attaches financial metrics and revenue targets to services by using cloud characteristics such as chargeback or showback. This practice can break down the actual usage and cost of the services that are run within an organization for informational or billing purposes.
    Providing practices around team structures to facilitate business outcomes through a product or service focused view. These practices include agile team concepts such as tribes or squads that concentrate on individual capabilities.

Another trend is an increased focus on the user as part of the enterprise IT experience through techniques such as Enterprise Design Thinking, which is a user-centered way of addressing problems. Enterprise Design Thinking starts with involving users in design sessions to capture their needs rather than relying on decision makers or developers to make assumptions about what is needed. Next, you use surveys, workshops, sponsor users that are representative voices of the target audience, and agile solution development to realize iterative capabilities as part of an integrated service portfolio.

Finally, organizations are using techniques such as segmentation to shape and support service portfolios. Segmentation facilitates planning by classifying capabilities according to whether they're used to maintain the business or used to disrupt and offer new revenue or growth opportunities.

By understanding this information, you can better accomplish investment planning and optimization, as the metrics of success can vary based on the type of service classification. That is, disruptive services might have metrics that are focused on overall growth while foundational services are measured by their efficiency. A detailed example is Gartner's Run, Grow, Transform phasing.


 Start with Enterprise Design Thinking

Maybe you think you have a great idea for a new product. Or, maybe you already created a product and your competitors just released an update that scares. Either way, you need a proven process for innovating and delivering fast. You need Enterprise Design Thinking.
Enterprise Design Thinking combines traditional techniques with new core practices

Enterprise Design Thinking starts by bringing together a series of traditional design techniques, such as personas, empathy maps, as-is scenarios, design ideation, to-be scenarios, wireframe sketches, hypothesis-driven design, and minimum viable product (MVP) definition. To these traditional design approaches, Enterprise Design Thinking adds three core practices: hills, playbacks, and sponsor users.
Hills

Enterprise Design Thinking created the notion of hills to provide a new business language for alignment around user-centric market outcomes, not feature requests. This new business language is rooted in user needs and desires. Each hill is expressed as an aspirational end state for users that is motivated by market understanding. Hills define the mission and scope of a release and serve to focus the design and development work on desired, measurable outcomes. For each project, define no more than three major release hill objectives plus a technical foundation objective.
Playbacks

As your effort moves forward, you'll want to obtain lots of feedback. You need playbacks.

Playbacks align your team, stakeholders, and users around the user value that you plan to deliver, rather than project line items. All design and development work is iterative. To scale in an iterative world, Enterprise Enterprise Design Thinking formalizes these sessions into iconic playback milestones that align everyone around a set of high-value scenarios that show the value of your offering.

Early playbacks align the team and ensure that it understands how to achieve a hill's specific user objectives. In later playbacks, the development team demonstrates its progress on delivering high-value, end-to-end scenarios.
Sponsor users

Sponsor users, a special component of Enterprise Design Thinking, are people who are selected from your real or intended user group. By working with sponsor users, you can better design experiences for real target users, rather than imagined needs. If at all possible, engage sponsor uses when you create your personas, and continue to include them throughout the entire design and development process.

As you engage sponsor users on a regular basis throughout the release cycle, your relationship deepens, and their feedback provides direct insight into the specialized needs of their business domains. Collaboration between sponsor users and your team ensures that your product is valuable, effortless, and enjoyable.
Important components of Enterprise Design Thinking

Enterprise Design Thinking involves several components.
Personas

Start by getting to know the person or people that you intend to help with your product. Collect information and answer a wide array of questions about them. Who are they? What are their personal demographics? What are their normal tasks? What motivates them? What problems do they face? What frustrates them?

You can gather this information from many sources, including surveys, forums, direct observation, and interviews. Then, take all of the information and organize it to describe one or more specific individuals, or personas, who represent your target audience. As you work toward your solution, return to the personas to ensure that what you are building is going to excite them and make them say "Wow."


Empathy maps

After you define one or more personas, get to know them at a deeper level. Capture what they think, what they feel, what they say, and what they do. By doing so, you'll begin to develop empathy for this person. You'll use an empathy map to identify their major pain points.
As-is scenarios maps

Next, take an in-depth look at your personas' primary task scenarios. In an as-is scenario map, document the steps that they take, and as you do, document what they think, what they feel, and what they do along the way.

During this phase, be sure to capture all of the issues and problems that your personas face in their current environment. Capturing issues can be difficult because you might need to candidly discuss the flaws in your current offering. Don't be afraid to be honest. The more honest you are, the more likely you are to identify the most critical pain points. Ultimately, you develop greater empathy for your personas and gain a deeper understanding of the problems that they face as they try to achieve their goals.
Design ideation and prioritization

After you create a persona, an empathy map, and an as-is scenario map, you'll understand your target audience and the problems that it faces. You'll also probably have a few ideas about how to solve their problems and excite them. During design ideation, brainstorm and generate as many ideas as possible. Initially, don't worry about what is feasible. Generate as many ideas as possible, regardless of whether you know how to implement them. Then, organize those ideas into clusters and decide which clusters have the greatest promise.
To-be scenario maps

At this point, your goal is to create a scenario map. This scenario map, called a to-be scenario map, describes the future state that the adoption of your best ideas leads to. Capture what personas think, do, and feel during this future set of activities. Be sure to capture the "wow" aspect in this new scenario flow. The key question is "Will the person feel compelled to purchase a product that achieves this outcome? Why? Why not?"
Wireframe sketches

To get a better sense of the to-be outcome, it is sometimes useful to create a set of low-fidelity wireframe sketches with various alternatives. These wireframe sketches are not intended to represent a final design. That comes later. At this point, try to sketch potential experiences and their flows. Create a wide array of alternatives, knowing that you might throw away most of them. You can show those alternative sketches to various stakeholders and to actual members of your target audience to get feedback.
Hypothesis-driven design

A key aspect of Enterprise Design Thinking is to create a set of testable and measurable hypotheses about what you design and deliver. The hypotheses are generally in this form: "If we provide persona A, with the ability to achieve outcome B, we'll then be able to measure the impact via metrics X, Y, and Z." These testable hypotheses help you determine whether you created the compelling product that you hoped to create.
Minimum viable product (MVP) definition

After you have a set of hypotheses, you can define an MVP. An MVP is the smallest thing that can be built and delivered quickly to test one of your hypotheses and help you learn and evaluate your effort. In Enterprise Design Thinking, MVPs are closely aligned with a set of hills. Teams often define their MVP statements and their hills in parallel.

Enterprise Design Thinking for the greater good

Five Technology Design Principles to Combat Domestic Abuse
Remote design

In an ideal world, Enterprise Design Thinking is done with co-located teams with everyone “in the room”. Often teams are geographically distributed. Getting everyone together can be both expensive and difficult due to schedule and travel considerations. You can successfully implement the Enterprise Design Thinking practices with remote teams if you follow some suggestions and make use of tools that facilitate remote team collaboration.
Evaluate

Consider what really needs to be done in a formal workshop. Determine if you can conduct research ahead of time and validate it in a virtual session rather than creating from scratch in a workshop.
Engage

Engage all participants. Ensure that everyone on the team has the opportunity to contribute their thoughts at the end of each exercise.
Manage your meetings

Plan shorter meetings. Shift from a 2 full day format to 2 to 3-hour sessions over the course of a week. Make sure that you provide the opportunity for the team to take a break in the middle of the session.
Plan your exercises

Build in extra time for each exercise done remotely. Exercises take 1.5 to 2 times longer when done remotely rather than in person. Set limits for your exercises and stick to them.
Choose tools

Use a tool, such as MURAL, to conduct your exercises and capture all of the input from across the team. When using MURAL, be sure to prepare your boards ahead of time so you do not lose time building templates for the exercise when the team could be collaborating.
MURAL templates for Enterprise Design Thinking exercises

Use the following MURAL templates as you conduct Enterprise Design Thinking exercises.

    Hope and Fears
    Stakeholder Map
    Team Canvas
    Scenario Map
    Prioritization
    Opportunity Canvas
    Research Plan
    Persona Grid
    Empathy Map Canvas

    Need Statements
    User Interview
    Business Model Canvas
    Storyboards
    Cognitive Walk Through
    Feedback Grid
    Experience Based Roadmap
    Retrospective

The benefits of Enterprise Design Thinking

Enterprise Design Thinking takes the best industry recognized design methods, adds three core practices—hills, sponsor users, and playbacks—and applies knowledge from real development with real users at Garage locations.

By using Enterprise Design Thinking, you can generate ideas faster; design, evaluate, and test them faster; and develop code faster. Most importantly, you can deliver value to your customers faster.


 Define personas

Great design is anchored in a deep understanding of your user. After all, you are creating something for someone. That "someone" is a person, and the better you understand that person, the more engaging the design will be and the better it will solve the person's problem.

You've probably heard members of your development team argue about the characteristics of your product's target audience. One developer is sure that "everyone" will love a certain new feature. Another developer thinks that the users are "pretty much just like me." And still another developer thinks the real target user is incredibly unique, and nothing like "the masses." Suddenly, everyone on your team is an expert in human behavior, but no one can agree. This situation can consume, and even immobilize a development team. When this happens, it's time to pause and align around a persona.
What is a persona?

A persona is an archetype of a user that helps designers and developers empathize by understanding their users' business and personal contexts. By basing personas on user research, teams can avoid the pitfalls of designing for anecdotal, "fake," or extreme users.

Don't confuse personas with roles: if a hill or an MVP specifies focus on a salesperson, seek out the full range of salespeople. For example, they might vary in skill levels or needs.


PERSONA CARD EXAMPLE:

Meet: Ivy
Role: Enterprise Architect at ABC TECH

Also, Enterprise Ops: "How do I ensure resiliency and implement geograph failover? How can I connect Bluemix with my existing enterprise monitoring and logging solution? Show me best practices and make sure I understand them quickly!"

"I have apps running in my datacenter and stringent security constraints. How does Bluemix fit in? Explain how that hybrid cloud thing is supposed to work. Spare me the details and show me architectural diagrams that I can understand."

ANALYSIS:

Ivy is: Defining the strategy to move her company to the cloud.

Ensuring the operations of the app is smooth.

----------------------

Her current frustrations are:

Worried and frustrated about understanding infrastructure security, performane reliability, how dev is implemented designs, and how to debug and recover.


----------------------

Her primary motivation is:

Find the most cost effective way to deliver a reliable, continuous, secure solution.

Hitting industry specific requirements and compliance.


-----------------------

Ivy needs:

Easily understand services, difference between options, and how to connect the pieces to represent a solution that cost effective to management.

A reliable and easy way to identify, monitor, and fix problems.

To be able to deploy resilient, securem and highly available solutions.

-----------------------


Define your personas

To develop this understanding, you can use research techniques that range from broad surveys and analysis of user activity to deep contextual inquiry, or on-site observation. Recognize that any time you take users out of their work context, you might miss key observations about their workflow, such as workarounds and cheat sheets. Bring two people on site: one to ask questions and another to document notes and visual media. Ask objective interview questions.

From the information that you gather through research, you can gain insights about the persona and their real-life circumstances. These insights can direct your team's energy toward creating meaningful solutions for real problems. The best way to do this is to find your target users and spend time with them.


 Understand users through empathy maps

Empathy maps are "quick and dirty" personas. Generally, empathy maps are low–fidelity works in progress that capture and articulate the facets of a representative user as currently understood and viewed by a team. The facets are thinks, feels, says, and does.


The value of empathy maps

As your team identifies what they know about the user and places this information on a chart, you gain a more holistic view of your user's world and the problem or opportunity space. By having a more holistic view, you gain insights that add layers of context about the relationships between the users and their experiences. A more holistic view can also reveal the ways in which your user most naturally engages with what your team designs and builds. In other words, your designs should reach out to the user. Empathy maps can help you do that.
Get started

Use empathy maps when your team needs to think systematically about users and all of their attributes and dimensions, beyond just their job roles. At the beginning of a release, you can use empathy maps to capture the current state of knowledge and uncover assumptions that can be tested and validated through research. You can also use empathy maps to capture and organize data that is gathered during user research. Empathy maps provide a structure for talking about users.


 Solve problems through ideation

When you have a clear idea of the problem that you want to solve, you can employ one of several design-thinking techniques to envision solutions.
Ideation

Ideation is the process of generating big ideas. Enterprise Design Thinking explains big ideas by contrasting them with features:

    Big idea: Algorithms to predict the future from the past
    Feature: Charts with lines that show prediction

Moving to big ideas takes your mind out of the problem space and into the realm of solutions. This realm is where you innovate and create revolutionary, rather than evolutionary, designs.

In the ideation process, have each member of the team put a minimum of five big ideas on the wall. Of each person's five big ideas, at least one must break the laws of physics. This last rule forces each member of the team to step out of their current thought process. It's a nod to the famous quotation attributed to Albert Einstein, "We can’t solve problems by using the same kind of thinking we used when we created them."


Storyboards

At their heart, storyboards tell human-centric stories. A storyboard might explain a new technology, a change in process, or how two personas work together. Regardless, the story is from the user's standpoint.

The story is presented through a series of 6 - 8 "frames" with pictures, thought bubbles, and annotations. If you're working on the flow for a new user interface, you can use low-fidelity wireframe sketches for the frames in your storyboard.


Think about a comic strip. A whole story is told in 6 frames with often less than 25 words. The number of frames in the story are limited; this helps the storyteller process through the natural arc, from beginning to middle to end, without diving into page details or technology. This practice can help your team get the idea out without getting lost in the details. The details are unnecessary at this point. When you find the right solution for the user, the team will find a way to make it happen.

Give each person a piece of paper that is divided into 6 or 8 frames to sketch in. Consider using sticky notes; they help the team converge on common ideas and move thoughts around. Independently, each person on the team sketches his or her vision of the user's preferred journey, imagining that the identified problem has been solved. What does your persona's task now look like? How has the task changed? How do you imagine their feelings would have changed?

After the team members create their storyboards, they share them, taking no more than 1 minute to tell their stories.


To-be scenario mapping

To-be scenario mapping can be used either independently or as a follow-up to storyboarding. When you create to-be scenarios for your personas, ask yourself what the persona's ideal journey might be. Perhaps the task can be eliminated altogether. Or perhaps steps can be improved, combined, or removed.

You can couple to-be scenarios with brainstorming, or you can create to-be scenarios before or after an ideation session. Any time you hit a block in identifying a great moment in your to-be scenario, break out and ideate on potential solutions. Then, fold those new ideas into the scenario and see how your persona's experience changes.


 Understand users through scenario mapping

After you understand your persona and have a reasonable amount of empathy for them, it's time to take a closer look at the problem or opportunity space that you want to impact. You probably have an idea of the area that you want to target. For example, you want to move to mobile so that your sales-person persona, Kathy, can access pertinent client information more quickly while on the road. But what is the real opportunity?
Problem identification: Scenario mapping

Porting the entire set of tools that you think Kathy wants to a mobile app isn't cost effective and might incur too much unnecessary risk. Too many questions are unasked and unanswered. Instead, clearly identify the problem or opportunity space within the context of your persona. This approach can help you understand where to focus your energy so that you can create the greatest impact for your persona.

Scenario mapping helps you see the experiences that your personas are having as they traverse through their activities. The relationship between task and experience provides a tremendous opportunity for insight. Your team can immediately see which feelings a persona is having. In the case of an as-is scenario, your team can see what is generating those feelings or experiences. In the case of a to-be scenario, your team can ensure that whatever they design and build generates positive, delightful experiences.


Define scenarios by using scenario maps

A scenario is a workflow for one or more personas. Scenarios are minimally captured in written text form, but can also be conveyed orally, as a storyboard, or as a video. Like any good story, a scenario begins with a motivation and involves a series of steps toward an intended—or unintended—outcome. Design thinking suggests using scenario maps to rapidly prototype scenarios. From those scenarios, you can identify pain points of users and opportunities for design.

As-is scenarios capture the workflow as it is today. To-be scenarios, which are covered in another article, are a way to envision how your persona will experience the workflow after it is redesigned.

Scenario maps use a relatively simple formula. You capture information by using different colored sticky notes for each of the following numbers. Each number represents a row:

    The steps that your persona takes to complete a task

    The specific activities in which your persona engages to complete the step

    What your persona is thinking during this step

    What your persona is saying during this step

    What your persona is feeling during this step

In addition to the steps that your persona completes, your team can see the feelings that these steps generate. Identify the negative feelings and see where they are coming from. Is the absence of a specific functionality causing extra work for the persona? Are unreasonable demands being placed on the persona, such as expecting him or her to remember too much information from one step to the next? Where can your team have the greatest impact on improving your user's experience?

An as-is scenario is a great way to surface the problems that you intend to solve with your project. You might start a project with a fairly clear sense of the problem, or you might be looking to make the biggest impact that you can with limited resources. Taking the time to map your persona's scenario and add his or her experience to that scenario can both validate what you know and uncover problems that you didn't consider.


Create a minimum viable product (MVP)

Minimum viable product (MVP) can mean different things depending on where a company is in its transformation lifecycle and which practices it follows. No matter the nuance, the intent of MVP is the same: reduce risk while you increase return on investment by building only what is necessary. "What is necessary" means the smallest possible version of a product that can be used to run a meaningful experiment to test key hypotheses and determine whether to continue investment.

For example, a bank wants to engage Alice, a millennial user. After the bank conducted user research and created an empathy map and an as-is scenario, the bank understands that Alice isn't in the habit of saving money. It wants to help her start saving. The bank believes that if Alice is happy and increasing her wealth, she'll use more of the bank's products and suggest it to her community. The bank identified their assumptions and created a hypothesis: "If we provide Alice, a millennial, with a fun application that gives her savings options that she can run at the touch of a button, Alice will increase her savings. We will see that she internalized the value of incremental savings through her continued use of the application."

An MVP is built with minimal investment and features, but is focused on a viable solution to an opportunity. It's the foundation upon which to iterate to deliver measurable business outcomes.

Each MVP is aligned to a business initiative, where you use a hypothesis-driven approach to achieve incremental business value. An MVP might have specific features that are used to gain insight on high-risk assumptions. It might not address enough key user needs for a scaled product launch.

MVPs are used for both learning and for delivering rapid business outcomes. Many MVPs are production pilots to get the highest fidelity feedback. You must learn before you're ready to go to scaled production. The product owner works with the squad to evolve the MVP to have enough features and meet enough nonfunctional requirements to satisfy the target business objectives. The product owner also decides whether an MVP release is ready for scaled production, or whether the squad should shift to another business initiative.
How to form the MVP

Before you write an MVP, identify your assumptions and write hypotheses to test them. That information can help you form the MVP. What exactly is the smallest thing that the bank can build to test its hypothesis? "A fun application that gives her savings options that she can run at the touch of a button."

What else might the bank need to know to test this hypothesis but reduce risk by not writing extraneous code? Alice is a millennial, so she will likely use her smartphone. She is not currently saving, so it's unlikely that she will download a native application to begin taking actions that she doesn't take now. Add that information: "A fun, responsive web application that gives her savings options that she can run at the touch of a button by using her smartphone."

Notice that the additions are in response to what the bank knows about Alice, not the result of business expectations or technical limitations. By staying focused on Alice, the bank can get the purest tests, which provide more reliable data.

Imagine that Alice is in Canada, where applications are often in both English and French. Code the initial MVP in one language, and then code it in another language after the first round of feedback is received. This approach saves time and effort by waiting until the essentials are validated. The MVP now states: "We will build Alice a fun, responsive web application in English that gives her savings options that she can run at the touch of a button by using her smartphone."

The bank can ensure that its application for Alice achieves the results that the bank expects: to see Alice increase her savings and continue to use the application. The data from tests will either support that the business and design are going in the right direction or help the bank learn and shift. When Alice is happy and the application is providing her with value, she will continue using it.


 Share ideas by using wireframes

When you work in software development, you inevitably will see a wireframe sketch of one kind or another. Wireframe sketches communicate the most basic aspects of the user interface and the most important task flows through your product. You can use wireframe sketches to quickly gather feedback from members of your target audience and learn more about what matters to them.
How to create wireframe sketches

No matter how much experience you have with wireframe sketches, you should always follow a few basic guidelines to ensure that you get the most from your efforts. In most cases, start with low-fidelity wireframe sketches. Low-fidelity sketches are powerful because you can use them to iterate quickly, learn, and evolve. In the simplest case, you draw your sketches using paper and pencil along with large sticky pages and sticky notes. You can draw a few rough sketches, using a simplified set of interface elements, show them to potential users, and get their feedback. If one set of sketches isn't quite right, toss it aside and start over. 


In contrast, if you show a hi-fidelity prototype to users, they often focus on esoteric aspects of your design, such as the quality of your graphics, or your color schemes, fonts, and typography. They stop thinking about the task flow and focus on aesthetics. Users also tend to think that high-fidelity designs are already complete. People start to make assumptions about what can change in the design and what cannot. They start to censor themselves, and the quality of their feedback can diminish. There is a point in the process where that input is valuable, but early, while you are still learning about users' primary goals and tasks, those details can get in the way.

Early in the design process, create many alternative designs. Let your target audience compare the alternatives. Let them draw their own versions. The process of creating sketches must be quick, easy, and even disposable. Creating more alternatives can lead to greater creativity and a broader discussion of priorities and approaches. You'll learn much more about your target audience.
Wireframe sketches for mobile

Creating wireframe sketches for mobile applications isn't too different from creating sketches for other devices. However, when you create wireframe sketches for mobile, a few considerations are critical.

First, think about the interaction model. When you interact with a traditional application that runs on a laptop, you rely on the keyboard, a mouse pointer, and perhaps a touch pad. In traditional applications, you don't touch the UI with your hands. In the mobile context, you tap, thumb scroll, and swipe to interact with the application. With that in mind, you want to make sure that some of your wireframe sketches are created at actual scale. You might start with large sketches, but at some point you need to sketch at actual scale to understand what users will experience when they actually touch the screen and your UI controls. Low-fidelity wireframe sketches and a limited set of UI widgets work well.


Another critical consideration is scroll depth. Deep scrolling, which is considered to be a design flaw in traditional applications, is a valuable design tactic in mobile applications. As some designers like to say, "thumb scrolling is free" in mobile applications. One technique is to sketch a set of deep UI alternatives on a large sketch pad and then create a stencil of a mobile phone where the screen is "cut out." You can slide the mobile phone stencil up and down to simulate deep scrolling and use that to get feedback from users.

You can also use tools to simulate deep scrolling, and some of them do provide value. However, many of those tools generate code for you. If you move to the tools too quickly, you might limit your creative process because you won't want to break an evolving prototype.
Summary

Regardless of the application type, low-fidelity wireframe sketches are almost always the best place to start. You're more likely to get feedback that reflects your target audience's task goals. And with that feedback, you can validate the aspects of your product that will make them say, "Wow!"


 Write user stories

In the Garage Method for Cloud and many other agile methods, one of the key tools to communicate between the product owner (the customer) and the development team is the user story. Martin Fowler and Kent Beck define a user story as "...a chunk of functionality (some people use the word feature) that is of value to the customer... The shorter the story the better. The story represents a concept and not a detailed specification. A user story is nothing more than an agreement that the customer and the developers will talk together about a feature." 1 (Author's emphasis.)
How to write a user story

User stories act as the common language between all participants in the development process: product owners, architects, designers, and developers must share a common understanding of stories. You can achieve this common language by focusing on the value that each story provides to users. This idea of user value can be understood across disciplines and is key in prioritization, design, and implementation decisions.

A user story envisions and describes a future in which a small increment of user value is delivered. Approach writing stories by asking three questions:

    Who is the user that benefits? You can use a persona name to concisely identify the user.

    What can the user do that they couldn't do before? Maybe they can see some important information or take an action that helps them to achieve something.

    How does this change benefit the user? Sometimes the benefit is obvious. If it isn't, state it explicitly.

After you have the answers to those three questions, write them down in a sentence or two.

For example, Adam sees a summary with the status of each service, which allows him to identify any outages at a glance.

In this example, Adam is a site administrator. His name is in many stories, with each one using the team's shared knowledge of the persona.

In the Garage Method for Cloud, this concise format is preferred over the "As a <role>, I want to <capability> so that I can <benefit>" template, which is popular elsewhere. The Garage style is more straightforward and direct, which increases the signal-to-noise ratio. This style makes it easier to quickly extract meaning from a list of stories (a backlog) and harder to disguise bad stories with boilerplate.

In addition to a concise statement of user value, stories must have acceptance criteria and a definition of done:

    Acceptance criteria is the set of requirements that must be met for a user story to be completed.

    The definition of done is a set of criteria that is common across related user stories. The criteria must be met to close the stories.

    For example, every story must have automated test cases that validate the functional and nonfunctional requirements. Acceptance criteria and a definition of done provide a clear view to the whole team of which conditions must be met to declare that the user story is done.


Examples of well-written user stories

To learn how to help your team write good user stories, review the following examples of poorly written user stories and see how they can be improved.

Doug is a web developer. He needs to know the areas of his website that aren't being accessed so that he can improve the website.

    Bad story: As a web developer, Doug needs a report on website usage.

    This story is too general. A developer wouldn't know what to implement. The story doesn't indicate why Doug needs the report. Because that information is excluded, the development team might write function that doesn't meet Doug's requirement.

    Good story: Doug views a report displaying usage statistics about each page of his website. He can see which pages are not visited frequently and make improvements.

Carly is a customer who uses Doug's website. She wants to provide text feedback and a rating to the website development team when she encounters a problem or likes a page.

    Bad story: Carly can report problems and rate pages on the website.

    This story is too general, contains two different functions to implement, and might be too large to complete in one iteration. The team can break this type of requirement into a few stories that can be written in parallel.

    Good story 1: Carly can rate each web page using a 1 to 5 star system so that she can indicate which pages she likes and which she thinks need improvement.

    Good story 2: Carly provides text feedback to the development team when she encounters a problem with the website. Her feedback is delivered by email.

Mike is a mechanic. He needs to use a website to look up parts for the vehicles that he is working on.

    Bad story: As a mechanic, I want to see a list of parts suitable for my vehicle so that I don't have to look at irrelevant parts.

    The not-great story has a poor signal-to-noise ratio. The actual information is hidden behind the “as a” construct that has no meaning, and the “so that” construct is redundant. In the better version, the persona Mike is called out directly. The “can” phrase is more story-like and describes what Mike needs and why.

    Good story: Mike the mechanic can see a list of parts suitable for his vehicle.

Mary is an integration developer who needs to use documentation to do her job.

    Bad story: As a developer, I need to write documentation.

    The boilerplate might look like a user story, but it isn't. The user shouldn't be the person who implements the story. The feature that is described in the story must be a delightful capability, not a burdensome obligation. In the better story, Mary benefits from the feature for the reason that is stated in the story.

    Good story: Mary, the integration developer, can create new flows by referencing documentation.

Although a good user story is expressed concisely, it also needs enough detail for a developer to act on it.

    Bad story: Make a dashboard.

    This story gives the developer no idea of what to implement or how. The better story includes a concise description of the user, the function that he needs, and why he needs it.

    Good story: Pat, the platform specialist, can track and monitor stats of the plan to check that a flow is performing its function.

Sometimes a user feature is driven when something happens in the system versus a specific user action. Capture this type of work as a task, not a user story.

    Task: When an order is received, Frank sees a "sparkler" on his dashboard.

    This story reflects that when an order event is received by the system, Frank is notified with a sparkler on his dashboard. The story mentions that the event has a visual indicator, but doesn't specify the design for what is shown.

Each good example includes all of the facets that make up a complete user story: a persona, a capability, and the resulting benefit.
Manage your user stories

After you learn to write good user stories, you need a place to track and manage them. Store user stories in a feature tracking system such as GitHub Issues. By using a tracking system, your team can rank the stories and track them through to completion.

No tool can replace human-to-human communication. Don't allow the tool to affect communication and become a human-to-system only. People must talk to each other; the stories are placeholders for deeper communication.

As user stories rise in priority in the backlog, the team must communicate with the stakeholders—both architects and customers who want the user story—to ensure that when the story is delivered, it satisfies the stakeholder requirements.

Your team can store the results of stakeholder communications directly in the tracking system. That way, all the information that is needed to design, implement, test, and deliver the user story is in one place.

The repository where you manage user stories must be readily accessible to the entire team. To track use stories, use tools such as kanban boards and dashboards. By doing so, you always know where each story is first in the ranked backlog and then on the path through development and testing to deployment.
Test your user stories

Each user story must be testable. According to Kent Beck, tests must be "isolated and automatic." 2 Beck expanded on his idea of automatic testing when he introduced the process of doing test-driven development with the JUnit framework. 3 Throughout Beck's examples, the tests are always what testers call functional tests.

Developers and product owners often stop after they implement functional tests, as though functional tests are the only tests that must be done. As a result, they test no more and no less than what is defined in the phrasing of the user stories. That mindset isn't the original intent of what a user story represents or what agile testing is supposed to cover.

Next, follow a practical example of how to develop user stories for a minimum viable product (MVP) and see how both functional and nonfunctional facets are handled throughout the process.
Define user stories for an MVP: A practical example

The Garage Method for Cloud follows a process that begins by using Enterprise Design Thinking to produce an MVP statement. An MVP statement is the absolute bare minimum in a delightful experience that your target persona accepts to accomplish a goal. After the MVP is defined, the inception process begins. In the inception process, the MVP statement is expanded into user stories.

The MVP must explicitly state a measurable goal. Often, the MVP statement includes all these facets.


For example, consider a recent MVP statement from the airline industry:

Polly the Passenger should be able to rebook her cancelled flight on her phone within one minute without having to speak to a human at the gate.

Notice that this MVP statement has all the mentioned facets, including a measurable goal and targets on the actions. In the inception process when this MVP statement is turned into user stories, stories might be as follows:

-------------------------------

MVP: Polly the Passenger should be able to rebook her cancelled flight on her phone within one minute, without having to speak to a human at the gate.

- Polly can view up to 20 available flights for rebooking.
- Polly selects one of the available flights and views the available seats on that flight.
- Polly can view available flights in under 2 seconds.
- Polly can view the available sets on a selected flight in under 2 seconds.

------------------------------

Notice the progression in the examples. The action in the MVP (rebook a canceled flight), is expanded into a set of user stories that either have a functional or a nonfunctional (performance) facet. User stories must be as small as possible, which is why the performance tests are broken out. Each user story has one or more tests, such as a test for whether you can display 20 flights, or an automated performance test to ensure that the display is shown within 2 seconds. These tests ensure that after a functional or nonfunctional story is done, it stays done. The tests that are delivered alongside the story act as a guarantee. For example, even if the page already loads within 2 seconds, marking the performance story complete means adding tests to detect future regressions.
What user stories are not

Now that you know what user stories are, how to write them, and how squads manage them, make sure that you know what isn't included as part of a user story. User stories are not a design specification for the function. The initial version of the user story focuses on what is needed, by whom, why, and a measurable result. The story doesn't focus on how to implement it. The implementation is determined when the story is under development. Then, the developer can enhance the story with design information and decisions that were made as part of the interactions with stakeholders and the development process.

User stories don't define when to implement the function. After a user story is written, it's added to a product backlog and ranked against all of the other known user stories.
Go deeper: User stories must cover more than what the product does

A whole set of user stories is often forgotten in the inception process. Those user stories are the ones that are in between the obvious actions. Consider internal stakeholders when you write stories because stakeholders are often critical to measuring the business hypothesis. For example, a team might have stories for Bob the business owner, who needs a dashboard to track sales or to coordinate product dispatch.

User stories around the nonfunctional facets are equally important. These stories capture what is required to delight the user, such as performance and security. As shown, performance requirements can be expressed as response-time targets and validated by the product owner.

Expressing security requirements as a user story requires more domain knowledge. Something like As a shopper, the site is secure is a poor user story because it doesn't have a clear definition of done. What would a product owner to do test it? Instead, discussions about security might lead to the development of new personas, such as hackers or black hats.

For example, no user wants to log in, but users care about whether other people can see their data. You might express this requirement as Hank the hacker cannot see Polly's flight details. In practice, this story is implemented as a login page. The POscript (acceptance steps) for this story are a set of actions that show how Polly continues to see her flight details by logging in. A second set shows how Hank can't log in with an incorrect password. For a more security-critical application, you might have a number of deeper, but still concrete, security stories. For example, you might write Blake the black hat cannot use an SQL injection attack on any of the APIs or Hank the hacker cannot hack the site by using the OWASP (Open Web Application Security Project) Top 10 vulnerabilities. These stories are good because they're measurable.
Work items for nonfunctional facets

Some nonfunctional requirements shouldn't be expressed as user stories but should still be in the same ranked backlog as the user stories. For example, in the flight-booking MVP, the measurable goal was to be able to rebook without having to talk to a human. If the expectation is that no human interaction is required 100% of the time, continuous availability is required. That requirement is expensive to implement.

To offset cost, you can scale back that availability requirement to a percentage of the time, such as 99%, which allows for a cheaper solution that is short of continuous availability. Meeting an availability target is important, but it's not a user story. A product owner can't validate a story such Polly should be able to access the mobile application and avoid talking to a human during no less than 99.9% of the year. Outages should last no longer than 30 minutes at a time. Although the story has a numeric success criteria, it is describing events that happen (or don't happen) in the future. A product owner can't validate it by using normal "given/when/then" criteria unless they watched the site for a year and timed outages. Instead, they can convert the requirements into actions that help to achieve them and track the actions in the backlog as tasks. For example, activities to support a high availability requirement might include Perform chaos testing to ensure system resilience or Mirror infrastucture to a second region for HA/DR. Nonfunctional work items emerge as part of discussions between the developers, the squad lead, and the product owner.


-------------------------------

Scale, Manageable, Security, Usable, Accessible, Performance, Resillience, Globalized

- Polly can view available flights in under two seconds.
- Mirror infrastructure to a second region for HA/DR.
- Perform chaos testing to ensure system resilience.
- Hank the hacker cannot see Polly's flight details.
- Blake the black hat cannot use an SQL injection attack on any of the APIs.

-------------------------------

Stories can be satisfied by implementing new product features or by changing the nonfunctional characteristics of an application. Use tasks only for plumbing work or nonfunctional requirements. Nonfunctional requirements are verifiable or non-verifiable. Direct measurement is either too difficult or requires more time than is practical before sign-off. All stories must be verifiable.


Tasks that track availability work are only one type of nonfunctional work item. Others include operations such as maintenance, logging, and alerting. A Cloud Service Management and Operations shift-left and observability need work from the development team. This work is important and must be tracked, but it isn't visible to a user. Another type of work item is plumbing (infrastructure) to support new capability.
Work items for plumbing

In the Garage Method for Cloud, the ideal user story takes about a day to complete. In practice, some stories are larger. You might need substantial invisible plumbing to make them work, or they might have complex dependencies. For example, setting up the ledger infrastructure and chain code in a blockchain project might take several days, but that effort isn't directly visible to a user or even a product owner.

To limit work-in-progress and also avoid multiday user stories, pull out substantial plumbing or dependency work from these user stories. This plumbing work can be tracked on the backlog as a task and the product owner doesn't track it. Normally, this split into tasks happens closer to when work starts on the user story.

The following table summarizes the different types of work items that can appear in a backlog:


Work item 				User story 	Task 		Defect
Has a persona 				Yes 		No 		Maybe
Who creates it 				Product owner 	Delivery team 	Product owner, delivery team, or users
Who prioritizes it 			Product owner 	Delivery team 	Product owner
Who accepts it 				Product owner 	No one 		Originator
Measurable (directly verifiable) 	Yes 		Maybe 		Yes
Creates capability 			Yes 		No 		No
Must have points 			Yes 		No 		No


Who owns the work items that cover nonfunctional facets?

After you write your nonfunctional work items (user stories or tasks), make sure that they're not ranked so far down the backlog that they're never addressed. Because product owners often think in functional terms, they must be educated in nonfunctional thinking and persuaded that nonfunctional work items are as important to the business as functional user stories. Consider the airline example. If you implemented the ability to reschedule a canceled flight from a cell phone and that function was available only 50% of the time, your customers would be unhappy.

In the Garage Method for Cloud, an ongoing negotiation occurs between the squad lead and the product owner to ensure that the backlog doesn't swing too far in one direction. A key responsibility of the squad lead is to make sure that both the nonfunctional and functional facets of the application are addressed. Another role that can be helpful to rank user stories is an architect. Some agile methods maintain two different owners: an architecture owner and a product owner. Your team might or might not choose to go that far. Make sure that all three roles have a seat at the table in ranking discussions.
Implications for squads: Testing

Planning Extreme Programming makes this key point:

    Eventually the customer will have to specify acceptance tests whose execution will determine whether the user stories have been successfully implemented.

Performance testing is an important part of developing a system in the Garage Method for Cloud. Individual functional user stories often have performance constraints that must be tested as part of the unit testing process, especially when the team is using a microservices architecture. In a microservices architecture, small sets of user stories often map directly to specific microservices. However, like those user-story specific tests, the team must also develop two types of acceptance tests: functional acceptance tests and nonfunctional acceptance tests.

In the airline MVP statement, one agreement between the team and the product owner was that the entire rebooking process must occur within a minute. The total performance budget of all the different steps of the rebooking process must fit within that minute. The application must pass the functional acceptance tests that are defined by the product owners and pass an end-to-end performance test that demonstrates that the entire process can occur within 1 minute.

The availability work item must have a set of tests defined. In availability testing, you shut down all or part of the system (destructive testing). Then, you either restore it or switch to inactive regions to ensure that the process can be done within the time allotted. Similarly, you can conduct chaos testing by using a framework like Chaos Monkey to ensure that the system meets the requirements that are defined by the availability tasks even when components unexpectedly fail.


 Inception

The purpose of an inception is to take the output of an Enterprise Design Thinking workshop and turn it into something that can be built. To do so, you break the minimum viable product (MVP) into individual user stories. The inception aligns the broader design, architecture, and development teams on project goals and scope. The output is updated goals, an exploration of technical project risks, and a ranked backlog of user stories. These stories might cover only a week's worth of the project instead of the whole project. The aim is to give the delivery teams enough information to start to build, rather than produce a perfect plan for the whole project.

A whiteboard is an excellent tool for doing an inception, along with sticky notes and index cards. A facilitator leads the workshop, and the whole team contributes.


Who attends an inception?

An inception is a multidisciplinary activity that involves the whole delivery team. These people are in the room:

    All the developers on the project
    All the designers on the project
    The project architect
    The product owner

Agenda

Start by establishing a shared understanding, being sure to include the people who weren't in the Enterprise Design Thinking workshop. Review the hypotheses and MVP statement from the workshop.
Goals and nongoals

After everyone understands the MVP statement, go over your goals and nongoals, or future goals, from the Enterprise Design Thinking workshop to validate that things haven't moved. Use the goals and nongoals as a starting point to generate more technical, specific goals and future goals. The goals still might be high level, but they might also include things such as technology choices and metagoals, such as “knowledge sharing with a client developer”.

This activity can be a guided exercise, or participants can take 5 minutes to write their ideas on sticky notes. You're aiming to build an MVP, so it's important for the set of goals to be concise. Moving goals into the "future goals" category is a good thing; you're keeping a tight focus on what you need to validate the business hypothesis. You can always move goals back into scope if needed.

The list of goals isn't final. You can revisit it to add, remove, and change goals throughout the workshop.  
Risks

Next, revisit the risks from the Enterprise Design Thinking workshop. As with the goals, the aim is to go deeper and capture anything that you missed, especially technical risks or project-specific risks. For example, you might include these risks:

    "Sal has a lot of holidays coming up"
    "The back-end connector might not be written in time"
    "The product owner is at a conference for a week"
    "The data might be too big for the lite tier"

Cluster and categorize the risks, and discuss and capture possible mitigations.  
Roles and personas

After you revisit the risks, start to think about users, or personas. Usually, you have multiple personas. For example, you might need to collect metrics for a business owner so that the business owner can validate the business hypothesis. Each persona must have a name and stories written from this perspective. If your project has multiple personas, use alliterative names, like Carol the Customer or Mike the Manager to help as a memory aid.

Common roles:

    Customer
    Admin
    Manager
    Hacker (This role helps you define security requirements. Talk about things the hacker can't do with your product.)

User stories

The largest part of the inception is breaking the MVP into user stories, ranking those stories into your backlog, and assigning points to them (estimation). Before you start to break the MVP into small stories, it's useful to break the MVP into bigger user journeys, or activities. After you capture the activities on a whiteboard, your team can switch to index cards or an electronic equivalent to record, point, and prioritize the stories.


How many stories to write?

Ideally an inception takes half a day. To manage the workshop length, the Garage sometimes creates only a single week's worth of stories. This practice is consistent with the Garage's lean approach. Don't spend effort on work that yields value only in the future because that effort might be wasted if things change.

Usually, your team has a vision for work beyond the first week that you don't want to risk losing. You can capture those longer-term work items as large stories or activities and then break them into smaller stories in a later iteration planning meeting.

The backlog that is produced by an inception isn't the final backlog; it's a starting point. The product owner can add new stories at any time and can reprioritize the stories. Iteration planning meetings and playbacks are a good opportunity to add and estimate new stories and adjust the backlog.
Logistics and practicalities

An inception is a good opportunity to finalize logistics. Your team can ensure that the necessary infrastructure, such as Cloud accounts, is in place. You can also agree on working hours and timing for stand-up meetings, iteration planning meetings, playbacks, and retrospectives.


 Affinity analysis

Not all clouds are equal. The variety of vendors and deployment models, such as infrastructure as a service (IaaS), platform as a service (PaaS), capabilities as a service (CaaS), public, and private each have highlights and lowlights. When your organization decides to adopt cloud, carefully consider these options against your requirements to ensure that you get the outcome that you expect.

For example, if your organization wants higher levels of control over its cloud infrastructure, a private on-premises cloud might be more suitable than a public cloud. These decisions are cloud service decisions. Because they cause organizations to become technically and operationally dependent, it's important to make decisions that best support long-term adoption and achieve your business and technical goals.

Typically, cloud service decision-makers aren't the actual users, and the actual users aren't responsible to integrate and manage the cloud services. To address that gap and align the most appropriate cloud service or services to support your objectives, use cloud affinity analysis.


You can see the technical merits of various options and considerations that are unique to an organization, such as its teams' skills, investments, or relationships. Cloud affinity analysis encompasses the organization and includes several technical and business aspects:

    Lines of businesses requirements
    Workload characteristics
    Organizational relationships
    Information technology (IT) culture and enablement
    Ecosystem

For example, Buy versus Build reflects against your organization’s business considerations, while Skills Reuse might impact your IT culture and enablement. After your organization completes its cloud affinity analysis, you use it as an input to next steps, such as architectural design and deployment planning.
Key practices and lessons learned

The key to selecting the right cloud services is diverse affinity and decision criteria that include a spectrum of requirements that consider many perspectives. Consider adopting these practices:

    Identify the benefits and drawbacks of a cloud solution by relying on a complete set of decision criteria that includes both business and technical objectives. Examples include functional, nonfunctional, cost, reliability, and performance-based criteria.
    Define use cases that allow for a structured comparison across cloud solution candidates.
    Present each cloud solution candidate’s benefits and drawbacks in a clear, accessible format and use tools to automate discovery and analysis.

Industry trends

Many organizations that are moving to the cloud base their decisions on limiting assumptions, such as "public cloud always provides cost savings," or on a single deployment model like containers for all workloads. The technical and financial constraints can pose challenges to cloud adoption if your organization doesn't assess the underlying application architectures, consider how developers and operators participate, evaluate new automation capabilities, and elevate users as key stakeholders. The results can include project delays, budget overruns, below-par performance and business value, application instability, and the creation or propagation of technical debt.

To avoid these shortcomings, expand your cloud affinity analysis to develop an advanced decision framework that is customized to your organization. Make sure that the framework contains financial and technical calculators and the workload requirements and guidelines that help decision makers to align decisions with business objectives and maintain organization-wide momentum.

Finally, treat the business and the users as first-class citizens in the decision-making process. By doing so, you ensure that the right decisions are made throughout the cloud adoption journey. The right decisions result in the timely adoption of critical business functions, less wasted effort, and less rework.
Take the first step

To build a decision-making framework for cloud, your organization needs to reflect on considerations that are important to it:

    What are the inhibitors and drivers of a cloud solution?
        The answer to this question can identify strong inhibitors for procurement, such as the sensitivity of data or the complexity of integration with traditional systems.

    Which cloud services align to my objectives in the context of business and technical goals?
        The answer to this question might surface the appropriate cloud models based on compliance with requirements, guiding principles, and feasibility.

    What criteria can help me establish the pool of cloud service options?
        The answer to this question involves identifying cloud solutions by excluding candidates that violate guiding principles or don't meet requirements.

    Which cloud solution is the best option for my organization?
        The answer to this question is driven by a fact-based affinity analysis and includes a clear understanding of implications and tradeoffs of selecting one solution over another.

After you answer those questions, you can get started on your framework to select the right cloud services for your organization.


 What-if analysis

As you adopt cloud, the path for transformation has a considerable impact on your business. Cost- and investment-related impacts, skill requirements for employees, new platform and service requirements, and impacts on customers and business partners are a few examples. In most cases, no single clear path exists, as each scenario has its own benefits and challenges.

To better understand the dynamics of a scenario and the impacts of levers and dependencies in your choices, use what-if analysis. What-if analysis, which is also called sensitivity analysis, is a tool for determining effects on outcomes in a mathematical model by changing the inputs to the model in multiple scenarios.

In the context of your cloud adoption and transformation journey, you apply what-if analysis in steps 2 and 3:


--------------------------------

1- Wave Planning
Establishes the waves of workloads to transform, with constraints and dependencies, based on the cloud disposition.


2- Readiness Assessment
- Assessment for general purposes:
Run the checklist to understand the general cloud MVP technical reqs the country needs for next workloads transformation waves.

- Assessment for specific readiness:
Run the checklist to identify the specific bundle of MVP technical pre-requisites. Those depend on disposition types (Saas, PaaS, IaaS).

3- Target Cloud preparation
Incrementally implement the MVP capabilities and building blocks.

4- Transformation Waves run
A wave of transformation is launched whithin the workload cloud transformation roadmap.


--------------------------------


Build a model

The result of a workload assessment is a grid where you plot workloads according to the value that they bring to the business and the effort that is required to move them to the cloud. The workloads that deliver the highest value and require the least effort to move are the first candidates for adoption.

Before you move the workloads, it's important to consider several key attributes, such as impact on the development, architecture, operations, and total cost of ownership. By using what-if analysis to evaluate the options that affect those attributes, you can envision the future state. What-if analysis is also helpful when you face dynamic market shifts, which can impact your organization's IT and business strategy and investment-related decisions.

To get an idea of how to build the model for your analysis, see the following examples of key attributes that dependencies and levers are defined from. Levers are the input components that you can toggle with. Dependencies impact the outcomes according to the levers that you set.
What-if analysis examples

    What if I want to move a workload to the cloud? Which of my workloads fit my target cloud?
        Considerations: Where should I host my workload: private, public, or hybrid?
        Dependencies: Data sovereignty, latency requirements, high availability, and recoverability
        Levers: Level of deployment automation, elasticity of the workload, and availability of supporting managed services versus building and running your own.

    What if I move or refactor my workload to the cloud? What impact should I consider?
        Considerations: Manage changes in ways of working, DevOps, and SRE processes
        Dependencies: Business lead time frames, vendor dependency, and service agreements
        Levers: Speed in delivery, incident response time, skills coverage, and level of automation

    What if I need to maintain organizational governance and compliance standards? Which migration or refactoring complexities do I need to consider?
        Considerations: Minimize the impacts of production workloads that are identified for migration to cloud, manage integration with your system of record, system of engagement, and security and network constraints.
        Dependencies: Certifications and compliance and regulatory impacts
        Levers: The extent of creating and exposing APIs and the extent of industrializing the core

    What if I move workloads to my target cloud? What are the operational cost differences to consider?
        Considerations: CapEx versus OpEx. What is the return on investment and the total cost of ownership?
        Dependencies: Vendor and license agreements and whether companies own financial models
        Levers: The extent of subscriptions versus procurement and licenses

    What if I follow an incremental approach versus a radical or disruptive approach?
        Considerations: How might this impact my organizational transformation? Would the incremental approach slow the organization? Would the radical or disruptive approach allow for a faster transformation to the cloud?
        Dependencies: Business commitments and service level agreements with customers
        Levers: The number of transformation activities, partnerships with accelerators, skills, and education activities

Benefits

By running various scenarios and comparing them, your organization can mitigate potential risks by understanding the impact of its decisions against capital resources. Use this information as input for building a business case to share with stakeholders and gain support across the organization for cloud adoption and the prioritization of investments and next steps.

A what-if analysis also provides measurable metrics or key performance indicators (KPIs) that are inferred from the model. You can use the metrics to measure the progress of your transformation. The outcomes that you model, such as the percentage of skilled employees in containerization as training is conducted, become your KPIs. You can track them throughout the transformation by updating the input variables that you gather. This activity allows for reflection and the optimization of your projects as needed to ensure that your organization meets its goals. For example, if market shifts impact the costs that a vendor is charging, a what-if analysis can identify alternatives in deployment models or hosting options. 


 Evaluate your tools

When your organization considers new tools and technologies, it's important to remember that the tools and technologies themselves don’t lead to value or outcomes for the people that they support. Evaluate tools based on the outcomes and experiences that you envision. As part of your evaluation, assess multiple perspectives, including users, business and IT goals, and organizational culture. You can aggregate all the perspectives into a decision framework that allows an objective tool evaluation that achieves your goals.
Key practices and lessons learned

One example is ChatOps, which combines messaging clients with bots to automate actions that are related to application operations. Because many traditional chat applications don't support ChatOps, you might need to consider new collaborative messaging tools. At the same time, it's important for an organization to understand the value that new tools and technologies provide. Without that understanding, your organization might experience unexpected outcomes, such as low levels of usage and negative return on investment (ROI).


Ensure that you adopt tools for long-term use and ROI


Users might not value the technologies the same way that decision makers do, so consider a comprehensive set of business and technical criteria that is specific to your organization's goals. Your organization might use these points to evaluate tools:

    Skills that allow users to use their tools of choice. Using popular tools can make adoption easier because a reasonable amount of preexisting training content is publicly available in video tutorials and support forums.

    Skills that are available in the market to support a tool. The partners, adopters, and wider community around a tool play a significant role in the success of its adoption.

    Team collaboration and culture, whether teams are colocated or operate remotely. Consider tools that are built to support team collaboration, which often include features like file sharing or a team calendar.

    The cost that is associated with acquiring a tool and operating. This factor becomes important when you consider the rollout of the tool in a wider group of users within an organization.

To simplify this process of evaluation, develop a decision-making framework and evaluation scorecard that align with the priorities, policies, and goals of the organization. You can use the framework and scorecard in objective assessments of use cases, functional and nonfunctional requirements, and user feedback to ensure that you adopt tools for long-term use and ROI.
Industry trends

As the adoption of cloud computing continues to increase, organizations are giving more focus to selecting tools that enable agility and flexibility. This adoption leads to specific trends, with emphasis on user preferences, extensibility, and automation.

A key consideration is whether a tool is one that users love. These tools often have intuitive user interfaces (UIs) and support individual customization. Organizations are placing greater emphasis on the input of the people who use the tools as part of their daily jobs. The goal is to involve users in the evaluation process and consider the most common workflows to support. This approach considers the users' unique challenges, including how they think and feel about a particular use case. The outcome is a tool decision that teams can readily accept. One way to understand this perspective is recruiting sponsor users to act as representative voices of users.

Open-source tools are popular in technical communities because of the range of options that they provide, including the extensibility to accommodate changes without affecting capabilities. These options resonate with organizations as they look to extend and customize open source tools to match a specific way of doing work. Some open source tools are designed to support certain trending notions in the industry, which makes them even more compelling. For example, Tuleap is an open source tool that manages agile software development projects.

Accelerating automation is an area of increasing importance to support greater levels of team efficiency and productivity. For instance, teams that adopt a test-driven development approach like using automated testing tools to reduce the cost of repeatedly running tests. Apache JMeter is an example of an open source performance test automation tool. The use of toolchains takes automation to the next level by allowing teams to automate significant parts of their processes and customize each stage to suit their work.


Just enough architecture


The biggest technical risk for most projects is the product's architecture. The architecture is where the biggest technical questions and unknowns live. Some products that reuse familiar architectures, such as online stores, use well-known architectural frameworks such as Node.js and Express. A mature architectural framework can eliminate a lot of technical risk.

However, the larger and more novel a product is, the more important it is to define architecturally significant parts of the system and understand risks, tradeoffs, and solutions.

The amount of effort that is dedicated to describing your architecture is partly influenced by the cost of failure. A slightly late delivery of an online store might have few consequences. At the other end of the scale, the late delivery of something complex like an Internet of Things (IoT) product has a high cost of failure.

Simpler applications are low-risk propositions because you can address failures by quickly fixing bugs and redeploying. Complex systems and platforms often have many risk areas—including deployment, legacy, and latency—that you must consider and communicate before you make promises to the customer.

Understanding and defining an architecture includes, but isn't limited to, these activities:

    Apply Domain-Driven Design to a microservices architecture.

    Identify and resolve architectural mechanisms that fulfill nonfunctional requirements, such as security and distribution.

    Identify sensitive areas, such as network bandwidth or latency issues that might need to be accommodated by the platform.

    Understand new technologies and platforms, and plan to adopt them.

    Describe the overall structure and behavior of the architecture. This information is often described in architectural views and viewpoints by using a modeling language.

The value of standing on an architecture

Defining an architecture early, even though it inevitably changes during development, helps your team understand the impact of business and technical decisions. The architecture is a clear reference of what's important in the system and the constraints on developing the platform. The architecture provides a vocabulary and reference that the team can use to develop a common understanding of how the platform works and the consequences of the team's decisions.

By focusing on the architecture early, the team can retire technical risk early in development. This practice reduces uncertainty and increases the chances that the team understands the outcome of changing the platform.

Defining architectural mechanisms identifies and resolves risks and provides repeatable patterns to address common needs across the platform, such as authentication.

The old saying is true: "You always have an architecture". But if you don't document it, you don't know what it is. Building on an unknown foundation is risky and inevitably leads to unpredictable impacts. By defining an architecture early and consistently updating it, you can communicate and think about the architecture that you have rather than the architecture that you imagine.
Start to define the architecture now

Review requirements and identify architecturally significant aspects of the platform. "Architecturally significant" aspects are areas that pose real-world issues, performance or security questions, critical architectural mechanisms, or parts of the platform that other components depend on.

Consider unknown technology or established devices as architecturally significant because other parts of a platform often must conform to constraints that are imposed by new or old components. Brittle parts of the platform—areas where making a change has a high risk of introducing regression issues—are architecturally significant because they have a high-risk factor. Brittle parts of the platform might also place a higher burden on other components, so keep the modifications of brittle components to a minimum.

Every architecture has architectural mechanisms that must be defined, designed, and implemented in a predictable and repeatable way. Those mechanisms are often nonfunctional requirements of the system:

    Security (authentication)

    Distribution

    Deployment

    Fault-tolerance

Your team might not need to produce a large amount of formal architectural documentation. Consider documenting your architecture by using the C4 model for software architecture.

Finally, your team can follow these general guidelines to ensure that you're creating just enough architecture for your product:

    Understand the critical areas of your system

    Identify the architectural mechanisms that you need

    List the technical risks and their mitigation strategies

    Communicate and document your architectural decisions to current and future members of your project

Just enough IoT architecture: An example

As you start a project, think through any special architectural considerations that might come into play because of the unique characteristics of the application or platform that it is being built on. For example, IoT platforms are large and unique. Even if portions of your IoT platform use existing architectural frameworks, understanding and defining the platform as a whole is complicated. IoT platforms tend to be novel, complex, and mission critical. A late or problematic deployment of a new release can significantly impact the customer's goodwill.

IoT platforms and other complex systems need to include MVPs, experimentation, and refactoring, but the physical devices and other considerations can limit the agility of the project and require you to make a few key architectural decisions early. As you develop your architecture, you need to make more considerations because of the IoT platform:

    Identify sensitive areas, such as network bandwidth or latency issues that might need to be accommodated by other parts of the IoT platform

    Deal with the impact of external circumstances, such as weather conditions or access issues that can affect deployment

Because of the nature of IoT platforms, give special consideration to nonfunctional requirements as you develop the architecture.

As you can see, you need to tailor your architectural considerations and definition of "just enough architecture" based on the type of application that you're building and the technology platform that you're using to build it. 


Develop


 Create innovative solutions fast

DevOps development practices help your team collaborate and produce high-quality code that you can confidently deliver to production.
Why change?

In the past, teams delivered code in releases that spanned months and years. As teams increase the rate at which they deliver new capabilities to customers, they must change how they build, integrate, and deliver applications to ensure high quality. Not long ago, to install software, you used CDs to copy files and install and configure a product. Build processes built and packaged the product, but usually, you had to complete a long checklist of manual installation tasks. Processes weren't always repeatable, and because the results of development builds and production builds often differed, integration issues occurred. Teams spent too much time debugging build failures and obtaining conflicting test results in different environments.

What you deliver and how you deliver it makes all the difference. With the evolution of the cloud, the goal is to develop new capabilities in small batches and then continuously deliver high-quality code to production. Teams must continuously integrate new code and corresponding test automation as it is developed. Continuously integrating code and supporting quality through automated testing and pair programming ensures an excellent customer experience.

To deliver new features to production continuously, the team needs tools that support a repeatable process to deliver the same code to development, test, and production environments. Teams can use infrastructure-as-code techniques to automatically provision identical development, test, and production environments. The Develop practices are about understanding and using tools to ensure a repeatable, automated delivery process that results in a fully tested production application.
What changes?

At the start of a project, an initial MVP is created. This MVP provides the minimum amount of coding effort to validate that the application meets the needs of the customers that it is being developed for.

The Garage Method for Cloud is cyclical. As the team codes and validates the initial MVP, the team also works with stakeholders and the design team to form the MVP for the next feature to add to the application.

As the project continues, several MVPs for new and improved features might be under development at the same time. For example, while the stakeholders formulate the next MVP (Envision), the developers code another MVP (Develop). At the same time, the operations team monitors the deployed MVPs (Operate) and the whole team learns about the deployed MVPs through customer feedback and analytics (Learn).


Produce and deliver high-quality code by using DevOps practices.


The DevOps team develops the application code and test cases that demonstrate an MVP. After an MVP is validated, the code usually needs to be refactored. MVPs are developed fast, and as such, some of the more complex application constructs, such as microservices and the Circuit Breaker pattern, aren't used in initial MVP development.

However, if your team doesn't use microservices, start to learn about them. Microservices are the heart of the architecture that results in continuously delivered applications. More specifically, microservices are small, independent services that communicate by using HTTP and messages. When they're integrated with each other, they form a large application. Microservices have these characteristics:

    Self-contained

    Easy to evolve and maintain by a small team

    Large enough to justify their existence and the boilerplates, runtimes, and pipeline that are necessary to handle their daily functions

One way to develop microservices is to use the Circuit Breaker pattern. The pattern enables the whole system to function even if an outage occurs in a single or a small set of microservices.

Teams that work on traditional applications also must decide how they deliver. If the application is valuable and you expect to use and enhance it for years to come, you might refactor the application to adopt the microservices architecture. If the application works as expected and you expect changes to be minimal, you might add APIs that can be called by the microservices in another application if necessary.
Test-driven development and pair programming to continuously integrate code

Teams that continuously integrate and continuously deliver must adopt agile, lean, and extreme principles that are combined with test-driven development (TDD) to deliver quality software.

The goal of TDD is to enable faster innovation and continuous integration while you ensure quality by running automated tests as part of the development work. TDD can also be used with pair programming. In that scenario, the pair follows TDD to create automated test cases and develop code for those test cases.


When you continuously deliver high-quality code, it's important to work in small batches. Another common practice in continuous integration is refactoring, because when you add new function, you typically need to change other function.

Store your code in a source-code management system that supports continuous integration. To construct the code, you might use a web-based integrated development environment (IDE). Ensure that each small batch includes items that must be part of all code batches, such as continuous globalization and integrated security.

Make automated testing a main focus. For example, be sure to test performance, globalization, and accessibility, among other areas. A story or batch isn't ready to deliver until all testing is completed.
Track and plan your work

By using Enterprise Design Thinking, the team defines the MVP and develops a rank-ordered backlog of user stories that must be completed to achieve the MVP. Stories in the backlog include setting up the development environment and building the functioning code that is needed to create the MVP. Each story is written as simply as possible to ensure that only the function that is required for the MVP is included.

The rank-ordered backlog contains the list of stories that must be implemented to complete the MVP. As development proceeds and user stories are completed, the next-highest-ranked stories are worked on. The team can choose whether to use formal iteration planning, which entails defining an iteration's worth of work ahead of time, or take a kanban approach to work through the backlog. No matter which approach the team takes, it must identify risks early and track them along with the stories to complete the MVP.

One of the decisions that the team must make is what tools to use to develop the MVP. A toolchain is a set of tools that enable the rapid development, testing, and deployment of an application that is scalable, resilient, and made for the cloud. Choose a toolchain that suits the way that your team works and the type of application that you're developing.
Set up a delivery pipeline

To achieve continuous delivery in a consistent and reliable way, the team creates and configures a delivery pipeline to divide the software delivery process into stages. The delivery pipeline is set up so that the code progresses through each stage automatically with minimal human intervention. The goal is for the final stage of the pipeline to produce code that is ready for production.

By implementing a delivery pipeline that includes robust automated testing, the team can deliver continuously into production. To deliver code, the team must understand several things about a delivery pipeline:

    What it is

    How to configure it to use the tools that the team selected

    How to define and configure the stages that the code must pass through to be ready for production

As you start a new project, create a "hello world" application that you can use to set up the stages of the delivery pipeline. These stages apply to most projects:

    Build: During the build stage, you build and package the code. You can run unit tests in this stage or run them in a stage that is created for unit testing only.

    Staging: The packaged code is deployed on a staging or test environment where regression testing and the complete suite of automated tests is run, including performance tests and globalization tests.

    Production: A stage is created to deploy the code into the production environment. Blue-green deployments are used to avoid downtime when a new version of the code is deployed.

When you're setting up the pipeline, define stages that make sense for the project. Typically, early stages in the pipeline run quickly, while later stages that handle complex suites of test cases need more time to run. You can configure stages so that the pipeline fails if any stage fails.


To avoid downtime during automated deployment, require a manual step to push the code into production. Limit the number of team members who can push production code.

Throughout the process of setting up the stages, your team must select and integrate tools into the delivery pipeline. To easily integrate tools, try open toolchains on Cloud, which include Delivery Pipeline and other commonly used tools.

After you define the stages and their flow, you can create code and test cases and then build and test them by using the pipeline.
Communicate efficiently

Programming in pairs encourages team members to communicate to produce higher-quality code. The pairs work on stories in the rank-ordered backlog, which enables everyone on the team to clearly understand the upcoming work and the priorities that are defined by the stakeholders.

Notifications that are built into the delivery pipeline notify the team when issues occur, such as build failures, test case failures, and deployment failures. The team can then immediately address the problems so that the code can continue to progress from development to production.


 Test-driven development

Test-driven development (TDD), which is rooted in extreme programming, is all about satisfying your team that the code works as expected for a behavior or use case. Instead of aiming for the optimum solution in the first pass, the code and tests are iteratively built together one use case at a time. Development teams use TDD as part of many coding disciplines to ensure test coverage, improve code quality, set the groundwork for their delivery pipeline, and support continuous delivery.
What is TDD?

The basic concept of TDD is that all production code is written in response to a test case. Robert C. Martin, who is known as Uncle Bob, describes these Three Laws of TDD:

    You are not allowed to write any production code unless it is to make a failing unit test pass.
    You are not allowed to write any more of a unit test than is sufficient to fail; and compilation failures are failures.
    You are not allowed to write any more production code than is sufficient to pass the one failing unit test.

Javier Saldana expresses TDD in this two-rule version:

    Write only enough of a unit test to fail.
    Write only enough production code to make the failing unit test pass.


Why use TDD?

TDD provides several benefits:

    It can enable faster innovation and continuous delivery because the code is robust.

    It makes your code flexible and extensible. The code can be refactored or moved with minimal risk of breaking code.

    The tests themselves were tested. A key characteristic of a test is that it can fail, and the development team verifies that each new test fails.

    The code that is produced is, by design, easy to test.

    The requirements are implemented with little to no wasted effort because only the function that is needed is written.

Red/Green/Refactor cycle

A TDD cycle incorporates Uncle Bob's three TDD rules and adds a refactoring step after the tests pass


TDD practitioners refer to this cycle as the Red/Green/Refactor cycle. In the article TDD - What it is and what it is not, Andrea Koutifaris describes this cycle:

    Red phase: You write an automated test for a behavior that you're about to implement. Based on the user requirement, you decide how you can write a test that uses a piece of code as if it were implemented. This is a good opportunity to think about the externals of the code, without getting distracted by actually filling in the implementation. Think about what the interface should look like? What behaviours should a caller of that interface expect? Is the interface clean and consumable?

    Green phase: You write production code, but only enough production code to pass the test. You don't write algorithms, and you don't think about performance. You can duplicate code and even violate best practices. By addressing the simplest tasks, your code is less prone to errors and you avoid winding up with a mix of code: some tested (your minimalist functions) and some untested (other parts that are needed later).

    Refactor phase: You change the code so that it becomes better. At a minimum, you remove code duplication. Removing duplication generally leads to abstraction. Your specific code become more general. The unit tests provide a safety net which supports the refactoring, because they make sure that the behavior stays the same and nothing breaks. In general, tests should not need to be changed during the refactor stage.

Unit testing as part of TDD

If your team is hesitant to skip traditional unit tests, remember: TDD drives the code development, and every line of code has an associated test case, so unit testing is integrated into the practice. Unit testing is repeatedly done on the code until each unit functions per the requirements, eliminating the need for you to write more unit test cases.

Teams found that the built-in unit testing produces better code. One team recently worked on a project where a small portion of the team used TDD while the rest wrote unit tests after the code. When the code was complete, the developers that wrote unit tests were surprised to see that the TDD coders were done and had more solid code.

Unlike unit testing that focuses only on testing the functions, classes, and procedures, TDD drives the complete development of the application. Therefore, you can also write functional and acceptance tests first.

To gain the full benefits of unit testing and TDD, automate the tests by using automated unit test tools. Automating your tests is essential for continuous integration and is the first step in creating an automated continuous delivery pipeline.
Get started

You can get started with TDD by following these steps:

    Think about the behaviors that your implementation requires. Select a behavior to implement.

    Write a test that validates the behavior. The test case must fail.

    Add only enough code to make the new test case and all previous test cases pass.

    Refactor the code to eliminate duplicate code, if necessary.

    Select the next requirement and repeat steps 1 - 4.

Key practices for TDD

As teams implemented TDD over time, several key practices emerged:

    Obtain buy-in for TDD from project leadership.

    Ensure that the development team understands TDD. Writing the test can sometimes require more effort than writing the code.

    Run all tests that are developed as part of your development pipeline. A failing test must stop the pipeline.

    Measure and monitor the value that is gained by implementing TDD.


 Program in pairs

Pair programming is a practice in which two developers work together on one task, with one physical machine, in the same development environment. Even before pair programming had a name, programmers commonly worked in this manner because programming lends itself to challenging problems that are often best solved by two people working together.
How pair programming works

In the most common pair programming style, driver-navigator, each developer has his or her own monitor, mouse, and keyboard. Each monitor is a mirror of the other. In this way, both developers can sit comfortably, have a full view of the screen, and take or relinquish control as needed. When two developers are working remotely, they can use a tool for remote screen sharing such as the screen sharing feature in Slack.

The developers alternate the roles of driver and navigator. The driver writes the code while the navigator observes and thinks ahead, asking questions and offering immediate suggestions for improvement. The navigator is essentially doing a real-time code review as the code is being written.

Learn about the role of pair programming in DevSecOps:
Pairing styles

In addition to the driver-navigator style, several other styles of pair programming can be more effective for a particular combination of developers. Choose one pairing style that works best for your team, or use a combination of styles during the course of development.
Driving school

In the driving school style, also called backseat navigator or strong-style, the navigator takes over more of the tactical responsibilities much like a backseat driver in a car. The driver sits with hands on the keyboard but the navigator gives instructions such as "create a method" or "create a new file".
Tour guide

In the tour guide style, the driver acts similar to a tour guide on a bus. He or she does the strategic and tactical thinking, and the typing. The other pair member is the "tourist" who passively listens. The tour guide model is useful when one person has much more context than the other, but it's not an ideal long-term arrangement. Even with a significant skills gap between pair programmers, learning by doing is more effective than learning by watching. You can also face a risk that the "watcher" gets bored and loses focus if the tour goes on for too long.
Ping-pong

Ping-pong programming works well when combined with test-driven development. In the ping-pong style, one member of the pair writes the test then the other member of the pair writes the code to make the test past. That pair member then writes the next test before passing control back to the other. The pair continues this way throughout the development. The ping-pong style works well for all skill levels because of the frequent switching.
Pomodoro

The Pomodoro pairing style is similar to ping-pong pairing, but prescribes set time intervals for each session. A typical Pomodoro-style pairing session lasts 25 minutes followed by a 5-minute break. The driver and navigator then switch positions. After four 25-minute sessions, both programmers take a longer 20-minute break. Forced breaks and regular position switching help ensure that both programmers are always productive, focused, and refreshed when a session begins.
Tag-team

In tag-team pairing, the driver and navigator switch positions but don't follow set time intervals or rules for when the switching can occur. Basically, the switch can occur when the driver is tired, when the navigator wants a chance to take the lead, or for any other reason given by either programmer.
The benefits of pair programming

Pair programming can improve overall productivity and software quality through the process of collaboration. You can find many resources about pair programming: what it is, how to do it, and how it can benefit your organization. You can also find research that empirically shows the positive effect of pair programming on productivity, quality, and even team morale.

The practice of pair programming yields many benefits:

    Higher quality code as a result of real-time review

    Better designed solutions through shared collaboration

    Better distribution of knowledge of the code within your team so that no single developer understands a particular portion of the code

    Greater job satisfaction for the developers who enjoy the high productivity and social aspect of pairing

    Faster delivery because solutions to challenging problems are found more quickly

    Consistent coding practices through collaboration

    Greater focus on the code and programming task without distractions

How to adopt pair programming

Organizational culture plays a big role in determining the success of pair programming. In some organizations, developers practice pair programming all day, every day, in an open space that everyone shares. However, some organizations can't adopt pair programming in that way. You can still see the benefits of this practice without adopting a full-scale approach. As with any change in a development process, a one-size-fits-all answer doesn't exist. As you work to transform your culture and practices, be mindful of the context that you are starting from and how open the organization is to making changes.
Remote pairing

While pair programming is best accomplished when pairs are co-located, there are many great resources for remote pairing. For more information, see Remote pair programming.
Barriers to pair programming

As you adopt pair programming, you might encounter one or more barriers related to the time required to adjust to the practice or the differing workstyles of a pair.
Time

When you start pair programming, you make an initial time investment. Developers can take longer to start producing code. Pair programming is an investment; it saves time in the long run because you can save on training time and bug fixing.
Personalities and skill levels

Pairing is a skill. To be most effective, pairs should have compatible working styles. For example, if one person isn't getting enough time at the keyboard, set a timer and switch roles at regular intervals. Take regular breaks. Going for a walk is often more effective than trying to push through a tough problem. The common practice of pairing a senior person with a junior person is often the most effective pairing and is the fastest way to teach junior developers.
Tips to get started

    Make sure that you are set up to pair properly. Developers have tried to pair on a laptop with a single monitor and single input devices. After that experience, they wanted to abandon pair programming altogether. When the same developers were given dual displays and dual input setups, their attitude toward pair programming changed completely: they loved it. Each pair should use a single computer that is connected to two displays, two keyboards, and two mice.

    Create dedicated pairing stations. Every pairing station should include the same software, and only the software that is needed for development, for example, no e-mail or social media apps. Also consider setting up pairing stations that don't have a specific owner and aren't cluttered with family photos or other personal items. These modifications make it easier for pairs to change pairing partners and work at any of the pairing stations.


    Start with something simple. Pick a defect or task in your backlog that is fairly self-contained and not in a complicated area of the product. Make sure that the work item doesn't have many external dependencies so that the developers can feel empowered to work directly towards a solution and deliver it. When people are trying something new and you want to evaluate its success, avoid external influences that might make it feel more challenging.

    Don't impose pair programming on unwilling developers while you are still experimenting. If you ask developers rather than force them, you'll be surprised at how willing some of them will be to try it. Take advantage of that enthusiasm and get them to try the practice out first. If people are willing, they are more likely to see value than if they enter the experiment with a negative bias. After you have some success with that core group, use your newly experienced pair programmers to pair with and educate other team members. Developers who are resistant to the idea are more likely to see value if they work with someone who has had success with it.

    Co-location and compatibility are often arguments used against pairing. However, pairs can use many methods to learn to work with one another and teams can use many tools to help facilitate the pairing such as the screen sharing feature in Slack.

    Keep the schedule simple. Avoid imposing strict hours or time frames. Instead, encourage people to pair on a specific work item that is containable in a reasonable period of time. After you have some success with small work items, you can start working on multi-day stories and swapping partners in the pairs as needed.

    Encourage the developers to alternate roles frequently. Pair programming is a mutual activity and won't be as effective if only one developer is always the driver. However, the pair that is doing the work makes the final decision about how often to alternate roles.

    Minimize distractions and interruptions. When pairing, developers should not be constantly checking their phones, answering calls, checking email, or chatting with other people. Managers and other leaders must be sensitive to this and not interrupt a pair when they are working.

    Have retrospectives with the team members who participated in the experiment. Find out what worked well, what didn't, and what changes you can make to improve.

Every team has its own culture and practices. Pair programming is one of the key practices in a DevOps culture.


 Contract driven testing

Have you ever traveled to a country where people don't speak your native language? When two people don't speak the same language, it's hard for them to accomplish anything together.

The same is true of microservices. If two microservices are to work together, they must "speak" the same language. The language that is shared between two microservices is defined in an API. One microservice, the consumer, makes a request to a second microservice, the provider, via the API. This event is called an interaction. For the interaction to succeed, the provider must understand the request from the consumer and return a response in a form that the consumer expects.


Consider an example of an airline reservation system where you're the passenger. You make a flight reservation, and the user interface calls the Reservation microservice. You also want to select seats, so the Reservation service makes a request to the Seat Assignment microservice to provide a list of the available seats. You select a seat and the Reservation service again calls the Seat Assignment microservice to assign the seat to you.

The API-driven interactions define the terms of a contract, or pact, between the microservices. Automated tests of the consumer can be used to generate the contract. After the contract interactions are defined, automated tests can be written to validate both the consumer and provider by using the stored contracts.


In the airline reservation example, two interactions are defined:

    Request available seats, which returns a list of seats that aren't assigned for your flight

    Reserve a seat, which associates a selected seat with your reservation

When you can't test the real interaction between two microservices, you must test it in a simulated way. In the airline example, you can't assign people to seats on real flights.

To solve this problem, developers use a simulated version of the provider service in a test environment and run a set of automated test cases to test the consumer. The simulated service receives a request and sends a minimal expected response that can be processed by the consumer. The simulated provider must process every interaction that the consumer can request.

To complete the testing of the contract, you must test the provider by using a simulated version of the consumer. Requests are sent to the provider, who sends a response. The provider response is compared to the expected response. This process is called consumer-driven contract testing.


When all the tests of the contract run successfully, you can be sure that your microservices speak the same language.
The Pact tool and contract-driven testing

The Pact tool provides a way to define the contract between the consumer and the provider in code, along with the tools to test the contract.

Pact works as follows:

    The consumer developer writes a test that defines the interaction between the consumer and the provider. The test includes the state that the consumer must be in and the expected result from the provider. This test is the contract.

    The Pact tool creates a simulated provider that responds based on the information in the contract. This mock provider is used to test the consumer.

    The same contract information is used to simulate the consumer and ensure that the provider returns the correct results.

As the consumer and provider microservices evolve, the contact must be kept up to date so that the microservices continue to speak the same language.


 Automate tests for continuous delivery

In the context of continuous delivery, test automation is a requirement for success. In order to have unattended automation from code commit to production, squads need to deliver several levels of automated tests to ensure the quality of what is being delivered, as well as to quickly understand the state of the software.


Automate continuous testing to enable continuous delivery.


Benefits of automated testing

An obvious benefit of automating testing, in contrast with manual testing, is that testing can happen quickly, repeatably, and on demand. It becomes a simple matter to verify that the software continues run as it has before. In addition, using the practices of test-driven development (TDD) and behavior-driven development (BDD) to create test automation has been shown to improve coding quality and design. In short, test automation has the following advantages:

    Reduces time to delivery

    Ensures higher quality

    Supports continuous delivery

    Provides confidence in rapidly changing software

    Enables programmers to run automated tests to ensure their code commits are stable

Get started with test automation

In a DevOps continuous delivery environment, the first principle is that no code is delivered without automated tests. But what automated tests, exactly?

Determining where to invest in test automation requires a strategy. Consider the test automation pyramid:


GUI -> Top
API -> Middle
Unit Tests - Base of pyramid


Here, the largest numbers of tests are unit and API tests. Test-driven development ensures that the squad creates unit tests and has a robust framework that makes them easy to write, deliver, and run.

Adopting behavior-driven development creates a robust, maintainable test automation framework for customer acceptance tests using the API or Service layer. In fact, the combination of developers implementing BDD scenarios, in conjunction with their code delivery, tends to ensure the testability of the API or Service layer. This helps teams achieve the desired structure of the pyramid. Typically, these tests are run in a deployed test environment and include integration tests, which are sometimes called "system" tests.

Finally, there is GUI test automation. This is typically the hardest to write and maintain. As a best practice, if the GUI tests can simply verify that everything is "hooked up," meaning that values entered though the UI are passed correctly to the APIs that were robustly tested independently, then this layer can indeed be even smaller than represented in the pyramid above. The smaller the top portion of the pyramid, the better.

Across all layers of testing, it is important to consider how the tests will run automatically. For unit tests, many industry-standard frameworks run with the continuous integration build. For API or service and GUI tests, setting up the production-like test environment is automated with the same deployment automation that is used for delivering to production. These test environments require deploying test tools, test scripts, and possibly test data, into the production-like test environments to allow the tests to run unattended. When a test automation framework is implemented, introducing dependencies increases the complexity of automatically running the tests. Avoid introducing dependencies, if possible.

Test automation can be built into a continuous delivery pipeline by implementing stages for the different types of tests that are required. The pipeline can run tests in one or more stages, and can be configured to stop in any stage where a test case fails. This process ensures that broken code never makes it to production.

The pipeline can include stages that run many types of tests in an automated manner—unit, API, GUI, security, scalability, performance, globalization— to ensure that the application is production ready.


 Automate continuous integration

The effort required to integrate a system increases exponentially with time. By integrating the system more frequently, integration issues are identified earlier, when they are easier to fix, and the overall integration effort is reduced. The result is a higher quality product and more predictable delivery schedules.


Integrate every change to minimize merge conflicts.


Activities

Continuous integration (CI) is implemented in three activities. First, changes are delivered and accepted by team members throughout the development day. Second, developers deliver their changes and perform personal builds and unit tests before making the changes available to the team. Finally, change sets from all developers are integrated in a team workspace and then built and unit tested frequently. This process happens at least daily, but ideally it happens any time a new change set is available.

The first activity ensures that any technical debt from conflicting changes is resolved as the changes occur. The second activity identifies integration issues early so that they can be corrected while the change is still fresh in the developer's mind. The third activity ensures that individual developer changes that are introduced to the team have a minimum level of validation through the build and unit testing, and that the changes are made to a configuration that is known to be good and tested before the new code is available.

The ultimate goal of CI is to integrate and test the system on every change to minimize the time between injecting a defect and correcting it.
Benefits of CI

CI provides the following benefits:


    Improved feedback. CI shows constant and demonstrable progress.

    Improved error detection. CI can help you detect and address errors early, often minutes after they've been injected into the product. Effective CI requires automated unit testing with appropriate code coverage.

    Improved collaboration. CI enables team members to work together safely. They know that they can make a change to their code, integrate the system, and determine quickly whether or not their change conflicts with others.

    Improved system integration. By integrating continuously throughout your project, you know that you can actually build the system, thereby mitigating integration surprises at the end of the lifecycle.

    Reduced number of parallel changes that need to be merged and tested.

    Reduced number of errors found during system testing. All conflicts are resolved before making new change sets available, and the resolution is done by the person who is in the best position to resolve them.

    Reduced technical risk. You always have an up-to-date system to test against.

    Reduced management risk. By continuously integrating your system, you know exactly how much functionality you have built to date, thereby improving your ability to predict when and if you are actually going to be able to deliver the necessary functionality.

Get started with CI

If the team is new to CI, it is best to start small and then incrementally add practices. For example, start with a simple daily integration build and incrementally add tests and automated inspections, such as code coverage, to the build process. As the team begins to adopt the practices, increase the build frequency. The following practices provide guidance in adopting CI.
Developer practices

As part of a CI approach, developers make changes available frequently. For CI to be effective, code changes need to be small, complete, cohesive, and available for integration. Keep change sets small so that they can be completed and tested in a relatively short time span.

Don't introduce errors. Test your changes by using a private build and unit testing before making your changes available.

Fix broken builds immediately. When a problem is identified, fix it as soon as possible, while it is still fresh in your mind. If the problem cannot be quickly resolved, back out the changes instead of completing them.
Integration practices

A build is more than a compilation. A build consists of compilation, testing, inspection, and deployment. Provide feedback as quickly and as often as possible.

Automate the build process so that it is fast and repeatable. In this way, issues are identified and conveyed to the appropriate person for resolution as quickly as possible.Test with build. Include automated tests with the build process and provide results immediately to the team.
Automation

To increase automation, commit all of your application assets to the code management (CM) repository so they are controlled and available to the rest of the team. The assets include source code, data definition language source, API definitions, and test scripts.

Integrate and automate build, deploy, testing, and promotion. Do this for both developer tests and integration tests. Tests must be repeatable and fast.

Automate feedback from the process to the originator, whether this is the entire team or a developer. Process and resolve feedback, avoiding excess formality, as a part of the backlog process.

Commit your build scripts to the CM repository so that they are controlled and available to the rest of the team. Both for private builds and integration builds, use automated builds. Builds must be repeatable and fast.

Invest in a CI server. The goal of CI is to integrate, build, and test the software in a clean environment any time that there is a change to the implementation. Although a dedicated CI server is not essential, it greatly reduces the overhead that is required to integrate continuously and provides the required reporting.
Common pitfalls

As you start to implement CI, remember these potential issues.
A build process that doesn't identify problems

A build is more than a simple compilation or its dynamic language variations. Sound testing and inspection practices, both developer testing and integration testing, must be adopted to ensure the right amount of coverage.
Integration builds that take too long to complete

The build process must balance coverage with speed. You don't have to run every system-level acceptance test to meet the intent of CI. Staged builds will provide a useful means to organize testing to get the right balance between coverage and speed.
Build server measures that are ignored

Most build servers provide dashboards to measure build results. These results can also be delivered directly to the individual users. Review them to identify trends in applications, components, and architecture that provide an opportunity for improvement.
Change sets that are too large

Developers must develop the discipline and skills to organize their work into small, cohesive change sets. This practice simplifies testing, debugging, and reporting. It also ensures that changes are made available frequently enough to meet the intention of CI.
Failure to commit defects to the code management repository

Ensure that developers are performing adequate testing before they make change sets available.


 Design software iteratively

"Prediction is difficult—especially about the future"

This proverb, quoted by the Danish physicist Neils Bohr, best explains iterative design.

Despite the fact that people know the limitations of prediction, they still try to understand all possible requirements of a software project at the beginning of the project. During a marathon set of design sessions, those requirements are written as a specification, resulting in an enormous design document that describes exactly how a system works for all time and in all situations.

Needless to say, that approach hasn't worked well. According to the 2014 version of the yearly Standish Group CHAOS report on software projects, only 16.4% of all software projects are completed on time and under budget.[1] This failure is often attributed to an incomplete understanding of what should be built or a change in requirements while the system is being built. In other words, it's hard to predict exactly what users want, especially when what they want can change over time.

A good solution to this problem is iterative design. Iterative design has been around for over 20 years; in 1986, Barry Boehm discussed building software in a different way in his paper "A spiral model of software development and enhancement." His suggestion was that you evolve your system from a series of ever-more-capable prototypes, versus the traditional end-to-end approach, or waterfall method.[2] In the early days of object-oriented development, Rebecca Wirfs-Brock advocated this type of evolutionary approach to design, as did Grady Booch.[3] In spite of the best intentions, the idea never really caught on. In the Internet boom of the 1990s and early 2000s, all of the other aspects of object-oriented design became popular. However, the adoption of iterative design has been slow. Even today, most projects still start with massive requirements statements and "big-bang" design efforts, sometimes called "Big Design Up Front." Along with other agile approaches, iterative design is the best opportunity to finally break free of this devotion to "write only" documents.
Get started with iterative design

To get started with the practice of iterative design and avoid the worst pitfalls, consider a few simple rules.
Design only what you know

This principle is at the basis of many other agile practices, such as test-driven development (TDD). If you limit your design scope to only the user story that you're working on right now, you're less likely to get stuck in a pile of abstractions. Don't ever let a design fill more than a single sheet of paper. If you find yourself having to draw something more complicated than that, you probably are either trying to be too clever or you need to rethink the split between your user stories.
Don't be afraid to draw a few diagrams

One of the first things you must recognize is that the act of design is, while tied to coding, different than coding. This truth sometimes gets lost in the discussion of agile methods. The key is to not do too much design, but only as much as is appropriate for the current user story you're working on.

How do you design? Many people, including perhaps a majority of developers, are visual thinkers, and thinking about abstract concepts in terms of pictures is usually helpful. Drawing diagrams has a long and successful history in programming—and for a good reason. Diagrams help you think about a problem spatially instead of textually, which for many people is what they need to find solutions to difficult problems.

When you design, don't worry about what kind of pictures you use or how many of them you draw. Use whatever notation or diagramming technique helps you understand how to solve the problem at hand. You might use a simple class diagram, a drawing of a data structure, or a simple flow diagram to help you understand a complex logical chain. Your diagrams don't have to conform to the rules of UML, E-R modeling, or flowcharting. As long as a diagram helps you solve the problem and explain it to the people on your team, you're fine. Don't be afraid if you want to draw a diagram or two before you code or while you're coding—just be sure that it's a part of the process of understanding and solving a problem and not an end in itself.
Look for your abstractions from the bottom up, not the top down

When the book Design Patterns: Elements of Reusable Object-Oriented Software was first published, many people who were familiar with design patterns immediately saw their importance and utility.[4] They all responded with the same reaction: "I've seen this!" Where many developers went wrong was in approaching those concepts—and other abstractions, such as class hierarchies or even REST URL designs—from the wrong direction. You don't start with an abstraction that might fit your problem and then fold, spindle, and mutilate your problem to fit.

Instead, as you encounter new problems, use your experience and the experience of others captured in approaches like Design Patterns, to find good abstractions that help you solve your problems. A great resource is the book Refactoring to Patterns, which gives real examples of how you can start with real, common problems and then refactor your code to solve those problems, eventually ending up with code that implements common design patterns as a side effect.[5]
Don't fall in love with any abstractions

Don't be afraid to abandon your abstractions. One of the worst traps a developer can fall into is to cling to a particular way of solving a problem long after new requirements have rendered that particular approach impractical. For instance, consider an SQL database such as DB2. If the problem you're solving involves answering arbitrary questions about multiple types of information, DB2 is a great solution. If your problem is figuring out which of your customers are female and live in Albuquerque, NM, today and figuring out which customers have an income of less than $25,000 and are under 24 tomorrow, you're probably not going to find a better solution than a relational database.

If you started with Oracle and find that it no longer solves your problem, don't be constrained by that choice. Experiment and realize that refactoring your code is not the end of the world.

Adrian Cockroft tells a great story about just that kind of switch, which Netflix made from Oracle to Cassandra.[6] The Etsy story of abandoning its "Sprouter" abstraction is another lesson in how sometimes giving up something cherished is the right way forward.[7]
Related practices

To avoid becoming tied to any particular abstraction, do test-driven development (TDD). If you develop good tests for your entire code base, as you change any part of it—as long as the tests continue to pass—you know that you have not introduced any unexpected behaviors. That knowledge gives you the freedom to try several approaches and use the one that works the best.


 Develop code in small batches

The idea that working in small batches is a far more efficient than working in large batches is nothing new. The efficiency of working in small batches is one reason why adopting agile development has benefited the software industry. In the days of waterfall development, months of coding were followed by months of testing, and only near the end of the testing phase did anything really work.
A helpful analogy

To understand the importance of working in small batches, think about installing electrical wiring in a house. If you work in large batches, you might wire the entire house and then turn on the electricity for the first time. Imagine that after you turn on the electricity, you discover a problem. How hard will it be to figure out where the problem is? Did you accidentally cross two wires in an outlet? Did you run the wire from the junction box the master bedroom when you meant to run it to the other bedroom? Who knows? To find and fix the problem, you need a lot of time for investigation and, likely, reverse engineering.

In contrast, consider how you might wire the house in small batches. You wire the dining room, turn on the electricity, and verify that everything works fine. You turn off the electricity and wire the kitchen. When you turn on the electricity this time, you discover a problem. Because everything was working after you wired the dining room, you know that the problem must be in the kitchen. In this case, troubleshooting is easy.
The benefits of developing software in small batches

Think about small batches in software development. Your team writes and tests a small batch of code, such as a single user story, which provides an end-to-end capability that is valuable to a customer. Everything is working, and you can demo the capability to the customer and get feedback quickly.

You write the next small batch of code, and as you're doing your performance testing, you notice that the performance degraded significantly since the work began on the new user story. How hard will it be to find the problem that is causing the performance issue? It shouldn't be hard at all because you can isolate the investigation to only the new code that was written in the past day or two.

This scenario is far easier and less risky than trying to find the source of a performance problem by looking at possibly hundreds of thousands of lines of code that were written months ago.
Get started

When teams start to work in small batches, their most common stumbling block is determining how to break their work into small batches that each encompass a small piece of end-to-end functionality. This process can be difficult for teams that worked in a waterfall way, or even in a component-based way, where one layer of a project is coded at a time, then the GUI is finally coded, and only then can some end-to-end functionality be exercised. 


 Manage work in a rank-ordered backlog

Software development teams often list their features by priority: high, medium, and low. The biggest part of that list is usually the high-priority features. Teams are also under constant pressure to do it all—now. When these elements combine, teams usually respond by working on as many high-priority features as possible to show progress on everything.
The problem with prioritizing

Unfortunately, working on many high-priority features at the same time can lead to significant problems. One problem is inefficiency. Because the team is trying to show progress on everything, it bounces among several unfinished features, wasting time that it might otherwise be using to focus on and complete one requirement.

When a team is working on too much at once, flexibility is virtually eliminated. If a new, unanticipated item needs to be worked on, the team will likely respond in one of several ways. First, the team might start working on the new item in addition to the others, thus slowing everything down. Alternatively, the team might start working exclusively on the new item, while the other, partially completed work gets in the way and team members lose their familiarity with it. Or, the team might be mandated to work extensive overtime, which typically leads to taking shortcuts that cause problems later.

Beyond the issue of inflexibility, problem determination can be a nightmare. If a performance problem occurs, how can the team determine which of their in-progress features might be causing it? The team might not even be able to start performance testing when lots of incomplete work is underway.

Perhaps the most significant problem with working on several high-priority items at once is that it delays the opportunity to obtain feedback from customers. For more details, keep reading.
The value of a rank-ordered backlog

Given the problems with working on many things at the same time, what's the solution? It's simple: work on and complete one thing at a time. The phrase "stop starting; start finishing" is popular in agile circles for a good reason.

How do teams "start finishing"? Again, the solution is simple: the team needs a rank-ordered backlog of work. The term rank-ordered is preferred because it implies numbering that is clear to follow. In contrast, the phrase prioritized backlog is less helpful because it implies priority. Several items might be high priority, but only one can be first when you have a rank-ordered list.

By adopting the practice of rank-ordering the work, the team has no question about what to work on: the item at the top of the list. When the work on that item is done, the team moves to the next item on the list. That item might be an unanticipated item that was recently added. However, that addition doesn't cause a problem because the team finished the previous item and has the flexibility to start working on something else, no matter how new or old it is. The following illustrates an example of a rank-ordered backlog represented as a column in a Kanban board. User story 7 is the next story to be started when one of the in progress stories is completed. 


One of the biggest benefits of working on and completing one thing at a time is that when something is completed, it can be shown to customers. If customers like what they see, the team can confirm that it is on the right path. If customers don't like what they see, the team can address the customers' concerns and recommendations because no other work is in progress.
Technical debt and the backlog

Backlogs aren't just for user stories. Teams must monitor and manage application technical debt as part of their backlog on an ongoing basis. Application technical debt is the future cost of not making improvements to your application code, which results in higher maintenance expenses, increased labor, limited functionality, and reduced quality. Application teams can either invest in paying down the debt or continue to pay interest on the accumulated debt.

You can use static analysis tools, such as CAST and SonarQube, to identify your technical debt. If you embed the static analysis into your pipeline, you can flag critical violations as defects and potentially stop the build until the violations are fixed. For all remaining issues, create work items for technical debt improvement ("chores" or "tasks") and place them in your backlog for ranking.
Non-functional requirements, defects, and plumbing

Beyond technical debt, there are several categories of work which should be tracked in the same backlog. Some tools allow different kinds of work items, and tags or labels can be used to distinguish work items otherwise.

    Stories track new feature work with a direct user impact.
    Non-functional requirements which aren't directly visible to the user, such as observability or security, should be tracked as tasks.
    It may be helpful to break large user stories down into a plumbing piece and a functional piece. For example, it might be necessary to set up a blockchain infrastructure or write some automation to make data available.
    Defects track problems with accepted user stories. As with non-functional requirements, some defects will be a higher priority than new feature work, but others might be low-priority niggles.


How to get started

You might be wondering how to rank the items in your backlog. A good practice is to base ranking on these factors: customer and business value, project risk, and necessary improvements to the team's environment and processes.

The product owner is ultimately responsible for the rank-ordering of items. However, the team must provide input because the work that needs to be done isn't always new feature work. For example, if the team's build environment is unstable and the team often wastes time investigating and fixing build failures, the highest-ranked item on the team's backlog should be fixing the build environment. After the build environment is fixed, the team can work on their next items at a faster pace. Similarly, if a new story requires a database with a complex setup, that infrastructure work should be tracked separately as a task. It clearly needs to be prioritised above the dependent story, so the development team should set the priority. In general, product owners should decide the positions of defects and stories, and the development team should prioritise tasks.

In sum, when teams work on and complete one thing at a time, their efficiency and productivity improve. The key is having a rank-ordered backlog. When work is ranked, the team knows what to work on, and the impact of any new work is immediately apparent.


 Get started with architectures

A reference architecture documents the best practices to integrate IT services, products, and tools to build a solution. Reference architectures are based on customer use cases and open industry standards. The architectures in the Garage Method for Cloud include Cloud and other cloud-based implementations that can jump-start application development with features such as AI (Watson), data and analytics, edge, and Internet of Things. The solutions show how to extend, build, and deploy and manage code samples by using suggested services, toolchains, and tools.

Reference architectures fall into several categories in the Cloud architecture model.


Example reference architectures

The Data, Analytics and AI architecture explains the notion of the AI ladder, which describes how data must be collected and organized for use in an application. The organized data can be analyzed and infused with AI capabilities that drive new or improved business capabilities. This architecture is deep and has many levels of constituent architectures that provide greater detail for specific parts of the architecture.


The Application modernization architecture describes how to create a modernization strategy. The strategy includes analyzing your current application estate, prioritizing your modernization goals, and building a roadmap to guide the work that it takes to realize your strategy.


The DevOps architecture describes the practices and tools that are needed to implement DevOps practices in the Method, including agile principles, continuous delivery, and operations automation. The DevOps architecture is complex. To help you better understand it, different aspects of the high-level architecture are shown on tabs. For example, one aspect of the DevOps architecture focuses on integration.


The Public cloud infrastructure architecture illustrates the Cloud platform, which can be used to support scalable, secure, and resilient workloads. The infrastructure services include networks, compute, storage, security, and management.


How to use a reference architecture

The architectures answer some common questions:

    Where do I start?
    I have all of these services, but how do they fit together?
    Do I want to use infrastructure, such as VMs, or services, such as a data store?
    Now that I've proven an approach, how do I scale up and move into production?
    How does my service talk to that other service?
    How is security handled?
    How can I build, deploy, and manage an application?

Each architecture provides an overview and a reference architecture description and diagram, and most architectures also include constituent architectures, solutions, and code patterns.
Overview

The overview describes the key architectural decisions that you need to consider and introduces the technology that is used to implement the architecture. Each overview also contains a conceptual, or level 100, diagram that shows how the parts of the architecture fit together. The architecture overviews also have a level 200 diagram that that describes in greater detail the technologies that you need to understand before you implement the architecture. Some overviews also include a link to a field guide for the architecture.
Reference architecture

A reference architecture diagram is the interactive roadmap to your solution. Numbered gray circles indicate the flow of activities within the diagram. You can click each component to view a description and product implementation choices.
Constituent architectures

Constituent architectures are complete stand-alone reference architectures. Each constituent architecture explores a specific part of its parent architecture.
Solutions

Solutions use tabbed diagrams to visualize how can help you solve your business problem. Each architecture type can have a different type of solution. Solutions typically have a Get the code link that takes you to a GithHub repository for an implementation of the solution.
Code patterns

Code patterns help you solve problems that are relevant to the architecture. They provide a few simple code samples to help you get started.
Resources

Resources include practices, videos, blog posts, and other artifacts that can help you understand and learn more about the architecture. Explore the field guides, practices, courses, and blog postings to leverage the full potential of the architecture center.
Summary

Architecture is the foundation for cloud solutions that are built for the enterprise and scale for production deployments. The architectures and implementations in the Method take many moving parts and show how they fit together, both from a development perspective and a business perspective. The website is an entry point into myriad services, solutions, and infrastructure. It presents how those features are related and can be used to solve problems and address practical high-level business needs.


 Build to manage

You've been there before: development throws its new code "over the wall" and your operations team has to figure out how to deploy, monitor, and manage it. Traditionally, your development team was measured on how fast it updated features and released them into production. Your operations team was measured on availability, which resulted in resistance to change. It's easy to see that these goals are diametrically opposed. In the traditional world, your team had time to build knowledge. Applications were infrequently updated, and after they were deployed, the lifetime of the application spanned years.

As you adopt practices that increase velocity and the speed of change, operations can become a bottleneck, leading to long release times or increased operational risk. To address this problem, create DevOps teams with a broad set of skills and common goals. All the team members are empowered to use their unique skills to drive the team towards overall success. The knowledge of your skilled operations team members helps your developers create more robust software.

Because continuous deployment is key in delivering cloud-based applications, the Ops part of your DevOps team has much less time to build and apply knowledge to prepare for each deployment. To address this reality, you need a different approach to operational management: build to manage.


As development and operations come closer together, new practices arise to ease operations for cloud-based applications.


Build to manage specifies a set of practices that developers can adopt to instrument the application and provide manageability aspects as part of the release. When you implement a build-to-manage approach, consider these practices:

    Health check APIs
    Log format and catalog
    Deployment correlation
    Distributed tracing
    Topology information
    Event format and catalog
    Test cases and scripts
    Monitoring configuration
    Runbooks
    First Failure Data Capture

By adopting those practices, your organization achieves a more mature operational level and faster velocity. Your DevOps team comes closer together as it works toward the common goal of quickly releasing robust functions that meet the required functional, availability, performance, and security objectives.


 Continuous delivery

Continuous delivery (CD) is a practice by which you build and deploy your software so that it can be released into production at any time. One of the hallmarks of computer science is the shortening of various cycle times in the development and operations process. In the early days of computers when programs were entered in binary with switches and toggles, entering in a program was a time-consuming and error-prone effort. Later, editing was instantaneous, but compile times measured in hours were common for large systems. Within a few years, modern compilers and languages such as Java™ and Ruby had made that a thing of the past, as your code was compiled as quickly as you could save the source file.
The move from continuous integration to continuous delivery

The move away from compilation waits merely shifted the focus of waiting. For many years after that, it was normal for developers to code in isolation on their own aspects of the system while an automated or semi-automated build system ran each night to integrate all of that work. Developers lived in fear of being "the one who broke the build," as a build failure couldn't be resolved until the next night. Many teams had "trophies," such as cardboard cutouts of obnoxious movie characters or funny hats, that were awarded to the people who were responsible for build failure.

That approach began to change with the introduction of continuous integration (CI) tools and practices. When you integrated your code more frequently, the possibility of having a misunderstanding that might lead to a build-breaking problem became less common. In addition, the consequences of breaking a build with a faulty automated test became less severe. Again, the focus shifted. Many teams have implemented CI tools but still do system releases on a quarterly or bi-yearly basis. They still live with the pain of multiple code branches: the code release that they sent to users 5 months ago is still being patched as bugs are fixed, the new code base for the next release is drifting farther away from it, and the possibility of missing new bugs in the new release increases daily.
Frequent small releases: Releases become boring

Why is this true? Why do enterprises and commercial software companies put themselves through the pain and anxiety of the "big bang" release? Probably the biggest reason is inertia. Operations teams carefully defined their operations environments and tweaked them in just the right way to ensure that they are secure, perform well, and are reliable. But as a result, they live in fear of change because change might wreck all of work that went into their carefully constructed environments.

In commercial software, Sales and Marketing teams are used to the twice-per-year training seminars, which they plan their years around. In enterprise development shops, the calendar revolves around pre-planned code freezes, planned vacations to respect those code freezes, and the various audits and checks that are the usual cause of the freezes. What if you turned all of that calendar planning on its head? What if instead of two or four large, disruptive releases, you had much more frequent and smaller releases? You would see several advantages:

    If you change less with each release, the release can break fewer things—that makes the release more predictable and probably easier to roll back.

    If you release more frequently, you vastly reduce the time between concept and rollout—in infrequent releases, the market forces that a feature was designed to address often change by the time it is released.

    You save time, anxiety, and money by having fewer meetings to plan the big-bang releases, less complexity to manage at the time of the releases, and less time spent testing and verifying each release.

The benefits are huge: your team can be more productive, less stressed, and more focused on feature delivery rather than dealing with big, unknown potential changes. In fact, you can go so far as to say that when you do releases often enough, they become predictable and even boring. However, to take advantage of these benefits, you have to embrace a few principles of CD.


Principles of continuous delivery

    Every change must be releasable: That goes almost without saying, but it hides a deep set of practices that influence the way your development and operations teams interact and join together. If every change is releasable, it has to be entirely self-contained. That includes things like user documentation, operations runbooks, and information about exactly what changed and how for audits and traceability. No one gets to procrastinate.

    Code branches must be short-lived: A practice from CI that applies—especially when you augment CI with CD—is the notion of short-lived code branches. If you branch your code from the main trunk, that branch must live for a only a short period of time before it is merged back into the trunk in preparation for the next release. If your releases are weekly or daily, the amount of time that a developer or team can spend working in a branch is limited greatly.

    Deliver through an automated pipeline: The real trick to achieving CD is the use of an automated delivery pipeline. A well-constructed delivery pipeline can ensure that all of your code releases are moved into your test and production environments in a repeatable, orderly fashion.

    Automate almost everything: Just as the secret to CD is in assembling a reliable delivery pipeline, the key to building a good delivery pipeline is to automate nearly everything in your development process. Automate not only builds and code deployments, but even the process of constructing new development, test, and production environments. If you get to the point of treating infrastructure as code, you can treat infrastructure changes as one more type of code release that makes its way through the delivery pipeline.

    Aim for zero downtime: To ensure the availability of an application during frequent updates, teams can implement blue-green deployments. In a blue-green deployment, when a new function is pushed to production, it is deployed to an instance that isn't the actual running instance. After the new application instance is validated, the public URL is mapped to the new instance of the application.


 Build and deploy by using a delivery pipeline

Continuous delivery requires that code changes constantly flow from development all the way through to production. To continuously deliver in a consistent and reliable way, a team must break down the software delivery process into delivery stages and automate the movement of the code through the stages to create a delivery pipeline.

A delivery pipeline is so-named because it allows code to flow through a consistent, automated sequence of stages where each stage in the sequence tests the code from a different perspective. Each successive stage becomes more production-like in its testing and provides more confidence in the code as it progresses through the pipeline.

While each stage is either building or testing the code, the stage must have the necessary automation to not only run the test but also to provision, deploy, set up, and configure the testing and staging environments. The code should progress through each stage automatically. The goal is to strive for unattended automation that eliminates or minimizes human intervention.


The benefits of an automated software delivery pipeline

An automated software delivery pipeline brings great value to teams:

    By providing automation, a pipeline removes the need for expensive and error-prone manual tasks.

    New team members can get started and become productive faster because they don't need to learn a complex development and test environment.

    Teams can detect any code that is not fit for delivery and then reject the code and provide feedback as early as possible.

    A pipeline provides visibility into and confidence in the code as it progresses through successive stages where the testing becomes more like production.

Get started

If you are fortunate enough to be working on a "greenfield" project, it is a wise investment to create and automate your delivery pipeline before you write much feature code. Create a simple "hello world" application in the programming language that the project will use and focus on getting the pipeline stages identified, implemented, and automated from end to end. By taking this approach, you can create feature code much faster and with confidence because you can build and test each bit of code automatically.

If you are adding an automated pipeline to an existing project, where you start depends on the current level of automation that is already in place and, more importantly, where the biggest bottleneck might be in the delivery flow that automation can improve.

Creating a continuous delivery pipeline is a significant development effort and investment that directly impacts the productivity of the entire team and should be treated as ”production software” itself.
Choose tools for the delivery pipeline

Creating even a simple delivery pipeline involves multiple automation tools and frameworks. Given the number of tools and the sophistication of automating them together into a reliable and fast delivery pipeline, it is common to have a dedicated team that creates and maintains the pipeline.

Most automated delivery pipelines include at least tools in these categories:

    Source-code management: Tools include Git and Subversion.

    Build: Tools include Ant, Make, Maven, and Gradle.

    Continuous integration (CI) server: Tools include Jenkins and Travis-CI.

    Configuration management: Tools include Ansible, SaltStack, Chef, and Puppet.

    Deployment and provisioning: Tools include UrbanCode Deploy, Bamboo, and Chef.

    Testing frameworks: Tools include xUnit, Behave, and Selenium. Testing frameworks tend to be programming-language specific.

Most pipelines also include an artifact repository where the output of the build stage—including binaries and install packages—is stored. Various stages of the pipeline either get or put items here.
Automated testing

Automated testing is one of the major features of an automated continuous delivery pipeline. Your testing strategy should be mapped to discreet stages and should include both functional and non-functional testing.
An orchestration framework for continuous delivery

A final aspect of building a delivery pipeline is developing the orchestration framework that ties all of the tools together. For example, Cloud Continuous Delivery is a framework that includes toolchains, which are sets of integrated tools, such as Delivery Pipeline and GitHub.

If you're considering using an open-source tool or framework, use one that is supported by a vibrant and active community with many users. Such tools often support several environments, integrate with other tools, have a wide selection of plug-ins or extensions, and have many online sources for help or bug fixes. In addition, a larger pool of people will know how to use the tool. You might combine several open-source tools, such as Gradle and Jenkins to form the basis of your orchestration framework.

Creating a robust delivery pipeline by using an orchestration framework can be a significant investment. It requires the same care, discipline, and quality as any other software development effort. Unreliable, flaky, or poorly done automation can waste more time than it saves. Teams must be able to trust the test automation and tools to avoid wasting time debugging test- or infrastructure-automation errors or false positives.

Depending on the size and number of projects, it is common to have a team that is responsible to create, own, and maintain the continuous delivery automation. As the orchestration framework becomes more sophisticated and more projects use it, maintaining automation can become a full-time job. This automation is often just as sophisticated as the software projects that it helps to deliver. The code needs to be maintained, optimized, and kept current so that it continues to add value and accelerate delivery.

Remember that tools and automation are a means to an end. Newer, better tools and techniques emerge every day. Avoid investing in shiny new tool just because it is new; it might not provide value depending on how it affects the other automation in your pipeline. At the same time, continuing to use tools or automation that can no longer satisfy the needs of the team is not productive.
Stages

Because every pipeline is different, stages don't have strict rules. However, several common stages are applicable to most projects: build, staging, and production.
Build

In the build stage, the software is built, packaged, and archived. Unit tests are run. The input for the build stage is typically a source-code repository. The output of this stage is an artifact that is stored in an artifact repository. The build stage is often configured to trigger on changes in the source-code repository.
Staging

In this stage, the build artifact is installed or deployed into a staging environment that is a clone of the production environment. Automated tests are run to verify the new version. Functional tests or integration tests are run on new capabilities. Regression tests are run to ensure that the new version does not break any capabilities. Finally, performance tests are run.
Production

In the production stage, the software is installed or deployed into the production environment. Additional tests are run to ensure that the new version is working as expected. To avoid downtime during deployment, techniques called "blue-green" or "red-black" deployments are often used.
Flow between stages

In simple pipelines, the stages are run sequentially. An error in one stage delays the next stage. In complex pipelines, multiple instances of some stages can exist. For example, you might have one production stage for each region or data center. The production stages are then run in parallel. For auditing purposes and to avoid unintended deployments, production runs are often guarded by a manual approval step.

Early stages should be very simple and run very fast to provide rapid feedback. Later stages run progressively more complex tests in a progressively more production like environment.

Later stages take longer to run and require more dependencies to run in a production like environment. Each successive stage increases confidence for production readiness as they pass.


 Prepare your data for AI and data science

After you complete the up-front tasks of translating a business problem into an AI and data science solution, and understanding the data needs in support of your business problem, it’s time to prepare the data. You need to prepare the data in a format that can be used for model development, measurement, and training a machine learning model.

Most AI and data science models require data to be combined and denormalized into one large analytical record before data mining, feature selection, model development and optimization, and training can occur.


Data preparation involves these tasks:

    Select a sample subset of data

    Filter on rows that target particular customers or products that help answer the data analysis and business goals. Also, filter on attributes that are relevant to data analysis and business goals. Some data science patterns, such as Fraud Detection, AML Anti-Money Laundering, and Log Analysis might require full-volume, unsampled data.

    Merge data sets or records

    To join data sets, you need a common key. Aggregate records, and merge based on group by like operations.

    Derive new attributes

    When you merge data sets, especially when you have 1 to many relationships, it can be useful to derive new attributes. For example, if you have a customer data set and a customer purchases data set, you might condense the purchases to a new derived attribute with mean or total spending.

    Format and sorting the data for modeling

    Sequence and temporal algorithms might need data to be presorted into a particular order. Categorical data fields might need to be converted from textual categories to numerical ones.

    Remove or replace blank or missing values

    Exclude rows where the missing attribute is key in making the decision. Fill in missing attributes with 0 or estimated values where the rest of the row adds value to the analysis.

    If your data is sparse or you have many attributes, consider reducing the number of features. Principle Component Analysis can show you which attributes have the biggest impact on the data. Linear regression can show you that two attributes are correlated, so you need to use only one of those attributes and remove the other.

    Normalize numeric fields to use the same range. Features with large values compared to features with small values can be given more weight by various algorithms. Normalization eliminates the unit of measurement by rescaling data, often to a value in the range 0 - 1.

    Replace or correct data and measurement errors

After you populate the data model, the next step is to ensure that the data fulfills the requirement of completeness, exactness, and relevance.

Conduct univariate statistical analysis to visually inspect the maximum, minimum, mean, and standard deviation values for each variable to detect implausible distributions, such as a negative age.

Handle outliers and missing values, which can produce biased results. The mitigation of outliers and the transformation of missing data into meaningful information can improve data quality.

Variables can be superfluous by presenting the same or similar information as other variables. You can find dependent or highly correlated variables by conducting statistical tests like bivariate statistics, linear and polynomial regression. Reduce dependent variables by selecting one variable for all others or by composing a new variable for all correlated variables by factor or component analysis. Variable reduction improves model performance.
Handle missing data

Missing data occurs when no data value is stored for a particular observation or field within the data. Missing data is a common occurrence and can affect the conclusions that you draw from the data.

To handle missing data, follow these common approaches:

    Calculate a value to replace the missing value. The simplest approach is to calculate the mean for the field in question and replace the missing value with the mean value. You can also apply more complex statistical methods, which consider other known values for the data record in question.

    Delete or remove the record from the data set. This approach is often adopted when the target field for a predictive model has missing data.

Data normalization

Data normalization is the process of adjusting values that are measured on different scales to a notionally common scale. For example, an analyst scores a piece of information on a scale of 1 - 5, where 1 = not useful and 5 = very useful. A machine scores the same piece of information, but as a percentage. The usefulness of the information is scored by two systems that use different scales. To compare the analyst versus the machine, the values must be normalized to a common scale.

One approach is to convert the percentage scores to the same scale that the analyst used:

0 - 20 = 1

21 - 40 = 2

41 - 60 = 3

61 - 80 = 4

81 - 100 = 5


Categorical data preparation

Categorical or discrete variables represent a fixed number of possible values, rather than a continuous number. Two of categorical data variables exist:

    Nominal: Categories are labeled without any order of precedence, such as London, Paris, and Berlin

    Ordinal: Categories are labeled where an order of precedence exists, such as low, medium, and high

Although some data mining algorithms can handle categorical data as it is, many require it to be converted into a numerical representation. That conversion can be done through many approaches, which are often referred to as encoding techniques.

In the label encoding approach, you assign each unique category value a set numeric value. This approach is easy to implement and is good for ordinal categorical data, but can be misleading for nominal data. The challenge with label encoding is that some algorithms might see a value as a relative measure of magnitude:
Category value   	Encoding
Low	 	1
Medium	 	2
High	 	3


Some algorithms might see a value of 3 times the magnitude of a value of 1.
Category value   	Encoding
London	 	1
Paris	 	2
Berlin	 	3
Rome	 	4


Another approach is one hot encoding. This approach converts each category value into a new column and assigns a 1 or a 0 (true or false). This approach overcomes the weighting issue of label encoding, and it might add many dimensions to your data.

Note: High dimensionality can cause model complexity. A data set with more dimensions requires more parameters for the model to understand, which means the model needs more rows to reliably learn the parameters. If the number of rows in a data set is fixed, the addition of dimensions without adding more information to learn from can reduce model accuracy.

A third approach is custom binary encoding. This approach is a hybrid of label and one hot encoding, where the number of categories are reduced into buckets and then one hot encoding is applied.

For example, if the category is world cities, you might group the cities into geographical regions or size in population. Then, you convert those abstracted labels into one hot encoding.

This approach tries to minimize the “curse of dimensionality” problem. Natural categorical groupings might be hard to determine or might collapse the data down too much.

Many other approaches use of the mean and standard deviation of the dependant variable to give a distribution across the one hot encoding. For example, each extra dimension is given a value in the range 0 - 1. Popular approaches include backward difference coding and polynomial coding.
Start with an example

Rachel defined several thought experiments as the conclusion of her data understanding task. Each experiment targets a part of the data and how it might be able to help answer the overall business goal. She is starting to work on thought experiment 1: Can the risk factors be analyzed in some way to determine what a good or bad loan is?

The data understanding task identified that a large amount of data was missing, especially in the risk factor fields. Rachel decides that some of the fields are easy to fix. For example, these fields are all negative records on a person’s credit history:

    delinq_2yrs: The number of times that the applicant was delinquent in the past two years

    inq_last_6mths: Credit inquiries in the last 6 months

    pub_rec_bankruptcies: Public recorded bankruptcies

    pub_rec: The number of negative entries in the public record

She assumes that the reason why data from those fields is that no records of those items exist for the borrower, so the field wasn't completed. She replaces each missing value with 0.

However, that approach doesn't work on other fields, such as mths_since_last_delinq. Rachel assumes that the values are missing because the borrower was never delinquent. The question is how to handle this case. She can’t replace the missing data with a number because by doing so she is suggesting that the borrower was delinquent.

Rachel decides to create a derived field: delinquency_normalized_frequency. She assumes that the recency of the delinquency is important and gives more weight to borrowers who were recently delinquent. After she plots the distribution of the mths_since_last_delinq field, she categorizes it as follows:

0 = isNull

1 = 49+ months

2 = 13 – 48 months

3 = 7 – 12 months

4 = 3 – 6 months

5 = 0 – 2 months

Note: Because some algorithms use the size of the value as a factor in analysis, Rachel separates the isNull and the most recent delinquency values as far away from each other as possible.

Rachel notices that the revol_util field has 90 missing values, but the revol_bal field has no missing values. Can she draw conclusions from the revol_bal field that help resolve the missing values from the revol_util field?

From the data understanding exercise, she learned that the revolving balance is where the borrower pays only part of the balance on their credit card each month. The revolving utilization is the amount of credit that the borrower is using relative to all available revolving credit. By inspecting the revol_bal field for the 90 missing revol_util values, she discovers that all but two of them are zero. This finding suggests that where there is a zero revol_bal, the borrower has a zero revol_util.

Rachel completed all the data preparation tasks that she identified as part of the thought experiment definition. She is ready to train and deploy a model. To continue to follow Rachel's story, see Select and develop an AI and data science model.

Just as Rachel documented the business understanding and data understanding, it's a good practice to record any data preparation tasks in a data preparation report. That report provides a core set of documentation that you can use later to integrate and deploy your AI and data science solution within your data environment.


Reason


 Integrate AI into your applications and processes

The Reason practices help your team to integrate AI into solutions and into the execution of the method practices. You develop analytic models by using machine learning approaches and train those models on your data so that you can arrive at the right decisions faster. You validate and audit the model to ensure correctness and non-bias, and train the example with both good examples and anti-patterns.
Why change?

Although well-defined business goals are important, you can't meet them without understanding your data. You need a way to tap what's inside of both structured and unstructured data so that you can achieve tangible business insights faster. AI solutions can help you to monitor trends, find patterns, and inform your decisions.
What changes?

Adopting AI solutions requires a multidisciplinary team that includes data scientists, business analysts, and developers who work together to ensure that data models are interpreted and optimized in the context of the business needs. The project sponsors and other team members need to understand the capabilities and limitations of AI and machine learning. Far from being a 'magic bullet' that can automatically answer any business question, AI is a science with a defined process and methodology to follow.

AI solutions require the application of a mix of disciplines, including AI, machine learning, data science, and statistical analysis. The data models that are created require a rigorous, iterative training process to ensure that the answers you receive are accurate and applicable to your target business problem.

In the early stages of AI and data science projects, teams can apply the Discover and Think practices to understand business and data needs, perform data analysis, and prepare the data. This data understanding stage is critical to "getting it right", and it sets the stage for using Reason practices:
Use models to analyze and unlock data

You select a data model and then, through trial and error, train and tune the model in an iterative process. Certain data models map to particular business problems, such as the use of clustering and classification techniques to address defect analysis. However, you might need to use multiple models to solve a business problem.
Train and evaluate your model

The next step in the process is to train the model. You select data sets, tune the model, and then evaluate whether the model provides insights that help to answer the business problem. The data scientist and business analyst always work together to interpret and optimize AI and data science models in context of the business needs.


 Put AI and data science to work in your organization

In the context of the DIKW (data, information, knowledge, wisdom) pyramid, the goal of an AI and data science project is to turn data into actionable knowledge that you can use to answer a business question and transform the way that your business operates and interacts with trust and transparency.


The actual process from inception to deployment can be broken into steps and cycles. But before you begin working with the data, you need to start with your business objectives and desired outcomes. To achieve success in meeting those objectives, you need to do these tasks:

    Assemble the right team

    Translate the business problem into an AI and data science context

    Define measures and goals


Typically, you have too much data to analyze all at once, so your first task is to sample, prepare, and cleanse a subset of data. Data is often in different sources and in different formats, so you need to combine and prepare it so that it can be consumed. The prepared data is analyzed by using a suite of techniques and AI and data science models that turn the data into information. This raw information can rarely be used directly by the business. Business subject matter experts interpret it in the context of the business use case. This entire process is cyclic and iterative, frequently uncovering surprises and insights from the data analysis.

For AI and data science projects, this information processing cycle can be decomposed into an iterative workflow with well-defined tasks and measures at each step. By design, data is in the center of this workflow. Without the correct data, the business use case can't be answered.


The tasks, although distinct, are highly connected, and the cycles are frequently bidirectional, especially in the early tasks. For example, if no data is available to answer the business goals, either the business goals need to be reset or more data needs to be found.

The early stages of business understanding, data understanding, and data preparation are key to delivering an AI and data science project. Get these tasks right, and the rest of the process of modeling, evaluation, and deployment are more likely to be successful. If you try to speed through them or skip them, the follow-on tasks will lack focus and you're more likely to fail.

In the business understanding stage, you translate your business problem into an AI and data science solution, assemble the team in support of a data-driven project, and establish the success criteria and metrics for your project.

In the data understanding stage, you understand your data needs to support an AI and data science solution and run thought experiments by using hypothesis-driven analysis. Next, you prepare your data for AI and data science.

In the modeling stage, you need the correct understanding of how AI and data science work. In that stage, you also select and develop AI and data science models.

Finally, you enhance and optimize your AI and data science models, and deploy and monitor them.

This pie chart provides a rough guide to the amount of time to spend on each task. Business understanding, data understanding, and data preparation take 80% of the overall project time.


Embrace agile

With the level of effort that is required in these early stages of building an AI and data science solution, it can be difficult for business sponsors to see immediate progress.

Many opportunities exist to break those larger areas of work into smaller activities of an agile delivery model, measuring against the business objectives to ensure that you're on the right track. For example, the series of thought experiments that you complete in the data understanding stage become a contribution to the backlog and prioritization for agile delivery for AI and data science projects.


Embracing a continuous delivery model that is aligned to your business objectives, desired outcomes, and measures of success motivates the team and shows progress to the business sponsors and stakeholders. This model is the foundation of a data-driven organization, and of transforming your business by infusing data, AI, and data science.


 Look behind the curtain of AI

If you spend enough time in the data science space, you're likely to encounter myths and untruths about what artificial intelligence (AI) and in particular, machine learning, can achieve.

The problem begins when the word supervised is detached from the front of the words machine learning. It might seem like a small omission, but the impact is massive. A customer is told that their new “all singing all dancing” system uses machine learning to understand the important information in their large document corpus and can answer questions with concise, on-topic responses. However, by dropping the word supervised, what the customer doesn't realize is that the machine must first undergo 3 - 12 months of intensive training.


The next problem is that many people don’t understand the science and the mathematics behind AI and data science. Data science is a highly specialized field, so it isn't surprising that many people don't understand it. The problem is that people seem to either understand it or they don't. And when people don't understand data science, they don’t challenge unfounded statements as they might normally do. Instead, they believe that anything is possible.

A further problem is that the terms AI, data science, and machine learning are overused and are starting to lose their meaning. Worse still, they're misused or misunderstood. People use them to push a particular agenda, such as an academic’s research topic or a salesman’s product.

It's time to dispel this confusion. Start by looking at a few high-level definitions:

Artificial intelligence (AI)

A field of computer science in which machines are programmed to mimic cognitive functions that humans associate with other human minds, such as learning and problem solving.

Data science

The use of scientific methods, processes, and algorithms to extract knowledge and insights from structured and unstructured data.

Machine learning

The use of algorithms and statistical models that computer systems use to effectively complete a specific task without explicit instructions.

Statistical analysis

A branch of mathematics that works with data collection, organization, and analysis.

If you take these high-level definitions and plot them onto something like a Venn diagram, some of this confusion is clarified. 


In the center, where computer science and mathematics overlap, the distinction between the terms is blurry. Neither are mutually exclusive. For example, AI is defined as computer science, but it uses statistical methods as part of its process.

None of these areas can replace another. Each area has its own purpose, and the areas often complement each other.

There is no magic behind AI and machine learning. It's a science with a defined process and methodology to follow like any other science. Rarely will a single model answer your business problem. Usually, a combination of techniques is required.

Many advanced data science libraries and tools are available that accelerate the job of the data scientist. The data scientist's skill is to understand how to use, configure, and optimize each technique to turn the available data into actionable knowledge.


 Select and develop an AI and data science model

You defined business goals and spent hours digging through data. You even did a bit of "data wrangling" to handle missing values and convert the job experience categorical data field into something numerical. Now, what are you going to do with all that data, and how can what you do with it add value to the business? It's time to create a model.

Fortunately, deciding which model to use isn't as daunting as it sounds. Typically, the type of model that you need is obvious. For example, Market Basket Analysis in retail uses an associations technique that was developed for that purpose.

The main techniques fall into one of these categories:

    Feature reduction: This technique is a data preparation task. A common challenge in data science is the "curse of dimensionality", where data sets have too many columns (also known as dimensions or features) of data. Feature reduction removes dimensions that have little or no impact on the data set.

    Association or sequence analysis: These techniques look for data that depends on other data. In the classic Market Basket Analysis use case, associations can be extracted, such as "If the person buys beer, the person is more likely to also buy diapers than someone who does not buy beer". Association modeling isn't limited to Market Basket Analysis. It's also widely used in temporal analysis (if a particular event happens, another kind of event is likely to happen) and social network analysis (this person works for the same organization as another person, so they're more likely to know each other compared to someone who doesn't work there).

    Clustering: This technique separates the data set into significant groups or buckets. Cluster algorithms attempt to divide the data into distinct groups by minimizing the distance between data points within a cluster and maximizing the distance between clusters. This is easy to visualize in a two-dimensional space, but nearly impossible to do in a highly dimensional space. An example clustering use case is customer profiling.

    Classification: This technique uses known labeled data to learn the characteristics of a business-defined category. The output is a model of some format that applies the patterns that it learned to categorize unlabeled data. A popular classification model is a decision tree. A decision tree builds a hierarchical tree of decision boundaries. A data record is pushed through the tree and is labeled when it reaches a leaf at the end of a branch. For example, the top of the tree might split based on gender, so the first decision boundary might be "is gender = male, if yes, go down the left branch, if no, go down the right branch". The next decision boundary is then applied on another feature, such as income. The features that make up each decision boundary and the values to branch on are "learned" by the decision tree during the model training process.

    Prediction: This technique is like classification, but is often a binary classification of Yes or No.

    Natural language processing (NLP): This technique performs grammatical and linguistic analysis on unstructured text to pull out important concepts or entities in the text. The concepts or entities that are extracted can be seen as structured metadata that can be used by any of the techniques. For example, you can create an NLP model to pull out vehicle details, damage, and severity of injuries for the various parties that are involved in a road traffic report. This kind of information can then be used as input into a predictive model to calculate the expected insurance claim range for the purposes of insurance underwriting.

Typically, the challenge isn't which technique to use, but how to train and optimize the technique. All the techniques require some form of parameter selection, which requires experience of how the techniques work and what the various parameters do. Modeling is an iterative process where a number of experiments are conducted to find the optimal model, parameter, and feature selection.

Some people say that machine learning techniques must be better than statistical ones, but that assertion isn't necessarily true. Both statistical and machine learning models can provide good results. It depends on the data and the business problem that you're trying to solve. You can't know which model is the best until you try it. A large part of model development, optimization, and tuning is experimentation through trial and error.

The following diagram attempts to map "typical" business use cases in the first row to the "standard" modeling techniques in the second row. Below each technique are the more commonly used statistical and machine leaning models. Business use cases can be answered by multiple techniques. For example, you can use association, clustering, classification, and predictive techniques to help answer a fraud detection use case. Often, you need to deploy multiple models to answer the business problem.

The diagram also differentiates between whether the model is supervised or unsupervised. Supervised means that the model needs labeled data with a known result or classification. Supervised techniques are often easier to relate back to the business problem because they use data that was labeled with a business context in mind. However, labeled data is not always readily available, and creating it can require a lot of work. Unsupervised techniques are appropriate where labeled data isn't required and the model decides what is significant. Because the model isn't trained with the business context, the challenge is that these patterns can be difficult to interpret, and you can't guarantee that they directly relate to the business problem.


Start with an example

Rachel just completed the data preparation task. She entered all the missing values, derived new fields, normalized a few of the more complex data points, and converted the text fields into numeric representations. She's ready to select and train a few models. Her overall goal is to train a predictive model to see whether she can determine good and bad loans. However, the data is still a bit raw: too many risk fields are sparsely populated or contain zero values. Rachel thinks that she is going to need to aggregate those fields to reduce them.

One option is to try to group the different types of risk into buckets based on business logic. However, there are too many fields, and Rachel isn't sure which ones to logically group. Instead, she decides to explore the data with a cluster algorithm to see whether a natural statistical separation exists. Clustering techniques are unsupervised, where the technique decides what is significant and you can't guarantee that the results map to the business problem.

Rachel isn't sure which clustering technique might work best with her data and decides to experiment with two of the better-known ones. She selects K-Means clustering because she knows that it works well on large data sets that contain many features. K-Means is also known to scale well, especially compared to the more computationally demanding hierarchical clustering technique. One drawback of K-Means clustering is trying to determine what the value of “K” is, or what the optimal number of clusters is. Because she knows about that weakness, Rachel selects another technique, the Kohonen Neural Network. Although Kohonen is computationally more demanding than K-Means, it builds a self-organizing feature map that automatically calculates the optimal number of clusters.

Now that Rachel selected two clustering techniques, her next step is to train them. She can tweak, tune, and optimize each technique by adjusting different input parameters, which are known as hyperparameters.
Summary

No rules exist on which model is best suited to analyze your data. Even if you know the AI and data science technique, you can apply multiple algorithms and models. For example, two cluster techniques can come up with different cluster definitions because they use different rules to decide how to split the data. Although a few basic guidelines about techniques exist, the only way to determine which technique is best for your data is to try it, evaluate it, and compare it to other methods. What works well on one data set might be out-performed by another technique on a different data set.


 Enhance and optimize your AI and data science models

You selected several AI and data science models to use in your thought experiment. Now, it's time to train the models and evaluate whether they answer the hypothesis that you're exploring.

To train a model, you take these steps:

    Select the data. Split the data into different data sets: one to train the model, one to validate the model, and a third set that you keep for further blind testing.

    Tune the model. Models provide several input parameters, called hyperparameters, that a data scientist uses to tune the model.

    Evaluate the results. Does the model provide insights into the data that helps to answer the business problem and prove or disprove the hypothesis?

Model training and optimization is a compute-heavy task. Expect to run the model tens or hundreds of times by using different hyperparameter configurations.
Data set selection

Training a good model is a balancing act between generalization and specialization. The following diagrams illustrate that balancing act in a simplified way. A model is unlikely to ever get every prediction right because data is noisy, complex, and ambiguous.

A model must generalize to handle the variety within data, especially data that it hasn't been trained on. If a model generalizes too much, though, it might underfit the data. The model needs to specialize to learn the complexity of the data. Alternatively, if the model specializes too much, it might overfit the data. Overfitted models learn the intricate local details on the data that they are trained on. When they're presented with new data or out-of-sample data, these local intricacies might not be valid. The ideal is for the model to be a good representation of the data on the whole, and to accept that some data points are outliers that the model will never get right.


To avoid overfitting or underfitting a model, split your data into multiple sets:

    A training data set is used to train the model. Typically, the training set is 60% of the available data.

    A validation data set. After you select a model that performs well on training data, you validate that model on the validation data set. Typically, this data set ranges from 10% - 20% of the available data.

    A test data set that contains data that has never been used in the training. Typically, this data set ranges from 5% - 20% of the available data.

Hyperparameter tuning

The main way of tuning an AI and data science model is to adjust the model hyperparameters. Hyperparameters are input parameters that are configured before the model starts the learning process. They're called hyperparameters because models also use parameters. However, those parameters are internal to the model and are adjusted by the model during the training process. Many data science libraries use default values for hyperparameters as a best guess. These values might create a reasonable model, but the optimal configuration depends the data that is being modeled. The only way to work out the optimal configuration is through trial and error.

To better understand hyperparameter tuning, look at the example of a random forest algorithm. The algorithm is a predictive model with these hyperparameters:

    n_estimators: The number of trees in the forest

    max_features: The max number of features that are considered for splitting a node

    max_depth: The max number of data points that are placed in a node before the node is split

    min_samples_split: The min number of data points that are placed in a node before the node is split

    min_samples_leaf: The min number of data points that are allowed in a leaf node

    bootstrap: The method for sampling data points, with or without replacement

To tune the model, you can try these combinations:

bootstrap: [True, False],
max_depth: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
max_features: ['auto', 'sqrt’, ‘log2’],
min_samples_leaf: [1, 2, 4, 6, 8, 10, 12],
min_samples_split: [2, 5, 10, 20, 30, 50],
n_estimators: [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]

To effectively tune one random forest model, you need to evaluate every permutation of those hyperparameters.
Model evaluation metrics

You evaluate a model differently depending on whether it is a statistical model or a machine learning model.
Regression models

You can use two standard measures to evaluate regression models:

    Mean square error calculates the distance between the predicted and actual value of the target field and squares it. It also calculates the mean of all the observed errors.

    R-squared determines how close the data is to the fitted regression line. This measure is the percentage of variation of the target variable that is explained by the linear model.

Classification models

A common evaluation technique for classification models is to use a confusion matrix. This confusion matrix is for a binary classifier, although it can be extended to the case that has more than two classes.


The matrix shows that two possible predicted classes exist: Yes and No. The test set contains 165 observations. Out of those 165 cases, the classifier predicted No 55 times and Yes 110 times. Five of the No predictions were false negatives (type 2 errors). Ten of the Yes predictions were false positives (type 1 errors).
Machine learning

A common evaluation metric is precision and recall, which gives an overall accuracy score. Precision is the fraction of retrieved instances that are relevant. Recall is the fraction of relevant instances that are retrieved.

Accuracy, also known as the f1 score, is the harmonic mean of precision and recall. Precision and recall are often at odds with one another. If you maximize the precision, the recall is likely to decrease. If you maximize the recall, the precision is likely to decrease. The harmonic mean is used because it punishes any disparity between precision and recall.

To better understand precision and recall, suppose that a computer program that recognizes dogs in photographs identifies eight dogs in a picture that contains 12 dogs and a few cats.


Of the eight dogs that were identified, five are dogs (true positives), and the rest are cats (false positives). Seven dogs weren't recognized (false negatives). The program's precision is 5/8, and the program's recall is 5/12. The f1 score is 0.5.
Conclusion

Through these examples, you can see that data science is a science and an art. Automation can streamline the repetitive tasks that a data scientist does, but the training, the data scientist, and business analyst always work together to interpret and optimize AI and data science models in context of the business needs.


Operate


 Ensure operational excellence and choose services, options, and capabilities to run solutions

In the Operate practices, you build a highly available infrastructure that ensures that your application is always available and meets your client's needs. By moving to the cloud, DevOps teams don't need to maintain server hardware or operating systems.

The Operate practices also help you ensure operational excellence by continuously monitoring application status and performance. Shifting operational practices and automating operations tasks reduces operational costs and helps teams meet service-level-agreement (SLA) targets.
Why change?

Maintaining server hardware, operating systems, and software is a significant job. In-house maintenance requires people, time, and capital expense. Development projects can quickly become delayed if resources aren't available at the right time.

With the evolution of the cloud, teams can get an MVP running in production and scale as the market demands. Adding instances, a data center, or even a different geographic region is easy, even for a small DevOps squad.

Teams have many options for running applications. If you're moving an application to the cloud, you might prefer a virtual machine (VM). If your team works with sensitive information, you might need a dedicated private cloud. Teams that are fully cloud based can run on a platform as a service (PaaS) and get started with their favorite runtime. Fully cloud based teams can leave the server hardware, operating system, middleware, and software maintenance to someone else and save on cost and resources.

As the rate of change to your application increases, you need powerful cloud-based management tools and trusted practices. Operations is expensive, but not as expensive as losing customers because your application is unreliable. Automating operations reduces operational costs, helps teams meet SLA targets, and frees teams to focus on higher-value tasks, such as adding customer-requested features to an application. One goal of the Operate practices is to build automation that enables high availability and resiliency while you reduce the cost of the supporting infrastructure and the resources that manage the application in production.
What changes?

While the team develops the code and tests to implement the MVP, the team must choose the infrastructure where the application runs for testing and production. The team also delivers a management and monitoring solution that achieves the SLA for the application that is being developed. To deliver a highly available application, your team must understand the environment that the application runs in and the tools that monitor the environment, notify the operations team when problems occur, and track problems to resolution.


One goal of the Operate practices is to understand how to run the application on your platform of choice. You might run your code in one of these ways:

    Cloud Functions (based on Apache OpenWhisk): An event-driven compute platform that runs application logic in milliseconds in response to events or direct invocations from web or mobile applications

    Instant runtimes (Cloud Foundry): An open source cloud PaaS where developers can build, deploy, run, and scale applications

    Containers: A lightweight VM with everything that your application needs: code, runtime, and any dependencies

    Virtual Servers (OpenStack): A fully configurable VM that the team can manage and that is an ideal environment for running traditional applications

Both the development and operations teams must fully understand the Operate practices, including the infrastructure and its network and security implications.

When you set up the delivery pipeline, one stage of the pipeline usually deploys to a test environment, and the last stage usually deploys to the production environment. In that context, the development team needs full knowledge of the environment where the application runs. When problems occur, the team must know whether those problems are the result of the underlying infrastructure or in the application itself. Debugging tools, platform dashboards, and services that can automatically configure a running application are all critical tools for both development and operations.
Tune the production environment

Successful applications scale in user and resource requirements over time. When you deploy on the cloud, you can use tools to ensure that your application is optimized as load requirements change.

In the Cloud environment, you can configure the auto-scaling tool to adjust heap size, memory, response time, and the number of application instances based on a policy. As the team better understands the application's usage patterns, the team can schedule configuration changes to reduce resource during low usage time and ultimately save time and money.
Control what is seen in production

If your team wants to deploy application features to production but make them available only to specific users, the team can use dark launches and feature toggles. In that scenario, only users who have access to the features can see them.
Build to manage

The first line of defense is to recognize problems that can be handled in an automated way. The development team implements build-to-manage practices that enable them to deliver an application that can be operated efficiently in production. Then, the application is pushed to production in the cloud. The Operate practices enable the push to production to data centers that are used for load balancing and failover.
Manage incidents and monitor continuously

Incident management aims to restore application service as quickly as possible by using a first-responder team that is equipped with automation and well-defined runbooks. To maintain the best possible levels of service quality and availability, the incident management team performs sophisticated monitoring to detect issues before the application is affected. For complex incidents, subject-matter experts collaborate on the investigation and resolution. Stakeholders, such as the application owner, are continuously informed about the status of the incident.

After monitoring is in place, develop policies to ensure that the operations team is notified of incidents as soon as they occur. The policies specify who is notified and when. They also specify how to escalate notifications if problems aren't fixed within a stated period.
Determine the root cause

Problem management aims to resolve the root causes of incidents to minimize their adverse impact and prevent recurrence. Outages are reviewed through root-cause analysis, incident analysis, dashboards, and collaboration. The team delivers application-specific dashboards and reports that show detailed outage data and metrics to indicate whether the application meets the defined SLA. Usually, the result is more monitoring and automation to prevent outages from happening again.
Prevent and handle recurring problems

In some organizations, the operations team and the development team are the same. Whether operations is done by the development team or a separate team, communication is key. To accelerate the resolution of known issues, development teams often create a runbook. A runbook describes common problems that the application encounters and explains how to identify problems in the log data and how to resolve them. In some cases, the problems can be resolved with automated operations. As team members find issues and resolve problems in the production application, they update the runbooks.

The goal is the same for everyone who is involved: keep the production application available to users all the time. Development must test the resiliency of the application. Chaotic testing involves injecting random failures into the system to improve confidence in the team's ability to recover when failures occur and to reduce the mean time to recovery.

The operations team must understand how to address failures of the production platform, dependent services, and the application. The goal is fast recovery for the application that is in production.

The DevOps team works together to ensure that the application meets the SLA.
Ensure high availability

Application users have high expectations for availability. To create an excellent user experience, the team must ensure that the application is available whenever users need it. The selection of the platform that an application runs on is vital. Whether you run on a bare metal server, a VM, or on Cloud, you need a high-availability strategy that includes the deployment of multiple instances of the application to hardware, VMs, or Cloud data centers that are in diverse locations.

In the Operate practices, a monitoring solution detects incidents across the data centers where the application is deployed. If automated monitoring detects a failure in the data center that hosts the application and no failover location is set up, the users of the application experience an outage. To reduce the outage time, your team can implement a backup instance of the application in another data center with failover automation.

Even with data-center failover in place, problems can still occur. Services that the production application depends on must be monitored to ensure that the team knows when dependent services are experiencing outages. Some service outages result in what appear to be application outages to users. In other cases, only parts of the application's function might be unavailable. Teams can implement the Circuit Breaker pattern to minimize the impact when a subset of the microservices that are required by an application are unavailable.
Communicate efficiently

Collaboration tools are key as the team learns about and starts to use the infrastructure and platforms where the applications are deployed. In many cases, you can set up notifications to remind the team about regularly scheduled maintenance or alert the team when a problem exists in the environment.

To facilitate communications when problems occur, your team can implement ChatOps and integrate one or more management and monitoring automation tools. Notifications can be sent to the operations team, development team, and subject-matter experts that must fix problems. They can also be sent to management teams for awareness. With ChatOps, all team members who work on a problem can be continuously updated on progress and see the history of what was done to solve a problem.


Auto scale applications

You developed an amazing cloud application. In fact, it's so successful that you now must watch carefully to ensure that enough resource is allocated on your hosting platform to keep up with demand. No one wants to spend time manually watching usage and adjusting a system configuration based on need. Enter auto scaling.

Application operators use auto scaling to ensure that enough resource is available in the application at peak times and to reduce allocated resource during low usage times. The benefits of auto scaling include customers who are delighted that the application runs well, and savings in cost and electricity. Auto scaling is responsive to actual usage patterns, so it can handle unexpected traffic patterns, such as when a link goes viral.

Based on a policy, auto scaling automatically increases and decreases the resources in your application infrastructure. You can define the policy based on a review of the historical scaling history. Typically, you can adjust several attributes through the auto-scaling policy:

    Number of application instances

    Heap size

    Memory

    Response time

If you use advanced policies, you can set application characteristics for time periods. For example, if you know that your application is heavily used between the hours of 9 AM and 9 PM, you can set a policy that is specific to that period. You might be wondering, "Why review scaling history or set time-based policies if the system can dynamically adjust to load?"
When not to use auto scaling

Auto scaling is a great way to ensure that you have the resources that you need when you need them. However, limits exist on how much it can, and should, adjust capacity. Increasing the number of instances to cope with sustained heavy load can cause costs to increase. At some point, the graceful degradation of service is preferable to runaway cost. Think about what that point is and put appropriate upper limits in place. Also, set up monitoring and alerting so that the system can ask for help if it is getting near the limits or if a distributed denial-of-service DDoS attack is detected.

Auto scaling might not be able to handle sharp spikes. New service instances need a warm-up time before they can handle requests. If the number of services is low, it might take too long to start enough instances to cope with a surge in demand. This issue is called "the cold start problem". For services that run on dedicated hardware, virtual machines, or other systems where the underlying system doesn't have much elasticity, trying to start too many services can cause system stress.

If the system is struggling to cope with high demand, the last thing it needs is to try to start several new services. By doing so, you can create a failure cascade where the service instances are starved of resources and then crash and can't restart.

To avoid this scenario, most auto-scaling services enforce a time window to prevent haphazard scaling up or down. They adjust the number of instances only if a few minutes passed since the last scaling event. If you need a large change in capacity, try to anticipate demand to some extent and start instances in a phased way.

Auto scaling doesn't apply to a few types of applications. If your application isn't a web-based application, such as a "Worker" or a background job, auto scaling might not help because response time and throughput metrics won't be available. Likewise, auto scaling in Cloud has a few features that depend on the type of application that you're using.

For example, Liberty for Java™ applications support scaling rules for heap, memory, response time, and throughput, but Node.js applications support scaling rules for only heap, memory, and throughput. For a full list of supported application types and scaling rule support, see the Cloud Docs.
Related practices

Auto scaling complements the microservices approach because in a microservices architecture, you can make scalability decisions for each service, as each service runs in its own container.


Canary testing and feature toggles

Sometimes, you need to push features into production that aren't ready for consumption by all users. Consider these examples:

    For testing purposes: "Will this feature perform reasonably in the production environment?"
    To enable some other feature to be deployed that has a dependency on the new feature: "The UI isn't ready, but you can call the API directly."
    Single stream development: "We always deploy the whole development stream, but turn on features only when they're ready."

To address those examples and limit the impact of potential problems, use canary testing, dark launch, and feature toggles.
Canary testing

A canary release rolls out a change to only a subset of users. Those users test the new function. Canary testing reduces the risk of introducing a defect to all users. When you're satisfied with the results, you can scale up the canary release and roll it out to all users.
Dark launch

Deploying a feature but limiting access to it is commonly called dark launching the feature. You can implement dark launching in any number of ways. The best mechanism depends on the complexity and type of application that you're building.

In a simple web application, the easiest way to dark launch a new feature on a page is to duplicate the page with a different name; for example, ../gitpage.html → ../gitpage2.html. Then, you can update the new page and push both. In this way, only the people who know the new page name can access the new version.

The nice thing about this approach is that the new code is kept separate from the old code. When it's time to make the page available to everyone, you can delete the old page and rename the new one, "taking it out of dark launch mode."

Unfortunately, the dark launch approach has a few downsides. Because the two pages are separate, if both pages are kept "live" for a significant amount of time, any common bugs must be fixed twice. Even worse, because the new page has a new URL, any links from other pages in the app will point at the old page, making it difficult to test flows that cross pages.
Feature toggles

To address these problems, consider a related notion, feature toggles. At its most basic, a feature toggle is just a way to provide extra context to the application, and then based on that context, to change the way that the application behaves. For example, to dark launch a feature in a web app, you can implement both the old behavior and the new feature. Then, you can test a feature toggle on the page to choose which one to show the user. This approach typically makes the code more complex (for old-school C programmers, think #ifdef statements), but it also means that bugs in common code need to be fixed only once. Additionally, the same URL now provides either the old or the new behavior based on the state of the toggle.
Get started

So, how do you implement feature toggles? A single best practice doesn't exist, but you can use a number of different techniques in different situations.

One possibility would be to use a query parameter on the URLs. So, for example, you might use:

    .../manage.html?darklaunch=true

However, if several pages are involved in the flow, then the parameter would need to be passed from the page.

Another possibility is to test a value that is kept in the browser's local storage. The code can check for something like this example:

    if (localStorage.darkLaunch==="true")

You would turn on this feature flag by going into the browser's debug console and running the following line:

    localStorage.darkLaunch="true"

Because this behavior is controllable directly from the browser, this approach is good for cases where you want to enable the new behavior for particular users, or only while debugging.

Another case is enabling or disabling a particular feature across the whole site, and switching the behavior with no downtime. In this case, you might have a "feature flag service," perhaps something like this example:

    .../home/services/...common.service.IFeatureFlagService

This returns flags for any configurable features, such as:

    {"isNewCoolFeatureEnabled":true}

Even more complex variations are possible, such as having the service return different values based on which user is accessing the application. However, if you push this too far, you end up with pages that you can't debug because it is difficult to understand the issue without knowing the exact context at the time they were invoked. For example, it might be that no two users would necessarily get the same experience.

You also must carefully manage how you access the service, because it can become a single point of failure, and the service itself, because you might end up with outdated flags.

Despite these potential issues, dark launching and feature flags are two related, powerful methods to enable new features to be deployed quickly and tested in the production environment.


 Implement high availability for on-premises applications

If your on-premises application fails, you can face significant impacts to your business continuity. To be successful, you must implement a business continuity plan that includes a high availability (HA) and disaster recovery (DR) solution. But how do you select the optimal HADR topology for your solution?
High availability versus disaster recovery

The terms high availability and disaster recovery are often used interchangeably. However, they are two distinct concepts:

    High availability (HA) describes the ability of an application to withstand all planned and unplanned outages (a planned outage could be performing a system upgrade) and to provide continuous processing for business-critical applications.

    Disaster recovery (DR) involves a set of policies, tools, and procedures for returning a system, an application, or an entire data center to full operation after a catastrophic interruption. It includes procedures for copying and storing an installed system's essential data in a secure location, and for recovering that data to restore normalcy of operation.

High availability is about avoiding single points of failure and ensuring that the application will continue to process requests. Disaster recovery is about policies and procedures for restoring a system or application to its normal operating condition after the system or application suffered a catastrophic failure or loss of availability of the entire data center.
Develop your HADR solution

To guide the development of a high availability disaster recovery (HADR) solution for your on-premises application, you should consider business challenges, functional requirements, and architecture principles.
Business challenges of your HADR solution

Your HADR solution should address these challenges:

    For business continuity, the application, and the business processes it supports, should remain available and accessible without any interruption, despite man-made or natural disasters. It should serve its intended function seamlessly.

    For continuous availability, a well-designed HA solution maintains an optimal customer experience with quick system response time and real-time execution of transactions.

    The architecture must be capable of processing the additional workload resulting from a spike in business transactions and mitigate the risk for revenue opportunity loss.

    For operational flexibility, you should have a well-designed HA topology with replication of code and data in a secondary site that is separated by sufficient geographical distance. The application can be reconstituted and/or activated at another location, processing the work after an unexpected catastrophic failure at the primary site.

Functional requirements for your HADR solution

You should consider the following functional requirements for your HADR solution:

    Minimize interruptions to the normal operations of the application. If any application component has an availability issue, ensure the smooth and rapid restoration of the application component back to normal operation.

    Restoration of the service of any application component must be completely automated or must be activated by humans with a single action.

    Monitor availability of each application component of the application. Alert in case of service level issues, such as slow response time or no response from any application component. Activate rapid restoration by automation or by a single action performed by a human specialist responsible for the high availability of the application.

Architecture principles affecting your HADR solution

The events that can cause a software application to fail to process user or other system requests can be divided into three categories. Each requires a different technique to mitigate.

    Events that involve the unexpected failure of only one component of the system, such as operating system process, a physical machine, or the network link connecting members of the system.

    Events that involve simultaneous unexpected failure across many components of the system. These events might be triggered by a natural disaster, human error, or a combination of both.

    Events caused by human error that involve logical corruption of a primary datastore by persisting incorrect or incoherent content into it.

Example: An on-premises B2B order application

In this example, an organization develops an on-premises application to process online orders from its B2B customers. The B2B order application uses multiple components providing specific services, such as user interface, product catalog, order creation, workflow, decision and integration services, and analytics. An ERP application stores the product data, such as price and inventory, and orders. The catalog application manages the unstructured data related to the product, such as images. The online order application uses a NoSQL database for storing its product catalog and traditional RDBMS for its analytics and ERP back-end application.


Data center topology options for HADR

To achieve high availability, you can select from two deployment topology options—a two data centers architecture and a three data centers architecture. You would set up your B2B order application the same way in a highly available cluster configuration in both the primary and the secondary data centers.
Two data centers topology

You can configure a two data centers topology in either active-standby mode or active-active mode. The simplest configuration is the active-standby topology, where the B2B order application in the secondary data center is in cold standby mode. In the active-active topology, the application and the services it uses are active in both data centers.
Three data centers topology

The configuration for three data centers has two variants, active-active-active and active-active-standby. In the active-active-standby configuration, the application and services are in active mode in the primary and secondary data centers, while the application is in standby mode in the third data center.
Disaster recovery scenarios

When a disaster strikes, the topology and configuration choices you made will determine how your application recovers. You need to understand the costs and benefits associated with each to determine the optimal one for your needs.
Disaster recovery with a two data centers topology

Active-active or active-standby are two possible configurations for this scenario. In both cases, you must have continuous replication of data between the two data centers.


This configuration provides higher availability with minimal human involvement than the active-standby configuration. Requests are served from both data centers. You should configure the edge services (load balancer) with appropriate timeout and retry logic to automatically route the request to the second data center if a failure occurs in the first data center environment.

Benefits of this configuration are reduced recovery time objective (RTO) and recovery point objective (RPO). For the RPO requirement, data synchronization between the two active data centers must be extremely timely to allow seamless request flow.


Requests are served from the active site. In the event of an outage or application failure, pre-application work is performed to make the standby data center ready to serve the request. Switching from the active to the standby data center is a time-consuming operation. Both recovery time objective (RTO) and recovery point objective (RPO) are higher compared to the active-active configuration.

The standby data center can be either a hot or cold standby environment. In the hot standby option, the order application and associated services are deployed to both data centers, but the load balancer directs traffic only to the application in the active data center. The benefit of this configuration is that the hot standby data center is ready to be activated when the active data center experiences a disaster. The DR procedure only requires reconfiguring the load balancer to redirect the traffic to the newly activated data center. The drawback of hot standby is that the second data center is kept active and the application is kept up-to-date, but it is not used to process customer requests. The software license is applicable to both data centers, although only one is actively in use.

In the cold standby option, the order application and associated services are deployed to both data centers, but are not started in the standby data center. If the active data center experiences a disaster, the DR procedure includes starting the application and services, and reconfiguring the load balancer to redirect the traffic. This option is cost-effective in terms of the software license cost and the data center operations cost, including the personnel. However, the application availability might suffer, depending on how quickly the cold standby data center and the order application can be started and activated to process requests.

When the application in the primary data center is restored after the outage, you can modify the edge service DNS to route user requests to the now active application in the primary data center. The application in the secondary data center can be switched back to standby mode.
Disaster recovery with a three data centers topology

In this era of Always On service with zero tolerance for downtime, customers expect every business service to remain accessible around the clock anywhere in the world. A cost-effective strategy for enterprises involves architecting your infrastructure for continuous availability rather than building disaster recovery infrastructures.

A three data centers topology provides greater resiliency and availability than two data centers. It can offer better performance by spreading the load more evenly across the data centers. A variant of this is to deploy two applications in one data center and deploy the third application in the second data center, if the enterprise has only two data centers. Alternatively, you can deploy business logic and presentation layers in the 3-active topology and deploy the data layer in the 2-active topology.

Two possible configurations are considered for this scenario, an active-active-active (3-active) and active-active-standby configuration. In both cases, a continuous replication of data is required between the data centers.


Requests are served by the application running in any of the three active data centers. A case study indicates that 3-active requires only 50% of the compute, memory, and network capacity per cluster, but 2-active requires 100% per cluster. The data layer is where the cost difference stands out. For further details, read Always On: Assess, Design, Implement, and Manage Continuous Availability.


In this scenario, when either of the two active applications in the primary and secondary data centers suffers an outage, the standby application in the third data center is activated. The DR procedure described in the two data centers scenario is followed for restoring normalcy to process customer requests. The standby application in the third data center can be set up in either a hot or a cold standby configuration.
Data replication across data centers

The procedure and technique for the continuous replication of data between the databases in the three data centers should follow the standard, established practices recommended by the vendors and the customer's existing corporate IT standards and procedures.

Make use of database management tools, such as the Db2 HADR feature and Oracle Data Guard, to replicate the database contents to a remote site.

    Replicate the SQL database using vendor-specific data-mirroring technology to mirror analytics data from the primary site to the secondary site.

    Replicate the NoSQL database so that the data is copied from the primary data center site to the secondary data center site.

    Replicate the ERP database using vendor-specific data-mirroring technology so that the order data is mirrored from the primary site to the secondary site.

Select the optimal topology for your HADR solution

How you implement your HADR solution is an important architecture decision that affects the continuous availability of the services provided by your on-premises application. While an active-active-active configuration provides the greatest resiliency, it is the most costly topology. An active-standby configuration is the most cost-effective but can reduce application availability. You should select the topology that best meets the needs of your business continuity and operational flexibility.

See these solution architectures for the topologies discussed in this article:

    On-premises high availability disaster recovery: Active-active-standby topology

    On-premises high availability disaster recovery: Active-active topology

    On-premises high availability disaster recovery: Active-standby topology


The principles of modern service management

Modern service management and operations refers to all the activities that an organization does to plan, design, deliver, operate, and control the applications in an enterprise. It includes the people who do the work, processes that define what work is needed and how it is done, and tools to enable and support these activities. Applications are monitored to ensure availability and performance according to service level agreements (SLAs) or service level objectives (SLOs).

Operations should be as agile as development, with continuous delivery practices and an emphasis on continuous improvement. Service management must transform to support this paradigm shift. The transformation has implications in various areas:

    Organization: Instead of a discrete operations organization that is distinct from the development team, full lifecycle responsibility is provided through small DevOps teams. Another approach is site reliability engineering (SRE), which brings a strong engineering focus to operations. SRE emphasizes automation to scale operations as load increases.

    Process: A key concept of DevOps is the automated and continuous testing, deployment, and release of functions. Service management processes, such as change management processes and the role of the change advisory board, must change to support this notion.

    Tools: Because time is of the essence in restoring a service, incident management tools must provide rapid access to the right information, support automation, and provide instant collaboration with the right subject-matter experts (SMEs). The term ChatOps describes a collaborative way to perform operations. Bot technology integrates service management and DevOps tools into this collaboration.

    Culture: As with any transformation project, you must consider a few cultural aspects. One example is the need for a blameless postmortem culture where the root cause of an incident is revealed and the organization can learn from it.

Slow is the new down

One reason that service management paradigms are shifting is that customers' expectations for services are shifting. Customers demand fast service and the rapid delivery of new products and features. If your mobile app or website is slow and does not perform, your site might as well be down. Your customer will take their business elsewhere.
Benefits of service management

A well-designed service management architecture provides several benefits:

    Maximizes operational effectiveness. Ensures the availability and performance of applications that run on Cloud, given the target SLA of 99.99% availability for applications.

    Increases operational reliability and agility by using event-driven guidance, automation, and notification to prompt the right activity to resolve important issues.

    Improves operational efficiency by using real-time analytics to identify and resolve problems faster. Reduces costs by creating a single view and central consolidation point for events and problem reports from your operations environment.

    Establishes and maintains consistency of the application's performance and its functional and physical attributes with its requirements, design, and operational information.

    Manages and controls operational risks and threats.

Five principles of service management

To effectively manage modern applications and consider the service management and operational facets of their applications, your operations team can follow five principles:

    Operations: Services need management. Operations activities typically include placement of workload based on resource requirements, rollouts and rollbacks, service discovery and load balancing, horizontal scaling, and recovery. Cloud platforms such as Kubernetes help with many of these activities by providing functions for self-healing, dynamic discovery, and automated rollbacks. Operational activities should also include compliance checks (ideally automated, regular, and done in production) and data backup.

    Monitoring: Services also need to be monitored. When you are deciding what to monitor, be guided by the experience of the user of the service. The user might be a human for front-facing services or a system for back-end services. Key metrics typically are availability, performance, response time, latency, and error rate. To ensure that you detect an issue before it causes an outage, prioritize monitoring of the four golden signals: latency, traffic, error rate, and saturation. Traditional metrics such as CPU, memory, and disk space are less relevant in a cloud context.

    Eventing and alerting: What happens if the monitoring solution detects a problem? An alerting system must notify first responders if a problem is detected, either by using email, SMS text message, or an alert in an instant messaging system. A single problem can cause a cascading failure across multiple systems, so the alert system must be able to correlate related events from different sources.

    Collaboration: A first responder is the first person, but probably not the only person, who helps to resolve an incident. In an architecture where many services depend on each other, expertise across multiple areas or systems is likely needed. The term ChatOps describes the process of using an instant-messaging communication platform to collaborate among SMEs and automated tools. Through the ChatOps platform, all interaction is logged in a central place and you can browse through the log to see what actions were taken.

    Root-cause analysis: To prevent an incident from reappearing, the root cause must be assessed. Follow the 5 Hows approach, which helps to surface the issue that was ultimately responsible for an incident. This investigation must be operated in a blameless culture; only through that approach are people willing to share their insights and help others to learn from the experience.

Observability

Observability is the ability to understand the running system by observing external outputs (that is, without stopping it and taking it apart). Provide observability by instrumenting the application and services. Extend the management and monitoring of containers by using sidecars and a service mesh framework such as Istio. Most importantly, monitor the service as it is experienced by the user.

Service developers should expose a health check API. These health checks should be tested automatically on every deployment.
Shift left and build to manage

Operations isn't just the responsibility of the Ops team. Developers should be adding instrumentation to support observability. They are the ones who know how to create runbooks and how to analyze logs and traces to identify and solve issues. The development team should also be using automation to test and deploy your applications as early in your development cycle as possible. (What should be automated? Everything!)


What is incident management?

Incident management is the practice of restoring a damaged service as quickly as possible by using a first-responder team that is equipped with automation and well-defined runbooks. To maintain the best possible levels of service quality and availability, incident management uses sophisticated monitoring to detect issues before the service is affected. For complex incidents, subject matter experts (SMEs) collaborate on the investigation and resolution. Stakeholders, such as the application owner, are continuously informed about the status of the incident.

Developers do their best to write applications that perform well, are robust, and provide great user experience. But things can go wrong, resulting in the unavailability or slowness of the application. The objective of incident management is to restore the service as quickly as possible when things go wrong.


Many times, cloud brings higher expectations of availability, performance, and reliability. Enterprises need new approaches to handle incidents: redundancy, automation, collaboration...paired with organizational and cultural changes. 


How to build the incident management toolchain

When you look at the Incident management reference architecture and toolchain, you might be overwhelmed with the number of functions. Although the functions are the recommended capabilities of a robust incident management solution, no one has all the functions at the beginning. The architecture is a suggested journey map to build the toolchain.

The core of the solution is monitoring to detect outages, performance saturation, and more. Because you can't afford to have your staff continuously watch consoles, the next critical element is notification to alert the right SME when something is going wrong.

People often need to collaborate with SMEs to isolate the issue and to define a mitigation strategy. Rather than relying on email and telephones, you can use ChatOps, where people collaborate by using instant messaging with each other, and potentially with tools and systems.


In addition to the monitoring and active probing of services and APIs, the monitoring of log files is an important function to add. This monitoring can help to identify issues before the service is affected. It can also expedite the incident identification and resolution phase.

As the load increases and the application landscape becomes more complex, first responders start suffering from too many alerts. They receive alerts that are related to symptoms and causes. Some alerts might not be actionable. Events might not provide sufficient context to act on them quickly, such as a service level agreement (SLA) or impact data. At this point, introduce event management to your toolchain. Event management correlates related events, removes noise to show only actionable alerts, and enriches those events with more context.


To respond to issues quickly, you add runbooks and automation. Runbooks can be invoked automatically, either to run diagnostic commands or to try to mitigate the issue. Runbooks can also be manually run by the first responder and incident resolver. To avoid logging in to a system and risking mistyped commands, semi‐automated runbooks provide secure and consistent execution.

As you add more tools, you need visibility across the entire landscape. This visibility doesn't replace existing product UIs, but instead complements them and provides a combined view of the environment in persona‐specific dashboards. Ideally, these views also show extra information, such as deployment activities or SLA information.

Finally, incident information is tracked persistently in ticketing tools, which provide a source of truth for SLA calculation. Enterprises especially need to maintain an audit trail for all incidents. The start and end of an incident and any major updates are tracked. Integration across the entire toolchain automates the population of this activity journal. You can use this information to detect trends in the environment and to take the right countermeasures.


Incident management concepts

A sophisticated monitoring infrastructure detects deviation from normal behavior, such as a decline in response time, and alerts the operations team about incidents. First responders, who are always on call, act to identify the component at fault and to restore the service as quickly as they can. They do so by using automation and runbooks to remove the dependency and risk that are associated with the manual execution of tasks.

While first responders use dashboards that provide visibility into the application and its landscape, they don't stare at consoles waiting for alarms to happen. Instead, they're notified about actionable alerts. These alerts are aggregated by various monitoring systems and are correlated and enriched with relevant information, such as the application name, the impacted user community and stakeholders, and SLA information. These alerts are actionable, ideally with a clear description of the mitigation. By using call rotation and on-call lists, the alarm is sent to the right first responder to take the necessary action.

Alerts that can't quickly be pinpointed to a mitigation require more analysis. SMEs across multiple domains collaborate to isolate the incident and to identify an effective response. Technologies like ChatOps help with this collaboration. DevOps and service management tools are also integrated through bot agents. The incident commander coordinates these tasks and maintains transparent communications to the affected stakeholders.

The objective of incident management is to restore the service. The team doesn't waste time analyzing the root cause of the problem. That analysis is done in the next step: problem management. Incident management approaches include the restart of a microservice, a configuration of the load balancer to ignore the failing instance, or a rollback to the previous version. Typical DevOps principles such as Blue-Green deployment (continuous delivery) ease the implementation of these approaches.


ChatOps and the virtual war room

Presented by: Robert Barron

Solving problems efficiently is the key to a business that performs well and has happy users. But working around a table in a physical war room isn't practical, especially not today. ChatOps is the integration of development tools, operations tools, and processes into a collaboration platform so that teams can communicate and manage the flow of their work. ChatOps supports the creation of a virtual war room, where collaboration is more efficient for people and the bots that help them. This webinar introduces the concepts of ChatOps and virtual war rooms and demonstrates the journey to achieve them.


Benefits of implementing incident management

    Improved availability and performance of applications and services. Incident management supports the need for high availability through proactive monitoring and the rapid restoration of services.

    Manage and control operational risks and threats. Effectively manage change, mitigate new threats from interconnected services and infrastructures, and ensure compliance.


What is problem management?

Problem management is the practice of preventing the recurrence of an incident by resolving the root cause to minimize its adverse impact on the service and to prevent the recurrence of similar incidents.

The goal of incident management is to restore the service as quickly as possible, and it does so by finding an immediate tactical solution to the incident. After critical incidents occur and are resolved, use problem management to identify and resolve the root cause. If you don't, you implicitly decide that it's acceptable for the same incident to happen again.

Dig deep and truly identify the root cause of the incident. Otherwise, you fix only the symptoms and not the underlying cause, and similar incidents might occur. To learn how to include problem management in your process, review the Problem management reference architecture.


We need to minimize the adverse impact of incidents caused by errors, and to prevent the recurrence of incidents related to these errors. A solution can only be as good as the problem statement and a blameless root cause analysis is the problem statement for IT operations. A “postmortem” debriefing should be considered first and foremost a learning opportunity, not a fixing one


Problem management techniques

The 5 Hows is an iterative interrogative technique that you can use to explore the cause-and-effect relationships that underlie a particular problem. The technique determines the root cause of a defect or problem by repeating the question "How?" recursively so that each answer forms the basis of the next question.

The goal of this root cause analysis process is to identify why the incident happened and to put sufficient measures in place so that a similar incident doesn't happen. These measures might target the application itself, the application architecture, or the infrastructure and management environment.

After the root cause analysis, hold a postmortem meeting. The purpose of the postmortem is to find out what happened and to define actions to improve the organization. It also can provide insights into how the team can better respond to future incidents.

A culture of blameless postmortem is critical to allow the organization to learn from past mistakes. Only when people are free from fear of punishment and can openly share their mistakes can others learn from the experience and be prevented from making the same mistakes.
Benefits of implementing problem management

    Reduction in incident volumes: Effective use of problem management techniques can stop an issue from occurring multiple times or prevent the issue from happening in the future.

    Improved overall quality of IT services: Because people expect cloud services to be constantly available, repeated problems can result in a loss of confidence in the reliability of your application.

    Improved organization knowledge and learning: As the organization implements a structured approach to problem management, it can learn from previous mistakes and use that knowledge to prevent failures or outages.


Automate operations

In the past, many teams ran their services in third-party data centers where they rented cages and outfitted them with the most advanced hardware with high, up-front costs. For those teams, a common issue was calculating how the up-front costs translated into monthly expenditures so that they could be compared with other costs, such as operations personnel, software licensing, and bandwidth subscriptions.

As teams started to shift their operations to an infrastructure-as-a-service model, they started to receive monthly bills for the compute, storage, and network utilization. For the first time, teams saw a clear picture of how they were spending their money. Do you know what they found? They found that it's really expensive to run a proper SaaS service!
Get started: Understand the cost of running a service

That discovery was no surprise to the people who actually build and run the platform. But for the decision makers and financiers, the transformation from capital expenditures to operating expenditures for all of operations was shocking. Unlike what most people thought, the largest cost component of a solution was not the use of compute, storage, and network, but the required investment in operations personnel. The cost of infrastructure, even at scale, is negligible compared to just the average monthly overhead of an employee, regardless of the employee's country of residence.

To estimate the cost of a solution, you can use a formula with three factors:

    (1) The number of people who are required to touch (2) some number of systems (3) some number of times.

If you can reduce any of those three factors, you can decrease the cost of the solution and improve profitability.
If you're required to log in to a server, you failed

Think in terms of operator-to-server ratio, which is another great metric in the cloud industry. The popular brands in cloud, such as Facebook and Google, claim openly to have operator-to-server ratios of something like 1:10,000. At that scale, operators don't have time to log in to individual servers. Instead, they are riding around data centers on scooters, looking for smoking blade servers and replacing them. Everything must be automated, from beginning to end. Servers are cattle, not pets. If the servers misbehave, they are replaced.

What if your platform is nowhere near the 1:10,000 operator-to-server ratio? Perhaps your ratio isn't bad, but you still have a lot to do. The only way to improve is to aim for a high target. Stay in this mindset: "If you are required to log in to a server to do your job, you've failed."

Which tasks do your operators do each day that keep them from higher value work, such as improving the system? Can those tasks be automated, or better yet, be eliminated by improving the system?
The benefits of unattended automation

By using unattended automation to reduce the number of times people need to touch the systems, you can decrease the cost of the solution and improve profitability. Some might argue that you don't need to worry as much about decreasing the number of systems because an increasing number of systems in the solution is usually a sign of growth and health. Real profitability occurs when you can decrease the number of times people are required to touch the systems. This allows operators to focus on higher-value work and it also leads to an improvement in the operator-to-server ratio.

DevOps is about building a highly-functioning and skilled organization that is responsible for the full lifecycle of their solution. That responsibility includes the elimination of mundane, highly repetitive tasks that artificially limit the skill growth of operation personnel and prevent them from achieving great things. What can you do to reduce the number of people who are required to touch a fewer number of systems some fewer number of times?


 Automate application monitoring

Automated monitoring is essential to every successful DevOps project. Knowing that your application or service is available and functioning within service level agreements (SLAs) is vital. Most applications strive to provide 99.999% availability. That's less than 6 minutes of downtime over the course of a year. Automated monitoring is the best way to ensure that applications are always functioning.

In an effort to guarantee this kind of availability, application owners build or purchase monitoring tools that measure application response time every few minutes from around the globe. Basic availability can be measured through simple URL pings. Proving a URL can be resolved and can return an expected result within an expected amount of time. More sophisticated monitors might involve authentication and traversing several dialogs, locating a form, entering a user name and password, and then validating the results.

Monitoring can involve agent-based monitors or synthetic monitors. Agent-based monitoring tools require that one or more agents be installed to analyze the details of code, server, user activity, or other data. Synthetic monitoring tools don't require the installation of an agent; instead, they simulate user traffic so that you can determine whether your application or site is performing correctly.

Often, applications are built from multiple microservices, such as an authentication service or a entitlement service. Because all of these services play a role in the aggregate responsiveness of your application, they too should be targets of automated monitoring. Monitoring at both the aggregate and individual levels also helps to isolate problems. The quick identification of a failing component leads to faster resolution.
Practices for implementing automated monitoring

    Monitor your application and the things that it depends on. If you establish monitors for each component of the application and the top-level interfaces, you can identify bottlenecks faster.

    Start early. Learn how to monitor an application, what to monitor, and how often. Use monitors to learn the trends of your application. Understand the dependencies that your application has on other components or services. Study the problems that are detected and those that aren't. Add monitoring where needed to clarify the complexities and fill the gaps. Measure the time it takes to find problems and restore services.

    Integrate automated monitoring with rich notification tooling. Many teams use tools like PagerDuty to notify on-call personnel when a problem occurs. If acknowledgements are not received, escalation policies can notify additional members of the team.

    Collaborate. Automated monitoring leads to some form of failure or performance degradation. Today's applications are complex and rely on many services and subject-matter experts to resolve issues. Use collaboration tools, such as Slack or Google Hangouts, to collectively solve problems.

    Test your monitors. Test that the automated monitoring that you established quickly detects downtime or poor performance. Simulate outages and downgraded performance situations to see how each is reflected by the tools.

    Investigate multiple monitoring tools. Many tools have features that are uniquely useful for one situation or application type. Custom extensions or programming skill level might be factors in determining which tool is the best fit for you.

The benefits of automated monitoring

    Your website can become more reliable and provide a better user experience.

    You can find problems before your users call or open a support ticket. No one likes the inconvenience of a problem. People become frustrated with downtime or poor performance and often go elsewhere. A sign of good automated monitoring is being able to recognize trends that lead to a problem. When you can fix problems before they happen, you've reached monitoring nirvana.

How to get started

Getting started with automated monitoring is simple. Cloud makes it easy by providing a built-in logging mechanism that produces log files for your apps as they run. The logs can show errors, warnings, and informational messages and can also be configured to log custom messages from your apps. To make sense of the logs and the availability of your apps, Cloud offers the Monitoring and Analytics service, which monitors your application's performance and includes features for log analysis. The Monitoring and Analytics service provides several advantages:

    Instant visibility and transparency into your application's performance and health without the need to learn or deploy other tools

    Faster innovation, as you spend less time fixing bugs and addressing performance issues and more time developing new features

    Quick identification of the root cause of an issue, through the use of line-of-code diagnostics

    Faster time to resolution of your application's problems, as you can use embedded analytics to search log and metric data

    Reduced maintenance costs, as you can keep your application running with minimal effort

Outside of Cloud, many tools in this space offer some form of free trial or limited edition to help you get started. Sign up and begin monitoring today. Simple monitoring can be configured in minutes. The number of monitors, the number of locations that they can be run from, and the amount of historical data might be limited. However, don't let those limitations stop you from getting started.


 Implement a high availability architecture

In today's global marketplace, websites are expected to be always available. The Garage Method for Cloud website is no different; it has a service level agreement (SLA) goal of being available 99.999% of the time. The website is hosted on the Cloud environment, and code is frequently being delivered to production. As a result, opportunities for errors and downtime abound. In this environment, it is critical to have a strategy to release new code into production with zero downtime.

To meet the SLA goal, the development team took these actions:

    Implement a continuous delivery process by using Cloud Continuous Delivery.
        Implement a "Deploy to Test" stage.
        Implement blue-green deployment.
        Deploy the production website to multiple Cloud data centers.

    Implement automated monitoring and outage notifications.

    Capture and maintain application log information to troubleshoot outages.

    Write and maintain runbooks to troubleshoot operational issues.

    Surface SLA reports that clearly show daily, weekly, and monthly outage data.

Implement a continuous delivery process by using Cloud Continuous Delivery

To create and manage the build and deployment of the website, the team adopted the Delivery Pipeline in the Cloud Continuous Delivery.


Implement a "Deploy to Test" stage

To avoid disruptions that can occur when developers deploy directly to production, the team created a delivery pipeline that includes a "Deploy to Test" stage. The purpose of the stage is to isolate the team's developers from the production website. In this stage, the team runs acceptance tests by using Sauce Labs to validate that the website is ready to push to production.
Implement blue-green deployment

The Garage Method for Cloud website is continuously delivered—as often as daily. To ensure that the transition to the upgraded version of the website has zero downtime, the team implemented blue-green deployment. As new function is pushed to production, it is deployed to an instance that isn't the actual running instance. After the new application instance is validated, the public URL is mapped to the new instance of the application.

Blue-green deployment involves these steps:

    If the blue app exists, manually delete it before you restart.

    Push a new version of the blue app.

    Set environment variables for the blue app.

    Create and bind services for the blue app.

    Start the blue app.

    Test the blue app.

    Map traffic to the new version of the blue app by binding it to the public host.

    Delete the temporary route for the blue app that was used for testing.

    Rename the green app to "green app backup." The backup application continues to run so that active sessions are not terminated.

    Rename the blue app to "green" app.

The team completes the blue-green deployment steps by using the built-in Cloud Foundry command line interface.
Deploy the production website to multiple Cloud data centers

The primary instance of the website runs in the Cloud US South data center. To ensure that the team can handle outages and maintain its SLA, the team created failover sites in the Cloud UK and Sydney data centers. When the team set up the failover sites, the team had to create a space in each data center and create a stage in the pipeline for each of the failover sites.

After a new feature is placed into production in US South, it is then pushed to the London and Sydney sites so that all the sites are consistent.

Requests to cloud/architecture are routed by using Akamai. Akamai polls the US South, UK, and Sydney health check URLs to determine whether they are up and running by looking for an HTTP 200 response. If Akamai detects that the application that is running on the US South data center is not responding, requests are routed to the application that is running on the UK data center.


 More deployment methods for zero downtime Expand
Implement automated monitoring and outage notifications

You can choose from among several tools to implement automated monitoring and outage notifications: Cloud Monitoring with Sysdig, Cloud Activity Tracker with LogDNA, PagerDuty, and New Relic. To measure outages on the website, the team uses the New Relic Cloud service. This service continuously monitors the availability of the website by checking its availability once per minute from nine locations around the world. If an incident is detected, New Relic calls out the PagerDuty service, which contacts the operations staff that is responsible to fix issues with the website. PagerDuty notifies people in three ways:

    It sends a message to a Slack channel that the team uses to monitor outages.
    It pages a preconfigured list of operations and support personnel.
    After a preconfigured time of no acknowledgment, it escalates and notifies operations management of the issue.


The Cloud US South, London, and Sydney sites are also monitored individually so that the operations team can be notified if any site is unavailable or needs attention. The strength of redundancy is best when all three nodes are available.
Write and maintain runbooks to resolve operational issues

Preventing downtime in applications is the best way to ensure high availability. Unfortunately, failures can still occur. Runbooks are explicit procedures for first responders to follow. When you have the appropriate access and know exactly what to do, you can act quickly when failures happen, and minimize the downtime that is associated with an application instance. With tools like PagerDuty, teams can directly link a runbook to an incident.
Capture and maintain application log information to troubleshoot outages

When problems occur on the production website, whether due to infrastructure outages or coding errors, the team that is tasked with getting the website back online must have access to all of the critical error log files and other troubleshooting information to identify and fix the problem. Cloud keeps only a limited amount of log information, which might not be enough if you are troubleshooting an issue that occurred hours ago or if the application produces so many log messages that the relevant messages are overwritten.

To access older logs, teams need a log management service that supports the 'syslog' protocol, such as SumoLogic or Splunk. Application logs should be streamed to that service. For more information about how to register a service with your application on Cloud, see the Cloud Docs.
Surface SLA reports that show daily, weekly, and monthly outage data

The New Relic service provides reports that the team uses to determine whether it is meeting the 99.999% availability goal. This information provides critical insight into whether more actions are required to maintain availability.

Over time, the website became much more reliable as the team added the failover sites and site monitoring. In particular, implementing the failover sites in the Cloud environment was a critical part of meeting the 99.999% uptime goal. The website can now continue running while regularly scheduled maintenance is done on the Cloud infrastructure.


The Garage Method for Cloud website ecosystem

The following diagram shows the end-to-end ecosystem to develop, deliver, run, and manage the website.


 Operational readiness

When your applications fail and it takes time to determine the root cause and restore service, customers get frustrated. You want to ensure that your customers are delighted. An assessment of your organization's operational readiness answers three questions:

    What needs to change?
    How significant is the change?
    What are the expected benefits?

Your answers to these questions identify the gaps that you need to close.

Follow these three steps to get started:

    Assess where you are. Engage in an operational readiness review to examine all key operational processes and to determine the as-is versus the to-be state.

    Determine where you need to be. Cost and risk tradeoffs are inherent in all processes. Assess each process to determine where you need to be.

    Improve and assess continuously. Identify gaps where processes don't meet minimum requirements. Put plans in place to address the gaps. As your organization matures, repeat the whole process regularly.

As you adopt cloud technologies and move workloads to the cloud, be sure to adapt your guidelines. Examine your readiness from two perspectives.
Operationalize the cloud

As your company adopts a new platform, its processes, roles, and responsibilities must be revisited to determine whether they still apply. The same is true when you adopt a public, dedicated, or private cloud.


Operationalizing the cloud starts with understanding the roles and responsibilities of the cloud consumer and the cloud provider. Consider using a RACI matrix to detail your operational activities and who is responsible, accountable, consulted, and informed.


Operationalize application readiness

As you adopt a microservices approach for your applications, be sure to establish guidelines and processes to keep your services robust and serviceable. For more information, see Operationalizing your application readiness.


Learn


 Continuously experiment to deliver the right solution

Learn how your team works together and how customers use the applications that you deliver by studying analytics data.
Why change?

You designed, built, and delivered your application. It’s running on the cloud and you’re managing it to high levels of quality. But how do you know whether you were successful?

You need to gather metrics and learn from your running application by using techniques such as hypothesis-driven development. You can use that feedback to understand what users find valuable or where they abandon the application.

Analyze metrics across three major categories:

    Usage analytics: Usage analytics metrics indicate how users interact with the application. Are they using the new features in the way that you expected? Are they having a good experience, or are they getting stuck? You can use these metrics to identify features of your application that customers use heavily and features that they don’t, either because the features are difficult or uninteresting. Your team uses this information as input into the definition and prioritization of its work.

    Team analytics: Team analytics metrics provide information about the development team's performance. This type of data helps to identify process improvements that make the development team more efficient and drive higher quality into the application.

    Business analytics: Business analytics metrics show whether the product is meeting the defined business key performance indicators (KPIs) and goals. Is the site attracting new users through social media, search, and business partners? Are you meeting the funnel conversion rate? Are you reaching the revenue targets?

You can incorporate this learning into the next round of capabilities as you start the cycle again.
What changes?

From the time when the team forms the initial MVP, the team uses hypothesis-driven development to ensure that it builds what the customer wants. In hypothesis-driven development, the team forms hypotheses and then proves or disproves them. 


Formulate hypotheses

Hypotheses are written and stored in the backlog as work items for the development team. They are integrated into the backlog and ranked with all the team's work. When a hypothesis is proved or disproved, you might need to write more user stories to address work that must be done as a result of the experiment.

You might be wondering what the difference is between user stories and hypotheses.

    Agile user stories generally follow this format: As a role, I want some function so that benefit.

    A hypothesis follows this format: We believe that function will lead to outcome and this will be proved when measurable condition.

    For example: We believe that adding links to related web pages will lead to users going deeper into the site content and will be proved when we see an increase in the page views for the pages that are linked.

Test different versions

Some hypotheses require the team to determine which version of a solution is most appealing. You can use A/B testing to compare how users react to two different versions that provide the same function. At the end of the test period, you can use empirical data to ensure that you pick the implementation that the users like the best.

When a team implements hypothesis-driven development, it must be able to do these tasks:

    Formulate hypotheses that are based on usage data

    Measure the results of the hypotheses according to usage data

Based on the results of the experiments, the team proves, disproves, or abandons the hypotheses. The team must decide how much time to spend gathering data to either prove or disprove a hypothesis. The amount of time that you need to measure results depends on the function that was delivered and differs for every hypothesis.

Generally, the team does as little work as possible to either prove or disprove a hypothesis. Sometimes the result of the experiment creates more work for the team. This work is captured in user stories and is added to the rank-ordered backlog. Other times, the team learns that it was headed in the wrong direction. In that scenario, the team shifts direction, or pivots.
Communicate efficiently

In periodic playbacks, teams share progress with the broader audience of stakeholders, including sponsor users. Sponsor users are customers who are invited to spend time with the development team throughout the process to provide their perspectives. Playbacks also include demos of the most recent user stories that were achieved, the results of hypothesis tests that were completed in the market, and new information that the team learned about the target audience. Playbacks give stakeholders the opportunity to see how much progress the team made in its development effort and provide input and guidance.

Between playbacks, the team views shared analytics dashboards so that everyone can see customer usage data and participate in the formulation of hypotheses.

Make sure that your team periodically gathers for retrospective analysis. A retrospective is a way to reflect on how things are going and identify what is done well and what can be improved. Teams can conduct a retrospective as a meeting or by using a collaboration tool, such as MURAL or Slack. Gather input from everyone on the team and vote to determine the most pressing issues. Then, open work items to make sure that the issues are addressed. The team must be free to change how it works if it thinks something can work better.
Learn continuously to drive business decisions

After it completes the initial MVP and delivers the application to the cloud, your team continues to expand its knowledge about the customers' experience. Based on what the team learns, it enhances the design to add or modify features in the application. Each addition or modification to the application is defined as another MVP so that it can be developed, delivered quickly, and validated by customers. This fundamental part of the process is also known as the Build-Measure-Learn loop.


 Hypothesis driven development

In the turbulent world of modern business, software developers are challenged by many forces, but two in particular stand out:

    Developers must deliver new function to customers in increasingly shorter time frames because the world is moving faster than ever before, powered by new technology and new ways of working.

    Developers must also deliver to a rapidly evolving and increasingly selective customer base that demands a better experience in every way. If you're a developer who builds for public consumption, you contend with competitors who all deliver their finest efforts, and only the best offerings survive. Even if you build applications purely for internal purposes, you are likely finding that people expect more as they sample the delights on mobile and web platforms, both in and outside of the workplace.

People can try different applications and services easily, and most people have been engaged by a highly compelling and delightful experience. When something does not measure up in comparison, they know, and they move on to try the next option.
The evolution of the product development process

No company or industry is safe from disruption, no matter how established it might be. The old way of discovering and documenting user requirements, building a system "to spec," and delivering it months or years later and hoping it's still the best solution is woefully inadequate.

Instead, developers must be prepared to evolve continuously and rapidly, sometimes even far from their original plan. This behavior has become so common that the term pivot is used to describe the act of completely abandoning one plan for another.


Hypotheses instead of requirements

To facilitate this highly evolutionary approach, developers use hypotheses instead of requirements. Requirements are a specification of something that must be built and delivered to customers. Hypotheses are provisional conjectures that must be proven. If you disprove a critical hypothesis, you might need to pivot and create another set of hypotheses. When you use hypotheses, you recognize that the world is complex, constantly changing, and sometimes confusing. Making investments that are based on assumptions and intuition can potentially waste time and resources. To deliver what customers want at the speed that they demand, you must hypothesize and make data-based decisions. Experiment early and often, solicit feedback from customers as to what works for them, and discard any features that provide little benefit or hinder customers.


Two of the most critical, and therefore risky, hypotheses are related to value and growth. If customers won't value your offering, or if you can't scale your business, you will fail.

Traditionally, user requirements are captured as user stories where a persona requests a function to achieve a goal. Sometimes the requirement is embellished to include the benefit of having the function. For example:

    As a role I want function [so that benefit].

In contrast, hypotheses pair a statement that asserts or predicts value with a testable condition that can be measured. The typical form is as follows:

    We believe that function will lead to outcome and this will be proven when measurable condition.

Where possible, the signal that is being measured should be an actionable metric and not a vanity metric, as covered later in this article.
Examples of hypotheses

When you formulate hypotheses, you can associate one or more user stories that are developed to implement the function that is defined in a specific hypothesis. Every user story does not require a hypothesis.
Example 1

Problem statement: Our team created a landing page to encourage web developers to try our new technology. Data shows that visitors are coming to the page, but few try our new technology. Instead, they leave the page or click other links on the page and don't return. We suspect that our start trial link is not obvious enough and that visitors don't notice it.

Hypothesis: Visitors, and web developers in particular, will click the start trial link and then sign up for a trial of our technology if we redesign our site landing page to prominently feature a Start trial button and remove other extraneous content and links. We believe that this redesign will lead to increased trial engagement by web developers, as measured by (a) an increase in clicks of the Start trial button, (b) a reduction in the bounce rate for the landing page, and (c) an increase in registrations for trials.

Related function to implement in one or more user stories: Modify the site landing page based on the new design to prominently feature the Start trial button and remove extraneous content and links.
Example 2

Problem statement: Our team found that its website gets little site traffic from social media sources. Yet, we believe that many of our site's targeted users are active on social media and blog sites.

Hypothesis: Publishing a blog post and a related tweet will increase the number of targeted users who visit the site and try the features that are highlighted in the blog post and tweet. This expected change in user behavior will be measured in the first 2 - 4 weeks after the blog post and tweet are published. The change will be measured by (a) an increase in social media referral traffic to the link that is cited in the blog post and tweet and (b) increased use of the features that are highlighted in the blog post and tweet.

Related user story work: In user stories that deliver a new feature, add a task that requires the writing of a blog post and a tweet to make users aware of the new features.
Explore the possibilities

The act of proving or disproving a hypothesis is always a learning opportunity, no matter what the outcome is. Even if you can't prove the hypothesis, the outcome might provide valuable insights that you can use in another hypothesis.

Adopt a mindset of continuous experimentation, where you explore multiple options. If a baseline function exists and a hypothesis asserts that an improvement to this function will be beneficial, you can conduct a split test, which is also known as an A/B test. In a split test, one set of users sees option A (the current function), and another set of users sees option B (the new function). With the data from the test, you can determine which option is the best.

Even if no baseline exists and you are exploring the idea of adding a new function, it is still valuable to build multiple prototypes and compare them in split testing. View the entire exercise as an experiment. In that mindset, you can do the minimal work required to conduct your experiment and gain the insight that you seek. In some cases, you might not even have to write code, as you can use surveys, interviews, or other means to gather the required data.
Manage hypotheses and experiments

If you view all feature development work through the lens of hypotheses, you can have a backlog of hypotheses that is ranked by risk and a kanban view of hypotheses that are in varying states. In this case, a useful kanban view includes at least two states: one where an experiment is being prepared to validate the hypothesis, and another where active validation is underway and data is being measured. When you complete an experiment, you can mark the hypothesis as proven, disproved, or even as abandoned if you decided to end the experiment early based on other findings.

By always tackling the highest risk hypotheses first, you can ensure that you reach any needed pivots as quickly as possible and avoid investing in pointless work.
Practice hypothesis-driven development

The development team for the website experiments frequently and practices hypothesis-driven development in several ways:

    Establish metrics

    Establish success and failure criteria

    Determine the usefulness of a feature by testing it on a small subset of users

    Run multiple experiments continuously

    Make fact-based decisions quickly

    Deliver faster so that you can experiment faster

    Establish a mechanism to enable system-wide experimentation, such as Google Analytics or Digital Analytics

    Consider different models of experimentation, such as classical A/B testing and multi-armed bandit

Use metrics properly

Metrics are essential to making good data-based decisions, and metrics such as key performance indicators (KPIs) are often used to measure or keep score for how well a person or business is doing. When metrics are used effectively and appropriately, they are an essential part of an execution cycle. Unfortunately, metrics can be overused and abused. Management by numbers exclusively can lead to misguided efforts and eventual failure, especially when what's being measured is not important.
Measure what matters

Vanity metrics can be used to present a picture of success and make everyone feel good, even when the real story is different. These measurements are often superficial, easily measured, and sometimes, just as easily manipulated. In a world where so much can be measured, it's easy to become overwhelmed by a mass of useless and irrelevant metrics that show details of a state without any insight as to what led to that state. Consequently, it's unclear how to repeat or avoid that state or improve it.

Actionable metrics are the metrics that really matter. They are the result of specific and repeatable actions, and they can help you make decisions about what to do next in the pursuit of your goals.

Metrics that go deeper than vanity metrics aren't necessarily actionable. For example, revenue is one measure of how your business is really doing, as opposed to the vanity of metric of how many people visit your website. However, it's pointless to know that revenue increased if you don't know what caused the increase. In that case, you don't have a way to repeat the success, let alone amplify it. If, however, you measure revenue before and after a specific change that affects a cohort of your users who are isolated from any other change over the same period of time, you have something actionable. And you can use the approach of hypothesis-driven development to experiment and find the best way to add value.

Vanity metrics and actionable metrics aren't exclusive to website behavior. You can measure so-called developer or team productivity in hundreds of ways, but few of these are truly relevant. It doesn't matter if a team is churning out lots of code if that code causes problems and must be rewritten in the next iteration. It doesn't matter if the team is rapidly delivering the solution that everyone agreed to build if it's the wrong solution and no one will buy it.

To get started, work with your team to find the few measurements that everyone believes are the ones that really matter. Then, share them so that everyone can see them before, after, and during any actions the team takes.


Run user experiments to validate your hypotheses

You know your user and their problems and you have a great idea that will change their world. All systems go—time to make your vision a reality.

Not so fast. At this point, all you have is an idea that you think will improve what you think is the user's biggest problem. Maybe the idea is not as world-changing as you think. Maybe what you think is their biggest problem is actually not that important to them.

Jumping into development before you validate your idea is taking a huge risk. If your idea wasn't everything you thought it was, you wasted all of that development time.
Hypothesis testing and prototyping

The best way to reduce the risk of doing the wrong thing is to validate your biggest assumptions early, with small investments, and move forward only when you know you're going in the right direction.

For example, Alice is a client of a bank. The bank wants to provide Alice with a way to help her begin saving. The bank's big idea is that if they provide Alice with a fun and simple way to start saving, she will start to save an then internalize the value of saving and grow her relationship with the bank. How would you validate that this idea is a good one?

This process has five simple steps:

    Identify your big assumptions. Assumptions can be anywhere. Which ones will have the biggest impact if they are wrong?

    Form a hypothesis. If this really is a great idea, what do you expect to observe if you brought the idea to your users?

    Prototype. Figure out what the smallest, cheapest thing you can build is that you can use to experiment to validate your hypothesis. Then, build it.

    Run the experiment. Give the prototype to select users and observe what happens.

    Learn. Did you observe what your hypothesized? If so, great! Move forward. If not, learn and repeat.

Identify your big assumptions

Assumptions can be anywhere. Was everything in your empathy map, persona, and scenario map based on verifiable user research, or did you use your best guess on some things? If your guess was wrong, does that invalidate your idea? Look back at all of your sticky notes and collect the ones that you're not sure about. For each sticky note, ask yourself how confident you are in that idea and what the risk is if it turns out to be wrong. From the example, does Alice want to start saving? Is the reason she's now saving because it's too hard to start?
Form a hypothesis

A hypothesis takes this form: If we provide USER with SOLUTION we will serve: METRIC.

You already know who your user is, and you have an idea for your solution. The big question is: What measurable thing do you expect to happen if you give that solution to the user? Be specific and make sure that the metric can be tied directly to the solution.

Looking back at Alice and the bank, what sort of metrics would the bank expect if this saving app really was a great idea? Maybe they could measure how many times a week she uses the app. Or they could measure whether it changes Alice's spending habits or if she likes it enough to share it on social media.
Prototype

You can build the solution in order to run the test in dozens of ways; some are elegant, some are hacked together. At this stage, favor "quick and dirty" over elegant. What's the quickest thing that you can build in order to test the hypothesis? What you build can be as simple as a paper prototype, or it might need to be more technically complete. In the earlier example, does the bank need to build a full scale app, or can they learn with something smaller? Does the app need to be cross platform, or iOS only? Is one language good enough? Does it need to be tied into all the other banking systems?
Run the experiment

Find sample users, give them the prototype, and run the experiment. Keep in mind that the results are only as good as the experimental setup. If at all possible, involve people who know how to run user testing. You can start by finding just a handful of Alice's friends and putting the app in their hands. Remember: you're not doing a full-scale rollout of a new app. You're giving the app to a few people for a specific reason.
Learn

Analyze your experimental results and see if they agree with your hypothesis. If they do, consider your assumption to be validated. If they don't, revise your thinking accordingly and try again. How often did Alice use it? Did she share it with her friends? Did it change her spending behavior? Compare the results with your hypothesis, and that will validate or invalidate your assumptions.

You might need to go through this hypothesis testing cycle several times until you validate all of your key assumptions. Each iteration will help you know whether you're on the right track, and thus reduce your risk of building the wrong thing.


A/B test your site

A/B testing is a process by which you compare two versions of a web page to see which is the most effective. It's a clear way to get empirical data so that you can determine which approach works better and is more productive.

Consider a web page that offers promotional items that are sold on a retail website. With A/B testing, you provide two versions of a web page: one that offers the current approach and one that offers a new approach. During the test phase, website visitors are randomly routed to each web page while you track which page produces better results. As testing progresses, you can compare the number of promotional items sold.

In another example, consider a web page that offers free registrations for a social media website. Through A/B testing, you can compare two versions of the web page to see which version gets more registrations. A measurable difference in results indicates whether the new approach is more effective than the current approach.
Practices for implementing A/B testing

    Define a clear measure of success, such as items sold or time spent on the page.
    Offer two approaches simultaneously. You want comparable results for the same period of time.
    Use the new approach only for new users, not returning ones. Don't frustrate returning users with multiple approaches.
    Use the new approach consistently across the website. If a widget is shown on multiple pages, show it in the same place on every page.
    Measure and monitor the two approaches over time. The data might show trends. The totals might not tell the whole story.
    Run multiple tests. The first alternative might not be the best. In addition, analyzing the results of small changes is clearer than analyzing the results of major changes.

To perform A/B testing, you can use free trials of tools like Maxymiser and Optimizely.
The benefits of A/B testing

By performing A/B testing and applying what you learn from the results, you can make your website more efficient and provide a better user experience. You can use empirical data to ensure that you use the best approaches.


 Run playbacks to gather feedback

Playbacks are meetings that are used to keep stakeholders, clients, and teams in sync. Playbacks occur at different points during a project:

    Persona and scenario definition

    Minimum viable product (MVP) definition, including the user experience

    Development delivery of completed stories

The value of playbacks

Playbacks help ensure that the team is aligned on the personas that will use the service and the scenarios to include to satisfy the MVP.
Playbacks in practice

Effective playbacks incorporate these practices:

    Being concise

    Scheduling enough time to listen to stakeholder feedback

    Focusing on the value of the scenario rather than the technical implementation details


MVP definition

When playbacks are part of the MVP definition, the goal is to align the entire team around the scenarios that are part of the MVP and the user interface to achieve those goals.

A great deal of design work must occur before the playback of the MVP and user experience. The team must address open issues around the market direction, the needs of the target users, and the technologies that development will use to implement the MVP.

During the playback, the team presents the scenarios that reflect the MVP, the storyboards, and the wireframes to get agreement on the MVP. The wireframes are presented with the understanding that they will be refined. The wireframes are revisited in the development playbacks, where they continue to be refined and implemented.
Development playbacks and continuous delivery

Development regularly holds playbacks and demonstrates completed stories, ideally weekly. Feedback from the playback can help to determine whether the MVP value is being achieved or a change is required.

In a project that is being delivered continuously, all types of playbacks occur throughout the life of the project. As developers work on the delivery of a part of the MVP, the next set of MVP function is identified. At any point, a team might be conducting all three types of playbacks.


Sponsor user playbacks

Internal playbacks are valuable, but it's also important to get direct feedback from sponsor users. Throughout the project, the product manager and design lead might want to hold playback meetings with sponsor users who represent the personas that will use the service or product. Through these playbacks, the team can obtain feedback to improve the product and focus on what the customer wants.


Improve your offering through business analytics

Business analytics deals with the user journey, satisfaction, and effectiveness of the application that you provide. This data can provide insight into the following areas:

    The user journey through the app; for example, understanding a person's journey from trying a new service to purchasing it.

    Difficult areas of the user interface; for example, the point when people experience difficulty with the app and decide to not use it anymore.

    Identifying people who might be good candidates for sales to contact.

How to implement business analytics

    Instrument your code to provide data to enable insights. Make sure that the instrumentation includes this information:

        Page-usage data for aggregation and analysis.

        The logs and metrics that you need to understand the application flow and performance. If you use microservices, you must understand the interaction between the capabilities in the microservices.

    Collect the data over time so that analytics can be run against the data. Store the data in a common data store that enables analysis and reporting for the development team and product owners.

    Understand the user flows for your product and how to move customers to the desired end results. For example, you might create a funnel that channels customers from marketing campaigns to experimentation with the product to purchase of the product until you understand the customers' usage patterns and the desired results of application use.

    Review business analytics based on the changes that were made in previous iterations. Such reviews help the team identify areas for improvement and learn how its app is being used.

How to get started with business analytics

To integrate business analytics into your development process, practice hypothesis-driven development.