The Best Practices for Running a Successful Software Company

Introduction

Situation

Running a software company is extremely challenging. All, even successful companies, constantly struggle to keep the balance between finding the right product for the right customer, keeping up with changing market demand, bringing in the money, maintaining product quality, controlling company evolution speed and many many other conditions for success. As a result, 99% of them close within first few years.

What defines a company as successful?

A successful company is a company with positive and continually growing net worth, which is maximally protecting itself from undesired closure.

This work is a set of qualities that contribute to company success, based on industry's best practices. It is structured around organization departments' while focusing on simplicity, standards, information accessibility, and monitoring.

Operations - responsible for defining and controlling cross-company processes focused on increasing the throughput of making money, while simultaneously reducing both inventory and operating expense.
Human Resources (HR) - responsible for recruiting skilled and motivated staff and maximizing its performance.
Marketing - responsible for identifying and profitably satisfying customer's demand.
Finance - responsible for providing financial insights, vital for current company's wellness, and ability to make future strategical decisions.
Information Technology (IT) - responsible for providing technological infrastructure to support business operations.
Research and Development (R&D) - responsible for engineering highly versatile product of high quality, according to Marketing department's requirements.

Task

Improve rate of company's net worth increase and protect it from undesired closure.

Action

Go over the items in the list and pick the ones you believe are vital for your company. Tick the "Save" checkbox near those items, you will be able to print them (or save to PDF) later, by clicking the " Print" button in the right sidebar.

If you are unsure what some item means - use your favorite search engine and other means to do proper research. Refer to Terminology to have a better understanding of terms used. If an item is out of place, outdated, poorly described or misspelled - improve it. See contributing.

Having the list of potential action items, take some time for research. Consult with people you work with and plan how to implement the change. Always remember that the goal of your company is to make money, as much and as fast as possible - do avoid from being distracted with ideas that do not help the company to achieve its goal.

Implement the change and measure success.

Results

When the changes are implemented, there should be a noticeable in company's performance. The rate of making money should increase, inventory and operational expenses should decrease. This is the most desirable outcome. If it didn't happen - try to understand why. Did you measure things properly? What other changes were introduced in parallel to yours?

Share the results with me (leonid@komarovsky.info), I will be glad to know about your challenges and progress.

Disclaimer

TBD

License

This work is licensed under a CC Attribution-NonCommercial-NoDerivatives 4.0 International License.

Community

Discuss and contribute

This page provides text annotations to enable public discussion and personal note-taking (a great service by hypothes.is). To start a new discussion, simply select any text on the page and click "Annotate." I am periodically summarizing public annotations and updating this document with relevant content.

You can also contact me directly at leonid@komarovsky.info with any questions or ideas you want to discuss personally.

Know someone who might need to use this work? Share the link with them.

Subscribe for latest updates

Operations

Operations Department is responsible for defining and controlling cross-company processes focused on increasing the throughput of making money, while simultaneously reducing both inventory and operating expense.

General

	Save
Performance of every group in the organizational hierarchy is measured by a set of Key Performance Indicators (KPIs)
Key Performance Indicators of every group in the organizational hierarchy are aligned with company's objectives
Values of Key Performance Indicators of every group in the organizational hierarchy are visible to the group members (for example wall mounted displays showing KPIs)
Key Performance Indicators of every group in the organizational hierarchy are periodically reevaluated and updated
The amount of Key Performance Indicators per group in the organizational hierarchy is between 3 and 7
Both leading and lagging Key Performance Indicators of every group in the organizational hierarchy are monitored

Planning

	Save
Work planning is focused on a predetermined clear strategy, aligned with business goals and time/resource constraints
Work is planned with long-term thinking in mind
Work is planned for 2-4 weeks ahead
At the end of the planning process, priorities, objectives, and responsibilities on a personal level are clear
Planned changes are divided into small, incremental changes that can be completed in a week or less
Work backlog is small

Working

	Save
Work progress, required effort, and product quality are monitored on a task level
All work is categorized as "planned", "unplanned" and "abandoned"
The amount of unplanned work is monitored and continually reduced
The amount of abandoned work is monitored and continually reduced
The amount of work in progress (WIP) is monitored and limited
Work processes and procedures are clear to management and employees
Work processes and procedures are documented
Work processes and procedures are constantly reevaluated and improved
Work processes and procedures are actively automated
Work processes and procedures do not slow people from achieving great work

Communication

	Save
There is a centralized Instant Messaging service, used for daily communications throughout the organization
The Instant Messaging service provides search within conversation history
It is possible to create topic-based communication channels in the Instant Messaging service
Important notifications such as monitoring alerts are immediately communicated via Instant Messaging service
It is possible to actively handle incidents using Instant Messaging service (ChatOps)
Critical services support in-service communications allowing to discuss service-specific items without leaving the service

Knowledge management

	Save
Knowledge sharing meetings are performed periodically
Technical specifications of every product, such as requirements, architecture, and technologies used, are easily accessible to members from other teams
There is an easy-to-use directory of experts, including their expertise and contact information
There is a centralized Knowledge base
Any knowledge possibly reusable by company employees is stored in the Knowledge base
The Knowledge base is periodically reorganized and is kept up to date
Every team has a person responsible for Knowledge management
There is a dedicated team responsible for Knowledge management
Knowledge management is monitored

Team organization and roles

	Save
Every team consists of 5 to 8 members
Every team consists of members with different roles allowing to plan, develop, build, test, deploy and monitor software and infrastructure changes within the same team
Induction process for a new team member is shorter than two weeks
There are dedicated teams that provide and manage internal, cross-company tools

Monitoring

Organizational performance

	Save
TBD

Work

	Save
Percentage of done out of planned work is monitored
Work progress, effort, and quality are monitored
The ratio of unplanned to planned work is monitored
The amount of work in progress (WIP) is monitored
The amount of abandoned work is monitored
The amount of meetings for every employee is measured and analyzed on a weekly basis
"Vanity metrics" such as lines of code produced and functions created are considered counterproductive and are NOT monitored
Competition monitoring such as team leaderboards is considered counterproductive and is NOT used

Human Resources

Human Resources Department is responsible for recruiting skilled and motivated staff and maximizing its performance.

General

	Save
TBD

Monitoring

	Save
Automated surveys collecting actionable information from employees are conducted periodically
Employee morale is monitored
Employee job satisfaction is monitored
Employee motivation is monitored
Employee turnover is monitored

Marketing

Marketing Department is responsible for identifying and profitably satisfying customer's demand.

General

	Save
TBD

Finance

Finance Department is responsible for providing financial insights, vital for current company's wellness, and ability to make future strategical decisions.

General

	Save
TBD

Information Technology

Information Technology Department is responsible for providing technological infrastructure to support business operations.

General

	Save
It is possible to assess cost effectiveness, usage, resource utilization, performance, quality per component, application, module

Monitoring

General

Collecting

	Save
Logs, metrics, data dumps, screenshots and other potentially important qualitative and quantitative data is collected
Collected data is stored in a centralized system
There are clear retention policies for collected data
The quality of collected data is periodically reviewed and improved
The log format is standardized across the company
Collected data always contains the time, origin and a descriptive message
Metric naming is standardized across the company
Metrics are periodically reviewed for correctness and relevance

Analyzing

	Save
It is possible to query collected data
It is possible to transform, combine, and perform computations on collected quantitative data
It is possible to perform trend analysis, including trend prediction on collected quantitative data
It is possible to visualize data queries using graphs, diagrams, tables and maps
It is possible to create custom dashboards with visualizations of data queries
It is possible to create scheduled reports based on data queries
Collected data is analyzed and translated into informative monitoring system events
Monitoring system events contain as much contextual information as possible
Events with contextual relationship are grouped into higher level events
It is possible to generate monitoring system events manually using easy-to-use web interfaces and APIs
The quality of visualizations, dashboards, and monitoring system events is periodically reviewed and improved

Reacting

	Save
Alerts are created for monitoring system events
Alerts are created only for actionable events
It is possible to prevent alerts of lower level from being sent if a higher level alert was already sent. - rephrase
Incidents are always recorded and analyzed
Incident handling processes are documented see Organization and culture / Knowledge management
Incident reports always contain issue summary, timeline, root cause, resolution and recovery, and corrective and preventative measures
Alerts contain reference to documentation explaining how to handle the incident
Escalation plans are documented see Organization and culture / Knowledge management
Alerts unacknowledged within expected time are automatically escalated according to escalation plans
Automated remediation is performed only for issues that the company does not have control over, such as failing hardware of external datacenter
The quality of alerts and escalation plans is periodically reviewed and improved

Monitoring monitoring

	Save
There is an external, independent system that is monitoring Monitoring System's health
When the Monitoring System is experiencing issues, management and team members are notified immediately
When the Monitoring System is out of service, standalone monitoring for critical components is working, and results are being recorded locally
When the Monitoring System is out of service, it is possible to perform a failover to an alternative Monitoring System

Communication monitoring

	Save
TBD

Knowledge management monitoring

	Save
Rate of contributions to the centralized Knowledge base is monitored
Rate of knowledge sharing sessions is monitored

Change monitoring

	Save
Lead time for release is monitored
The release rate is monitored
Time to restore service is monitored
Release failure rate is monitored
Infrastructure state drift between coded and actual is monitored

Code monitoring

	Save
TBD

Test monitoring

	Save
Percentage of automated tests is monitored
Test execution times are monitored for all test levels and types
Test efficiency is monitored to detect false-positive or inefficient tests
Test usage is monitored to detect unused tests

System monitoring

	Save
There is a centralized System Monitoring System
There is a clear definition of system components, critical to the organization’s operational well-being
Key requirements for system's availability, stability, performance, throughput, and security indicators are defined and documented
Resource usage, state and health of every process are monitored
Resource usage, state and health of every host are monitored
Resource usage, state and health of every system component are monitored
Resource usage, state and health of the infrastructure are monitored
There is a documented dependency map allowing to understand how any failure affects the rest of the system
It is possible to identify system issues using a bottom-up approach, starting at the process level
It is possible to identify system issues using a top-down approach, starting at system component level
Software license compliance is monitored
Expiry dates of software licenses are monitored
Expiry dates of domain name registrations are monitored
Expiry dates of SSL certificates are monitored
Hosts and applications have one of the following states: In-service (ex. OK), Unknown, Out-of-service (ex. critical), Some-issues (ex. warning), Recovered, Unstable (ex. Flapping)

Network monitoring

	Save
There is a centralized easy-to-use Network Monitoring System
There is an up-to-date, easily accessible inventory of hosts and network equipment
It is possible to review the history of inventory changes
Hosts and network equipment of the entire network are discovered automatically
Network topology is discovered using network tomography and SNMP or Route analytics
There is a graphical representation of the network
It is possible to arrange hosts and network equipment into user-defined logical groups
It is possible to define dependencies between groups of hosts and network equipment
It is possible to separate network and application issues
It is possible to detect the specific layer of OSI model that the issue happened on
Network performance is monitored
Network availability is monitored
Network uptime is monitored
Network reliability is monitored

Application performance management

General

	Save
There is a centralized easy-to-use Application Performance Management System
It is possible to record and replay sessions of user activity
Alerts about application performance and the end-user experience are actionable
There are automated remediation processes triggered in response to application performance issues
It is possible to identify any system changes such as new releases or insufficient system resources, responsible for exceptional behavior

End-user experience

	Save
It is possible to identify issues experienced by end-user, using top-down approach
The flow of user actions and responses to them across all system's components is logged
User interactions with the system are grouped by sessions, allowing to trace and analyze only relevant events
Slow responses to user actions are identified
Rate and percentage of slow responses to user actions are monitored
Unexpected responses to user actions are identified
Rate and percentage of unexpected responses to user actions are monitored

Business transactions

	Save
Business transactions are clearly identified and documented
It is possible to search for business transactions based on context and content such as time of arrival or transaction type
Slow business transactions are identified
Rate and percentage of slow business transactions are monitored
Unsuccessful business transactions are identified
Rate and percentage of unsuccessful business transactions are monitored
Suspicious business transactions are identified
Synthetic end-user transactions are clearly defined and documented
Synthetic end-user transactions are performed periodically

Runtime application architecture (Black box)

	Save
It is possible to identify application performance issues using bottom-up approach, starting at system component level
Servers, networks, storage, applications and services within the environment are automatically discovered by Application Discovery and Dependency Mapping tools (ADDM)
Transactions and applications are automatically mapped to underlying infrastructure components
Servers, networks, storage, applications and services within the environment have up/down state monitoring

Deep dive component monitoring (White box)

	Save
It is possible to identify application performance issues using bottom-up approach, starting at the code level
Call stack of application code execution and the timing associated with each method are recorded and monitored
Communications with services external to each module are recorded and monitored

Analytics / Reporting

	Save
It is easy to correlate application performance data from various sources to provide actionable information
Simple reports with application performance information are sent periodically to stakeholders and team members.

Website monitoring

	Save
TBD

Deployment monitoring

	Save
TBD

Security monitoring

	Save
TBD

Security

General

	Save
Security is a priority in the organization
There is a thorough, up-to-date documentation of authentication and authorization mechanisms, network architecture, storage and hardware access see Organization and culture / Knowledge management
Security-related documentation access is restricted and audited
There are complete trust and transparency between people responsible for the development, operations, and security
Probability and impact of security risks are clear to everyone
Incremental improvements are preferred to following a detailed security roadmap
Security practices improve on each step of the Delivery Pipeline
Third-party software is standardized
Automated audit trails are implemented across all systems
Preparedness is tested with Security Games
Security hardening process is not slowing down the pace of business activities
Automation of security processes is of high priority
Security reviews are conducted periodically
Threat modeling and risk assessment are conducted periodically
All IT and R&D employees are immediately notified about any security vulnerabilities detected in the system

Account and privilege management

	Save
Definitions of users, groups, roles, and privileges are stored in Source Control
Management of users, groups, roles, and privileges is performed by explicitly using code stored in Source Control
It is possible to rollback definitions of users, groups, roles and privileges to a known good state in response to any detected aberrations
All users, groups, roles and privileges are carefully discussed and designated to resources on a need-to-know basis
The practice of assigning the least-privilege model of access is applied whenever possible
Any privileged accounts are closely monitored for changes see Monitoring / Security monitoring

Inventory management

	Save
There is an always-up-to-date inventory of hardware, software and information assets
New assets are discovered automatically within minutes
It is easy to determine the team or person that are responsible for any asset
Changes in existing assets are validated as soon as they appear in the inventory
Any aberrations are automatically communicated to the responsible team or person
It is possible to rollback any inventory item to a known good state in response to any detected aberrations

Configuration and patch management

	Save
Configuration Management System is continuously applying configuration standards to new systems and enforce the configuration to systems that deviate from those standards
There is an easily accessible catalogue of "Golden Images" with predefined core functionality, such as identity management, configuration management, secrets- as-a-service and audit

Data security, at rest and in transit

	Save
TBD Encrypted communication channels

Logging and event management

	Save
Logs and events generated by services, applications and operating systems are automatically collected and sent to a central platform see Monitoring / General
Logs and events affecting data are automatically collected and sent to a central platform
Logs and events generated by services, applications and operating systems are closely monitored with Security Information and Event Management (SIEM) tools
It is possible to rollback the system to a known good state as a response to any aberration detected with Security Information and Event Management
Continuous Security Monitoring is fully implemented

Vulnerability scanning and assessment

	Save
Automated dynamic and static code analysis is performed as part of the delivery cycle
External host vulnerability scanning is performed periodically
Internal agent-based host vulnerability scanning is performed periodically
External network vulnerability scanning is performed periodically
Internal network vulnerability scanning is performed periodically

Research and Development

Research and Development Department is responsible for engineering highly versatile product of high quality, according to Marketing department's requirements.

Development

This section should be considered equally for both infrastructure and application development.

General

	Save
Variability of used technologies is small
Technical debt is monitored and removed periodically see Monitoring / Code monitoring

Architecture and Design

	Save
Architectural and design decisions are documented along with their context and consequences see Organization and culture / Knowledge management
Architecture is evolutionary, supports incremental change accross multiple dimensions
Systems, components, and modules are loosely coupled
Replacing a technology with an alternative is theoretically possible
Duplications in systems, components, and modules are periodically identified and minimized

Services, APIs etc. are treated as products for internal customers

Service provider-consumer contracts are documented
Service evolution is possible without violating existing provider-consumer contracts
Service provider-consumer contracts are automatically tested
Service provider-consumer contracts specify quality of service characteristics

Tools

	Save
There is an easy-to-use catalog of libraries and tools used in the company see Organization and culture / Knowledge management
There is a self-service allowing to provision required tools on developer's machine
The process of working with development tools is thoroughly documented see Organization and culture / Knowledge management

Development environments

	Save
Environment management, including provisioning and application deployment, is fully automated
There is a self-service process for provisioning of Development and Test environments
There is a self-service process for application deployment to Development and Test environments
The development process is happening in an isolated environment and is not affecting work of other team members
It is possible to provision an isolated datastore including a lightweight version of data used in production
It is possible to debug and profile applications in Development and Test environments
It is possible to access logs and metrics of applications and infrastructure, running in the Development or Testing environment, at any time see Monitoring / Code monitoring

Source Control

	Save
There is a centralized Source Control system
Code changes are submitted to Source Control as task-level commits
Code changes are submitted to Source Control at least once a day
Source Control branches or forks are used to isolate work on every task
Code submission to Source Control triggers automated build and test processes see Development / Application building, Testing / Unit testing, Testing / Static code analysis
Only fully tested, production-ready code is integrated into the main branch see Testing / General
Code freeze practice does not exist

Coding

General

	Save
Coding conventions are documented see Organization and culture / Knowledge management
Working code is preferred over comprehensive code documentation
Code changes are organized on a task level
Existing code is easy to maintain and extend over time
Every module or class has responsibility for a single part of the functionality provided by the software
Every module, class, or function are open for extension but closed for modification
Replacing an object of class A with an object of class B, which is a subclass of A, will not break the program
Interfaces are small and defined specifically for interaction between specific suppliers and consumers
High-level modules do not depend on low-level modules - both depend on abstractions
Abstractions do not depend on details; instead, details depend on abstractions
Feature toggles are used to temporarily hide task-level code changes from end-users, without changing the code
Feature toggles are categorized by their purpose as "release toggles", "operations toggles", "experiment toggles", and "permission toggles"
Feature toggles' category-specific longevity and dynamism are monitored see Monitoring / Code monitoring
Feature toggles are periodically reviewed and cleaned up
There is a centralized system for feature toggles management see Deployment / Releasing to Production environment

Infrastructure

	Save
TBD

Application

	Save
TBD

Application building

	Save
Application building process is automated
It is possible to run application building process with a single command
Application building processes are executed on a dedicated machine
Every application building process is executed in an isolated workspace
Build dependencies are stored in a centralized Package Management System see Development / Package/Asset Management
Static Code Analysis is performed automatically during application build process execution see Testing / Static Code Analysis
Unit tests are performed automatically during application build process execution see Testing / Unit testing
When application building process execution fails, an alert is sent to a person that triggered the build (and anyone else that is relevant)
When application building process execution fails, it is the highest priority to fix it
Any application build result can be recreated from Source Control
Build report is generated when a building process is done
Build reports are accessible at any time

Package/Asset Management

	Save
There is a centralized Package Management System
Build results are automatically versioned and tagged
Build results are automatically stored in Package Management System
Build results contain information allowing to connect them to a specific build process execution and code revision
Build package contains all relevant information required to automatically provision needed infrastructure, set up monitoring and deploy the package

Testing

General

	Save
Testing processes are using a centralized Monitoring System to determine whether tests passed or failed, whenever it is possible see Monitoring
Functionality of each product and its components is tested
Reliability of each product and its components is tested
Usability of each product and its components is tested
Efficiency of each product and its components is tested
Maintainability of each product and its components is tested
Test strategy is clear and documented
Test plans are documented in details and are accessible at any time see Organization and culture / Knowledge management
Tests are categorized by level and type
It is possible to run specific level or type of tests
Processes for management and maintenance of test data are standardized
Tests code is stored in Source Control
It is possible to run tests using a single command
Tests are fully automated
Automation tools are standardized
It is possible to run tests in a dedicated Testing environment
It is possible to create and test a version consisting from a group of pending code changes
It is possible to run any test on any version of infrastructure, application, and data state
Test failures are likely to indicate a real defect
Identified defects are fully analyzed
Identified defects are immediately assigned to a relevant team member
Fixing new defects has the highest priority
Cross-company bug tracking service exists
Test summary reports are stored and accessible at any time
There is an easy-to-use centralized Test Management System
There is a self-service allowing to run any test on any Development or Testing environment
A dedicated team manages the Test Management System, including defining standards and organization of existing tests
Continuous Testing is fully implemented

Work integration testing

	Save
Work integration into the main branch is only possible for fully-tested, production-ready code

Infrastructure testing

	Save
rspec, serverspec?

Static code analysis

	Save
Peer code reviews are conducted periodically
Static Code Analysis is performed automatically during application build process execution see Development / Application building
During Static Code Analysis process technical debt is measured
During Static Code Analysis process coding conventions are verified
During Static Code Analysis process bad practices (anti-patterns) are detected
During Static Code Analysis process, software metrics such as Code Coverage, Cyclomatic Complexity, Class Coupling and Maintainability Index are calculated
During Static Code Analysis process security vulnerabilities are detected
Team members are automatically notified about code aberrations detected during Static Code Analysis

Unit testing

	Save
Unit tests cover the smallest independent and testable parts of the source code which are usually individual methods or OOP classes
Unit tests cover at least 80% of the code
Mocks and proxies are used for external dependencies
Unit integration tests exist
Team members are automatically notified about code aberrations detected by the Unit Testing process

Integration testing

	Save
Integration testing is performed on interfaces between individual modules and defects are detected in individual modules and not in the entire system
Integration testing process is used to verify functional, performance, and reliability requirements placed on major design items
Integration testing process is running only after Unit testing process finished successfully
Both success and error cases are being simulated during the Integration testing
Bottom-to-top Integration testing approach is NOT used
Team members are automatically notified about defects detected by Integration testing process

System testing

	Save
System testing process is performed to evaluate the system's compliance with specified requirements
System testing process is running only after Integration testing process finished successfully
System testing test cases are developed to simulate real-life scenarios
Creating test cases for System testing does not require knowledge of the inner design of the code or logic
Regression testing is performed to validate that newly introduced changes to the system do not introduce new defects
Non-regression testing is performed to validate that newly introduced changes to the system have the intended effect
Smoke (Sanity) testing is performed to validate that critical functionalities of the system are working as expected
Graphical user interface testing is performed to validate that its visual representation and functionality meets specifications
Usability testing is performed by observing people trying to use the system for its intended purpose; to validate its usability
Performance testing is performed to validate the correctness of system performance in terms of responsiveness and stability under a particular workload
Scalability testing is performed to validate the ability to scale-up/down or scale-out/in responding to system's load
Compatibility testing is performed to validate application's compatibility with the computing environment, such as hardware and OS
Exception handling testing is performed to validate the correct system's behavior during the occurrence of anomalous or exceptional conditions requiring special processing
Security testing is performed to reveal flaws in system's security mechanisms that protect data and maintain functionality
Accessibility testing is performed to validate the accessibility of the system to all people, regardless of disability type or severity of impairment
Team members are automatically notified about defects detected by System testing process

Acceptance testing

	Save
Acceptance testing process is used to enable the user, customer or other authorized entity to determine whether or not to accept the system, based on their needs, requirements, and business processes
Acceptance testing environments are designed to be identical, or as close as possible, to the anticipated production environment
Acceptance testing process is running only after System testing process finished successfully
User acceptance tests are specified by business customers or product owners as primary stakeholders
User acceptance tests are written in Business Domain-Specific Language (such as Gherkin)
There is a manual process for User acceptance tests, performed by stakeholders
There is an automated process for User acceptance tests, performed during
Operational Acceptance Testing includes testing of component and network failover processes
Operational Acceptance Testing includes checking for presence of proper monitoring and alerts, including monitoring of SLA/OLA
Operational Acceptance Testing includes testing of data backup and recovery processes
Operational Acceptance Testing includes testing of disaster recovery processes
Operational Acceptance Testing includes checking of security vulnerabilities
Operational Acceptance Testing includes testing of deployment and rollback processes
Operational Acceptance Testing includes testing of application installation process (in cases when the application has to be installed on customer's computer)
There is a self-service procedure allowing to create separate environments, dedicated for Operational Acceptance Testing
Team members are automatically notified about the results of Acceptance testing

Deployment

General

	Save
A centralized tool is used to provision infrastructure, deploy applications and perform data migrations to multiple target environments
Infrastructure provisioning, application deployment, and data migrations can be performed as a single, atomic process, separately for each task
Quality Gateways ensure that relevant quality checks are passed before deployment to any environment

Infrastructure provisioning

	Save
Infrastructure provisioning process is documented see Organization and culture / Knowledge management
Infrastructure provisioning process is performed entirely from code stored in Source Control
Infrastructure provisioning process is fully automated
It is possible to run Infrastructure provisioning process with a single command
Infrastructure is automatically validated after being provisioned
In a case of failure, it is possible to rollback and reprovision working infrastructure version
In a case of failure, rollback process is triggered automatically
There is a self-service process for provisioning of any infrastructure version to Development and Test environments

Application deployment

	Save
Application deployment process is documented see Organization and culture / Knowledge management
Application deployment process uses only build artifacts stored in a centralized Package Management System see Development / Package/Asset Management
Same build artifact is deployed to Test and Production environments
Application deployment process is fully automated
It is possible to run application deployment process with a single command
Applications are automatically validated after being deployed
In a case of failure, it is possible to rollback and redeploy working application version
In a case of failure, rollback process is triggered automatically
There is a self-service process for deployment of any application version to Development and Test environments

Data and schema management

	Save
Data and schema migration process is documented see Organization and culture / Knowledge management
Data and schema migration process is performed entirely from code stored in Source Control
Data and schema migration process is fully automated
It is possible to run data and schema migration process with a single command
Data and schema migrations are automatically validated after being performed
In a case of failure, it is possible to rollback to a working data and schema state
In a case of failure, rollback process is triggered automatically
There is a self-service process for deployment of a lightweight version of data used in Production to Development and Test environments

Releasing to Production environment

	Save
The release process is documented see Organization and culture / Knowledge management
Continuous Operations approach is fully implemented, and the release process does not require any downtime
Deployment to production and release to production are defined and performed as two separate processes
Feature toggles are accessible through an easy to use interface
New features are incrementally released to groups of customers (Canary releases)
In a case of failure, it is possible to rollback to a working version of the system
In a case of failure, rollback process is triggered automatically
Continuous Delivery is fully implemented
Continuous Deployment is fully implemented
Release notes are auto-generated after each release

Unsorted

General

	Save
Top-down metrics catch outages
Bottom-up metrics tell you what's wrong
Measure process effectiveness of interlinked DevOps processes across the delivery pipeline—such as test-driven development, continuous delivery and response times
Determine bottlenecks within the processes ^
Functionality of each product and its components is monitored
Reliability of each product and its components is monitored
Usability of each product and its components is monitored
Efficiency of each product and its components is monitored
Maintainability of each product and its components is monitored

Business monitoring

	Save
Expected business value of each delivered feature or improvement is monitored and verified
Revenue per User Story is monitored
Business transaction monitoring
User Interactions

User satisfaction monitoring

	Save
Social media monitoring
User reviews monitoring
Net Promoter Score

Product monitoring

	Save
Production environment is monitored for availability
Production environment is monitored for performance

Real user monitoring

	Save
TBD

Cost monitoring

	Save
Cost of execution is monitored e.g. salaries/time
Cost of resources is monitored e.g. AWS
Cost of resources is monitored e.g. laptops
It is possible to view cost of work at any moment
Operating cost - //en.wikipedia.org/wiki/Operating_cost
Total Capex and Opex cost reduction compared to other approaches (e.g. ROI case study)

Operations monitoring

	Save
TBD

Introduction

Situation

Task

Action

Results

Disclaimer

License

Community

Discuss and contribute

Share

Subscribe for latest updates

Donate

Operations

General

Planning

Working

Communication

Knowledge management

Team organization and roles

Monitoring

Organizational performance

Work

Human Resources

General

Monitoring

Marketing

General

Finance

General

Information Technology

General

Monitoring

General

Collecting

Analyzing

Reacting

Monitoring monitoring

Communication monitoring

Knowledge management monitoring

Change monitoring

Code monitoring

Test monitoring

System monitoring

Network monitoring

Application performance management

General

End-user experience

Business transactions

Runtime application architecture (Black box)

Deep dive component monitoring (White box)

Analytics / Reporting

Website monitoring

Deployment monitoring

Security monitoring

Security

General

Account and privilege management

Inventory management

Configuration and patch management

Data security, at rest and in transit

Logging and event management

Vulnerability scanning and assessment

Research and Development

Development

General

Architecture and Design

Tools

Development environments

Source Control

Coding

General

Infrastructure

Application

Application building

Package/Asset Management

Testing

General

Work integration testing

Infrastructure testing

Static code analysis