Part II: The Production Landscape
Chapter 5: The Production Environment
The production environment is the foundation of the production system. It encompasses both the solution architecture and the operations architecture. It can also involve multiple data centers for fault tolerance and disaster recovery. This chapter examines a hypothetical, medium-scale, highly resilient, and scalable production environment. This chapter is organized into the following sections:
The Big Picture — Examines the architecture hardware diagram, including the solutions architecture and the operations architecture environments. The architecture hardware diagram is a snapshot view of the overall production environment showing the different hardware (servers and network components) it encompasses.
The Hardware Topology — Examines the usage and resiliency of individual server groups. It also looks at the high availability and fault-tolerant features of the architecture, along with some items that should be considered. It examines shared storage for use by the application by making use of the SAN. And, finally, it takes a quick look at network segmentation which can be employed for improved security and performance.
The Software Topology — Describes how the software topology is put together, examining each of the server groups and the applications being used. It looks at how the applications are mapped to the servers and what sort of information should be captured and documented.
The Primary and Secondary Data Centers — Describes the primary and secondary data centers organizations use to provide failover capabilities. It also describes how data centers can be either active-passive or active-active in their configuration, and the benefits and drawbacks of each.
Chapter 6: Activities and Environments
This chapter takes a quick look at some of the testing and proving activities involved in the development lifecycle, and some potential environments where they can be carried out. Given that there are so many options, I've decided to concentrate on the core activities and their environment requirements. Out in the field, you'll need to perform a similar exercise to determine which activities will be performed and the most appropriate environments to perform them in. This chapter is organized into the following sections:
The Types of Testing — Examines the types of tests that need to be performed to ensure that the system is fit for purpose including Acceptance Testing, Technical Testing, Functional Testing and Regression Testing.
Testing in the "Production" Environments — This section discusses the different types of production environments where some of the preceding tests can be carried out including the primary site environment, the secondary site environment and the pre-production environment.
Chapter 7: Service Delivery and Operations
The service delivery and operations teams, which typically are based out of the data centers, take care of all the environments, servers, and systems. These are the folks you should work closely with to understand how the system should operate and be operated in live service. This chapter is organized into the following sections:
The Three Levels of Live Service Support — This section provides an overview of the three different levels of support and how the individual teams interact. It also provides some high-level activities that the different teams will perform in the event of a live service incident. It then takes a more detailed look at the roles and responsibilities of the operations and service delivery teams and provides insights into the routine day-to-day tasks that are performed.
The Operations Manual — Takes a look at some of the reasons why you should understand the Service Delivery and Operations organizations and what you can do to make your life and theirs a lot easier. This is typically achieved through the Operations Manual, which is one of the most important documents produced for the system.
Chapter 8: Monitoring and Alerts
Monitoring is an integral part of the organization's effectiveness. Knowing when something is or isn't functioning correctly or requires manual intervention is critical to the operation of the business and its customers. Effective monitoring will help to ensure this. This chapter is organized into the following sections:
What Is Monitoring? Examines the monitoring architecture and the various rules that can be put in place to filter and escalate information. It also discusses monitoring blackout windows, which are used to filter out alerts at certain periods of time.
What Is Monitored? Examines the various monitoring sources as well as some typical application and server monitoring. It also discusses the types of events that are captured, Windows and other third-party application performance counters, and custom performance counters updated by the application.
What Are Alerts? Examines how alerts are the trigger for incident investigation.
Chapter 9: Reporting and Analytics
There' s often a tendency to think of reporting and analytics as simply providing business users and stakeholders with data gathered from around the system, typically from within the database — for example, a report showing sales figures for a particular period or region. Reporting and analytics are sometimes referred to as business intelligence. However, reports are not just developed for business users; they can be developed for technical staff as well. This chapter is organized into the following sections:
Reporting Overview — Provides an overview of the reporting function and the categories of reports, as well as some specifics on capturing report criteria and requirements. It also shows some typical reports and output styles.
The Reporting Architecture — Takes a look at the overall reporting architecture and how report data can be gathered from multiple sources and replicated databases.
Technical Reporting — Looks at some of the basic technical reports and how they can be used in various ways.
Analytics Overview — Provides an overview of the various web analytics software that can be used to analyze web server logs and provide reports and analysis. It also discusses the importance of understanding the analytics architecture, as it could have other implications on the solution architecture and design.
Chapter 10: Batch
This chapter examines batch and batch processing. Batch processing has its roots in the mainframe era with the earliest batch or job schedulers. Batch jobs do not have a user interface; all inputs to a batch job are either through command parameters, scripts, configuration files, or configuration data. Today the batch window is ever decreasing with 24/7 availability requirements. This chapter is organized into the following sections:
Batch Processing — Discusses the batch processing required in today's software systems and the different batch processing groups. It looks at the batch window and some techniques for reducing the batch window.
The Batch Scheduler — Examines the batch scheduler, batch jobs, and schedules. It also looks at the dependencies betweens jobs and groups, as well as at the batch nodes and node groups.
The Batch Run Date — Highlights the importance of including a batch run date in the architecture to ensure that batch jobs process the appropriate information.
Chapter 11: Incident Investigation
Incident investigation can be one of the most stressful and difficult areas of software development. The stress is compounded by how critical the situation is, how difficult the problem is, and how quickly it needs to be resolved. Some of the most critical situations include production outages and issues that endanger deadlines and milestones. Not all incidents are resolved with software or hardware modification or configuration; some incidents can be resolved by further user and/or operating training, education, and procedures. Incidents will happen, so understanding and preparing for them are vital parts of software development. This chapter is organized into the following sections:
The Investigation Roadmap — Looks at the steps involved in incident investigation and resolution.
Typical Incidents — Lists some of the typical functional and technical incidents that occur in the everyday environments.
Chapter 12: Application Maintenance
The application maintenance team is responsible for maintaining the application while it is in live service. Application maintenance functions generally include third-line support, issue resolution, bug fixing, application enhancements, testing, and release management. Application maintenance is essentially a mirror of change control and defect management in construction. This chapter is organized into the following sections:The Application Maintenance Team — Provides a brief overview of the application maintenance team and its services. It also reinforces the message that while the system is still under construction, the development team essentially performs all the same activities and functions as the maintenance team.
Application Maintenance Functions — Examines the types of activities the application maintenance team carries out as part of its day-to-day operation.
The Developer's Guide — The Developer's Guide provides a good basis for the application maintenance activities.