ArchitectureGuide/ServicesRequirements

Not logged in - Log In / Register

Revision 2 as of 2011-05-25 07:53:58

Clear message

Introduction

We have some basic requirements that all services that make up Launchpad need to have for us to maintain and deploy them reliably.

No downtime Deployability

All services must be able to be upgraded without downtime throughout the rest of the system. Whether that is via the use of haproxy + dual running services or some other approach is service specific. For HTTP service haproxy is a sane default.

No single point of failure (SPOF)

Some services will be very hard to deliver without SPOFs. All the rest shouldn't have any - and we definitely must not factor something out from a non-SPOF situation and replace it with a SPOF situation. Examples where SPOF may be hard to avoid are extremely large scaling situations. Any exception to this requirement needs signoff from the Technical Architect.

Uptime and KPI monitoring

Our services are hooked into Nagios for monitoring of general availability and key elements (e.g. queue depth). While this is a ops reponsibility, identifying service specific KPIs should be done as part of the initial deploy, and for non-HTTP services identifying how to monitor for general availability should be done during the development effort.

KPI graphing

We gather metrics on the basic characteristics of services - response time, concurrent request rate, error rates. These are often gathered by a statistics API on the service. We can also gather them from haproxy when it is being used. Regardless of the source, new services need to be setup with appropriate graphing from day one.

Access logs

Each service should log requests that go through it. Apache common log format is preferred, but anything will do. An unlogged service is extremely hard to diagnose

Error reporting

Errors need to be reported and aggregated - we treat failures in handling requests as critical bugs. The ideal way is to hook into the OOPS tools system, but for services that don't do that we need something to capture exceptions that occur with the service and collate them. Integrating with the OOPS system is recommended.