We have some basic requirements that all services that make up Launchpad need to have for us to maintain and deploy them reliably.
No downtime Deployability
All services must be able to be upgraded without downtime throughout the rest of the system. Whether that is via the use of haproxy + dual running services or some other approach is service specific. For HTTP service haproxy is a sane default.
No single point of failure (SPOF)
Some services will be very hard to deliver without SPOFs. All the rest shouldn't have any - and we definitely must not factor something out from a non-SPOF situation and replace it with a SPOF situation. Examples where SPOF may be hard to avoid are extremely large scaling situations. Any exception to this requirement needs signoff from the Technical Architect.
Uptime and KPI monitoring
Our services are hooked into Nagios for monitoring of general availability and key elements (e.g. queue depth). While this is a ops reponsibility, identifying service specific KPIs should be done as part of the initial deploy, and for non-HTTP services identifying how to monitor for general availability should be done during the development effort.
We gather metrics on the basic characteristics of services - response time, concurrent request rate, error rates. These are often gathered by a statistics API on the service. We can also gather them from haproxy when it is being used. Regardless of the source, new services need to be setup with appropriate graphing from day one.
Each service should log requests that go through it. Apache common log format is preferred, but anything will do. An unlogged service is extremely hard to diagnose
Errors need to be reported and aggregated - we treat failures in handling requests as critical bugs. The ideal way is to hook into the OOPS tools system, but for services that don't do that we need something to capture exceptions that occur with the service and collate them. Integrating with the OOPS system is recommended.
Access controls need to be defined - e.g. ip address restrictions, firewalling, and whatever authentication is needed.
Services are separate projects
See The anatomy of a service. We need to make a new project for the thing we deploy - see CreatingNewProjects. There are some corollaries to the data modelling, encapsulation and testing requirements.
Note that the import here refers to python imports - and is obviously irrelevant for a non-python microservice ;). Take it as a broad statement, if a service was in C, then the rule would be phrased include.
- Code in the launchpad tree must not be imported into microservice trees directly or indirectly.
- Code in microservice trees must not be imported into other microservice trees, or the Launchpad tree directly or indirectly.
- Network test doubles must be supplied by microservices.
- Client code for a service should be in a language + stack specific tree. e.g. python+twisted, or go etc.
- Testing a microservice should not need any client code - services must not import their own client in test or prod [unless the service genuinely talks to itself].
- Releases (and deploys) of microservices must not need synchronisation with those of other microservices or Launchpad itself.
There are a few things that tie together to drive these rules about importing and how we share code.
The first one is having crisp clean and enforce layers between services. Its extremely hard to do that (and stick to it) within a single Python VM. So we need to have accessing the implementation of a service be a Big Deal - because making a clean contract for a service requires more thought than just poking at an internal aspect.
Because we're going to be doing HA, management, deployment separately from the existing appserver stack - and (in future) allowing direct use to some services - we need to treat each service as its own project needing backwards compatibility for clients (so that we can deploy to the service without breaking LP). If the client code for a microservice is in the same tree, it becomes subject to loop-back testing, or echo chamber defects. Having a completely separate client which runs its tests against the server-provided network fake, allows great confidence that a release of the server has not broken the client: we can easily run the client tests against both the old and new server when changing the server, and run old and new versions of the client against the server when changing the client. Separate clients also avoid running into complexity with mismatched build and test systems (e.g. a django client for the gpgverification service will want to run with the django test runner, not e.g. twisted).
While in-process test doubles are nice when dealing with a fixed contract (like say HTTP), at some point we need to be sure that the code (in Launchpad, say) actually works over the network with the microservice. Having a test double that the microservice provides (and the microservice is then responsible for keeping the test and prod implementations synchronised), allows us to have great confidence that things will hang together correctly outside the test environment. We *can* also have totally in-process test doubles where appropriate, but those should be something provided by the service *client* code, not the service *server* code.
A similar echo chamber effect exists when a service's contract is defined by a client module that depends on the service: our services really have to be simple, so that we can test them clearly: the syntactic sugar offered by a client must not be needed to test the service or we've failed.
We don't [yet] have test fakes for existing non-zope services. A test fake (something that exposes the same service contract but is fast to startup, lightweight, and possibly permits error injection) is a necessary condition to use a new service without dealing with sacrificing test coverage or permitting skew.
Generally speaking, a test fake should
- startup subsecond
- accept parameters/environment variables describing where to find its dependencies
- print out its dynamically allocated service point on std out
- log to stdout (and/or stderr)
For bonus points they should:
- have a specific testing api allowing the user to inject faults (e.g. cause a request to break / work and so on)
Nice to have
Nice to have are just that - not mandatory (at this point in time) but if we can do them, great.
Page performance reports
We have page performance report doing detailed analysis of the primary launchpad zope server. Other services don't have this but it would be cool if they do.