7512
Comment: and finish the transcription
|
7739
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
In this guide you will find some expansion and clarification on the architectural values I presented in the [[https://docs.google.com/a/canonical.com/present/view?id=dgpdcfn9_4fd46fgcz&revision=_latest&start=0&theme=blank&authkey=CJWpj5EN&cwj=true|Launchpad Architectural Vision 2010]] | In this guide you will find some expansion and clarification on the architectural values I presented in the [[https://docs.google.com/a/canonical.com/present/view?id=dgpdcfn9_4fd46fgcz&revision=_latest&start=0&theme=blank&authkey=CJWpj5EN&cwj=true|Launchpad Architectural Vision 2010]]. Be sure to look at the speakers notes: they are the juicy bits. |
Line 49: | Line 49: |
* For blocking calls (SQL, bzr, email, librarian, backend web services, memcache) use lp.services.timeline.requesttimeline to record the call. This includes it in OOPS reports. | |
Line 74: | Line 75: |
And we can go further. For instance, the job system is nice, but its tightly coupled to the launchpad DB, perhaps we could make it possible to use it for other Canonical projects, or other Zope projects. Or perhaps move it to MQ and just have an adapter instead? Tasks running in a job could still talk to the DB. |
|
Line 133: | Line 132: |
* Emit through Python logging at an appropriate importance level: warning or error for things operators need to know about, info or debug for things that don't normally indicate a problem. |
|
Line 138: | Line 139: |
* If you can split one class into two or more that would individually make sense and be simpler, do it. |
Architectural Guide
In this guide you will find some expansion and clarification on the architectural values I presented in the Launchpad Architectural Vision 2010. Be sure to look at the speakers notes: they are the juicy bits.
All the code we write will meet these values to a greater or lesser degree. Where you can, please make choices that make you code more strongly meet these values.
Some existing code does not meet them well; this is simply an opportunity to get big improvements - by increasing e.g. the transparency of existing code, operational issues and debugging headaches can be reduced without a great deal of work.
This guide is intended as a living resource : all Launchpad developers, and other interested parties, are welcome to join in and improve it.
Contents
Goals
The goal of the recommendations and suggestions in this guide are to help us reach a number of big picture goals: We want Launchpad to be:
- Blazingly fast
- Always available
- Change safely
- Simple to make, manage and use
- Flexible
(See the presentation for more details).
However its hard when making any particular design choice to be confident that it drives us towards these goals : they are quite specific, and not directly related to code structure or quality.
Related documents
Of particular note the PythonStyleGuide is specifies coding style guidelines.
Values
There are a number things that are more closely related to code, which do help drive us towards our goals. These are values I (RobertCollins) hold dear, and which the more our code meets these values, the easier it will be to meet our goals.
The values are:
- Transparency
- Loose coupling
- Highly cohesive
- Testable
- Predictable
Transparency
Transparency speaks to the ability to analyse the system without dropping into pdb or taking guesses.
Some specific things that aid transparency:
- For blocking calls (SQL, bzr, email, librarian, backend web services, memcache) use lp.services.timeline.requesttimeline to record the call. This includes it in OOPS reports.
- fine grained just-in-time logging (e.g. bzr's -Dhpssdetail option)
- live status information
- (+opstats, but more so)
- cron script status
- migration script status
- regular log files
- OOPS reports - lovely
We already have a lot of transparency. We can use more.
Aim for automation, Developer usability, minimal losa-intervention, on-demand access.
When adding code to the system, ask yourself 'how will I troubleshoot this when it goes wrong without access to the machine it is running on'.
Loose coupling
The looser the coupling between different parts of the system the easier it is to change them. Launchpad is pretty good about this in some ways due to the component architecture.
But its not the complete story and I think decreasing the coupling more will help the system.
I've seen some recent work on this such as the jobs system and the buildd queue refactoring, which is excellent - generic pieces that can be used and reused.
The acid test for the coupling of a component is 'how hard is it to reuse?'
Of particular note, many changes in one area of the system (e.g. bugs) break tests in other areas (e.g. blueprints) - this adds a lot of developer friction and is a strong sign of overly tight coupling.
Highly cohesive
The more things a component does, the harder it is to reason about it and performance tune it.
So this is "Do one thing well" in another setting.
The way I like to assess this is to look inside the component and see if it is doing one thing, or many things.
One common sign for a problem in this area is attributes (or persistent data) that are not used in many methods - that often indicates there is a separate component embedded in this one.
There are tradeoffs here due to database efficiency and normalisation, but its still worth thinking about: narrower tables can perform better and use less memory, even if we do add extra tables to support them.
On a related note the more clients using a given component, the wider its responsibilities and the more critical it becomes. Thats an easy situation to end up with too much in one component (lots of clients wanting things decreases the cohesion), and then we have a large unwieldy critical component - not an ideal situation.
Testable
We write lots of unit and integration tests at the moment. However its not always easy to test specific components - and the coupling of the components drives this.
The looser the coupling, the better in terms of having a very testable system. However loose coupling isn't enough on its own, so we should consider testability from a few angles:
Can it be tested in isolation? If it can, it can be tested more easily by developers and locally without needing lots of testbed environment every time.
Can we load test it? Not everything needs this, but if we can't load test a component that we reasonably expect to see a lot of use, we may have unpleasant surprises down the track.
Can we test it with broken backends/broken data? It is very nice to be confident that when a dependency breaks (not if) the component will behave nicely.
Its also good to make sure that someone else maintaining the component later can repeat these tests and is able to assess the impact of their changes.
Automation of this stuff rocks!
Predictable
An extension of stability - servers should stay up, database load should be what it was yesterday, rollouts should move metrics in an expected direction.
Predictability isn't very sexy, but its very useful: useful for capacity planning, useful for changing safely, useful for being highly available, and useful for letting us get on and do new/better things.
The closer to a steady state we can get, the more obvious it is when something is wrong.
Design metrics
This is an experiment, an attempt to set a measurable figure on some metrics that hopefully relate well to the goals and values above. The Launchpad review team will be asking about these metrics in reviews - if your code doesn't meet one, thats *OK*: this is an experiment. Please note in the review that the metric seemed nuts/inapplicable, and we'll fold that into evolving things.
Performance
Document how components are expected to perform. Docstrings or doctests are great places to put this. E.g. "This component is expected to deal with < 100 bug tracker types; if we have more this will need to be designed."
Testing
Tests for a class should complete in under 2 seconds. If they aren't, spend at least a little time determining why.
Transparency
Behaviour of components should be analysable in lpnet without doing a 'losa ping' : that is, if a sysadmin is needed to determine whats wrong with something; we've designed it wrong. Lets make sure there is a bug for that particular case, or if possible Just Fix It.
- Emit through Python logging at an appropriate importance level: warning or error for things operators need to know about, info or debug for things that don't normally indicate a problem.
Coupling
No more than 5 dependencies of a component.
Cohesion
Attributes should be used in more than 1/3 of interactions. If they are used less often than that, consider deleting or splitting into separate components.
- If you can split one class into two or more that would individually make sense and be simpler, do it.