Diff for "ArchitectureGuide"

Not logged in - Log In / Register

Differences between revisions 2 and 12 (spanning 10 versions)
Revision 2 as of 2010-09-01 19:36:44
Size: 1934
Editor: lifeless
Comment: more
Revision 12 as of 2010-10-13 15:49:32
Size: 8530
Editor: abentley
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
In this guide you will find some expansion and clarification on the architectural values I presented in the [[https://docs.google.com/a/canonical.com/present/view?id=dgpdcfn9_4fd46fgcz&revision=_latest&start=0&theme=blank&authkey=CJWpj5EN&cwj=true|Launchpad Architectural Vision 2010]] In this guide you will find some expansion and clarification on the architectural values I presented in the [[https://docs.google.com/a/canonical.com/present/view?id=dgpdcfn9_4fd46fgcz&revision=_latest&start=0&theme=blank&authkey=CJWpj5EN&cwj=true|Launchpad Architectural Vision 2010]]. Be sure to look at the speakers notes: they are the juicy bits.
Line 27: Line 27:
== Related documents ==

Of particular note the PythonStyleGuide is specifies coding style guidelines.
Line 41: Line 45:
Transparency speaks to the ability to analyse the system without dropping into pdb or taking guesses.

Some specific things that aid transparency:

 * For blocking calls (SQL, bzr, email, librarian, backend web services, memcache) use lp.services.timeline.requesttimeline to record the call. This includes it in OOPS reports.
 * fine grained just-in-time logging (e.g. bzr's -Dhpssdetail option)

 * live status information
  * (+opstats, but more so)
  * cron script status
  * migration script status

 * regular log files
 * OOPS reports - lovely

We already have a lot of transparency. We can use more.

Aim for automation, Developer usability, minimal losa-intervention, on-demand access.

When adding code to the system, ask yourself 'how will I troubleshoot this when it goes wrong without access to the machine it is running on'.
Line 42: Line 67:

The looser the coupling between different parts of the system the easier it is to change them. Launchpad is pretty good about this in some ways due to the component architecture.

But its not the complete story and I think decreasing the coupling more will help the system.

I've seen some recent work on this such as the jobs system and the buildd queue refactoring, which is excellent - generic pieces that can be used and reused.

The acid test for the coupling of a component is 'how hard is it to reuse?'

Of particular note, many changes in one area of the system (e.g. bugs) break tests in other areas (e.g. blueprints) - this adds a lot of developer friction and is a strong sign of overly tight coupling.
Line 45: Line 80:
; focused components The more things a component does, the harder it is to reason about it and performance tune it.

So this is "Do one thing well" in another setting.

The way I like to assess this is to look inside the component and see if it is doing one thing, or many things.

One common sign for a problem in this area is attributes (or persistent data) that are not used in many methods - that often indicates there is a separate component embedded in this one.

There are tradeoffs here due to database efficiency and normalisation, but its still worth thinking about: narrower tables can perform better and use less memory, even if we do add extra tables to support them.

On a related note the more clients using a given component, the wider its responsibilities and the more critical it becomes. Thats an easy situation to end up with too much in one component (lots of clients wanting things decreases the cohesion), and then we have a large unwieldy critical component - not an ideal situation.
Line 48: Line 94:
We write lots of unit and integration tests at the moment. However its not always easy to test specific components - and the coupling of the components drives this.

The looser the coupling, the better in terms of having a very testable system. However loose coupling isn't enough on its own, so we should consider testability from a few angles:

Can it be tested in isolation? If it can, it can be tested more easily by developers and locally without needing lots of testbed environment every time.

Can we load test it? Not everything needs this, but if we can't load test a component that we reasonably expect to see a lot of use, we may have unpleasant surprises down the track.

Can we test it with broken backends/broken data? It is very nice to be confident that when a dependency breaks (not if) the component will behave nicely.

Its also good to make sure that someone else maintaining the component later can repeat these tests and is able to assess the impact of their changes.

Automation of this stuff rocks!
Line 49: Line 109:
; in isolation, load tests, broken-backend tests
An extension of stability - servers should stay up, database load should be what it was yesterday, rollouts should move metrics in an expected direction.

Predictability isn't very sexy, but its very useful: useful for capacity planning, useful for changing safely, useful for being highly available, and useful for letting us get on and do new/better things.

The closer to a steady state we can get, the more obvious it is when something is wrong.

= Simple =
A design that allows for future growth is valuable, but it is not always clear how much growth to expect, or in the case of code extension, what kind. In this case, it is better to design the simplest thing that will work at the time being, and update the design when you have a better idea of what's needed. Simplicity also aids comprehension and reduces the surface area for bugs to occur.

Related ideas are KISS, YAGNI, and avoiding premature optimization, but it is always important to apply judgement. For example, avoiding premature optimization does not justify rolling your own inefficient sort function.

= Design metrics =

This is an experiment, an attempt to set a measurable figure on some metrics that hopefully relate well to the goals and values above. The Launchpad review team will be asking about these metrics in reviews - if your code doesn't meet one, thats *OK*: this is an experiment. Please note in the review that the metric seemed nuts/inapplicable, and we'll fold that into evolving things.

== Performance ==

Document how components are expected to perform. Docstrings or doctests are great places to put this. E.g. "This component is O(N) in the number of bug tasks associated with a bug." or "This component degrades very quickly as more bug tracker types are added." or "This component compares two user accounts in a reasonable time, but when comparing three or more it's unusable."

== Testing ==

Tests for a class should complete in under 2 seconds. If they aren't, spend at least a little time determining why.

== Transparency ==

Behaviour of components should be analysable in lpnet without doing a 'losa ping' : that is, if a sysadmin is needed to determine whats wrong with something; we've designed it wrong. Lets make sure there is a bug for that particular case, or if possible Just Fix It.

 * Emit through Python logging at an appropriate importance level: warning or error for things operators need to know about, info or debug for things that don't normally indicate a problem.

== Coupling ==
No more than 5 dependencies of a component.

== Cohesion ==
Attributes should be used in more than 1/3 of interactions. If they are used less often than that, consider deleting or splitting into separate components.

 * If you can split one class into two or more that would individually make sense and be simpler, do it.

Architectural Guide

In this guide you will find some expansion and clarification on the architectural values I presented in the Launchpad Architectural Vision 2010. Be sure to look at the speakers notes: they are the juicy bits.

All the code we write will meet these values to a greater or lesser degree. Where you can, please make choices that make you code more strongly meet these values.

Some existing code does not meet them well; this is simply an opportunity to get big improvements - by increasing e.g. the transparency of existing code, operational issues and debugging headaches can be reduced without a great deal of work.

This guide is intended as a living resource : all Launchpad developers, and other interested parties, are welcome to join in and improve it.

Goals

The goal of the recommendations and suggestions in this guide are to help us reach a number of big picture goals: We want Launchpad to be:

  • Blazingly fast
  • Always available
  • Change safely
  • Simple to make, manage and use
  • Flexible

(See the presentation for more details).

However its hard when making any particular design choice to be confident that it drives us towards these goals : they are quite specific, and not directly related to code structure or quality.

Of particular note the PythonStyleGuide is specifies coding style guidelines.

Values

There are a number things that are more closely related to code, which do help drive us towards our goals. These are values I (RobertCollins) hold dear, and which the more our code meets these values, the easier it will be to meet our goals.

The values are:

  • Transparency
  • Loose coupling
  • Highly cohesive
  • Testable
  • Predictable

Transparency

Transparency speaks to the ability to analyse the system without dropping into pdb or taking guesses.

Some specific things that aid transparency:

  • For blocking calls (SQL, bzr, email, librarian, backend web services, memcache) use lp.services.timeline.requesttimeline to record the call. This includes it in OOPS reports.
  • fine grained just-in-time logging (e.g. bzr's -Dhpssdetail option)
  • live status information
    • (+opstats, but more so)
    • cron script status
    • migration script status
  • regular log files
  • OOPS reports - lovely

We already have a lot of transparency. We can use more.

Aim for automation, Developer usability, minimal losa-intervention, on-demand access.

When adding code to the system, ask yourself 'how will I troubleshoot this when it goes wrong without access to the machine it is running on'.

Loose coupling

The looser the coupling between different parts of the system the easier it is to change them. Launchpad is pretty good about this in some ways due to the component architecture.

But its not the complete story and I think decreasing the coupling more will help the system.

I've seen some recent work on this such as the jobs system and the buildd queue refactoring, which is excellent - generic pieces that can be used and reused.

The acid test for the coupling of a component is 'how hard is it to reuse?'

Of particular note, many changes in one area of the system (e.g. bugs) break tests in other areas (e.g. blueprints) - this adds a lot of developer friction and is a strong sign of overly tight coupling.

Highly cohesive

The more things a component does, the harder it is to reason about it and performance tune it.

So this is "Do one thing well" in another setting.

The way I like to assess this is to look inside the component and see if it is doing one thing, or many things.

One common sign for a problem in this area is attributes (or persistent data) that are not used in many methods - that often indicates there is a separate component embedded in this one.

There are tradeoffs here due to database efficiency and normalisation, but its still worth thinking about: narrower tables can perform better and use less memory, even if we do add extra tables to support them.

On a related note the more clients using a given component, the wider its responsibilities and the more critical it becomes. Thats an easy situation to end up with too much in one component (lots of clients wanting things decreases the cohesion), and then we have a large unwieldy critical component - not an ideal situation.

Testable

We write lots of unit and integration tests at the moment. However its not always easy to test specific components - and the coupling of the components drives this.

The looser the coupling, the better in terms of having a very testable system. However loose coupling isn't enough on its own, so we should consider testability from a few angles:

Can it be tested in isolation? If it can, it can be tested more easily by developers and locally without needing lots of testbed environment every time.

Can we load test it? Not everything needs this, but if we can't load test a component that we reasonably expect to see a lot of use, we may have unpleasant surprises down the track.

Can we test it with broken backends/broken data? It is very nice to be confident that when a dependency breaks (not if) the component will behave nicely.

Its also good to make sure that someone else maintaining the component later can repeat these tests and is able to assess the impact of their changes.

Automation of this stuff rocks!

Predictable

An extension of stability - servers should stay up, database load should be what it was yesterday, rollouts should move metrics in an expected direction.

Predictability isn't very sexy, but its very useful: useful for capacity planning, useful for changing safely, useful for being highly available, and useful for letting us get on and do new/better things.

The closer to a steady state we can get, the more obvious it is when something is wrong.

Simple

A design that allows for future growth is valuable, but it is not always clear how much growth to expect, or in the case of code extension, what kind. In this case, it is better to design the simplest thing that will work at the time being, and update the design when you have a better idea of what's needed. Simplicity also aids comprehension and reduces the surface area for bugs to occur.

Related ideas are KISS, YAGNI, and avoiding premature optimization, but it is always important to apply judgement. For example, avoiding premature optimization does not justify rolling your own inefficient sort function.

Design metrics

This is an experiment, an attempt to set a measurable figure on some metrics that hopefully relate well to the goals and values above. The Launchpad review team will be asking about these metrics in reviews - if your code doesn't meet one, thats *OK*: this is an experiment. Please note in the review that the metric seemed nuts/inapplicable, and we'll fold that into evolving things.

Performance

Document how components are expected to perform. Docstrings or doctests are great places to put this. E.g. "This component is O(N) in the number of bug tasks associated with a bug." or "This component degrades very quickly as more bug tracker types are added." or "This component compares two user accounts in a reasonable time, but when comparing three or more it's unusable."

Testing

Tests for a class should complete in under 2 seconds. If they aren't, spend at least a little time determining why.

Transparency

Behaviour of components should be analysable in lpnet without doing a 'losa ping' : that is, if a sysadmin is needed to determine whats wrong with something; we've designed it wrong. Lets make sure there is a bug for that particular case, or if possible Just Fix It.

  • Emit through Python logging at an appropriate importance level: warning or error for things operators need to know about, info or debug for things that don't normally indicate a problem.

Coupling

No more than 5 dependencies of a component.

Cohesion

Attributes should be used in more than 1/3 of interactions. If they are used less often than that, consider deleting or splitting into separate components.

  • If you can split one class into two or more that would individually make sense and be simpler, do it.

ArchitectureGuide (last edited 2021-11-25 16:19:00 by cjwatson)