Differences between revisions 31 and 32

The purpose of this template is to help us get ReadyToCode on features or tricky bugs as quickly as possible. See also LaunchpadEnhancementProposalProcess.

Feature Flags

(previously called Dynamic Configuration)

goal state: Launchpad has a registry of configuration options that can be changed by admins through the web ui, without restarting Launchpad.

As a Launchpad developer/operator
I want to turn features on and off without a heavyweight deployment
so that I can more adroitly test and deploy new features (A:B testing, long-running closed betas, etc)
and so that I can recover from emergencies by cutting-off problem features

As a Launchpad user
I want you to tell me about impending downtime through the web ui
so that I can I can plan not to be using Launchpad when it's offline/readonly.

This is not a mechanism for general per-user configuration
This is not a complete replacement for static configuration or other parts of the deployment process
This only affects new code that specifically uses it; it doesn't magically affect existing code
This may become a replacement for things that are currently done through SQL queries or configuration files
This is site-wide not per-appserver.
This does not yet replace the readonly-mode flag (implemented as a special file on disk) because it's special.
This embraces "feature flags" and more, such as site-wide notifications.

Scenarios:

Dark launches (aka embargoes: land code first, turn it on later)
Closed betas
Scram switches (omg daily builds are killing us, make it stop)
Soft/slow launch (let just a few users use it and see what happens)
Site-wide notification
Show an 'alpha', 'beta' or 'new!' badge next to a UI control, then later turn it off without a new rollout

Rationale

We want people to land features faster, and to deploy more often. Having control over when features are generally exposed separately from landing the code may help.
This could support things like site-wide notifications which would help our users by warning them when Launchpad's about to go offline.
Some developers are interested in A:B testing of UIs and this would help with that too.
Some other sites find this very useful: see http://www.scribd.com/doc/16877392/10-Deploys-Per-Day-Dev-and-Ops-Cooperation-at-Flickr
At the 2010-02 team leads meeting there was enthusiastic support for feature flags but they've stalled.
Doing configuration changes through a branch, merge, landing and deploy is hugely expensive, compared to changing a web ui.
We've had problems with configuration being inadvertently set inconsistently across different servers.
Provides visibility into the system.

Stakeholders

Who really cares about this feature? When did you last talk to them?

LOSAs
Launchpad devs
Design group?
Architect and product strategist
Curtis
mthaddon
Gary

Constraints and Requirements

Must

A function that can be called from a template or other code, that tells you the value of a configuration item.
The function must be very cheap to call so that it does not cause performance problems even if it's called several times per request. (It should do at most one database query (of reasonable size) per request that cares about configuration.)
Feature flags can be used to hide or disable some user interface items.

Nice to have

Configuration scopes:
- "on edge"
- "for authenticated/unauthenticated users"
- "in readonly mode"
- "for x% of users"
- "for users in the beta group"
Configuration that can be changed while in readonly mode.
Configuration is validated before it is applied: eg if something must be an IP address, we won't let the admin commit a change that makes it invalid.
Log of changes that were made, when, and by whom.
A machine-readable registry of known names, with a help string and a description of the type to be stored in them. (A little like the Mailman admin interface but much simpler.)

Must not

Reduce test coverage by having code paths that are only hit when certain variables are set, and there are not tests for those variables being set. (Using bzr-style scenario multiplication may help.)
Cause entanglement by having the same feature flag checked at many points in the code.

Subfeatures

Other LaunchpadEnhancementProposals that form a part of this one.

Site Wide Notification (to be written)

Workflows

What are the workflows for this feature?

Change configuration

A LOSA goes to https://launchpad.net/+config where they see a simple web form allowing them to edit the configuration.

Anyone else can see the configuration but cannot change it. (Perhaps we should hide it from people other than developers, but since they can see the source this may not matter...)

Provide mockups for each workflow.

Success

How will we know when we are done?

You can check flags in code or templates.
You can change the configuration.
People do actually change the configuration.

How will we measure how well we have done?

Adoption of feature flags.
Developers and LOSAs report satisfaction with the facility and it becomes a standard practice.

Thoughts?

Put everything else here. Better out than in.

As a general rule, each switch should be checked only once or only a few time in the codebase. We don't want to disable the same thing in the ui, the model, and the database.
Obviously it would be better not to ever have planned downtime. But...
Would this have helped with daily builds, or other things?
If we want to unify the edge and production appservers, this may help.
Having useful differences across edge and lpnet seem to imply having at least that level of scoping from the beginning.
Could get an interesting feedback loop between oops_per_second vs config changes.
How should these be tested? Perhaps we want a small number of tests that try flipping the flag and checking both ways works?
How to edit? One big textarea? How about races?
Which scope matches? Explicit ordering? Most-specific? Require no overlaps?
Perhaps you'll accumulate an ever-increasing inventory of configuration options that are never used, and will break if they are used. Perhaps a switch that has not been changed in the last year should be considered to be removed altogether.
Arguably we should couple together "this feature is only for beta users" with "this feature has a beta badge next to it", but perhaps it's simpler at this level and more flexible to just have separate flags for the two of them.
Should document a naming convention that explains what feature of thing this flag affects, and what kind of effect it has.
We need to create a culture that people do actually add and make use of flags; as part of our incident analysis we should consider whether adding a flag might have helped.

Implementation

The API is: getFlag(name, scopes=None) => value, probably living on a Zope utility.

Any particular request can be in several scopes, perhaps set(global, edge_server, beta_user, override). These can be inferred from the URL, the server static configuration, the user's group membership, perhaps other things.

The value for any flag is the highest-priority setting for any active scope.

If the scopes set is not passed to getFlag, in the web server it is computed from the request object. In other places like jobs or the code host we need to pass in some other object with similar info.

The database model is that there are various "configuration scopes" which each have a name. There is a total order between the scopes that defines the level of specificity: for instance we may have some settings that are active for the edge server, and some for beta user, and say that in case of a conflict the beta user setting has priority. A configuration variable can be defined up to once per configuration scope.

To look up the full set of active configuration variables, we look across the selected scopes and take the highest-priority setting. If we do not find a value for a particular setting it defaults to None.

For any particular scope set it is a single SQL query to get the full environment of settings, something like: {{{select configuration.name, first(value)

from configuration natural join configuration_group where group_id in %(scopes)s order by configuration_group.priority group by configuration.name}}}. (Or one can of course query one value at a time.)

The name looks like dotted python identifiers, with the form APP.FEATURE.EFFECT. The value is a Unicode string.

The admin gui can show the values grouped and sorted by scope.

We define the following scope priorities from most to least important. (The numbers are arbitrary, the order's all that matters.)

 override  -- can be used to mask out anything

 edge_server_beta_user -- set only when both are true

 production_server -- one of these is chosen based on the url or static config
 edge_server
 staging_server
 dev_server 

 beta_user -- set based on group membership; we can add more specific beta groups later

 default -- lowest priority and always set; used when we want a None-null default

examples:

scope	name	value	explanation
edge_server_beta_user	soyuz.build_from_branch.ui_visible	True
default	soyuz.build_from_branch.badge	beta	show "beta" icon next to the ui
edge_server	soyuz.build_from_branch.run_jobs	True
production_server	notification.global.message	Going down for an upgrade, should be back in 10m
production_server	notification.global.countdown_time	20101220T00:00	(show "in %d minutes" based on this)

Once the build_from_branch.ui_visible feature is stable, we would either set it to True in the default scope. Perhaps later we would make it unconditionally enabled.

landing plans:

add db tables
add definitions of scopes
add utility with a method that looks up the value of a flag given a description of the active scopes (and tests)
add function that computes the right active scopes based on the request object
make sure tests are isolated from the scope determined by the machine where they're running
provide a way to test "as if on edge" etc
use it for site-wide notifications
show how to use in TAL

Complicated alternative implementations

This should probably live on a zope utility? Is "config" confusable with other names, and if so what should we call it instead?

The flags are named with the same syntax as Python identifiers. All punctuation is reserved so that we can try scope selectors like server=edge/user_group=beta/soyuz_build_from_branch=True.

The value is a Unicode string.

We will add a machine-readable registry of known names, with a help string and a description of the type to be stored in them. (A little like the Mailman admin interface but much simpler.)

The values are stored in a database table, with two columns: name, value. (If we add scope selectors we'll add a third column, so you can quickly pull out all the rows possibly relevant to the name.) This means you perhaps can't change it while we're in readonly mode. Later we can split it to a separate replicated database, or to some non-sql database.

More examples:

sitewide_message=Going down for an upgrade, should be back in 10m
sitewide_countdown_time=20101220T00:00 (show "in %d minutes" based on this)
if server(edge): if user_in(beta): bug_page_new=True (show the new version of the bug page only on edge)
if user_subset(0,10): registry_layout_new=True (give users with id%10==0 a new layout to see how they like it)

The story for how this works: request goes in to the app server code which calls config('bug_page_new'). (Based on this it will choose a different page template or turn on/off some parts of that template.) The config mechanism walks through the configuration settings looking for one that has the name 'bug_page_new' and matches the context. It checks for matches in the context by looking at all the selectors and calling a callable looked up by name. In this case it is 'server' which will look in the request object for the vhost header.

Maybe we don't need multiple levels: if we want things active only on edge for users in ~launchpad-beta, we define a selector function that composes those things.

Or we could eliminate the arguments to the selectors, and just make them simple callables.

Alternative language: put the name first and then the selectors, so that there's exactly one per name:

bug_page_new: server=edge,True,False

perhaps we should just use actual Python fragments:

bug_page_new = True if server=='edge' else False

(These could be actually evaluated by Python, or they could just look like Python.)

Or you could put all the logic into the app code and make the config a purely dumb dictionary:

  if user_in_beta or not config('bug_page_new_beta_only') ...

Perhaps the simplest thing would be to say there are several semi-statically-configured scopes, including "edge", "beta users", "everywhere" with a total ordering. We look through these in order for the relevant name. This would mean:

the configuration ui can be clear about how they interact
we don't need (or get) a minilanguage
the application code can do one query something like select * from configuration natural join configuration_group where group in ('edge', 'beta', ...) order by group_priority

-  ⇤ ← Revision 31 as of 2010-07-13 10:48:55 → 
  Size: 13686
  Editor: mbp
  Comment:
+   ← Revision 32 as of 2010-07-13 10:57:06 → ⇥
  Size: 14299
  Editor: mbp
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 130:
- * Perhaps needs a better name that "dynamic configuration" that's not confusable with static configuration.
-Line 157:
+Line 155:
+ * We need to create a culture that people do actually add and make use of flags; as part of our incident analysis we should consider whether adding a flag might have helped.
 Line 159:
-The API is: {{{config(name, scopes=None) => value}}}, probably living on a Zope utility.

If the scopes set is not specified, in the web server it is computed from the request object.  In other places like jobs or the code host we need to pass in some other object with similar info.  

The database model is that there are various "configuration scopes" which each have a name.  There is a total order between the scopes that defines the level of specificity: for instance we may have some settings that are active for the edge server, and some for beta user, and say that in case of a conflict the beta user setting has priority.  A configuration variable can be defined up to once per configuration scope.  Thus to look up the full set of active configuration variables, we look across the selected scopes and take the highest-priority setting.  If we do not find a value for a particular setting it defaults to None.

For any particular scope set it is a single SQL query to get the full environment of settings, something like: {{{select configuration.name, first(value) from configuration natural join configuration_group where group_id in %(scopes)s order by configuration_group.priority group by configuration.name}}}.  (Or one can of course query one value at a time.)
+The API is: {{{getFlag(name, scopes=None) => value}}}, probably living on a Zope utility.

Any particular request can be in several scopes, perhaps set(global, edge_server, beta_user, override).  These can be inferred from the URL, the server static configuration, the user's group membership, perhaps other things.

The value for any flag is the highest-priority setting for any active scope.

If the scopes set is not passed to getFlag, in the web server it is computed from the request object.  In other places like jobs or the code host we need to pass in some other object with similar info.  

The database model is that there are various "configuration scopes" which each have a name.  There is a total order between the scopes that defines the level of specificity: for instance we may have some settings that are active for the edge server, and some for beta user, and say that in case of a conflict the beta user setting has priority.  A configuration variable can be defined up to once per configuration scope.  

To look up the full set of active configuration variables, we look across the selected scopes and take the highest-priority setting.  If we do not find a value for a particular setting it defaults to None.

For any particular scope set it is a single SQL query to get the full environment of settings, something like: 
{{{select configuration.name, first(value)    from configuration natural join configuration_group where group_id in %(scopes)s order by configuration_group.priority group by configuration.name}}}.  (Or one can of course query one value at a time.)
-Line 171:
+Line 179:
-We define the following scope priorities from lowest to higher.  (The numbers are arbitrary, the order's all that matters.)
+We define the following scope priorities from most to least important.  (The numbers are arbitrary, the order's all that matters.)
-Line 174:
+Line 182:
-global
 200  staging_server
 210  edge_server
 220  production_server
 240  dev_server
 400  beta_user
 410  edge_server_beta_user
 420  staging_server_beta_user
 2000 override
+ override  -- can be used to mask out anything

 edge_server_beta_user -- set only when both are true

 production_server -- one of these is chosen based on the url or static config
 edge_server
 staging_server
 dev_server 

 beta_user -- set based on group membership; we can add more specific beta groups later

 default -- lowest priority and always set; used when we want a None-null default
-Line 184:
+Line 195:

launchpad development

Diff for "LEP/FeatureFlags"