Size: 10031
Comment: a bit simpler
|
Size: 12832
Comment: more example
|
Deletions are marked like this. | Additions are marked like this. |
Line 32: | Line 32: |
* Show an 'alpha', 'beta' or 'new!' badge next to a UI control, then later turn it off without a new rollout | |
Line 87: | Line 88: |
* A machine-readable registry of known names, with a help string and a description of the type to be stored in them. (A little like the Mailman admin interface but much simpler.) |
|
Line 149: | Line 152: |
* Arguably we should couple together "this feature is only for beta users" with "this feature has a beta badge next to it", but perhaps it's simpler at this level and more flexible to just have separate flags for the two of them. * Should document a naming convention that explains what feature of thing this flag affects, and what kind of effect it has. |
|
Line 151: | Line 158: |
The API is: {{{config(name, context) => value}}}. Can/should we get the context implicitly? In the API the context object is the request and perhaps other things (?) Perhaps for use within eg a cron job or the code host we need to pass in some other object with similar info. |
The API is: {{{config(name, scopes=None) => value}}}, probably living on a Zope utility. If the scopes set is not specified, in the web server it is computed from the request object. In other places like jobs or the code host we need to pass in some other object with similar info. The database model is that there are various "configuration scopes" which each have a name and a total order between them. The order defines the level of specificity: for instance we may have some settings that are active for the edge server, and some for beta user, and say that in case of a conflict the beta user setting has priority. A configuration variable can be defined up to once per configuration scope. Thus to look up the full set of active configuration variables, we look across the selected scopes and take the highest-priority setting. For any particular scope set it is a single SQL query to get the full environment of settings, something like: {{{select configuration.name, first(value) from configuration natural join configuration_group where group_id in %(scopes)s order by configuration_group.priority group by configuration.name}}}. (Or one can of course query one value at a time.) The name looks like dotted python identifiers, with the form APP.FEATURE.EFFECT. The value is a Unicode string. The admin gui can show the values grouped and sorted by scope. We define the following scopes {{{ 100 global 200 staging_server 210 edge_server 220 production_server 230 staging_server 240 dev_server 400 beta_user 410 edge_server_beta_user 2000 override }}} examples: || scope || name || value || explanation || edge_server_beta_user || soyuz.build_from_branch.ui_visible || True || || default || soyuz.build_from_branch.ui_visible || False || || default || soyuz.build_from_branch.badge || beta || show "beta" icon next to the ui || || edge_server || soyuz.build_from_branch.run_jobs || True || || default || soyuz.build_from_branch.run_jobs || False || || production_server || notification.global.message || Going down for an upgrade, should be back in 10m || || production_server || notification.global.countdown_time || 20101220T00:00 || (show "in %d minutes" based on this) || === Complicated alternative implementations === |
The purpose of this template is to help us get ReadyToCode on features or tricky bugs as quickly as possible. See also LaunchpadEnhancementProposalProcess.
Dynamic Configuration
Launchpad has a registry of configuration options that can be changed by admins through the web ui, without restarting Launchpad.
As a Launchpad developer/operator
I want to turn features on and off without a heavyweight deployment
so that I can more adroitly test and deploy new features (A:B testing, long-running closed betas, etc)
and so that I can recover from emergencies by cutting-off problem features
As a Launchpad user
I want you to tell me about impending downtime through the web ui
so that I can I can plan not to be using Launchpad when it's offline/readonly.
- This is not a mechanism for general per-user configuration
- This is not a complete replacement for static configuration or other parts of the deployment process
- This only affects new code that specifically uses it; it doesn't magically affect existing code
- This may become a replacement for things that are currently done through SQL queries or configuration files
- This is site-wide not per-appserver.
This does not yet replace the readonly-mode flag (implemented as a special file on disk) because it's special.
- This embraces "feature flags" and more, such as site-wide notifications.
Scenarios:
- Dark launches (aka embargoes: land code first, turn it on later)
- Closed betas
- Scram switches (omg daily builds are killing us, make it stop)
- Soft/slow launch (let just a few users use it and see what happens)
- Site-wide notification
- Show an 'alpha', 'beta' or 'new!' badge next to a UI control, then later turn it off without a new rollout
Rationale
- We want people to land features faster, and to deploy more often. Having control over when features are generally exposed separately from landing the code may help.
This could support things like site-wide notifications which would help our users by warning them when Launchpad's about to go offline.
- Some developers are interested in A:B testing of UIs and this would help with that too.
Some other sites find this very useful: see http://www.scribd.com/doc/16877392/10-Deploys-Per-Day-Dev-and-Ops-Cooperation-at-Flickr
- At the 2010-02 team leads meeting there was enthusiastic support for feature flags but they've stalled.
- Doing configuration changes through a branch, merge, landing and deploy is hugely expensive, compared to changing a web ui.
- We've had problems with configuration being inadvertently set inconsistently across different servers.
- Provides visibility into the system.
Stakeholders
Who really cares about this feature? When did you last talk to them?
- LOSAs
- Launchpad devs
- Design group?
- Architect and product strategist
- Curtis
- mthaddon
- Gary
Constraints and Requirements
Must
- A function that can be called from a template or other code, that tells you the value of a configuration item.
- The function must be very cheap to call so that it does not cause performance problems even if it's called several times per request. (It should do at most one database query (of reasonable size) per request that cares about configuration.)
- Feature flags can be used to hide or disable some user interface items.
Nice to have
- Configuration scopes:
- "on edge"
- "for authenticated/unauthenticated users"
- "in readonly mode"
- "for x% of users"
- "for users in the beta group"
- Configuration that can be changed while in readonly mode.
- Configuration is validated before it is applied: eg if something must be an IP address, we won't let the admin commit a change that makes it invalid.
- Log of changes that were made, when, and by whom.
- A machine-readable registry of known names, with a help string and a description of the type to be stored in them. (A little like the Mailman admin interface but much simpler.)
Must not
- Reduce test coverage by having code paths that are only hit when certain variables are set, and there are not tests for those variables being set. (Using bzr-style scenario multiplication may help.)
- Cause entanglement by having the same feature flag checked at many points in the code.
Subfeatures
Other LaunchpadEnhancementProposals that form a part of this one.
- Site Wide Notification (to be written)
Workflows
What are the workflows for this feature?
Change configuration
LOSA goes to https://launchpad.net/+config where they see a simple web form allowing them to edit the configuration.
Developers can go there to see but not edit the configuration.
Normal users are not allowed to see it.
Provide mockups for each workflow.
Success
How will we know when we are done?
- You can check flags in code or templates.
- You can change the configuration.
- People do actually change the configuration.
How will we measure how well we have done?
- Adoption of feature flags.
- Developers and LOSAs report satisfaction with the facility and it becomes a standard practice.
Thoughts?
Put everything else here. Better out than in.
- Perhaps needs a better name that "dynamic configuration" that's not confusable with static configuration.
- As a general rule, each switch should be checked only once or only a few time in the codebase. We don't want to disable the same thing in the ui, the model, and the database.
- Obviously it would be better not to ever have planned downtime. But...
- Would this have helped with daily builds, or other things?
- If we want to unify the edge and production appservers, this may help.
- Having useful differences across edge and lpnet seem to imply having at least that level of scoping from the beginning.
- Could get an interesting feedback loop between oops_per_second vs config changes.
- How should these be tested? Perhaps we want a small number of tests that try flipping the flag and checking both ways works?
- How to edit? One big textarea? How about races?
- Which scope matches? Explicit ordering? Most-specific? Require no overlaps?
- Perhaps you'll accumulate an ever-increasing inventory of configuration options that are never used, and will break if they are used. Perhaps a switch that has not been changed in the last year should be considered to be removed altogether.
- Arguably we should couple together "this feature is only for beta users" with "this feature has a beta badge next to it", but perhaps it's simpler at this level and more flexible to just have separate flags for the two of them.
- Should document a naming convention that explains what feature of thing this flag affects, and what kind of effect it has.
Implementation
The API is: config(name, scopes=None) => value, probably living on a Zope utility.
If the scopes set is not specified, in the web server it is computed from the request object. In other places like jobs or the code host we need to pass in some other object with similar info.
The database model is that there are various "configuration scopes" which each have a name and a total order between them. The order defines the level of specificity: for instance we may have some settings that are active for the edge server, and some for beta user, and say that in case of a conflict the beta user setting has priority. A configuration variable can be defined up to once per configuration scope. Thus to look up the full set of active configuration variables, we look across the selected scopes and take the highest-priority setting.
For any particular scope set it is a single SQL query to get the full environment of settings, something like: select configuration.name, first(value) from configuration natural join configuration_group where group_id in %(scopes)s order by configuration_group.priority group by configuration.name. (Or one can of course query one value at a time.)
The name looks like dotted python identifiers, with the form APP.FEATURE.EFFECT. The value is a Unicode string.
The admin gui can show the values grouped and sorted by scope.
We define the following scopes
100 global 200 staging_server 210 edge_server 220 production_server 230 staging_server 240 dev_server 400 beta_user 410 edge_server_beta_user 2000 override
examples:
|| scope || name || value || explanation
edge_server_beta_user |
soyuz.build_from_branch.ui_visible |
True |
|
default |
soyuz.build_from_branch.ui_visible |
False |
|
default |
soyuz.build_from_branch.badge |
beta |
show "beta" icon next to the ui |
edge_server |
soyuz.build_from_branch.run_jobs |
True |
|
default |
soyuz.build_from_branch.run_jobs |
False |
|
production_server |
notification.global.message |
Going down for an upgrade, should be back in 10m |
|
production_server |
notification.global.countdown_time |
20101220T00:00 |
(show "in %d minutes" based on this) |
Complicated alternative implementations
This should probably live on a zope utility? Is "config" confusable with other names, and if so what should we call it instead?
The flags are named with the same syntax as Python identifiers. All punctuation is reserved so that we can try scope selectors like server=edge/user_group=beta/soyuz_build_from_branch=True.
The value is a Unicode string.
We will add a machine-readable registry of known names, with a help string and a description of the type to be stored in them. (A little like the Mailman admin interface but much simpler.)
The values are stored in a database table, with two columns: name, value. (If we add scope selectors we'll add a third column, so you can quickly pull out all the rows possibly relevant to the name.) This means you perhaps can't change it while we're in readonly mode. Later we can split it to a separate replicated database, or to some non-sql database.
More examples:
sitewide_message=Going down for an upgrade, should be back in 10m
sitewide_countdown_time=20101220T00:00 (show "in %d minutes" based on this)
if server(edge): if user_in(beta): bug_page_new=True (show the new version of the bug page only on edge)
if user_subset(0,10): registry_layout_new=True (give users with id%10==0 a new layout to see how they like it)
The story for how this works: request goes in to the app server code which calls config('bug_page_new'). (Based on this it will choose a different page template or turn on/off some parts of that template.) The config mechanism walks through the configuration settings looking for one that has the name 'bug_page_new' and matches the context. It checks for matches in the context by looking at all the selectors and calling a callable looked up by name. In this case it is 'server' which will look in the request object for the vhost header.
Maybe we don't need multiple levels: if we want things active only on edge for users in ~launchpad-beta, we define a selector function that composes those things.
Or we could eliminate the arguments to the selectors, and just make them simple callables.
Alternative language: put the name first and then the selectors, so that there's exactly one per name:
bug_page_new: server=edge,True,False
perhaps we should just use actual Python fragments:
bug_page_new = True if server=='edge' else False
(These could be actually evaluated by Python, or they could just look like Python.)
Or you could put all the logic into the app code and make the config a purely dumb dictionary:
if user_in_beta or not config('bug_page_new_beta_only') ...
Perhaps the simplest thing would be to say there are several semi-statically-configured scopes, including "edge", "beta users", "everywhere" with a total ordering. We look through these in order for the relevant name. This would mean:
- the configuration ui can be clear about how they interact
we don't need (or get
a minilanguage
the application code can do one query something like select * from configuration natural join configuration_group where group in ('edge', 'beta', ...) order by group_priority