Branches and Revisions
The links between branches and revisions are currently (Sep 2010) handled using the BranchRevision table. There is one row in this table for every revision in the branch. This is mildly insane for the number of feature branches that we encourage projects to use as the vast majority of the revisions are common to the branches.
Consider the Launchpad project itself. There are over 90k revisions in the ancestry, so every branch adds 90k rows to the BranchRevision table.
Before we can simplify, reduce and clean up this relationship, we need to understand what the entries are used for.
Uses for the BranchRevision table.
- Branch page
- shows the recent commits for any given branch, up to 10, and includes those already in trunk
Is that actually wanted, though? I think people might prefer seeing only revisions created since branching. — AaronBentley
I think you are right for most branches. Developers of feature branches are only concerned with those since branching, but there are trunk branches where the information is (slightly) useful. Ideally I'd like this to come from loggerhead. — TimPenhey
- shows the recent commits for any given branch, up to 10, and includes those already in trunk
- Merge proposal page
- unmerged revisions (up to 10 - confusing ui)
- commits since the start of the review
- Finding the most relevant branch for any given revision (primarily used in the revision feeds)
What about just keeping track of which branch introduced the revision? — AaronBentley
This approximation is probably fine. — TimPenhey
- Allocating revision karma
Is the branch relevant here, or just the project/package? — AaronBentley
Just the project/package that the branch is connected to. — TimPenhey
- Merge detection in the scanner
- Is the tip of this branch in the ancestry of the development focus branch?
- And if scanning a series linked branch, is the tip of any unmerged branches of the same target present in my ancestry?
- I feel that this use case is the harder one to solve if we keep a limited ancestry.
Meta: For revision feeds and karma, it seems like we're using a list of all branches containing the revision to find a single branch containing the revision-- if we just store the single branch, we can be more efficient.
Possible Solutions
Delta-compress the branch-revision table
This solution is highly compatible with our existing approach. It is a trade-off of performance for space, but with care, the performance reduction may be unobservable. It applies to all use cases.
Use loggerhead
This applies only to display use cases-- Branch revision listings possibly merge proposal revision listings
Scan only for revisions in the current branch that merge the tips of other branches
In the common case, adding a revision to a branch does not enable detecting a merge, because the revision being added to the branch will already be in the ancestry of the merging branch. The exceptions are new branches (which generally should not be set to merged) and ghost-filling. Ghost-filling is believed to be extremely rare.
Store only tip revision info and do multiple DB queries
This models the underlying branches well, but has performance costs. It applies to display use cases.
Store tip revision info, and group revisions by ancestry
Storing groups of, say, 100 revisions according to ancestry would allow retrieving the latest revisions in one or two single database queries and then doing in-memory graph operations. This models the underlying branches well, and applies to display use cases. It could be implemented to provide fast ghost-filling.
Use bzr-history-db
This is similar to "Store tip revision info, and group revisions by ancestry", but only lefthand history is included in the "groups", which are referred to as "revision ranges" in bzr-history-db.
This is a form of delta compression-- for any revision, it would be possible to look up the branches that include it (by traversing revision groups) and to look up all the revisions included by a branch. However, like "store tip revision info, and group revision by ancestry", it is biased toward fast lookups of recent data.
Loggerhead will be switching to bzr-history-db, so it would be advantageous if it could share the database with Launchpad. (Perhaps a separate db that only LP could write to?)
Possible downside: Merges are stored relatively expensively-- for revisions foo, bar, baz, if foo merges bar, and bar merged baz, then baz gets two rows (one for foo, one for baz), and baz gets one row. However, merges do not need to be tracked for mainlines we don't care about. However, we probably care about more than 50% of mainlines for some projects (e.g. Launchpad).
Store most-relevant branch on Revision
Since there is only one most-relevant branch, we do not need the one-to-many relationship that BranchRevision provides. However, if the most relevant branch is deleted, we would either need to accept a NULL field or find a new most-relevant branch. If we allow the field to become NULL, we can call it "introducing branch" rather than "most-relevant" branch. This supports allocating revision karma and revision feeds.
Store introducing project/package on Revision
This supports the Revision Karma use case. It is not affected by branch deletion, but will not track branch moves. It is subject to project/package deletion.
Associate merge proposals with Revisions
This supports the use case of displaying unmerged revisions.
Store the last 10 revisions for a branch
This supports the use case of displaying branch revisions.
Analysis of history-db
Current schema
CREATE TABLE revision ( db_id INTEGER PRIMARY KEY AUTOINCREMENT, revision_id TEXT NOT NULL, gdfo INTEGER NOT NULL ); CREATE TABLE parent ( child INTEGER REFERENCES revision NOT NULL, parent INTEGER REFERENCES revision NOT NULL, parent_idx INTEGER NOT NULL, -- 0 == left-hand parent CONSTRAINT parent_is_unique UNIQUE (child, parent, parent_idx) ); CREATE TABLE ghost ( db_id INTEGER PRIMARY KEY REFERENCES revision ); CREATE TABLE dotted_revno ( tip_revision INTEGER REFERENCES revision NOT NULL, merged_revision INTEGER REFERENCES revision NOT NULL, revno TEXT NOT NULL, end_of_merge BOOL NOT NULL, merge_depth INTEGER NOT NULL, dist INTEGER NOT NULL, -- Offset from tip, so we preserve the order CONSTRAINT dotted_revno_key UNIQUE (tip_revision, merged_revision) ); CREATE TABLE mainline_parent_range ( pkey INTEGER PRIMARY KEY AUTOINCREMENT, head INTEGER REFERENCES revision NOT NULL, tail INTEGER REFERENCES revision, -- NULL indicates start-of-history -- tail is *not* included in the mainline_parent table count INTEGER NOT NULL -- num in range, inclusive ); CREATE TABLE mainline_parent ( range INTEGER REFERENCES mainline_parent_range NOT NULL, revision INTEGER REFERENCES revision NOT NULL, dist INTEGER NOT NULL -- Offset from head, so we preserve the order -- Not adding the constraint at this time, but it is logically there -- CONSTRAINT mainline_parent_rev_unique UNIQUE (range, revision) );
Use case analysis
We will use Branch.last_scanned_id for the "tip revision" in these examples.
Merge proposal page
I think the simplest approach is to use a bzrlib Graph for this. To create a Graph, we need a ParentsProvider. The contract is that ParentsProvider.get_parent_map(list_of_revids) returns a dict like {revision_id_1: list_of_parents, revision_id_2: list_of_parents}. We can do this using the tables we already have in Launchpad, but not efficiently.
We can improve efficiency by retrieving more revisions at once. For each revision_id we are requested to retrieve, we can
- Of all its entries in mainline_parent, select the range of the row where it is closest to HEAD.
- retrieve all mainline_parent entries with the same range
- retrieve all the dotted_revno entries whose "tip" revision is referenced by one of the mainline_parent revisions we retrieved
- retrieve all the revision entries that are referenced by a dotted_revno.merged_revision entry.
This provides a large slice of the ancestry of a revision, which can be cached. (The ParentsProvider can be wrapped in a CachingParentsProvider.) It is likely that all future queries will hit the cache, rather than hitting the database, because Graph calls do not traverse deeper than necessary into the ancestry.
Once we have a Graph, finding unmerged revisions is trivial: Graph.find_difference() can be used to determine which revisions are unique to the source branch. Then we use Graph.iter_lefthand_ancestry (or algorithm 1 for the Branch page) to find the last ten revisions, and determine whether they are unique to the source branch.
Branch page
Algorithm 1
Of all mainline_parent entries that refer to the tip revision, select the range of the one where it is closest to the head.
Find all mainline_parent entries (and referenced revision entries) with that range whose head_offset is greater than tip's.
- If fewer than 10 revisions (including tip) are retrieved this way, find the mainline_parent_range for them, find the tail, find the mainline_parent_range where the tail is the "tip", and repeat step 2.
Algorithm 2
Create a bzrlib Graph (see above). Call Graph.iter_lefthand_history and retrieve the first 10 revisions.
Finding the most relevant branch for any given revision (primarily used in the revision feeds)
Algorithm 1
Unlike bzr repositories, we can traverse from a revision into its (known) descendants. We can invert the operation used for the branch page:
Find all mainline_parent rows referring to the target revision.
Find all mainline_parent rows (and referenced revision rows) with same range as the target's mainline_parent rows, and whose head_offset is less than target's.
Find all mainline_parent_range rows whose tail is the same as the highest-retrieved revision.
For each mainline_parent_range, find all mainline_parent rows for this range
- GOTO 3
When no more matching mainline_parent_ranges can be found, you have a list of all the revisions that are descendants of the target revision. Now, retrieve all the branches which have one of these revisions as their tip. Now determine which branch is "most relevant".
There are a couple of problems finding children via mainline_parent_range. First, when a branch sees a new revision, it will get a new range (to include the new tip), but we have to decide what to do with the old range. The existing code leaves it, because that table is still pretty small, but eventually cruft could overwhelm it. Regardless, if you know the branches of a project by a given user, it seems better to start with that as your search tips. Note also that the 'revision' table has gdfo, so you can know if you've searched past the point where the revision could be in the ancestry. -- JohnArbashMeinel
Algorithm 2
As above, but change the data structures such that each revision can appear in only one revision range.
This option requires more investigation to ensure it has desirable performance characteristics.
Algorithm 3
Walk from parents to children using the parent table, until you find all descendant revisions of the branch.
This option is slow due to performing a database query for each step.
Algorithm 4
Start with all potentially-relevant branches and walk backwards until you find the revision. You can fail early if the current revision's GDFO is less than the target revision's. Then find all branches that have descendants as tips. Then determine which is most relevant.
This option performs best if the branches are closely related and the current revision's GDFO is high relative to the GDFOs of candidate branch tips.
Allocating revision karma
See above.
Merge detection in the scanner
With my recent changes, detection of merges into the scanned branch doesn't need any info from the database except the relevant last_scanned_ids.
If the scanned branch has a new revision that has been merged into another branch and is a descendant of the previous last_scanned_id, the merge has already been detected.
If the tip is changed arbitrarily (e.g. push --overwrite), then it's possible that a merge could be detected. In this case, the scanned branch's repository may not contain entries for the potential mergers, so a Graph must be constructed using the branch's repository's ParentsProvider and the DB-backed parents provider. (See "merge proposal page above".)
Results importing Launchpad Branches
I currently have access to 4153 branches of Launchpad hosted on Launchpad itself. (Apparently there are another 1700 or so private branches I can't see.)
This breaks down into:
137 tips/Abandoned.txt 399 tips/Development.txt 6 tips/Experimental.txt 3 tips/Mature.txt 3608 tips/Merged.txt 4153 total
I ran an import of each set, in the order of "Mature, Development, Experimental, Abandoned, Merged", and computed some basic statistics on the result.
Set |
DB Size |
total branches |
dotted_revno rows |
revision rows |
mainline_parent rows |
mainline_parent_range rows |
Mature |
32.2MB |
3 |
130,099 |
93,307 |
14,117 |
145 |
Development |
55.3MB |
402 |
489,822 |
96,303 |
38,277 |
574 |
Experimental |
55.8MB |
408 |
497,635 |
96,347 |
38,581 |
580 |
Abandoned |
60.0MB |
545 |
561,380 |
96,945 |
45,639 |
722 |
Merged |
131.6MB |
4153 |
1,705,554 |
97,023 |
240,671 |
4,402 |
From that, we can get some idea about statistics.
- It took 15s to import all of 'devel', and 5s to import 'db-devel'. It then took 12m20s to import all Merged branches. An average of 200ms per Merged branch.
- 4,150 non-Mature branches added 1,575,455 rows to dotted_revno. An average of 379 rows per branch. I have a strong feeling this is fairly skewed, and depends mostly on how much a given branch merges trunk back before it lands. (Some will have low 10s of rows, some will have 1000s). The Development branches averaged 894 rows.
Mainline_parent_rows does go up, but far slower than dotted_revno. mainline_parent_range basically ends up as 1 range per branch. (Meaning that branches average less than 100 new mainline revisions.) The code currently favors expanding to 100. So if a branch walks back 3 steps and finds a range that is 70 long, it will create a new range that is 73 long, rather than a range 3 followed by a range 70, and then range 100s.
- I ran another script which used the parent graph to determine size-of-ancestry for every revision. I then used that information to determine
how this would compare to the existing BranchRevision table. Assuming that every tip ends up having a complete ancestry. I got: 4153 tips 327,838,696 ancestors So we are down to 1.7M rows in the largest table, vs 328M rows in
BranchRevision (192:1).