Code/BranchRevisions

Not logged in - Log In / Register

Branches and Revisions

The links between branches and revisions are currently (Sep 2010) handled using the BranchRevision table. There is one row in this table for every revision in the branch. This is mildly insane for the number of feature branches that we encourage projects to use as the vast majority of the revisions are common to the branches.

Consider the Launchpad project itself. There are over 90k revisions in the ancestry, so every branch adds 90k rows to the BranchRevision table.

Before we can simplify, reduce and clean up this relationship, we need to understand what the entries are used for.

Uses for the BranchRevision table.

Meta: For revision feeds and karma, it seems like we're using a list of all branches containing the revision to find a single branch containing the revision-- if we just store the single branch, we can be more efficient.

Possible Solutions

Delta-compress the branch-revision table

This solution is highly compatible with our existing approach. It is a trade-off of performance for space, but with care, the performance reduction may be unobservable. It applies to all use cases.

Use loggerhead

This applies only to display use cases-- Branch revision listings possibly merge proposal revision listings

Scan only for revisions in the current branch that merge the tips of other branches

In the common case, adding a revision to a branch does not enable detecting a merge, because the revision being added to the branch will already be in the ancestry of the merging branch. The exceptions are new branches (which generally should not be set to merged) and ghost-filling. Ghost-filling is believed to be extremely rare.

Store only tip revision info and do multiple DB queries

This models the underlying branches well, but has performance costs. It applies to display use cases.

Store tip revision info, and group revisions by ancestry

Storing groups of, say, 100 revisions according to ancestry would allow retrieving the latest revisions in one or two single database queries and then doing in-memory graph operations. This models the underlying branches well, and applies to display use cases. It could be implemented to provide fast ghost-filling.

Use bzr-history-db

This is similar to "Store tip revision info, and group revisions by ancestry", but only lefthand history is included in the "groups", which are referred to as "revision ranges" in bzr-history-db.

This is a form of delta compression-- for any revision, it would be possible to look up the branches that include it (by traversing revision groups) and to look up all the revisions included by a branch. However, like "store tip revision info, and group revision by ancestry", it is biased toward fast lookups of recent data.

Loggerhead will be switching to bzr-history-db, so it would be advantageous if it could share the database with Launchpad. (Perhaps a separate db that only LP could write to?)

Possible downside: Merges are stored relatively expensively-- for revisions foo, bar, baz, if foo merges bar, and bar merged baz, then baz gets two rows (one for foo, one for baz), and baz gets one row. However, merges do not need to be tracked for mainlines we don't care about. However, we probably care about more than 50% of mainlines for some projects (e.g. Launchpad).

Store most-relevant branch on Revision

Since there is only one most-relevant branch, we do not need the one-to-many relationship that BranchRevision provides. However, if the most relevant branch is deleted, we would either need to accept a NULL field or find a new most-relevant branch. If we allow the field to become NULL, we can call it "introducing branch" rather than "most-relevant" branch. This supports allocating revision karma and revision feeds.

Store introducing project/package on Revision

This supports the Revision Karma use case. It is not affected by branch deletion, but will not track branch moves. It is subject to project/package deletion.

Associate merge proposals with Revisions

This supports the use case of displaying unmerged revisions.

Store the last 10 revisions for a branch

This supports the use case of displaying branch revisions.

Analysis of history-db

Current schema

CREATE TABLE revision (
    db_id INTEGER PRIMARY KEY AUTOINCREMENT,
    revision_id TEXT NOT NULL,
    gdfo INTEGER NOT NULL
);
CREATE TABLE parent (
    child INTEGER REFERENCES revision NOT NULL,
    parent INTEGER REFERENCES revision NOT NULL,
    parent_idx INTEGER NOT NULL, -- 0 == left-hand parent
    CONSTRAINT parent_is_unique UNIQUE (child, parent, parent_idx)
);
CREATE TABLE ghost (
    db_id INTEGER PRIMARY KEY REFERENCES revision
);
CREATE TABLE dotted_revno (
    tip_revision INTEGER REFERENCES revision NOT NULL,
    merged_revision INTEGER REFERENCES revision NOT NULL,
    revno TEXT NOT NULL,
    end_of_merge BOOL NOT NULL,
    merge_depth INTEGER NOT NULL,
    dist INTEGER NOT NULL, -- Offset from tip, so we preserve the order
    CONSTRAINT dotted_revno_key UNIQUE (tip_revision, merged_revision)
);
CREATE TABLE mainline_parent_range (
    pkey INTEGER PRIMARY KEY AUTOINCREMENT,
    head INTEGER REFERENCES revision NOT NULL,
    tail INTEGER REFERENCES revision, -- NULL indicates start-of-history
    -- tail is *not* included in the mainline_parent table
    count INTEGER NOT NULL -- num in range, inclusive
);
CREATE TABLE mainline_parent (
    range INTEGER REFERENCES mainline_parent_range NOT NULL,
    revision INTEGER REFERENCES revision NOT NULL,
    dist INTEGER NOT NULL -- Offset from head, so we preserve the order
    -- Not adding the constraint at this time, but it is logically there
    -- CONSTRAINT mainline_parent_rev_unique UNIQUE (range, revision)
);

Use case analysis

We will use Branch.last_scanned_id for the "tip revision" in these examples.

Merge proposal page

I think the simplest approach is to use a bzrlib Graph for this. To create a Graph, we need a ParentsProvider. The contract is that ParentsProvider.get_parent_map(list_of_revids) returns a dict like {revision_id_1: list_of_parents, revision_id_2: list_of_parents}. We can do this using the tables we already have in Launchpad, but not efficiently.

We can improve efficiency by retrieving more revisions at once. For each revision_id we are requested to retrieve, we can

  1. Of all its entries in mainline_parent, select the range of the row where it is closest to HEAD.
  2. retrieve all mainline_parent entries with the same range
  3. retrieve all the dotted_revno entries whose "tip" revision is referenced by one of the mainline_parent revisions we retrieved
  4. retrieve all the revision entries that are referenced by a dotted_revno.merged_revision entry.

This provides a large slice of the ancestry of a revision, which can be cached. (The ParentsProvider can be wrapped in a CachingParentsProvider.) It is likely that all future queries will hit the cache, rather than hitting the database, because Graph calls do not traverse deeper than necessary into the ancestry.

Once we have a Graph, finding unmerged revisions is trivial: Graph.find_difference() can be used to determine which revisions are unique to the source branch. Then we use Graph.iter_lefthand_ancestry (or algorithm 1 for the Branch page) to find the last ten revisions, and determine whether they are unique to the source branch.

Branch page

Algorithm 1

  1. Of all mainline_parent entries that refer to the tip revision, select the range of the one where it is closest to the head.

  2. Find all mainline_parent entries (and referenced revision entries) with that range whose head_offset is greater than tip's.

  3. If fewer than 10 revisions (including tip) are retrieved this way, find the mainline_parent_range for them, find the tail, find the mainline_parent_range where the tail is the "tip", and repeat step 2.

Algorithm 2

Create a bzrlib Graph (see above). Call Graph.iter_lefthand_history and retrieve the first 10 revisions.

Finding the most relevant branch for any given revision (primarily used in the revision feeds)

Algorithm 1

Unlike bzr repositories, we can traverse from a revision into its (known) descendants. We can invert the operation used for the branch page:

  1. Find all mainline_parent rows referring to the target revision.

  2. Find all mainline_parent rows (and referenced revision rows) with same range as the target's mainline_parent rows, and whose head_offset is less than target's.

  3. Find all mainline_parent_range rows whose tail is the same as the highest-retrieved revision.

  4. For each mainline_parent_range, find all mainline_parent rows for this range

  5. GOTO 3

When no more matching mainline_parent_ranges can be found, you have a list of all the revisions that are descendants of the target revision. Now, retrieve all the branches which have one of these revisions as their tip. Now determine which branch is "most relevant".

Algorithm 2

As above, but change the data structures such that each revision can appear in only one revision range.

This option requires more investigation to ensure it has desirable performance characteristics.

Algorithm 3

Walk from parents to children using the parent table, until you find all descendant revisions of the branch.

This option is slow due to performing a database query for each step.

Algorithm 4

Start with all potentially-relevant branches and walk backwards until you find the revision. You can fail early if the current revision's GDFO is less than the target revision's. Then find all branches that have descendants as tips. Then determine which is most relevant.

This option performs best if the branches are closely related and the current revision's GDFO is high relative to the GDFOs of candidate branch tips.

Allocating revision karma

See above.

Merge detection in the scanner

With my recent changes, detection of merges into the scanned branch doesn't need any info from the database except the relevant last_scanned_ids.

If the scanned branch has a new revision that has been merged into another branch and is a descendant of the previous last_scanned_id, the merge has already been detected.

If the tip is changed arbitrarily (e.g. push --overwrite), then it's possible that a merge could be detected. In this case, the scanned branch's repository may not contain entries for the potential mergers, so a Graph must be constructed using the branch's repository's ParentsProvider and the DB-backed parents provider. (See "merge proposal page above".)

Results importing Launchpad Branches

I currently have access to 4153 branches of Launchpad hosted on Launchpad itself. (Apparently there are another 1700 or so private branches I can't see.)

This breaks down into:

   137 tips/Abandoned.txt
   399 tips/Development.txt
     6 tips/Experimental.txt
     3 tips/Mature.txt
  3608 tips/Merged.txt
  4153 total

I ran an import of each set, in the order of "Mature, Development, Experimental, Abandoned, Merged", and computed some basic statistics on the result.

Set

DB Size

total branches

dotted_revno rows

revision rows

mainline_parent rows

mainline_parent_range rows

Mature

32.2MB

3

130,099

93,307

14,117

145

Development

55.3MB

402

489,822

96,303

38,277

574

Experimental

55.8MB

408

497,635

96,347

38,581

580

Abandoned

60.0MB

545

561,380

96,945

45,639

722

Merged

131.6MB

4153

1,705,554

97,023

240,671

4,402

From that, we can get some idea about statistics.

  1. It took 15s to import all of 'devel', and 5s to import 'db-devel'. It then took 12m20s to import all Merged branches. An average of 200ms per Merged branch.
  2. 4,150 non-Mature branches added 1,575,455 rows to dotted_revno. An average of 379 rows per branch. I have a strong feeling this is fairly skewed, and depends mostly on how much a given branch merges trunk back before it lands. (Some will have low 10s of rows, some will have 1000s). The Development branches averaged 894 rows.
  3. Mainline_parent_rows does go up, but far slower than dotted_revno. mainline_parent_range basically ends up as 1 range per branch. (Meaning that branches average less than 100 new mainline revisions.) The code currently favors expanding to 100. So if a branch walks back 3 steps and finds a range that is 70 long, it will create a new range that is 73 long, rather than a range 3 followed by a range 70, and then range 100s.

  4. I ran another script which used the parent graph to determine size-of-ancestry for every revision. I then used that information to determine

    how this would compare to the existing BranchRevision table. Assuming that every tip ends up having a complete ancestry. I got: 4153 tips 327,838,696 ancestors So we are down to 1.7M rows in the largest table, vs 328M rows in

    BranchRevision (192:1).

Code/BranchRevisions (last edited 2010-10-26 14:45:32 by abentley)