yellow/EmailNumberEstimation

Not logged in - Log In / Register

Rendering of reStructured text is not possible, please install Docutils.
Emails for a subscription
=========================

Overview
--------
 * BugNotification — 1 month of notifications (~90k rows)
 * BugNotificationArchive — all of them (4M rows)
 * BugActivity — linked from BugNotification, ~5M rows

Problem statement
~~~~~~~~~~~~~~~~~
 * How many emails is a certain subscription going to generate?
 * We can consider two separate cases: when we already have historical data for the subscription (i.e. it has been active and producing actual emails for a while), a "historical" case below, and a case where we don't have any historical data (i.e. it hasn't been active long enough, or it is just being created), or a "future" case below
 * Doesn't need to be completely correct, an order of magnitude is enough, though such things need to be carefully worded to users


Historical
----------
"This subscription has produced X emails in the last week/month."

Proposal solutions:

 * Directly analyze BugNotification to get this data

   * We'd want to add a direct link to StructuralSubscription/BugSubscriptionFilter to be able to easily and cheaply calculate how many emails have been produced in the last X amount of time; this might be troubling when a bug notification is part of multiple subscriptions/filters: which one do we link to?

   * We need to take into consideration that BugNotifications do not correspond 1-to-1 with produced emails: some might be discarded because they are "undo" actions, and others are simply grouped together; my suggestion would be just to find a mean ratio and use that in calculations (get a total number of corresponding BugNotifications, and divide it by the pre-set factor — factor should be based on data that we already have in the database)

   * This wouldn't perform too bad (O(n) for the database, with n being the number of bugnotifications); for potentially largest subscriptions (all of Ubuntu), n~=8000 for the past month (staging data), and BugNotification table size is ~91k rows.

   * Adding another column might force us to add it to BugNotificationArchive (all notifications older than 1 month), which is a big table (4M rows), thus enlarging our DB footprint

 * Process BugActivity through newly introduced link from BugNotification

   * This processing would be equivalent to what's already happening for email sending, which means that it might be slightly more complex solution if we want to reuse existing processing without duplicating code

   * Keep a cached value for number of emails sent asynchronously, and store that on the StructuralSubscription/BugSubscriptionFilter — we don't really care about having latest and completely precise data, so updating the cache once a week should be enough

   * Benefit of this approach is that it's very fast for quering from the web app (O(1) for the database — directly get a value on the filter), but async nature always makes stuff harder to keep track of

   * Another benefit is that estimation based on BugActivity might be useful for figuring out the number of emails a new subscription *will* generate (see below)

Summary:
~~~~~~~~
This is achievable without too much effort, and performance directly scales with effort put into the implementation.

We can also combine both approaches (process BugNotification and store a cached value on subscription itself) to get benefits of both, and avoid the complexity of the second proposal.

Future
------
"Adding this subscription will get you roughly X emails per week."

When you introduce a new subscription, it'd be nice to estimate how many emails you'll get.

 * Special case: entire pillar (eg. distribution, distroseries, product): we can query BugActivity for all bugs relating to a pillar that have seen activity in the last month; my staging tests for Ubuntu (slowest case) have shown that these queries are reasonably fast after appropriate indexes get inside the DB cache (see https://pastebin.canonical.com/43513/ — we can expect to see slowest queries to take less than a second); that'd still be too slow imo

Proposal solutions for everything else:

 * Bug activity processing — because of the number of all potential combinations of subscription filters—there's too many of them—we can't cache any data, so it'd have to be calculated live

   * It would be relatively slow, so it would have to be done async: perhaps AJAX is the best solution here, but the backend process could be the same code that estimates everything based on BugActivity for "historical" data

   * I am still not sure if this would be viable for our biggest projects (known as Ubuntu :), because we'd have to process BugActivity rows on the order of 10k (last three months for Ubuntu have been 7k, 5k, 7k)

   * I have considered basing this on previous BugNotifications for a pillar and only processing them, but that would depend on there being at least a single catch-all subscription (and for someone who doesn't ignore self-generated bug mail), so I scratched that option out

 *  Guessing based on general parameters (i.e. "projects usually get this much email", or things like "if you've got only LIFECYCLE emails selected, that on average produces only 1/3 of emails", or "Status changes are 1/2 of emails" or even something smart as "for every open bug, it gets closed 1.05 times" causing that many emails etc); this could probably made to perform reasonably well live even, but would involve a bit more investigative work

Summary:
~~~~~~~~
This is potentially doable, with the risk of it not performing too well; depending on the approach for the first step, it might be easy or not so easy.

Overlapping subscriptions
-------------------------
"This subscription overlaps with another one"

This problem boils down to the problem of estimating the number of emails for a subscription that is intersection of two subscriptions, which brings the problem to figuring out the intersection of two subscriptions.

With that we can answer all the questions using the following:

 * A, B – overlapping subscriptions
 * A∩B — intersection (this is still a subscription filter, containing only the conditions that appear in both filters)
 * n(S) — number of emails for a subscription (filter) S

Eg.

 * How many emails will A and B generate together? n(A∪B)=n(A)+n(B)-n(A∩B)
 * How many emails will I stop receiving if I remove subscription A? n(A)-n(A∩B)
 * ...

(We can easily construct a ring using operations of union and intersection, and it would be closed to the application of them, and then define a vector space with n() as the metric in it :)

Summary:
~~~~~~~~
All that $@#%!$ basically just says: yes, this is doable, and it's as complex as the problem of getting an intersection of filters is complex (provided we already have methods to estimate number of emails for an arbitrary subscription; in essense this assumes solving the "future" part of the problem).

yellow/EmailNumberEstimation (last edited 2011-02-17 11:18:27 by danilo)