## Template for LP Production Meeting logs. Just paste xchat log below and the format IRC line will take care of formatting correctly #format IRC #startmeeting Meeting started at 09:00. The chair is matsubara. Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE] Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. [TOPIC] Roll Call New Topic: Roll Call Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us! me me Ursinha, flacoste, bigjools, intellectronica, herb me me bac, ping me matsubara, already answered me rockstar, hi me matsubara, hi me ok, stub can join later. everyone else is here. [TOPIC] Agenda New Topic: Agenda * Actions from last meeting * Oops report & Critical Bugs * Operations report (mthaddon/herb/spm) * DBA report (DBA contact) [TOPIC] * Actions from last meeting New Topic: * Actions from last meeting * stub to investigate the fix to avoid staging restore problems * matsubara to chase rockstar about a fix for OOPS-1138CEMAIL12 * asked jml about this. It's bug 326056 and had importance raised. * cprov and bigjools to investigate OOPS-1145EA14 * Ursinha to file bugs: * Bug 333072: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1143EB189 * Bug 333071: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1145EA14 Launchpad bug 326056 in launchpad-bazaar "OOPS on BadStateTransition when reviewing code by mail" [High,Triaged] https://launchpad.net/bugs/326056 Launchpad bug 333072 in soyuz "AttributeError OOPS on Build:+index" [Undecided,Invalid] https://launchpad.net/bugs/333072 Launchpad bug 333071 in soyuz "AssertionError OOPS on +copy-packages" [High,Triaged] https://launchpad.net/bugs/333071 333072 is invalid bigjools, any news about 333071? yes, it's not too serious, we've set it for 2.2.3 it's a corner case in the copying despite the doom-mongering error message ok. thanks bigjools [action] matsubara to chase stub about staging restore problems ACTION received: matsubara to chase stub about staging restore problems [TOPIC] * Oops report & Critical Bugs New Topic: * Oops report & Critical Bugs * matsubara hands Ursinha the mic * Ursinha looks * rockstar runs registry, foundations, code and bugs: oopses for you Registry:- https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153E919 https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 (or foundations, not sure) Foundations:- https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153D667 Code: https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152XMLP1 https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 Bugs: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152EA162 ~ rockstar, ha! rockstar, have you seen this one: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152XMLP1? Ursinha, looking at all of them now. rockstar, you can just look at code's one :) sinzui, hi sinzui, I'm not sure if https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 is foundations or registry https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 Ursinha, looks like registry Ursinha: strange. do you see lots of those? Ursinha: yes, looks like registry intellectronica, no, actually not intellectronica, but never saw one of those before so better bring to attention intellectronica, Ursinha that one looks like caused by the rollout Ursinha I don't know the answer either. I will look into it and assign it. I suspect salgado-afk is working on matsubara, even in the time it happened? matsubara: i also thought so * salgado-afk is now known as salgado but it is quite early intellectronica, I've discarded the rollout possibility because of its timestamp sinzui, thanks for that yeah, too early to be caused by the rollout. intellectronica, can you take a look then, please? Ursinha, I'll have to investigate our oops. It's the XML-RPC server, and it requires the sacrifice of a virgin goat. check OSAs incident log to see if something happened during that time so, this isn't really a bugs oops, but i don't know whether it's rollout-related or not. fwiw it's more than three hours before rollout, so it's hard to see how it would be related rockstar, oh, I have a bunch here in my backyard if you need some Ursinha, :) intellectronica, I'll do what matsubara suggested [action] ursinha to check OSAs incident log to help identify cause of OOPS-1152EA162 ACTION received: ursinha to check OSAs incident log to help identify cause of OOPS-1152EA162 thanks intellectronica and matsubara [action] rockstar to investigate xmlrpc oops OOPS-1152XMLP1 ACTION received: rockstar to investigate xmlrpc oops OOPS-1152XMLP1 flacoste, hi Translations is happy, that POFile:+translate dropped from the timeout top ten now .. btw ;) henninge, indeed, congrats to translate team :) * stub (n=stub@canonical/launchpad/stub) has joined #launchpad-meeting translations there he is :) Ursinha: thank you, I will pass it on. sinzui, about the other oops Sorry - on a call and didn't realize the time bac: can you look at it. Ursinha: they seem to be related (acting for sinzui today) * sinzui is in another meeting hmm i'd say registry * cumulus007 (n=sander@unaffiliated/cumulus007) has joined #launchpad-meeting yes, i think registry for both Ursinha are you talking about OOPS-1153A1135? https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 bac, hi :) so, can you take a look in both oopses? do you need me to file bugs about them? * Ursinha looks Ursinha: yes i'll look at them both i can open the bugs unless you need the karma flacoste, no, https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153D667 https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 bac, haha, no Ursinha: that's also a registry query [action] bac to file bugs and take care of https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153E919 and https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 [action] bac to file bugs for OOPS-1153E919 and OOPS-1153A1135 ACTION received: bac to file bugs for OOPS-1153E919 and OOPS-1153A1135 https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 wow, y'all are insistent today! :) :) flacoste, hm. thanks bac, can you take a look at that too? which? promise not to paste the oops again * danilo-afk is now known as danilos bac, https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 I tried :) * bac looks yes bac, thanks that's all from me from the oops land [action] bac to also file a bug and take care of OOPS-1153D667 https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 ACTION received: bac to also file a bug and take care of OOPS-1153D667 https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 ok, thanks everyone. [TOPIC] * Operations report (mthaddon/herb/spm) New Topic: * Operations report (mthaddon/herb/spm) there's one critical bug, though argh bad bad timing shall I wait for the critical bug? danilo is handling the critical bug, so won't duplicate what's in the bug report. herb, just a second, let me check with henninge it's bug 334787 matsubara, okay, if you say so Launchpad bug 334787 in rosetta "Ubuntu packagers are not translation editors (assertion error)" [Critical,In progress] https://launchpad.net/bugs/334787 let's move on go ahead herb, thanks 2009-02-20 - We had an issue that may have caused some users to experience intermittent outages on Launchpad. I worked with joey and flacosted to find the issue. joey's notes were sent to the list. I would be interested in hearing any updates we might have on this issue. 2009-02-21 and 2009-02-22 - It appears we had bit of buggy code land on edge that caused a performance problem on both edge and production. The revision was backed out and I believe the code has been fixed. 2009-02-26 - We rolled out 2.2.2 based on r7763 We continue to see problems relating to bug #156453 and bug #118625. So much so that we're going to start bouncing codebrowse regularly to hopefully head off any issues. I want to emphasize that this will be masking the problem and we really do need to find the root cause and fix it. Launchpad bug 156453 in loggerhead "production loggerhead branch leaks memory" [Critical,Triaged] https://launchpad.net/bugs/156453 Launchpad bug 118625 in launchpad-bazaar "codebrowse sometimes hangs" [High,Triaged] https://launchpad.net/bugs/118625 Bug #260171 continues to creep up regularly (every few days). This is already morked as high and I know that mwhudson's plate is full with codebrowse issues, but can we get an update on this one? Bug 260171 on http://launchpad.net/bugs/260171 is private * herb somehow managed to change flacoste into a verb. matsubara, Ursinha: I am running tests on the critical bug fix, will let you know once it has landed i saw! i've been flacosted! danilos, thanks thanks danilos rockstar, can you bring up the codebrowse issue to the code team? matsubara, everyday. :) rockstar, thanks :-) Codebrowse is being ACTIVELY worked on. It'd be nice if we knew what the issues is. Right now, we're just fixing things and hoping that was the problem. rockstar: let the losas know if there is anything we can do to help. herb, we certainly will. Should we be bringing in any outside help to intrument, test and diagnose the issue? herb, anything happened to the DB during the time of this OOPS-1152EA162? or maybe stub might know ^ matsubara: nothing in the incident log. matsubara: That is one of the connection reaper scripts kicking in matsubara: I think that's also on the void between LOSAs. ah, there we go. We kill connections idle in a transaction more than a few hours (and should be more agressive), and appserver connections that have been in a transaction for more than 2 minutes. stub, I see stub, ok. so if we start seeing too many of those, we have a problem somewhere and a few is kinda normal? The notification gets sent to the error-reports list (where we can confirm that this is indeed what happened) stub, aha. that's better. I'll chase the lp-errors for that one s/lp-errors/lp-errors list/ If we see many of them, we have a problem. One is probably a problem - appserver requests taking two minutes on the db means we need to investigate why the normal timeout mechanisms didn't work. [action] matsubara to look lp-errors list to determine cause of OOPS-1152EA162 ACTION received: matsubara to look lp-errors list to determine cause of OOPS-1152EA162 right. thanks for the explanation -1 second non-sql time, 0 seconds total time indicates a problem at the appserver? The request never got started? I'll file a bug about that one and we can discuss there hmm... might be a reconnection bug - perhaps the previous request handled by that thread got killed? I don't know if we Retry on DisconnectionError exceptions, or if it is a good idea in all cases. ok [TOPIC] * DBA report (stub) New Topic: * DBA report (stub) and thanks herb and stub New hardware exists and is being brought online by IS. I've realized I might need to tweak the db maintenance scripts (upgrade.py, security.py etc.) to cope with a third replica - I think it only copes with a single master and slave at the moment. Staging can be moved by the LOSAs as soon as the hardware is available and they have time, which will move that load from the production systems. I assume the rollout went fine as far as the db upgrade procedure goes. I assume it did too. I didn't hear any complaints from my colleagues. stub, great news! with the new hardware we won't have the staging restore problems anymore? stub: what's the plan with the 3rd replica? The staging restore problems should no longer be a problem. * herb feels like he missed something herb: We can start by pointing half the appservers at the new slave when it is online. We really should get a connection pool/load balancer thingy though running like pgbouncer, pgpool 1 or 2. stub: gotcha herb: I realized just now though that upgrade.py won't apply patches to a third replica, which would be bad. So that needs to be fixed. yeah. that's important. Or actually, slonik may take care of all that. I need to confirm anyway. I forget and it is too late for my brain :) erm... late as in evening all right. I guess that's all unless there are questions for stub thanks stub Thank you all for attending this week's Launchpad Production Meeting. See the channel topic for the location of the logs. #endmeeting Meeting finished at 09:42. thanks matsubara hey matsubara: question do we need a new roll-out? and i think it applies to everyone here flacoste, I was on vacation and need t ocheck that anyone requires a new roll-out? but I think there's at least danilos' bug to re roll flacoste: i don't know of any issues for us matsubara, flacoste: yes I thought it was policy to let enough bugs through qa to require a rerollout? we're getting better at QA stub even the code team weren't that late this cycle :-) ok, so we'll need a re-roll for translations. need to check for the other teams, but so far, there's nothing on the radar We need a counter somewhere - 'Launchpad has been running for n days without need to a release critical patch' stub, :) I think that's all then. thanks everyone