## Template for LP Production Meeting logs. Just paste xchat log below and the format IRC line will take care of formatting correctly #format IRC Meeting started at 10:00. The chair is matsubara. Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE] Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. [TOPIC] Roll Call New Topic: Roll Call me me me me Ursinha: * stub (n=stub@canonical/launchpad/stub) has joined #launchpad-meeting me me (on the right server this time) * flacoste (n=francis@canonical/launchpad/flacoste) has joined #launchpad-meeting me (if no call) me intellectronica: hi me all right, everyone here [TOPIC] Agenda New Topic: Agenda * Actions from last meeting * Oops report & Critical Bugs * Operations report (mthaddon/herb/spm) * DBA report (stub) [TOPIC] * Actions from last meeting New Topic: * Actions from last meeting * intellectronica to make efforts to take a look at bug 329908 * sinzui to talk to kiko about pending cp requests Launchpad bug 329908 in malone "DownloadFailed OOPS when reporting a bug with apport (dup-of: 349646)" [Undecided,New] https://launchpad.net/bugs/329908 Launchpad bug 349646 in malone "apport uploads not being found in +filebug" [Undecided,Fix released] https://launchpad.net/bugs/349646 matsubara: that's fixed well, sinzui's one is not needed anymore since that's been released thanks intellectronica matsubara: I removed the requests because it was close to the rollout and the items were not critical sinzui: sure. thanks for checking moving on [TOPIC] * Oops report & Critical Bugs New Topic: * Oops report & Critical Bugs * sinzui has a question about what is critical for unmaintaines app * Notify: mthaddon is online (lindbohm.freenode.net). * mthaddon (n=mthaddon@adsl-70-137-154-128.dsl.snfc21.sbcglobal.net) has joined #launchpad-meeting Ursinha: ? me 4 bugs to talk about * flacoste has quit (Read error: 104 (Connection reset by peer)) matsubara wants to talk about bug 353530 • bigjools, bug 347194, fixed as RC but still appears on lpnet • sinzui: bug 353863 • bigjools, bug 353568, timeout at +source/package page Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530 sinzui: good question. You mean blueprint stuff? Launchpad bug 347194 in soyuz "IntegrityError: duplicate key value violates unique constraint "binarypackagerelease_binarypackagename_key"" [High,Fix committed] https://launchpad.net/bugs/347194 Launchpad bug 353863 in launchpad-registry "TypeError when finishing creating user account in lpnet" [Undecided,New] https://launchpad.net/bugs/353863 Launchpad bug 353568 in soyuz "ubuntu/source/package/+index timing out" [High,Triaged] https://launchpad.net/bugs/353568 should we raise bug 353568 to critical? sinzui: I think we need to raise that question in the list cprov: what's up wit hteh ones bigjools fixed? * flacoste (n=francis@canonical/launchpad/flacoste) has joined #launchpad-meeting me again hi francis another X lock-up what did i miss? we're doing the oops section flacoste, the bugs we'll discuss Ursinha: That looks like a critical bug to me so far nothing for foundations matsubara: I don't know, AFAICT it's not fixed. Ursinha: I will give it to salgado who is already looking into login/account issues sinzui, I couldn't reproduce that, don't know if matsubara tried that those oopses are likely to be candidates for RC and next re-roll for sure Ursinha: I did not thanks sinzui what login/account issues are we having? Ursinha: salgado saw many oopses he could not reproduce, but I think he can at least explain why matsubara: I will look at it this afternoon, maybe I can do something quick to stop the timeout in production flacoste, bug 353863 Launchpad bug 353863 in launchpad-registry "TypeError when finishing creating user account in lpnet" [Undecided,New] https://launchpad.net/bugs/353863 I'll need help with this one re: bug 353530, intellectronica could you take a look? it's about the OOPS in filing bug using the email interface but I'm not sure that scpecific oops is under Bugs responsability Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530 cprov: cool. thanks matsubara: according to steve's comment that's another case of missing permissions but i'm not clear whether it was dealt with. i'll check I'm going to add those to the CurrentRolloutBlockers page and use that page to coordinate things that will go in for the re-roll matsubara, afaik that was just fixed by adding the user to the conf file in the server intellectronica: seems to be dealt with, but my question is more in the sense on how we can avoid that in the future as per spm explanations to me so, apparently it was a unusual rollout requirement but nobody added it there Ursinha: don't say server, we have at least 10 "servers" out there :-) matsubara, sorry :) s/server/server in which the conf was missing/ anyway, glancing at it, could be that the slaves were missing the right config? so it seems matsubara, might that be a question for the db report section? Ursinha, matsubara: we should add test for missing permission matsubara: did you file a bug about the one you wanted me to discuss with stub? flacoste: nope, but I have the pastebin here. I'll file a bug about it right after the meeting [action] matsubara to file a bug about the missing select permissions that delayed the rollout ACTION received: matsubara to file a bug about the missing select permissions that delayed the rollout thanks [action] cprov to look up soyuz bugs 347194, 353568 Launchpad bug 347194 in soyuz "IntegrityError: duplicate key value violates unique constraint "binarypackagerelease_binarypackagename_key"" [High,Fix committed] https://launchpad.net/bugs/347194 ACTION received: cprov to look up soyuz bugs 347194, 353568 Launchpad bug 353568 in soyuz "ubuntu/source/package/+index timing out" [High,Triaged] https://launchpad.net/bugs/353568 matsubara: the first one is fixed err, sorry about that, I'll edit that entry [action] matsubara to edit #347194 out of the last action :-) ACTION received: matsubara to edit #347194 out of the last action :-) matsubara: some errors happened yesterday because I had to reprocess a bunch binary uploads that failed after the rollout (due the absence of the launchpad_auth DB user) cprov, now it makes sense ah, so that also affected other things other than the email interface. thanks :) Ursinha: yes, it was a nightmare, because the buildfarm was full and binaries could not be processed due to the lack of DB access [action] matsubara to include francis suggestion to bug 353530 and ursinha to summarize what spm told her Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530 ACTION received: matsubara to include francis suggestion to bug 353530 and ursinha to summarize what spm told her indeed salgado: how can we help you with that one? matsubara, I'll let you know once I know. :) [action] salgado to debug and fix bug 353863 Launchpad bug 353863 in launchpad-registry "TypeError when finishing creating user account in lpnet" [Undecided,New] https://launchpad.net/bugs/353863 ACTION received: salgado to debug and fix bug 353863 I think I addressed everything Ursinha: has there been any outcome of the timeout discussion? so, as usual after the release we are going to monitor the oops reports constantly and coordinate with the teams about any new oopses danilos, I'm going to talk about it with stub in his section Ursinha: ok, thanks sorry for not following the script, I forgot my lines :) danilos, :) [action] sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint) ACTION received: sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint) sinzui: ^ is that correct? matsubara: yes ok, I think that's all for this section. All the critical ones are being handled thanks everyone [TOPIC] * Operations report (mthaddon/herb/spm) New Topic: * Operations report (mthaddon/herb/spm) 2009-03-30 - Experienced some DB problems that affected the service. Launchpad was unavailable for approximately 9 minutes. stub sent out an email summarizing the issues. 2009-03-30 - Cherry picked r8054 and part of r7999. 2009-04-01 - Rollout of 2.2.3. Total downtime was approximately 100 minutes. I think there were a few hiccups on some DB permissions, but I haven't had an opportunity to catch up with mthaddon and spm on the details. Bug 156453 and bug 118625 continue to be a source of discomfort. I think rockstar has an update on these though. Launchpad bug 156453 in loggerhead "production loggerhead branch leaks memory" [Critical,In progress] https://launchpad.net/bugs/156453 Launchpad bug 118625 in launchpad-bazaar "codebrowse sometimes hangs" [High,Triaged] https://launchpad.net/bugs/118625 Bug 80895 and bug 119420 are a pain point for the LOSAs. I think something may have been scheduled for this cycle on this front. If so that's a total win from our point of view. When do we think we'll be doing a re-roll? Launchpad bug 80895 in malone "Give people five minutes to edit/delete their comment" [Undecided,Confirmed] https://launchpad.net/bugs/80895 Launchpad bug 119420 in launchpad-answers "Cannot edit a comment" [Medium,Triaged] https://launchpad.net/bugs/119420 herb, I can has update! :) woo! So we have a memory middleware currently that's allowing us to track down memory issues. herb, also, mwhudson and jam have been writing a C-based memory profiler as well, so we can track refs even better in bzrlib itself. excellent herb: I'll let you know about the re-roll once we know. :-) matsubara: appreciated. herb, unfortunately, I can't really tell if the "sometimes hangs" bug is related to the "leaks memory" bug. herb: re: the DB permission, I'm going to file a bug about it and flacoste and stub will discuss it :-) rockstar: I suspect so, but fixing the memory issue would be a huge win. its not a bug, it was an operational issue indeed herb, yes. If they are unrelated, it's probably a bug in one of our dependencies. erm... if you are talking about the same one i'm thinking off. stub: I'm talking about the permission for the SSO user ok. different ;) :-) ok, anything else for herb? thanks herb. thanks matsubara and thank mthaddon and spm for the handling the rollout so well too! moving on. matsubara: will do [TOPIC] * DBA report (stub) New Topic: * DBA report (stub) Todays Database update ran in about 100 mins with all replicas enabled. Earlier calculations indicated the downtime would be a bit under three hours. The discrepancy is staging isn't as powerful and normal staging operations are underway during the restore. This was good from a downtime perspective, but does mean we can no longer get reliable rollout timings from staging. When rollout times are a concern, we might have to test the database upgrade process on a production server and calculate the time from there. I want to switch our master database to the new 16 core box from the current 8 core box in the next two weeks. This will require a few minutes downtime - I think a scheduled 10 minute outage will suffice. We might want to double up if there is other downtime required in the near future. A few days ago, generating a table bloat report managed to mess up PostgreSQL, causing all queries to the master to generate nothing but errors. A forced restart was required, causing a few minutes of downtime total The cause has been tracked down and is being worked on upstream, and we can avoid it now we know what it is (don't feed temporary tables to pgstattuple). I've opened a couple of bugs about batch jobs that are taking too long. I generally don't care how long things take as long as their impact is light, but staging updates and post rollout processes are approaching 24 hours... A number of problems where caused by missing PostgreSQL authorization to the new launchpad_auth user on production. This authorization was added to staging, but missed getting into the production rollout tasks. spm sorted it a few hours after the rollout as I understand it. This is a purely operational issue outside the scope of our test suite (staging is the test bed for database connection authorizations). Ignore OOPSes and bugs like 353 All from me. Bug 353530 Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530 stub, I have one oops, I don't know if it was just a hiccup stub, https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1188D1214 https://devpad.canonical.com/~jamesh/oops.cgi/1188D1214 [action] matsubara to talk to mrevell to announce a maintenance in the DB for about 10 min outage in the next 2 weeks. ask mrevell to talk to stub about it ACTION received: matsubara to talk to mrevell to announce a maintenance in the DB for about 10 min outage in the next 2 weeks. ask mrevell to talk to stub about it Ursinha: Thats a bug needing fixing. stub, I'll file a bug about it now about the timeouts we mentioned during the week it seems they indeed dropped the major responsible now is the source package index page danilos, ^ Ok. So we need to be even less aggressive doing mass data migration. if the timeouts continue the next days, we'll have to chase another cause. stub, Ursinha: we'll have something similar coming up, how can we make sure the impact is not felt on our production machines? danilos: Either set the acceptable lag setting lower, or a cooldown time after each batch. stub: or both? stub: ok, I guess we'll have to experiment with these or both ok. I guess that's all for stub? thanks stub thanks stub I have a minor annoucement that I forgot to add to the agenda Next week is our second performance week so, please add the bugs you're going to work on in https://dev.launchpad.net/PerformanceWeeks/April2009 and I think that's all anything else before I close? 3 2 1 Thank you all for attending this week's Launchpad Production Meeting. See the channel topic for the location of the logs. stub, bug 353897 Launchpad bug 353897 in launchpad-foundations "DisallowedStore OOPS in lpnet/+login" [Undecided,New] https://launchpad.net/bugs/353897 #endmeeting Meeting finished at 10:39.