DevelopmentMeeting20090402

Not logged in - Log In / Register

   1 <MootBot> Meeting started at 10:00. The chair is matsubara.
   2 <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
   3 <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. 
   4 <matsubara> [TOPIC] Roll Call 
   5 <MootBot> New Topic:  Roll Call
   6 <rockstar> me
   7 <herb> me
   8 <cprov> me
   9 <sinzui> me
  10 <matsubara> Ursinha: 
  11 * stub (n=stub@canonical/launchpad/stub) has joined #launchpad-meeting
  12 <Ursinha> me
  13 <stub> me (on the right server this time)
  14 * flacoste (n=francis@canonical/launchpad/flacoste) has joined #launchpad-meeting
  15 <danilos> me (if no call)
  16 <flacoste> me
  17 <matsubara> intellectronica: hi
  18 <intellectronica> me
  19 <matsubara> all right, everyone here
  20 <matsubara> [TOPIC] Agenda 
  21 <MootBot> New Topic:  Agenda
  22 <matsubara>  * Actions from last meeting
  23 <matsubara>  * Oops report & Critical Bugs 
  24 <matsubara>  * Operations report (mthaddon/herb/spm)
  25 <matsubara>  * DBA report (stub)
  26 <matsubara> [TOPIC] * Actions from last meeting
  27 <MootBot> New Topic:  * Actions from last meeting
  28 <matsubara>   * intellectronica to make efforts to take a look at bug 329908
  29 <matsubara>   * sinzui to talk to kiko about pending cp requests
  30 <ubottu> Launchpad bug 329908 in malone "DownloadFailed OOPS when reporting a bug with apport (dup-of: 349646)" [Undecided,New] https://launchpad.net/bugs/329908
  31 <ubottu> Launchpad bug 349646 in malone "apport uploads not being found in +filebug" [Undecided,Fix released] https://launchpad.net/bugs/349646
  32 <intellectronica> matsubara: that's fixed
  33 <matsubara> well, sinzui's one is not needed anymore since that's been released
  34 <matsubara> thanks intellectronica 
  35 <sinzui> matsubara: I removed the requests because it was close to the rollout and the items were not critical
  36 <matsubara> sinzui: sure. thanks for checking
  37 <matsubara> moving on
  38 <matsubara> [TOPIC] * Oops report & Critical Bugs 
  39 <MootBot> New Topic:  * Oops report & Critical Bugs
  40 * sinzui has a question about what is critical for unmaintaines app
  41 * Notify: mthaddon is online (lindbohm.freenode.net).
  42 * mthaddon (n=mthaddon@adsl-70-137-154-128.dsl.snfc21.sbcglobal.net) has joined #launchpad-meeting
  43 <matsubara> Ursinha: ?
  44 <Ursinha> me
  45 <Ursinha> 4 bugs to talk about
  46 * flacoste has quit (Read error: 104 (Connection reset by peer))
  47 <Ursinha> matsubara wants to talk about bug 353530
  48 <Ursinha> • bigjools, bug 347194, fixed as RC but still appears on lpnet
  49 <Ursinha> • sinzui: bug 353863
  50 <Ursinha> • bigjools, bug 353568, timeout at +source/package page
  51 <ubottu> Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530
  52 <matsubara> sinzui: good question. You mean blueprint stuff?
  53 <ubottu> Launchpad bug 347194 in soyuz "IntegrityError: duplicate key value violates unique constraint "binarypackagerelease_binarypackagename_key"" [High,Fix committed] https://launchpad.net/bugs/347194
  54 <ubottu> Launchpad bug 353863 in launchpad-registry "TypeError when finishing creating user account in lpnet" [Undecided,New] https://launchpad.net/bugs/353863
  55 <ubottu> Launchpad bug 353568 in soyuz "ubuntu/source/package/+index timing out" [High,Triaged] https://launchpad.net/bugs/353568
  56 <Ursinha> should we raise bug 353568 to critical?
  57 <matsubara> sinzui: I think we need to raise that question in the list
  58 <matsubara> cprov: what's up wit hteh ones bigjools fixed?
  59 * flacoste (n=francis@canonical/launchpad/flacoste) has joined #launchpad-meeting
  60 <flacoste> me again
  61 <matsubara> hi francis
  62 <flacoste> another X lock-up
  63 <flacoste> what did i miss?
  64 <matsubara> we're doing the oops section
  65 <Ursinha> flacoste, the bugs we'll discuss
  66 <sinzui> Ursinha: That looks like a critical bug to me
  67 <matsubara> so far nothing for foundations
  68 <cprov> matsubara: I don't know, AFAICT it's not fixed.
  69 <sinzui> Ursinha: I will give it to salgado who is already looking into login/account issues
  70 <Ursinha> sinzui, I couldn't reproduce that, don't know if matsubara tried that
  71 <matsubara> those oopses are likely to be candidates for RC and next re-roll
  72 <Ursinha> for sure
  73 <matsubara> Ursinha: I did not
  74 <Ursinha> thanks sinzui
  75 <flacoste> what login/account issues are we having?
  76 <sinzui> Ursinha: salgado saw many oopses he could not reproduce, but I think he can at least explain why
  77 <cprov> matsubara: I will look at it this afternoon, maybe I can do something quick to stop the timeout in production
  78 <Ursinha> flacoste, bug 353863
  79 <ubottu> Launchpad bug 353863 in launchpad-registry "TypeError when finishing creating user account in lpnet" [Undecided,New] https://launchpad.net/bugs/353863
  80 <salgado> I'll need help with this one
  81 <matsubara> re: bug 353530, intellectronica could you take a look? it's about the OOPS in filing bug using the email interface but I'm not sure that scpecific oops is under Bugs responsability
  82 <ubottu> Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530
  83 <matsubara> cprov: cool. thanks
  84 <intellectronica> matsubara: according to steve's comment that's another case of missing permissions
  85 <intellectronica> but i'm not clear whether it was dealt with. i'll check
  86 <matsubara> I'm going to add those to the CurrentRolloutBlockers page and use that page to coordinate things that will go in for the re-roll
  87 <Ursinha> matsubara, afaik that was just fixed by adding the user to the conf file in the server
  88 <matsubara> intellectronica: seems to be dealt with, but my question is more in the sense on how we can avoid that in the future
  89 <Ursinha> as per spm explanations
  90 <Ursinha> to me
  91 <matsubara> so, apparently it was a unusual rollout requirement but nobody added it there
  92 <matsubara> Ursinha: don't say server, we have at least 10 "servers" out there :-)
  93 <Ursinha> matsubara, sorry :) s/server/server in which the conf was missing/
  94 <matsubara> anyway, glancing at it, could be that the slaves were missing the right config?
  95 <intellectronica> so it seems
  96 <rockstar> matsubara, might that be a question for the db report section?
  97 <flacoste> Ursinha, matsubara: we should add test for missing permission
  98 <flacoste> matsubara: did you file a bug about the one you wanted me to discuss with stub?
  99 <matsubara> flacoste: nope, but I have the pastebin here. I'll file a bug about it right after the meeting
 100 <matsubara> [action] matsubara to file a bug about the missing select permissions that delayed the rollout
 101 <MootBot> ACTION received:  matsubara to file a bug about the missing select permissions that delayed the rollout
 102 <flacoste> thanks
 103 <matsubara> [action] cprov to look up soyuz bugs 347194, 353568
 104 <ubottu> Launchpad bug 347194 in soyuz "IntegrityError: duplicate key value violates unique constraint "binarypackagerelease_binarypackagename_key"" [High,Fix committed] https://launchpad.net/bugs/347194
 105 <MootBot> ACTION received:  cprov to look up soyuz bugs 347194, 353568
 106 <ubottu> Launchpad bug 353568 in soyuz "ubuntu/source/package/+index timing out" [High,Triaged] https://launchpad.net/bugs/353568
 107 <cprov> matsubara: the first one is fixed
 108 <matsubara> err, sorry about that, I'll edit that entry 
 109 <matsubara> [action] matsubara to edit #347194 out of the last action :-)
 110 <MootBot> ACTION received:  matsubara to edit #347194 out of the last action :-)
 111 <cprov> matsubara: some errors happened yesterday because I had to reprocess a bunch binary uploads that failed after the rollout (due the absence of the launchpad_auth DB user)
 112 <Ursinha> cprov, now it makes sense
 113 <matsubara> ah, so that also affected other things other than the email interface.
 114 <Ursinha> thanks :)
 115 <cprov> Ursinha: yes, it was a nightmare, because the buildfarm was full and binaries could not be processed due to the lack of DB access
 116 <matsubara> [action] matsubara to include francis suggestion to bug 353530 and ursinha to summarize what spm told her
 117 <ubottu> Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530
 118 <MootBot> ACTION received:  matsubara to include francis suggestion to bug 353530 and ursinha to summarize what spm told her
 119 <Ursinha> indeed
 120 <matsubara> salgado: how can we help you with that one?
 121 <salgado> matsubara, I'll let you know once I know. :)
 122 <matsubara> [action] salgado to debug and fix bug 353863
 123 <ubottu> Launchpad bug 353863 in launchpad-registry "TypeError when finishing creating user account in lpnet" [Undecided,New] https://launchpad.net/bugs/353863
 124 <MootBot> ACTION received:  salgado to debug and fix bug 353863
 125 <matsubara> I think I addressed everything
 126 <danilos> Ursinha: has there been any outcome of the timeout discussion?
 127 <matsubara> so, as usual after the release we are going to monitor the oops reports constantly and coordinate with the teams about any new oopses
 128 <Ursinha> danilos, I'm going to talk about it with stub in his section
 129 <danilos> Ursinha: ok, thanks
 130 <danilos> sorry for not following the script, I forgot my lines :)
 131 <Ursinha> danilos, :)
 132 <matsubara> [action] sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint)
 133 <MootBot> ACTION received:  sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint)
 134 <matsubara> sinzui: ^ is that correct?
 135 <sinzui> matsubara: yes
 136 <matsubara> ok, I think that's all for this section. All the critical ones are being handled
 137 <matsubara> thanks everyone
 138 <matsubara> [TOPIC] * Operations report (mthaddon/herb/spm)
 139 <MootBot> New Topic:  * Operations report (mthaddon/herb/spm)
 140 <herb> 2009-03-30 - Experienced some DB problems that affected the service. Launchpad was unavailable for approximately 9 minutes. stub sent out an email summarizing the issues.
 141 <herb> 2009-03-30 - Cherry picked r8054 and part of r7999.
 142 <herb> 2009-04-01 - Rollout of 2.2.3. Total downtime was approximately 100 minutes. I think there were a few hiccups on some DB permissions, but I haven't had an opportunity to catch up with mthaddon and spm on the details.
 143 <herb> Bug 156453 and bug 118625 continue to be a source of discomfort. I think rockstar has an update on these though.
 144 <ubottu> Launchpad bug 156453 in loggerhead "production loggerhead branch leaks memory" [Critical,In progress] https://launchpad.net/bugs/156453
 145 <ubottu> Launchpad bug 118625 in launchpad-bazaar "codebrowse sometimes hangs" [High,Triaged] https://launchpad.net/bugs/118625
 146 <herb> Bug 80895 and bug 119420 are a pain point for the LOSAs. I think something may have been scheduled for this cycle on this front. If so that's a total win from our point of view.
 147 <herb> When do we think we'll be doing a re-roll?
 148 <ubottu> Launchpad bug 80895 in malone "Give people five minutes to edit/delete their comment" [Undecided,Confirmed] https://launchpad.net/bugs/80895
 149 <ubottu> Launchpad bug 119420 in launchpad-answers "Cannot edit a comment" [Medium,Triaged] https://launchpad.net/bugs/119420
 150 <rockstar> herb, I can has update!
 151 <rockstar> :)
 152 <herb> woo!
 153 <rockstar> So we have a memory middleware currently that's allowing us to track down memory issues.
 154 <rockstar> herb, also, mwhudson and jam have been writing a C-based memory profiler as well, so we can track refs even better in bzrlib itself.
 155 <herb> excellent
 156 <matsubara> herb: I'll let you know about the re-roll once we know. :-)
 157 <herb> matsubara: appreciated.
 158 <rockstar> herb, unfortunately, I can't really tell if the "sometimes hangs" bug is related to the "leaks memory" bug.
 159 <matsubara> herb: re: the DB permission, I'm going to file a bug about it and flacoste and stub will discuss it :-)
 160 <herb> rockstar: I suspect so, but fixing the memory issue would be a huge win.
 161 <stub> its not a bug, it was an operational issue
 162 <Ursinha> indeed
 163 <rockstar> herb, yes.  If they are unrelated, it's probably a bug in one of our dependencies.
 164 <stub> erm... if you are talking about the same one i'm thinking off.
 165 <matsubara> stub: I'm talking about the permission for the SSO user
 166 <stub> ok. different ;)
 167 <matsubara> :-)
 168 <matsubara> ok, anything else for herb?
 169 <matsubara> thanks herb. 
 170 <herb> thanks matsubara
 171 <matsubara> and thank mthaddon and spm for the handling the rollout so well too!
 172 <matsubara> moving on.
 173 <herb> matsubara: will do
 174 <matsubara> [TOPIC] * DBA report (stub)
 175 <MootBot> New Topic:  * DBA report (stub)
 176 <stub> Todays Database update ran in about 100 mins with all replicas enabled. Earlier calculations indicated the downtime would be a bit under three hours. The discrepancy is staging isn't as powerful and normal staging operations are underway during the restore.
 177 <stub> This was good from a downtime perspective, but does mean we can no longer get reliable rollout timings from staging. When rollout times are a concern, we might have to test the database upgrade process on a production server and calculate the time from there.
 178 <stub> I want to switch our master database to the new 16 core box from the current 8 core box in the next two weeks. This will require a few minutes downtime - I think a scheduled 10 minute outage will suffice. We might want to double up if there is other downtime required in the near future.
 179 <stub> A few days ago, generating a table bloat report managed to mess up PostgreSQL, causing all queries to the master to generate nothing but errors. A forced restart was required, causing a few minutes of downtime total The cause has been tracked down and is being worked on upstream, and we can avoid it now we know what it is (don't feed temporary tables to pgstattuple).
 180 <stub> I've opened a couple of bugs about batch jobs that are taking too long. I generally don't care how long things take as long as their impact is light, but staging updates and post rollout processes are approaching 24 hours...
 181 <stub> A number of problems where caused by missing PostgreSQL authorization to the new launchpad_auth user on production. This authorization was added to staging, but missed getting into the production rollout tasks. spm sorted it a few hours after the rollout as I understand it. This is a purely operational issue outside the scope of our test suite (staging is the test bed for database connection authorizations). Ignore OOPSes and bugs like 353
 182 <stub> All from me.
 183 <stub> Bug 353530
 184 <ubottu> Launchpad bug 353530 in malone "OOPS filing a bug using the email interface " [Undecided,New] https://launchpad.net/bugs/353530
 185 <Ursinha> stub, I have one oops, I don't know if it was just a hiccup
 186 <Ursinha> stub, https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1188D1214
 187 <ubottu> https://devpad.canonical.com/~jamesh/oops.cgi/1188D1214
 188 <matsubara> [action] matsubara to talk to mrevell to announce a maintenance in the DB for about 10 min outage in the next 2 weeks. ask mrevell to talk to stub about it
 189 <MootBot> ACTION received:  matsubara to talk to mrevell to announce a maintenance in the DB for about 10 min outage in the next 2 weeks. ask mrevell to talk to stub about it
 190 <stub> Ursinha: Thats a bug needing fixing.
 191 <Ursinha> stub, I'll file a bug about it now
 192 <Ursinha> about the timeouts we mentioned during the week
 193 <Ursinha> it seems they indeed dropped
 194 <Ursinha> the major responsible now is the source package index page
 195 <Ursinha> danilos, ^
 196 <stub> Ok. So we need to be even less aggressive doing mass data migration.
 197 <Ursinha> if the timeouts continue the next days, we'll have to chase another cause.
 198 <danilos> stub, Ursinha: we'll have something similar coming up, how can we make sure the impact is not felt on our production machines?
 199 <stub> danilos: Either set the acceptable lag setting lower, or a cooldown time after each batch.
 200 <herb> stub: or both?
 201 <danilos> stub: ok, I guess we'll have to experiment with these
 202 <stub> or both
 203 <matsubara> ok. I guess that's all for stub?
 204 <matsubara> thanks stub 
 205 <Ursinha> thanks stub
 206 <matsubara> I have a minor annoucement that I forgot to add to the agenda
 207 <matsubara> Next week is our second performance week
 208 <matsubara> so, please add the bugs you're going to work on in https://dev.launchpad.net/PerformanceWeeks/April2009
 209 <matsubara> and I think that's all
 210 <matsubara> anything else before I close?
 211 <matsubara> 3
 212 <matsubara> 2
 213 <matsubara> 1
 214 <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See the channel topic for the location of the logs. 
 215 <Ursinha> stub, bug 353897
 216 <ubottu> Launchpad bug 353897 in launchpad-foundations "DisallowedStore OOPS in lpnet/+login" [Undecided,New] https://launchpad.net/bugs/353897
 217 <matsubara> #endmeeting 
 218 <MootBot> Meeting finished at 10:39.

DevelopmentMeeting20090402 (last edited 2009-04-02 15:48:18 by matsubara)