DevelopmentMeeting20091015

Not logged in - Log In / Register

   1 <matsubara> #startmeeting
   2 <MootBot> Meeting started at 10:00. The chair is matsubara.
   3 <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
   4 <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. 
   5 <matsubara> [TOPIC] Roll Call 
   6 <MootBot> New Topic:  Roll Call
   7 <rockstar> ni!
   8 * herb__ has quit (Client Quit)
   9 <danilos> me
  10 <adeuring> me
  11 <adeuring> (allenap is sick)
  12 <danilos> (or "coo", if anyone knows about kin dza dza ;)
  13 <matsubara> gary_poster, Chex, bigjools: hi
  14 <mbarnett> hello
  15 <matsubara> sinzui, hi
  16 <gary_poster> me
  17 <gary_poster> and hi
  18 <matsubara> :-)
  19 <sinzui> me
  20 <mbarnett> me
  21 <matsubara> apologies from Stuart and Ursula
  22 <bigjools> me
  23 <Chex> hello
  24 <matsubara> [TOPIC] Agenda 
  25 <MootBot> New Topic:  Agenda
  26 <matsubara>  * Actions from last meeting
  27 <matsubara>  * Oops report & Critical Bugs & Broken scripts
  28 <matsubara>  * Operations report (mthaddon/Chex/spm/mbarnett)
  29 <matsubara>  * DBA report (stub)
  30 <matsubara>  * Proposed items
  31 <matsubara> [TOPIC] * Actions from last meeting
  32 <MootBot> New Topic:  * Actions from last meeting
  33 <matsubara> * matsubara to trawl logs related to high load on edge on 2009-09-09 ~1830UTC and ping Chex about it
  34 <matsubara> * matsubara to email the devel list about the new ErrorReportingUtility method
  35 <matsubara>     * done
  36 <matsubara> * matsubara to file a bug to have the HWSubmissionMissingFields oopses as informational only (note to self: see bug 438671 for more details)
  37 <matsubara>     * filed https://bugs.edge.launchpad.net/malone/+bug/446660
  38 <matsubara> * matsubara to look in lp-production-configs for the new oops prefixes.
  39 <matsubara> * all QA contacts to inform their teams about the new QA column and what they should do about it.
  40 <matsubara> * Chex to email the list about the new QA column in https://wiki.canonical.com/InformationInfrastructure/OSA/LPIncidentLog
  41 <ubottu> Launchpad bug 438671 in checkbox "HWSubmissionMissingFields OOPS on +hwdb/+submit" [Undecided,Confirmed] https://launchpad.net/bugs/438671
  42 <ubottu> Launchpad bug 446660 in malone "HWSubmissionMissingFields exceptions should be updated to be informational only" [High,Triaged]
  43 <matsubara> I still haven't checked the high load logs. Chex or mthaddon, did you notice high loads after the 2009-09-09? 
  44 <Chex> matsubara: we still have been seeing some high loads, yes
  45 <matsubara> I did look the new prefixes on lp-productions-configs. I need to update oops-tools to recognize those
  46 <matsubara> I also noticed that some oops prefixes will conflict with existing ones, so I need to sort that out with...
  47 <matsubara> losas I guess
  48 <matsubara> [action] matsubara to file a bug on oops-tools to recognize new oops prefixes and sort out conflicting prefixes with losas
  49 <MootBot> ACTION received:  matsubara to file a bug on oops-tools to recognize new oops prefixes and sort out conflicting prefixes with losas
  50 <matsubara> Chex, re: the high load, could you take on the task of analysing the logs? my idea was to correlate information from the app servers logs with the apache logs and see if that could shed some light.
  51 <matsubara> mthaddon emailed the list about the new QA column, so everyone, read it and spread the word to your teams, please.
  52 <Chex> Chex: yes sure, I can look at that.
  53 <danilos> matsubara: it has just been discussed in the TL call as well, flacoste will champion the process
  54 <Chex> matsubara: ^^  I mean..
  55 <danilos> matsubara: (about QA Info column on LP incident log)
  56 <matsubara> Chex, cool, thanks a lot. ping me if you need any info on that
  57 <matsubara> danilos, cool. thanks!
  58 <matsubara> [action] Chex to check app server logs and apache logs to see if it can shed any light in the high load issue.
  59 <MootBot> ACTION received:  Chex to check app server logs and apache logs to see if it can shed any light in the high load issue.
  60 <matsubara> [TOPIC] * Oops report & Critical Bugs & Broken scripts
  61 <MootBot> New Topic:  * Oops report & Critical Bugs & Broken scripts
  62 <matsubara> we're seeing a bunch on DisconnectionErrors which are not informational only
  63 <matsubara> which means, the Retry mechanism is not enough for those cases.
  64 <gary_poster> matsubara: are these the ones on the xmlrpc server?
  65 <gary_poster> or something else?
  66 <matsubara> gary_poster, yes, most of them on xmlrpc server
  67 <matsubara> but there are a few, like OOPS-1383I246, in login.launchpad.net
  68 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1383I246
  69 <gary_poster> matsubara: right.  I investigated and could not duplicate the ones in the xmlrpc server.  Kicking the xmlrpc server made them go away.  There's a bug number which I can get in a moment.  After discussing with flacoste, I think the best we can hope for is to figure out a way to add more diagnostic information should the problem happen again
  70 <matsubara> gary_poster, ok, I take this is a foundations taks then. let me know the bug number please (or I'll file a new one for the more diagnostic info needed issue, if that's not what the bug you mentioned is about)
  71 <gary_poster> bug 450593 .   Stuart has a follow up: check with losas if there were any unusual activity ATM
  72 <ubottu> Launchpad bug 450593 in launchpad-foundations "Lots of DisconnectionErrors on xmlrpc server - staging" [Undecided,New] https://launchpad.net/bugs/450593
  73 <gary_poster> I think a comment saying that we should address by adding diagnostic information in case there is a repeat would be sufficient.  I'll do that.
  74 <matsubara> thanks gary_poster 
  75 <matsubara> apart from that we have a bunch of oopses that will need fixing given the new zero oops policy.
  76 <matsubara> Ursula will keep an eye on those for now and let the teams lead which ones are happening more frequently
  77 <matsubara> we had some script failures last week
  78 <matsubara> the main one seems to be the branch-puller which was already discussed in the list
  79 <matsubara> checkwatches failed on the 13th, but since no other email came out, I assume it was a blip. adeuring, can you confirm?
  80 <adeuring> matsubara: erm, I have no idea...
  81 <matsubara> and the product-release finder and update-cache failed to run on the 14th
  82 <adeuring> matsubara: I'll ask Graham
  83 <matsubara> sinzui, do you know what's up with the product release finder script?
  84 <matsubara> who's owns the update-cache script? 
  85 <matsubara> s/'s//
  86 <matsubara> thanks adeuring 
  87 <gary_poster> I don't know; looking
  88 <matsubara> [action] adeuring to check with gmb about checkwatches failure
  89 <MootBot> ACTION received:  adeuring to check with gmb about checkwatches failure
  90 <sinzui> matsubara: No, but I think the issue is not that it failed, bu that a long process prevented it from running
  91 <matsubara> sinzui, right, that'd explain. could you check that's the root cause and reply to the list?
  92 <sinzui> matsubara: okay
  93 <matsubara> maybe the update-cache failure happened for the same reason
  94 <gary_poster> matsubara: I don't see an update-cache script in the LP tree.  (I do see variants like update-download-cache)
  95 <matsubara> just a reminder to everyone, if a script fails and your team owns that script, please reply to the failure email saying that someone is taking a look at it.
  96 <matsubara> gary_poster, all I see is: "The script 'update-cache' didn't run on 'loganberry' between 2009-10-14 04:00:08 and 2009-10-14 22:00:08 (last seen 2009-10-13 11:36:51.345188)" not sure which script that one is monitoring.
  97 <matsubara> for the critical bugs section, we have 4 bugs, 3 fix committed and 1 in progress
  98 <matsubara> danilos, the one in progress is assigned to henning but he's on vacation
  99 <matsubara> is it really critical?
 100 <danilos> matsubara: I'd have to check, sorry for not being on top of this
 101 <gary_poster> (I also looked for update-cache in lp-production-configs.  not there either.)
 102 <matsubara> gary_poster, I think it's cronscripts/update-pkgcache.py. IIRC, the losas script monitoring tool uses the script name defined in LaunchpadCronScript
 103 <matsubara> [action] danilos to check bug 438039, assess if it's really critical. if it's is, land a fix, if it's not, update the importance
 104 <MootBot> ACTION received:  danilos to check bug 438039, assess if it's really critical. if it's is, land a fix, if it's not, update the importance
 105 <ubottu> Launchpad bug 438039 in rosetta "bzr branch import script oopses sometimes" [Critical,In progress] https://launchpad.net/bugs/438039
 106 <gary_poster> matsubara: oh ok, thanks.  that script is either the one salgado was talking about that he owns, or something for soyuz, seems to me.
 107 <bigjools> it's traditionally maintained by soyuz
 108 <gary_poster> bigjools: ok, thanks
 109 <bigjools> but in the new world order it could be registry
 110 <matsubara> bigjools, can you confirm that update-cache failure described in the "Subject: Scripts failed to run: loganberry:productreleasefinder, loganberry:update-cache" refers to the update-pckg.py and reply back to the email sent to the list?
 111 <matsubara> ok, you just did :-)
 112 <bigjools> it doesn't look like update-packagecache
 113 <bigjools> errr ah it is
 114 <matsubara> bigjools, it's the only script that has update-cache string in cronscripts/
 115 <bigjools> sorry got confused by seeing productreleasefinder
 116 <matsubara> [action] bigjools to investigate update-cache failure and reply back to the list
 117 <MootBot> ACTION received:  bigjools to investigate update-cache failure and reply back to the list
 118 <matsubara> bigjools, you might want to coordinate with sinzui since he'll check the product release failure one and suspects it might have failed because of a long running process
 119 <bigjools> matsubara: is there an oops?
 120 <sinzui> only an email that it did not run
 121 <matsubara> bigjools, nope
 122 <bigjools> and it was a one-off?
 123 <sinzui> bigjools: it did not start, and that is 99% of the time the fault of a long running process
 124 <bigjools> ok
 125 * sinzui really does not think about the issue until it happens two days in a row
 126 <bigjools> and me
 127 <matsubara> sinzui, perhaps the script monitoring should have such a feature
 128 <matsubara> but anyway, sorry for taking so long on this section
 129 <matsubara> thanks eveyrone
 130 <matsubara> [TOPIC] * Operations report (mthaddon/Chex/spm/mbarnett)
 131 <MootBot> New Topic:  * Operations report (mthaddon/Chex/spm/mbarnett)
 132 <Chex> hello everyone
 133 <gary_poster> hi
 134 <Chex> sorry, notes failure:
 135 <Chex> - LP Ship-it progress:
 136 <Chex> ; LP shipit is live on the new servers
 137 <Chex> ; Nigel Pugh is now in charge of approving CPs to those servers
 138 <Chex> ; We are still working on the new front-ends for LP Login and LP itself
 139 <Chex> - Buildd-manager DB restart issue/bugs: Bugs 451351 & 451349 have been
 140 <ubottu> Launchpad bug 451351 in soyuz "buildd-manager doesn't give us a good way of determining it's in a failed state" [High,Triaged] https://launchpad.net/bugs/451351
 141 <Chex> filed to address this issue, any movement to fix this problem?
 142 <Chex> - QA column in Incident Log: Tom sent a email to LP list on Oct 12, has
 143 <Chex> anyone reviewed the email and have comments/concerns about it?
 144 <matsubara> Chex, are oops reports from those new servers going to be rsync'ed to devpad? such oopses are supposed to be included in LP oops summaries?
 145 <Chex> LP Incidents of note: ; Applied: CP 9660 to lpnet, CP 9679 to lpnet
 146 <Chex>     ; Small LP outage (8 mins) : App servers (and
 147 <Chex>          librarians) didn't reconnect & had to be restarted after LP DBs
 148 <Chex> were restarted: Bug filed: 451093
 149 <Chex> and thats our report for this week. sorry for the troubles there
 150 <bigjools> Chex: I am looking into 451351 but don't expect anything soon, it's a hard problem
 151 * noodles775 has quit ("Leaving")
 152 <Chex> matsubara: I am not sure on the status of oops summaries on the new servers, I will check on that
 153 <matsubara> Chex, cool, thanks.
 154 <Chex> bigjools: ok, thanks, just looking for status of progress.
 155 <matsubara> Chex, danilos mentioned that QA column things was discussed today in the TL meeting and flacoste will champion the process.
 156 * andrea-bs has quit (Remote closed the connection)
 157 <Chex> matsubara: ok, that is great to hear.
 158 <matsubara> Chex, thanks for the report
 159 <matsubara> let me move on as we are overdue
 160 <matsubara> [TOPIC] * DBA report (stub)
 161 <MootBot> New Topic:  * DBA report (stub)
 162 <matsubara> The new replica to become the master for the authentication service has been taken offline, as the hardware was showing signs of strain keeping up with Launchpad's write load. The hardware is being beefed up to cope. The alternative is to just put the authdb replication set on this server and have the authentication service appservers connect to the main launchpad databases for the data they need to pull from the lpmain repl
 163 <matsubara> ication set.
 164 <matsubara> Nothing else to report.
 165 <matsubara> that came from Stuart. any questions about dba's report?
 166 <matsubara> ok, I'll take that as a no :-)
 167 <matsubara> [TOPIC] * Proposed items
 168 <MootBot> New Topic:  * Proposed items
 169 <matsubara> no new proposed items
 170 <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See https://dev.launchpad.net/MeetingAgenda for the logs. 
 171 <matsubara> sorry for overrunning 
 172 <matsubara> #endmeeting
 173 <MootBot> Meeting finished at 10:51.

DevelopmentMeeting20091015 (last edited 2009-10-15 17:10:39 by matsubara)