1 <matsubara> #startmeeting
2 <MootBot> Meeting started at 10:00. The chair is matsubara.
3 <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
4 <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues.
5 <matsubara> [TOPIC] Roll Call
6 <MootBot> New Topic: Roll Call
7 <rockstar> ni!
8 * herb__ has quit (Client Quit)
9 <danilos> me
10 <adeuring> me
11 <adeuring> (allenap is sick)
12 <danilos> (or "coo", if anyone knows about kin dza dza ;)
13 <matsubara> gary_poster, Chex, bigjools: hi
14 <mbarnett> hello
15 <matsubara> sinzui, hi
16 <gary_poster> me
17 <gary_poster> and hi
18 <matsubara> :-)
19 <sinzui> me
20 <mbarnett> me
21 <matsubara> apologies from Stuart and Ursula
22 <bigjools> me
23 <Chex> hello
24 <matsubara> [TOPIC] Agenda
25 <MootBot> New Topic: Agenda
26 <matsubara> * Actions from last meeting
27 <matsubara> * Oops report & Critical Bugs & Broken scripts
28 <matsubara> * Operations report (mthaddon/Chex/spm/mbarnett)
29 <matsubara> * DBA report (stub)
30 <matsubara> * Proposed items
31 <matsubara> [TOPIC] * Actions from last meeting
32 <MootBot> New Topic: * Actions from last meeting
33 <matsubara> * matsubara to trawl logs related to high load on edge on 2009-09-09 ~1830UTC and ping Chex about it
34 <matsubara> * matsubara to email the devel list about the new ErrorReportingUtility method
35 <matsubara> * done
36 <matsubara> * matsubara to file a bug to have the HWSubmissionMissingFields oopses as informational only (note to self: see bug 438671 for more details)
37 <matsubara> * filed https://bugs.edge.launchpad.net/malone/+bug/446660
38 <matsubara> * matsubara to look in lp-production-configs for the new oops prefixes.
39 <matsubara> * all QA contacts to inform their teams about the new QA column and what they should do about it.
40 <matsubara> * Chex to email the list about the new QA column in https://wiki.canonical.com/InformationInfrastructure/OSA/LPIncidentLog
41 <ubottu> Launchpad bug 438671 in checkbox "HWSubmissionMissingFields OOPS on +hwdb/+submit" [Undecided,Confirmed] https://launchpad.net/bugs/438671
42 <ubottu> Launchpad bug 446660 in malone "HWSubmissionMissingFields exceptions should be updated to be informational only" [High,Triaged]
43 <matsubara> I still haven't checked the high load logs. Chex or mthaddon, did you notice high loads after the 2009-09-09?
44 <Chex> matsubara: we still have been seeing some high loads, yes
45 <matsubara> I did look the new prefixes on lp-productions-configs. I need to update oops-tools to recognize those
46 <matsubara> I also noticed that some oops prefixes will conflict with existing ones, so I need to sort that out with...
47 <matsubara> losas I guess
48 <matsubara> [action] matsubara to file a bug on oops-tools to recognize new oops prefixes and sort out conflicting prefixes with losas
49 <MootBot> ACTION received: matsubara to file a bug on oops-tools to recognize new oops prefixes and sort out conflicting prefixes with losas
50 <matsubara> Chex, re: the high load, could you take on the task of analysing the logs? my idea was to correlate information from the app servers logs with the apache logs and see if that could shed some light.
51 <matsubara> mthaddon emailed the list about the new QA column, so everyone, read it and spread the word to your teams, please.
52 <Chex> Chex: yes sure, I can look at that.
53 <danilos> matsubara: it has just been discussed in the TL call as well, flacoste will champion the process
54 <Chex> matsubara: ^^ I mean..
55 <danilos> matsubara: (about QA Info column on LP incident log)
56 <matsubara> Chex, cool, thanks a lot. ping me if you need any info on that
57 <matsubara> danilos, cool. thanks!
58 <matsubara> [action] Chex to check app server logs and apache logs to see if it can shed any light in the high load issue.
59 <MootBot> ACTION received: Chex to check app server logs and apache logs to see if it can shed any light in the high load issue.
60 <matsubara> [TOPIC] * Oops report & Critical Bugs & Broken scripts
61 <MootBot> New Topic: * Oops report & Critical Bugs & Broken scripts
62 <matsubara> we're seeing a bunch on DisconnectionErrors which are not informational only
63 <matsubara> which means, the Retry mechanism is not enough for those cases.
64 <gary_poster> matsubara: are these the ones on the xmlrpc server?
65 <gary_poster> or something else?
66 <matsubara> gary_poster, yes, most of them on xmlrpc server
67 <matsubara> but there are a few, like OOPS-1383I246, in login.launchpad.net
68 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1383I246
69 <gary_poster> matsubara: right. I investigated and could not duplicate the ones in the xmlrpc server. Kicking the xmlrpc server made them go away. There's a bug number which I can get in a moment. After discussing with flacoste, I think the best we can hope for is to figure out a way to add more diagnostic information should the problem happen again
70 <matsubara> gary_poster, ok, I take this is a foundations taks then. let me know the bug number please (or I'll file a new one for the more diagnostic info needed issue, if that's not what the bug you mentioned is about)
71 <gary_poster> bug 450593 . Stuart has a follow up: check with losas if there were any unusual activity ATM
72 <ubottu> Launchpad bug 450593 in launchpad-foundations "Lots of DisconnectionErrors on xmlrpc server - staging" [Undecided,New] https://launchpad.net/bugs/450593
73 <gary_poster> I think a comment saying that we should address by adding diagnostic information in case there is a repeat would be sufficient. I'll do that.
74 <matsubara> thanks gary_poster
75 <matsubara> apart from that we have a bunch of oopses that will need fixing given the new zero oops policy.
76 <matsubara> Ursula will keep an eye on those for now and let the teams lead which ones are happening more frequently
77 <matsubara> we had some script failures last week
78 <matsubara> the main one seems to be the branch-puller which was already discussed in the list
79 <matsubara> checkwatches failed on the 13th, but since no other email came out, I assume it was a blip. adeuring, can you confirm?
80 <adeuring> matsubara: erm, I have no idea...
81 <matsubara> and the product-release finder and update-cache failed to run on the 14th
82 <adeuring> matsubara: I'll ask Graham
83 <matsubara> sinzui, do you know what's up with the product release finder script?
84 <matsubara> who's owns the update-cache script?
85 <matsubara> s/'s//
86 <matsubara> thanks adeuring
87 <gary_poster> I don't know; looking
88 <matsubara> [action] adeuring to check with gmb about checkwatches failure
89 <MootBot> ACTION received: adeuring to check with gmb about checkwatches failure
90 <sinzui> matsubara: No, but I think the issue is not that it failed, bu that a long process prevented it from running
91 <matsubara> sinzui, right, that'd explain. could you check that's the root cause and reply to the list?
92 <sinzui> matsubara: okay
93 <matsubara> maybe the update-cache failure happened for the same reason
94 <gary_poster> matsubara: I don't see an update-cache script in the LP tree. (I do see variants like update-download-cache)
95 <matsubara> just a reminder to everyone, if a script fails and your team owns that script, please reply to the failure email saying that someone is taking a look at it.
96 <matsubara> gary_poster, all I see is: "The script 'update-cache' didn't run on 'loganberry' between 2009-10-14 04:00:08 and 2009-10-14 22:00:08 (last seen 2009-10-13 11:36:51.345188)" not sure which script that one is monitoring.
97 <matsubara> for the critical bugs section, we have 4 bugs, 3 fix committed and 1 in progress
98 <matsubara> danilos, the one in progress is assigned to henning but he's on vacation
99 <matsubara> is it really critical?
100 <danilos> matsubara: I'd have to check, sorry for not being on top of this
101 <gary_poster> (I also looked for update-cache in lp-production-configs. not there either.)
102 <matsubara> gary_poster, I think it's cronscripts/update-pkgcache.py. IIRC, the losas script monitoring tool uses the script name defined in LaunchpadCronScript
103 <matsubara> [action] danilos to check bug 438039, assess if it's really critical. if it's is, land a fix, if it's not, update the importance
104 <MootBot> ACTION received: danilos to check bug 438039, assess if it's really critical. if it's is, land a fix, if it's not, update the importance
105 <ubottu> Launchpad bug 438039 in rosetta "bzr branch import script oopses sometimes" [Critical,In progress] https://launchpad.net/bugs/438039
106 <gary_poster> matsubara: oh ok, thanks. that script is either the one salgado was talking about that he owns, or something for soyuz, seems to me.
107 <bigjools> it's traditionally maintained by soyuz
108 <gary_poster> bigjools: ok, thanks
109 <bigjools> but in the new world order it could be registry
110 <matsubara> bigjools, can you confirm that update-cache failure described in the "Subject: Scripts failed to run: loganberry:productreleasefinder, loganberry:update-cache" refers to the update-pckg.py and reply back to the email sent to the list?
111 <matsubara> ok, you just did :-)
112 <bigjools> it doesn't look like update-packagecache
113 <bigjools> errr ah it is
114 <matsubara> bigjools, it's the only script that has update-cache string in cronscripts/
115 <bigjools> sorry got confused by seeing productreleasefinder
116 <matsubara> [action] bigjools to investigate update-cache failure and reply back to the list
117 <MootBot> ACTION received: bigjools to investigate update-cache failure and reply back to the list
118 <matsubara> bigjools, you might want to coordinate with sinzui since he'll check the product release failure one and suspects it might have failed because of a long running process
119 <bigjools> matsubara: is there an oops?
120 <sinzui> only an email that it did not run
121 <matsubara> bigjools, nope
122 <bigjools> and it was a one-off?
123 <sinzui> bigjools: it did not start, and that is 99% of the time the fault of a long running process
124 <bigjools> ok
125 * sinzui really does not think about the issue until it happens two days in a row
126 <bigjools> and me
127 <matsubara> sinzui, perhaps the script monitoring should have such a feature
128 <matsubara> but anyway, sorry for taking so long on this section
129 <matsubara> thanks eveyrone
130 <matsubara> [TOPIC] * Operations report (mthaddon/Chex/spm/mbarnett)
131 <MootBot> New Topic: * Operations report (mthaddon/Chex/spm/mbarnett)
132 <Chex> hello everyone
133 <gary_poster> hi
134 <Chex> sorry, notes failure:
135 <Chex> - LP Ship-it progress:
136 <Chex> ; LP shipit is live on the new servers
137 <Chex> ; Nigel Pugh is now in charge of approving CPs to those servers
138 <Chex> ; We are still working on the new front-ends for LP Login and LP itself
139 <Chex> - Buildd-manager DB restart issue/bugs: Bugs 451351 & 451349 have been
140 <ubottu> Launchpad bug 451351 in soyuz "buildd-manager doesn't give us a good way of determining it's in a failed state" [High,Triaged] https://launchpad.net/bugs/451351
141 <Chex> filed to address this issue, any movement to fix this problem?
142 <Chex> - QA column in Incident Log: Tom sent a email to LP list on Oct 12, has
143 <Chex> anyone reviewed the email and have comments/concerns about it?
144 <matsubara> Chex, are oops reports from those new servers going to be rsync'ed to devpad? such oopses are supposed to be included in LP oops summaries?
145 <Chex> LP Incidents of note: ; Applied: CP 9660 to lpnet, CP 9679 to lpnet
146 <Chex> ; Small LP outage (8 mins) : App servers (and
147 <Chex> librarians) didn't reconnect & had to be restarted after LP DBs
148 <Chex> were restarted: Bug filed: 451093
149 <Chex> and thats our report for this week. sorry for the troubles there
150 <bigjools> Chex: I am looking into 451351 but don't expect anything soon, it's a hard problem
151 * noodles775 has quit ("Leaving")
152 <Chex> matsubara: I am not sure on the status of oops summaries on the new servers, I will check on that
153 <matsubara> Chex, cool, thanks.
154 <Chex> bigjools: ok, thanks, just looking for status of progress.
155 <matsubara> Chex, danilos mentioned that QA column things was discussed today in the TL meeting and flacoste will champion the process.
156 * andrea-bs has quit (Remote closed the connection)
157 <Chex> matsubara: ok, that is great to hear.
158 <matsubara> Chex, thanks for the report
159 <matsubara> let me move on as we are overdue
160 <matsubara> [TOPIC] * DBA report (stub)
161 <MootBot> New Topic: * DBA report (stub)
162 <matsubara> The new replica to become the master for the authentication service has been taken offline, as the hardware was showing signs of strain keeping up with Launchpad's write load. The hardware is being beefed up to cope. The alternative is to just put the authdb replication set on this server and have the authentication service appservers connect to the main launchpad databases for the data they need to pull from the lpmain repl
163 <matsubara> ication set.
164 <matsubara> Nothing else to report.
165 <matsubara> that came from Stuart. any questions about dba's report?
166 <matsubara> ok, I'll take that as a no :-)
167 <matsubara> [TOPIC] * Proposed items
168 <MootBot> New Topic: * Proposed items
169 <matsubara> no new proposed items
170 <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See https://dev.launchpad.net/MeetingAgenda for the logs.
171 <matsubara> sorry for overrunning
172 <matsubara> #endmeeting
173 <MootBot> Meeting finished at 10:51.