1 <matsubara> #startmeeting
2 <MootBot> Meeting started at 10:00. The chair is matsubara.
3 <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
4 <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues.
5 * danilos has quit (Nick collision from services.)
6 <matsubara> [TOPIC] Roll Call
7 <MootBot> New Topic: Roll Call
8 <matsubara> Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us!
9 <sinzui> me
10 <al-maisan> me
11 <danilo_> me
12 <mrjazzcat> me
13 <mbarnett> me
14 <matsubara> sorry mrjazzcat, I always forget to ping you about the meeting. I'll add you to the "Who should be here?" section if you don't mind
15 <mrjazzcat> yes, please
16 <matsubara> on the MeetingAgenda page, I mean
17 <mrjazzcat> no worries
18 <matsubara> [action] add brian to the list of attendees in the MeetingAgenda page
19 <MootBot> ACTION received: add brian to the list of attendees in the MeetingAgenda page
20 <matsubara> Ursula won't be around today
21 <matsubara> and I'll be standing in for Gary
22 <matsubara> rockstar, hi, around?
23 <matsubara> allenap, hi
24 <matsubara> well, let's move on and then Gavin and Paul can join in later
25 <matsubara> [TOPIC] Agenda
26 <MootBot> New Topic: Agenda
27 <matsubara> * Actions from last meeting
28 <matsubara> * Oops report & Critical Bugs & Broken scripts
29 <matsubara> * Operations report (mthaddon/Chex/spm/mbarnett)
30 <matsubara> * DBA report (stub)
31 <matsubara> * Proposed items
32 <matsubara> [TOPIC] * Actions from last meeting
33 <MootBot> New Topic: * Actions from last meeting
34 <matsubara> * allenap to dig the master bug of OOPS-1474EA771
35 <matsubara> * salgado to take a look in the TypeError oopses (OOPS-1479S1000)
36 <matsubara> * already did that, this is bug 403281, it happened because mthaddon was testing the new read-only switch on staging.
37 <matsubara> * rockstar to take a look in OOPS-1480CMP1
38 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
39 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1479S1000
40 <ubottu> Launchpad bug 403281 in launchpad-foundations "public xmlrpc requests broken during read only period" [Undecided,Triaged] https://launchpad.net/bugs/403281
41 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1
42 <matsubara> ok, so I'll re-add both items for allenap and rockstar
43 <matsubara> [action] * allenap to dig the master bug of OOPS-1474EA771
44 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
45 <MootBot> ACTION received: * allenap to dig the master bug of OOPS-1474EA771
46 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
47 <matsubara> [action] * rockstar to take a look in OOPS-1480CMP1
48 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1
49 <MootBot> ACTION received: * rockstar to take a look in OOPS-1480CMP1
50 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1
51 <matsubara> [TOPIC] * Oops report & Critical Bugs & Broken scripts
52 <MootBot> New Topic: * Oops report & Critical Bugs & Broken scripts
53 <matsubara> we have some oops reports but most of them foundations issues
54 <matsubara> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
55 <matsubara> Looks like an anonymous user is trying to do some operation which (s)he's not allowed. Should we really log an oops for this?
56 <matsubara> maybe related to https://bugs.edge.launchpad.net/launchpad-foundations/+bug/271029
57 <matsubara> More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
58 <matsubara> InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
59 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884
60 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489J147
61 <ubottu> Ubuntu bug 271029 in launchpad-foundations "ForbiddenAttribute exception raised changing property of object" [Medium,Triaged]
62 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094
63 <matsubara> code team, BranchMergeProposalExists https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
64 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174
65 <matsubara> so, that's it and there's no one from Code to take a look at the BranchMergeProposalExists one
66 <matsubara> [action] matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
67 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174
68 <MootBot> ACTION received: matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
69 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174
70 <matsubara> [action] matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
71 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884
72 <MootBot> ACTION received: matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
73 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884
74 <matsubara> [action] matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
75 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489J147
76 <MootBot> ACTION received: matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
77 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489J147
78 <matsubara> [action] matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
79 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094
80 <MootBot> ACTION received: matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
81 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094
82 <matsubara> lovely, looks like I'm running the meeting all by myself heheh
83 <rockstar> me
84 <allenap> me
85 * noodles775 has quit (Read error: 110 (Connection timed out))
86 <al-maisan> :)
87 <matsubara> on the broken scripts side
88 <matsubara> sinzui, Scripts failed to run: loganberry:send-person-notifications seems to be broken
89 <matsubara> sinzui, could you take a look and reply to the list?
90 * noodles775 (n=miken@canonical/launchpad/noodles775) has joined #launchpad-meeting
91 <sinzui> matsubara: all scripts appear to be broken
92 <matsubara> all?
93 <sinzui> They are not running and I am tempted to say something new was added that is taking forever and a day
94 <matsubara> I only see notifications for send-person-notifications and garbo-hourly
95 <matsubara> sinzui, can you confirm and reply to the list that's the case, at least for the send-person-notifications one?
96 <matsubara> I'll ask losas and/or stub about garbo-hourly not running as well
97 <allenap> matsubara: Re. OOPS-1474EA771, it's bug 508302, and deryck is working on it today.
98 <ubottu> Launchpad bug 508302 in malone "NotImplementedError OOPS when reporting a bug" [High,In progress] https://launchpad.net/bugs/508302
99 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
100 <matsubara> thanks allenap, I'll adjust the bug link on that oops report
101 <matsubara> [action] matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302
102 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
103 <MootBot> ACTION received: matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302
104 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
105 <matsubara> [action] sinzui to investigate failure on send-person-notifications and reply to the list with his findings
106 <MootBot> ACTION received: sinzui to investigate failure on send-person-notifications and reply to the list with his findings
107 <matsubara> btw, updatebranches script also failed recently but that's been fixed by spm. the new rollout changed the script name and losas updated the notification thing to recognize the new name
108 <matsubara> on the critical bugs side
109 <rockstar> matsubara, updatebranches no longer runs.
110 <matsubara> we have 3 critical bugs
111 <mthaddon> matsubara: er, not quite
112 <rockstar> It's been replaced by scan_branches
113 <mthaddon> matsubara: we've had to revert it a bunch of times
114 <matsubara> mthaddon, hmm no? spm's email seems to indicate that
115 <mthaddon> matsubara: spm went to bed a while ago - a new problem was discovered since then
116 <mthaddon> matsubara: abentley and Chex have been working on it
117 <matsubara> oh, I was looking at this latest email to the list replying to one of the script failures notification
118 <matsubara> well, if they're already working on it, it's ok. :-)
119 <mthaddon> matsubara: not really...
120 <matsubara> mthaddon, no? what else is expected?
121 <mthaddon> matsubara: as I understand it, we've reverted to the old script because we still don't know what was wrong
122 <mthaddon> matsubara: and the fact that we've reverted between the old and new scripts twice now on production is a problem in itself
123 * salgado (n=salgado@canonical/launchpad/salgado) has joined #launchpad-meeting
124 <matsubara> mthaddon, I meant it's ok in the sense that people are already working on a solution and there's nothing much to be done during this meeting to have people act on it
125 <mthaddon> matsubara: and also the fact that the first we heard about the problem was from a user report
126 <mthaddon> i.e. we don't have a good measure of when this problem is even happening
127 <mthaddon> matsubara: maybe not, but I'd like a bit of discussion about this class of problem and what can be done to prevent it in the future
128 <danilo_> mthaddon: what is the exact problem that we need to be able to track? (sorry, I am not fully up to date on what broke)
129 <mthaddon> danilo_: aiui email notifications of branch updates failed to be sent out
130 <gary_poster> mthaddon: "reverted ...twice...on production": I think we all agree this sucks. However, AIUI, this was successfully QAd. Either the QA was bad, or staging is not close enough to prod in some way. I don't think we know yet.
131 <matsubara> mthaddon, I'm unaware of the details as well. My expectation is that a IncidentLog will be filed and action to prevent it will be included in the incident log
132 <danilo_> mthaddon: ah, right, that could have a bigger impact (it might be harming us in translations as well)
133 <mthaddon> matsubara: this doesn't really qualify as an incident log item since there's no measurable service that's been interrupted (we don't have any kind of nagios monitoring of this) - I guess I'm asking how we plan to approach it from here
134 <mthaddon> and how we got into this situation
135 <danilo_> mthaddon, gary_poster, matsubara: we are obviously missing a dedicated "communications person" for this specific item (someone to keep the entire situation in check); we've discussed that approach before, it'd be nice to find someone who can offload the communication side from abentley and others working on it
136 <gary_poster> danilo_: to the degree there's a failure there (communications), it'd probably be mine as RM
137 <gary_poster> maybe we can have somebody else too
138 <gary_poster> but that's RM stuff
139 <gary_poster> but AIUI that's not the prob
140 <danilo_> gary_poster, not necessarily, we discussed this in a TL call a few weeks (months?) back where we need someone to communicate with everyone
141 <gary_poster> maybe so
142 <gary_poster> but probs I see:
143 <danilo_> gary_poster, it's mostly about having someone take responsibility for making sure problems are visible and we know what's going on
144 <gary_poster> - we didn't catch this on staging. Why?
145 <gary_poster> either QA was bad or staging is too diff
146 <gary_poster> we need to know why
147 <gary_poster> and fix it
148 <mthaddon> yep, I agree with that
149 <gary_poster> then also, unless I misunderstand, mthaddon is saying that we don't have an automated nagios-like process verifying basic success on production for this thing
150 <danilo_> gary_poster, neither of those is easy to fix (one depends on people always DTRT, another on machines always DTRT), so we need to be able to easily find out when it's broken rather than wait for users to report it
151 <gary_poster> danilo_: but doesn't that depend on one of the three things I said? (people DTRT, machines DTRT, nagios-like-thing DTRT)
152 <danilo_> gary_poster, it does, I was typing before you typed the last one :)
153 <mthaddon> gary_poster: it's possible we can't do that for *everything*, but if we decide this is a sufficiently important thing that we care about it if it fails, it sounds like we need to monitor it somehow, yeah (possibly we are already with OOPSes, but why didn't we catch it til a user told us about it?)
154 <gary_poster> :-) ok
155 <danilo_> gary_poster, the 4th is lack of coordination and communication :)
156 * salgado-lunch has quit (Read error: 110 (Connection timed out))
157 <gary_poster> mthaddon: right. For me, this gets to my "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
158 <gary_poster> maybe the jobs system can help with this
159 <danilo_> anyway, gary_poster, I think we should just raise the importance of ensuring sufficient monitoring of this part of code-hosting by thumper, and we can be done with the topic
160 <gary_poster> maybe we can architect the jobs system to give us a nagios-like hook
161 <danilo_> gary_poster, we don't have to solve the problem here :)
162 <gary_poster> because doing it with cron scripts is a one-per job
163 <matsubara> danilo_, can you raise the topic in the next TL meeting?
164 <gary_poster> danilo_: ack. I kind of disagree with your summary though, and your action item, so that's why I'm continuing to blather :-)
165 <danilo_> matsubara, we are having a week long TL meeting next week, so it'd be best to action it for someone from code team to pass it on to thumper, imho :)
166 <gary_poster> (IOW, this is not a problem for thumper, it is a problem for Björn, team leads, etc.)
167 <danilo_> gary_poster, well, sure, I agree, but one step at a time
168 <gary_poster> matsubara: two action items: :-)
169 <matsubara> [action] rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper
170 <MootBot> ACTION received: rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper
171 <danilo_> gary_poster, there's immediate problem and then there's the elegant solution; I'm always for fixing the immediate problem first and having the elegant solution come out of that
172 <gary_poster> yeah, that's number one
173 <gary_poster> number two is gary to bring up archtecture concerns to team lead mtg :-)
174 <danilo_> gary_poster, as for the other one, I think it ties in well with what we discussed today and what we'll want to discuss anyway
175 <matsubara> [action] TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
176 <MootBot> ACTION received: TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
177 <matsubara> does that summarize it well?
178 <gary_poster> yeah thank you. though it's probably my action, since I'm the one with the bee in my bonnet :-) but that's fine
179 <danilo_> gary_poster, matsubara: I don't like action items like that because they put no responsibility on anyone in particular, thus meaning that if they get done, they get done unrelated to the action item; thus, you don't really need it
180 <gary_poster> so give it to me :-)
181 <matsubara> danilo_, I'll add it to gary's queue when I add the summary to the MeetingAgenda page
182 <danilo_> gary_poster, heh, that's ok, I am certain we would have discussed this regardless of us having any particular action item
183 <danilo_> matsubara, sure, thanks
184 <gary_poster> :-)
185 <matsubara> it serves as a reminder as well
186 <matsubara> anyway, thanks for the comments
187 <rockstar> It fairness, the "not getting branch update emails" thing was because a rather large part of the code hosting system was made into a job.
188 <gary_poster> To whom are you being fair? :-)
189 <gary_poster> Never mind, I'll be quiet :-)
190 <danilo_> :)
191 <matsubara> we have 3 critical bugs, one in progress, one fix committed
192 <rockstar> I'm not sure how "sufficient monitoring" would have fixed this.
193 <matsubara> the other one is triaged, bug 511567
194 <ubottu> Launchpad bug 511567 in launchpad-foundations "Can't remove authorised app" [Critical,Triaged] https://launchpad.net/bugs/511567
195 <rockstar> gary_poster, to the code team in general.
196 <matsubara> hmm
197 <matsubara> that's a dupe
198 <danilo_> rockstar, sufficient monitoring of scripts that do this
199 <matsubara> and I filed that bug a few days ago
200 <rockstar> danilo_, howso?
201 <matsubara> or maybe I filed the dupe
202 <gary_poster> rockstar: ah, gotcha. Tim can beat us into shape at the TL sprint so we understand.
203 <rockstar> gary_poster, yeah, I'll talk to him.
204 <gary_poster> cool
205 <danilo_> rockstar, monitoring should have caught the problem (i.e. "hey, this script is failing"); I won't pretend to understand the entire problem, so we might be entirely off base, but we should be able to check our service level
206 <rockstar> danilo_, there wasn't a script failing.
207 <rockstar> It ran fine, it was just a new script that had apparently left out some old functionality.
208 <danilo_> rockstar, right, never mind the "implementation details", the problem is: "why we didn't catch it before someone told us it's failing"; there's not necessarily a technical solution
209 <danilo_> matsubara, am I still on the channel?
210 <matsubara> yes
211 <danilo_> oh, ok, it's just everybody being quite :)
212 <danilo_> matsubara, I think we should go on
213 <matsubara> sorry, I was looking for a bug report to dupe against 511567
214 <matsubara> anyway
215 <matsubara> thanks
216 <matsubara> [TOPIC] * Operations report (mthaddon/Chex/spm/mbarnett)
217 <MootBot> New Topic: * Operations report (mthaddon/Chex/spm/mbarnett)
218 <matsubara> hello?
219 <matsubara> Chex, mbarnett ?
220 <mbarnett> sorry
221 <Chex> sorry
222 <Chex> here is the report
223 <Chex> - LP rollout 10.01 Wednesday was successful:
224 <Chex> : See https://wiki.canonical.com/InformationInfrastructure/OSA/LPRollout20100127 for more details.
225 <Chex> : The read-only switch left idle connections to the master DB, it is currently being investigated
226 <Chex> - New LP Appserver is online, some issues with internal access, but now everything is OK.
227 <Chex> - New branch-scanner having issues, just reverted back to old again. Based on meeting dicsussion here,
228 <Chex> continuing to address.
229 <Chex> and thats all for us. Any questions/comments?
230 <matsubara> Chex, what's this new LP appserver online? I guess I'll have to tell oops-tools about oops reports from it?
231 <matsubara> [action] matsubara to update oops-tools to know about the new lp appserver
232 <MootBot> ACTION received: matsubara to update oops-tools to know about the new lp appserver
233 <noodles775> Chex: do you know if the new servers have access to the private librarian?
234 <mbarnett> matsubara: soybean was recently put online as a replacement for gangotri +
235 <mbarnett> noodles775: that was resolved earlier today
236 <matsubara> mbarnett, oh, so it's using the same config files?
237 <noodles775> A user was seeing about 1 in 4 requests to download a... ah, great, thanks!
238 <mbarnett> matsubara: it took over lpnet1, lpnet2, and edge1 from gangotri, stole lpnet9 from gandwana, and added a sparkly new lpnet15 standard lpnet appserver
239 <matsubara> mbarnett, ok, it's the new lpnet15 instance I care about. I'll check the configs and update oops-tools accordingly
240 <matsubara> thanks
241 <matsubara> moving on
242 <mbarnett> matsubara: thank you.
243 <matsubara> [TOPIC] * DBA report (stub)
244 <MootBot> New Topic: * DBA report (stub)
245 <matsubara> stub sent the report to the list
246 <matsubara> allenap, he mentioned something about checkwatches being very cpu intensive. it's probably of interest of the Bugs team
247 <allenap> mars: deryck has just forwarded the message to me.
248 <allenap> matsubara: ^
249 <matsubara> thanks allenap
250 <matsubara> [TOPIC] * Proposed items
251 <MootBot> New Topic: * Proposed items
252 <matsubara> no proposed items
253 <matsubara> which brings this meeting to a close
254 <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See https://dev.launchpad.net/MeetingAgenda for the logs.
255 <matsubara> and sorry for the delay
256 <matsubara> #endmeeting