DevelopmentMeeting20100128

Not logged in - Log In / Register

   1 <matsubara> #startmeeting
   2 <MootBot> Meeting started at 10:00. The chair is matsubara.
   3 <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
   4 <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. 
   5 * danilos has quit (Nick collision from services.)
   6 <matsubara> [TOPIC] Roll Call 
   7 <MootBot> New Topic:  Roll Call
   8 <matsubara> Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us! 
   9 <sinzui> me
  10 <al-maisan> me
  11 <danilo_> me
  12 <mrjazzcat> me
  13 <mbarnett> me
  14 <matsubara> sorry mrjazzcat, I always forget to ping you about the meeting. I'll add you to the "Who should be here?" section if you don't mind
  15 <mrjazzcat> yes, please
  16 <matsubara> on the MeetingAgenda page, I mean
  17 <mrjazzcat> no worries
  18 <matsubara> [action] add brian to the list of attendees in the MeetingAgenda page
  19 <MootBot> ACTION received:  add brian to the list of attendees in the MeetingAgenda page
  20 <matsubara> Ursula won't be around today
  21 <matsubara> and I'll be standing in for Gary
  22 <matsubara> rockstar, hi, around?
  23 <matsubara> allenap, hi
  24 <matsubara> well, let's move on and then Gavin and Paul can join in later
  25 <matsubara> [TOPIC] Agenda 
  26 <MootBot> New Topic:  Agenda
  27 <matsubara>  * Actions from last meeting
  28 <matsubara>  * Oops report & Critical Bugs & Broken scripts
  29 <matsubara>  * Operations report (mthaddon/Chex/spm/mbarnett)
  30 <matsubara>  * DBA report (stub)
  31 <matsubara>  * Proposed items
  32 <matsubara> [TOPIC] * Actions from last meeting
  33 <MootBot> New Topic:  * Actions from last meeting
  34 <matsubara>  * allenap to dig the master bug of OOPS-1474EA771
  35 <matsubara>  * salgado to take a look in the TypeError oopses (OOPS-1479S1000)
  36 <matsubara>    * already did that, this is bug 403281, it happened because mthaddon was testing the new read-only switch on staging.
  37 <matsubara>  * rockstar to take a look in OOPS-1480CMP1
  38 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
  39 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1479S1000
  40 <ubottu> Launchpad bug 403281 in launchpad-foundations "public xmlrpc requests broken during read only period" [Undecided,Triaged] https://launchpad.net/bugs/403281
  41 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1
  42 <matsubara> ok, so I'll re-add both items for allenap and rockstar 
  43 <matsubara> [action] * allenap to dig the master bug of OOPS-1474EA771
  44 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
  45 <MootBot> ACTION received:  * allenap to dig the master bug of OOPS-1474EA771
  46 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
  47 <matsubara> [action] * rockstar to take a look in OOPS-1480CMP1
  48 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1
  49 <MootBot> ACTION received:  * rockstar to take a look in OOPS-1480CMP1
  50 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1
  51 <matsubara> [TOPIC] * Oops report & Critical Bugs & Broken scripts
  52 <MootBot> New Topic:  * Oops report & Critical Bugs & Broken scripts
  53 <matsubara> we have some oops reports but most of them foundations issues
  54 <matsubara> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
  55 <matsubara> Looks like an anonymous user is trying to do some operation which (s)he's not allowed. Should we really log an oops for this?
  56 <matsubara> maybe related to https://bugs.edge.launchpad.net/launchpad-foundations/+bug/271029
  57 <matsubara> More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
  58 <matsubara> InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
  59 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884
  60 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489J147
  61 <ubottu> Ubuntu bug 271029 in launchpad-foundations "ForbiddenAttribute exception raised changing property of object" [Medium,Triaged]
  62 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094
  63 <matsubara> code team, BranchMergeProposalExists https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
  64 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174
  65 <matsubara> so, that's it and there's no one from Code to take a look at the BranchMergeProposalExists one
  66 <matsubara> [action] matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
  67 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174
  68 <MootBot> ACTION received:  matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
  69 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174
  70 <matsubara> [action] matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
  71 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884
  72 <MootBot> ACTION received:  matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
  73 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884
  74 <matsubara> [action] matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
  75 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489J147
  76 <MootBot> ACTION received:  matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
  77 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489J147
  78 <matsubara> [action] matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
  79 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094
  80 <MootBot> ACTION received:  matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
  81 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094
  82 <matsubara> lovely, looks like I'm running the meeting all by myself heheh
  83 <rockstar> me
  84 <allenap> me
  85 * noodles775 has quit (Read error: 110 (Connection timed out))
  86 <al-maisan> :)
  87 <matsubara> on the broken scripts side
  88 <matsubara> sinzui, Scripts failed to run: loganberry:send-person-notifications seems to be broken
  89 <matsubara> sinzui, could you take a look and reply to the list?
  90 * noodles775 (n=miken@canonical/launchpad/noodles775) has joined #launchpad-meeting
  91 <sinzui> matsubara: all scripts appear to be broken
  92 <matsubara> all?
  93 <sinzui> They are not running and I am tempted to say something new was added that is taking forever and a day
  94 <matsubara> I only see notifications for send-person-notifications and garbo-hourly
  95 <matsubara> sinzui, can you confirm and reply to the list that's the case, at least for the send-person-notifications one?
  96 <matsubara> I'll ask losas and/or stub about garbo-hourly not running as well
  97 <allenap> matsubara: Re. OOPS-1474EA771, it's bug 508302, and deryck is working on it today.
  98 <ubottu> Launchpad bug 508302 in malone "NotImplementedError OOPS when reporting a bug" [High,In progress] https://launchpad.net/bugs/508302
  99 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
 100 <matsubara> thanks allenap, I'll adjust the bug link on that oops report
 101 <matsubara> [action] matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302
 102 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
 103 <MootBot> ACTION received:  matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302
 104 <ubottu> https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771
 105 <matsubara> [action] sinzui to investigate failure on send-person-notifications and reply to the list with his findings
 106 <MootBot> ACTION received:  sinzui to investigate failure on send-person-notifications and reply to the list with his findings
 107 <matsubara> btw, updatebranches script also failed recently but that's been fixed by spm. the new rollout changed the script name and losas updated the notification thing to recognize the new name
 108 <matsubara> on the critical bugs side
 109 <rockstar> matsubara, updatebranches no longer runs.
 110 <matsubara> we have 3 critical bugs
 111 <mthaddon> matsubara: er, not quite
 112 <rockstar> It's been replaced by scan_branches
 113 <mthaddon> matsubara: we've had to revert it a bunch of times
 114 <matsubara> mthaddon, hmm no? spm's email seems to indicate that
 115 <mthaddon> matsubara: spm went to bed a while ago - a new problem was discovered since then
 116 <mthaddon> matsubara: abentley and Chex have been working on it
 117 <matsubara> oh, I was looking at this latest email to the list replying to one of the script failures notification
 118 <matsubara> well, if they're already working on it, it's ok. :-)
 119 <mthaddon> matsubara: not really...
 120 <matsubara> mthaddon, no? what else is expected?
 121 <mthaddon> matsubara: as I understand it, we've reverted to the old script because we still don't know what was wrong
 122 <mthaddon> matsubara: and the fact that we've reverted between the old and new scripts twice now on production is a problem in itself
 123 * salgado (n=salgado@canonical/launchpad/salgado) has joined #launchpad-meeting
 124 <matsubara> mthaddon, I meant it's ok in the sense that people are already working on a solution and there's nothing much to be done during this meeting to have people act on it
 125 <mthaddon> matsubara: and also the fact that the first we heard about the problem was from a user report
 126 <mthaddon> i.e. we don't have a good measure of when this problem is even happening
 127 <mthaddon> matsubara: maybe not, but I'd like a bit of discussion about this class of problem and what can be done to prevent it in the future
 128 <danilo_> mthaddon: what is the exact problem that we need to be able to track? (sorry, I am not fully up to date on what broke)
 129 <mthaddon> danilo_: aiui email notifications of branch updates failed to be sent out
 130 <gary_poster> mthaddon: "reverted ...twice...on production": I think we all agree this sucks.  However, AIUI, this was successfully QAd.  Either the QA was bad, or staging is not close enough to prod in some way.  I don't think we know yet.
 131 <matsubara> mthaddon, I'm unaware of the details as well. My expectation is that a IncidentLog will be filed and action to prevent it will be included in the incident log
 132 <danilo_> mthaddon: ah, right, that could have a bigger impact (it might be harming us in translations as well)
 133 <mthaddon> matsubara: this doesn't really qualify as an incident log item since there's no measurable service that's been interrupted (we don't have any kind of nagios monitoring of this) - I guess I'm asking how we plan to approach it from here
 134 <mthaddon> and how we got into this situation
 135 <danilo_> mthaddon, gary_poster, matsubara: we are obviously missing a dedicated "communications person" for this specific item (someone to keep the entire situation in check); we've discussed that approach before, it'd be nice to find someone who can offload the communication side from abentley and others working on it
 136 <gary_poster> danilo_: to the degree there's a failure there (communications), it'd probably be mine as RM
 137 <gary_poster> maybe we can have somebody else too
 138 <gary_poster> but that's RM stuff
 139 <gary_poster> but AIUI that's not the prob
 140 <danilo_> gary_poster, not necessarily, we discussed this in a TL call a few weeks (months?) back where we need someone to communicate with everyone
 141 <gary_poster> maybe so
 142 <gary_poster> but probs I see:
 143 <danilo_> gary_poster, it's mostly about having someone take responsibility for making sure problems are visible and we know what's going on
 144 <gary_poster> - we didn't catch this on staging.  Why?
 145 <gary_poster> either QA was bad or staging is too diff
 146 <gary_poster> we need to know why
 147 <gary_poster> and fix it
 148 <mthaddon> yep, I agree with that
 149 <gary_poster> then also, unless I misunderstand, mthaddon is saying that we don't have an automated nagios-like process verifying basic success on production for this thing
 150 <danilo_> gary_poster, neither of those is easy to fix (one depends on people always DTRT, another on machines always DTRT), so we need to be able to easily find out when it's broken rather than wait for users to report it
 151 <gary_poster> danilo_: but doesn't that depend on one of the three things I said?  (people DTRT, machines DTRT, nagios-like-thing DTRT)
 152 <danilo_> gary_poster, it does, I was typing before you typed the last one :)
 153 <mthaddon> gary_poster: it's possible we can't do that for *everything*, but if we decide this is a sufficiently important thing that we care about it if it fails, it sounds like we need to monitor it somehow, yeah (possibly we are already with OOPSes, but why didn't we catch it til a user told us about it?)
 154 <gary_poster> :-) ok
 155 <danilo_> gary_poster, the 4th is lack of coordination and communication :)
 156 * salgado-lunch has quit (Read error: 110 (Connection timed out))
 157 <gary_poster> mthaddon: right. For me, this gets to my "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
 158 <gary_poster> maybe the jobs system can help with this
 159 <danilo_> anyway, gary_poster, I think we should just raise the importance of ensuring sufficient monitoring of this part of code-hosting by thumper, and we can be done with the topic
 160 <gary_poster> maybe we can architect the jobs system to give us a nagios-like hook
 161 <danilo_> gary_poster, we don't have to solve the problem here :)
 162 <gary_poster> because doing it with cron scripts is a one-per job
 163 <matsubara> danilo_, can you raise the topic in the next TL meeting?
 164 <gary_poster> danilo_: ack.  I kind of disagree with your summary though, and your action item, so that's why I'm continuing to blather :-)
 165 <danilo_> matsubara, we are having a week long TL meeting next week, so it'd be best to action it for someone from code team to pass it on to thumper, imho :)
 166 <gary_poster> (IOW, this is not a problem for thumper, it is a problem for Björn, team leads, etc.)
 167 <danilo_> gary_poster, well, sure, I agree, but one step at a time
 168 <gary_poster> matsubara: two action items: :-)
 169 <matsubara> [action] rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper
 170 <MootBot> ACTION received:  rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper
 171 <danilo_> gary_poster, there's immediate problem and then there's the elegant solution; I'm always for fixing the immediate problem first and having the elegant solution come out of that
 172 <gary_poster> yeah, that's number one
 173 <gary_poster> number two is gary to bring up archtecture concerns to team lead mtg :-)
 174 <danilo_> gary_poster, as for the other one, I think it ties in well with what we discussed today and what we'll want to discuss anyway
 175 <matsubara> [action] TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
 176 <MootBot> ACTION received:  TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
 177 <matsubara> does that summarize it well?
 178 <gary_poster> yeah thank you.  though it's probably my action, since I'm the one with the bee in my bonnet :-)  but that's fine
 179 <danilo_> gary_poster, matsubara: I don't like action items like that because they put no responsibility on anyone in particular, thus meaning that if they get done, they get done unrelated to the action item; thus, you don't really need it
 180 <gary_poster> so give it to me :-)
 181 <matsubara> danilo_, I'll add it to gary's queue when I add the summary to the MeetingAgenda page
 182 <danilo_> gary_poster, heh, that's ok, I am certain we would have discussed this regardless of us having any particular action item
 183 <danilo_> matsubara, sure, thanks
 184 <gary_poster> :-)
 185 <matsubara> it serves as a reminder as well
 186 <matsubara> anyway, thanks for the comments
 187 <rockstar> It fairness, the "not getting branch update emails" thing was because a rather large part of the code hosting system was made into a job.
 188 <gary_poster> To whom are you being fair? :-)
 189 <gary_poster> Never mind, I'll be quiet :-)
 190 <danilo_> :)
 191 <matsubara> we have 3 critical bugs, one in progress, one fix committed
 192 <rockstar> I'm not sure how "sufficient monitoring" would have fixed this.
 193 <matsubara> the other one is triaged, bug 511567
 194 <ubottu> Launchpad bug 511567 in launchpad-foundations "Can't remove authorised app" [Critical,Triaged] https://launchpad.net/bugs/511567
 195 <rockstar> gary_poster, to the code team in general.
 196 <matsubara> hmm
 197 <matsubara> that's a dupe
 198 <danilo_> rockstar, sufficient monitoring of scripts that do this
 199 <matsubara> and I filed that bug a few days ago
 200 <rockstar> danilo_, howso?
 201 <matsubara> or maybe I filed the dupe
 202 <gary_poster> rockstar: ah, gotcha.  Tim can beat us into shape at the TL sprint so we understand.
 203 <rockstar> gary_poster, yeah, I'll talk to him.
 204 <gary_poster> cool
 205 <danilo_> rockstar, monitoring should have caught the problem (i.e. "hey, this script is failing"); I won't pretend to understand the entire problem, so we might be entirely off base, but we should be able to check our service level
 206 <rockstar> danilo_, there wasn't a script failing.
 207 <rockstar> It ran fine, it was just a new script that had apparently left out some old functionality.
 208 <danilo_> rockstar, right, never mind the "implementation details", the problem is: "why we didn't catch it before someone told us it's failing"; there's not necessarily a technical solution
 209 <danilo_> matsubara, am I still on the channel?
 210 <matsubara> yes
 211 <danilo_> oh, ok, it's just everybody being quite :)
 212 <danilo_> matsubara, I think we should go on
 213 <matsubara> sorry, I was looking for a bug report to dupe against 511567
 214 <matsubara> anyway
 215 <matsubara> thanks
 216 <matsubara> [TOPIC] * Operations report (mthaddon/Chex/spm/mbarnett)
 217 <MootBot> New Topic:  * Operations report (mthaddon/Chex/spm/mbarnett)
 218 <matsubara> hello?
 219 <matsubara> Chex, mbarnett ?
 220 <mbarnett> sorry
 221 <Chex> sorry
 222 <Chex> here is the report
 223 <Chex> - LP rollout 10.01 Wednesday was successful:
 224 <Chex>     : See https://wiki.canonical.com/InformationInfrastructure/OSA/LPRollout20100127 for more details.
 225 <Chex>     : The read-only switch left idle connections to the master DB, it is currently being investigated
 226 <Chex> - New LP Appserver is online, some issues with internal access, but now everything is OK.
 227 <Chex> - New branch-scanner having issues, just reverted back to old again.  Based on meeting dicsussion here,
 228 <Chex>         continuing to address.
 229 <Chex> and thats all for us.  Any questions/comments?
 230 <matsubara> Chex, what's this new LP appserver online? I guess I'll have to tell oops-tools about oops reports from it?
 231 <matsubara> [action] matsubara to update oops-tools to know about the new lp appserver
 232 <MootBot> ACTION received:  matsubara to update oops-tools to know about the new lp appserver
 233 <noodles775> Chex: do you know if the new servers have access to the private librarian?
 234 <mbarnett> matsubara: soybean was recently put online as a replacement for gangotri +
 235 <mbarnett> noodles775: that was resolved earlier today
 236 <matsubara> mbarnett, oh, so it's using the same config files?
 237 <noodles775> A user was seeing about 1 in 4 requests to download a... ah, great, thanks!
 238 <mbarnett> matsubara: it took over lpnet1, lpnet2, and edge1 from gangotri, stole lpnet9 from gandwana, and added a sparkly new lpnet15 standard lpnet appserver
 239 <matsubara> mbarnett, ok, it's the new lpnet15 instance I care about. I'll check the configs and update oops-tools accordingly
 240 <matsubara> thanks
 241 <matsubara> moving on
 242 <mbarnett> matsubara: thank you.
 243 <matsubara> [TOPIC] * DBA report (stub)
 244 <MootBot> New Topic:  * DBA report (stub)
 245 <matsubara> stub sent the report to the list
 246 <matsubara> allenap, he mentioned something about checkwatches being very cpu intensive. it's probably of interest of the Bugs team
 247 <allenap> mars: deryck has just forwarded the message to me.
 248 <allenap> matsubara: ^
 249 <matsubara> thanks allenap 
 250 <matsubara> [TOPIC] * Proposed items
 251 <MootBot> New Topic:  * Proposed items
 252 <matsubara> no proposed items
 253 <matsubara> which brings this meeting to a close
 254 <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See https://dev.launchpad.net/MeetingAgenda for the logs. 
 255 <matsubara> and sorry for the delay
 256 <matsubara> #endmeeting 

DevelopmentMeeting20100128 (last edited 2010-01-29 11:12:38 by matsubara)