## Template for LP Production Meeting logs. Just paste xchat log below and the format IRC line will take care of formatting correctly #format IRC #startmeeting Meeting started at 10:00. The chair is matsubara. Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE] Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. * danilos has quit (Nick collision from services.) [TOPIC] Roll Call New Topic: Roll Call Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us! me me me me me sorry mrjazzcat, I always forget to ping you about the meeting. I'll add you to the "Who should be here?" section if you don't mind yes, please on the MeetingAgenda page, I mean no worries [action] add brian to the list of attendees in the MeetingAgenda page ACTION received: add brian to the list of attendees in the MeetingAgenda page Ursula won't be around today and I'll be standing in for Gary rockstar, hi, around? allenap, hi well, let's move on and then Gavin and Paul can join in later [TOPIC] Agenda New Topic: Agenda * Actions from last meeting * Oops report & Critical Bugs & Broken scripts * Operations report (mthaddon/Chex/spm/mbarnett) * DBA report (stub) * Proposed items [TOPIC] * Actions from last meeting New Topic: * Actions from last meeting * allenap to dig the master bug of OOPS-1474EA771 * salgado to take a look in the TypeError oopses (OOPS-1479S1000) * already did that, this is bug 403281, it happened because mthaddon was testing the new read-only switch on staging. * rockstar to take a look in OOPS-1480CMP1 https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771 https://lp-oops.canonical.com/oops.py/?oopsid=1479S1000 Launchpad bug 403281 in launchpad-foundations "public xmlrpc requests broken during read only period" [Undecided,Triaged] https://launchpad.net/bugs/403281 https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1 ok, so I'll re-add both items for allenap and rockstar [action] * allenap to dig the master bug of OOPS-1474EA771 https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771 ACTION received: * allenap to dig the master bug of OOPS-1474EA771 https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771 [action] * rockstar to take a look in OOPS-1480CMP1 https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1 ACTION received: * rockstar to take a look in OOPS-1480CMP1 https://lp-oops.canonical.com/oops.py/?oopsid=1480CMP1 [TOPIC] * Oops report & Critical Bugs & Broken scripts New Topic: * Oops report & Critical Bugs & Broken scripts we have some oops reports but most of them foundations issues https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884 Looks like an anonymous user is trying to do some operation which (s)he's not allowed. Should we really log an oops for this? maybe related to https://bugs.edge.launchpad.net/launchpad-foundations/+bug/271029 More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147 InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094 https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884 https://lp-oops.canonical.com/oops.py/?oopsid=1489J147 Ubuntu bug 271029 in launchpad-foundations "ForbiddenAttribute exception raised changing property of object" [Medium,Triaged] https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094 code team, BranchMergeProposalExists https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174 https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174 so, that's it and there's no one from Code to take a look at the BranchMergeProposalExists one [action] matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174 https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174 ACTION received: matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174 https://lp-oops.canonical.com/oops.py/?oopsid=1488EA174 [action] matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884 https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884 ACTION received: matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884 https://lp-oops.canonical.com/oops.py/?oopsid=1488EA884 [action] matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147 https://lp-oops.canonical.com/oops.py/?oopsid=1489J147 ACTION received: matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147 https://lp-oops.canonical.com/oops.py/?oopsid=1489J147 [action] matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094 https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094 ACTION received: matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094 https://lp-oops.canonical.com/oops.py/?oopsid=1489C1094 lovely, looks like I'm running the meeting all by myself heheh me me * noodles775 has quit (Read error: 110 (Connection timed out)) :) on the broken scripts side sinzui, Scripts failed to run: loganberry:send-person-notifications seems to be broken sinzui, could you take a look and reply to the list? * noodles775 (n=miken@canonical/launchpad/noodles775) has joined #launchpad-meeting matsubara: all scripts appear to be broken all? They are not running and I am tempted to say something new was added that is taking forever and a day I only see notifications for send-person-notifications and garbo-hourly sinzui, can you confirm and reply to the list that's the case, at least for the send-person-notifications one? I'll ask losas and/or stub about garbo-hourly not running as well matsubara: Re. OOPS-1474EA771, it's bug 508302, and deryck is working on it today. Launchpad bug 508302 in malone "NotImplementedError OOPS when reporting a bug" [High,In progress] https://launchpad.net/bugs/508302 https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771 thanks allenap, I'll adjust the bug link on that oops report [action] matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302 https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771 ACTION received: matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302 https://lp-oops.canonical.com/oops.py/?oopsid=1474EA771 [action] sinzui to investigate failure on send-person-notifications and reply to the list with his findings ACTION received: sinzui to investigate failure on send-person-notifications and reply to the list with his findings btw, updatebranches script also failed recently but that's been fixed by spm. the new rollout changed the script name and losas updated the notification thing to recognize the new name on the critical bugs side matsubara, updatebranches no longer runs. we have 3 critical bugs matsubara: er, not quite It's been replaced by scan_branches matsubara: we've had to revert it a bunch of times mthaddon, hmm no? spm's email seems to indicate that matsubara: spm went to bed a while ago - a new problem was discovered since then matsubara: abentley and Chex have been working on it oh, I was looking at this latest email to the list replying to one of the script failures notification well, if they're already working on it, it's ok. :-) matsubara: not really... mthaddon, no? what else is expected? matsubara: as I understand it, we've reverted to the old script because we still don't know what was wrong matsubara: and the fact that we've reverted between the old and new scripts twice now on production is a problem in itself * salgado (n=salgado@canonical/launchpad/salgado) has joined #launchpad-meeting mthaddon, I meant it's ok in the sense that people are already working on a solution and there's nothing much to be done during this meeting to have people act on it matsubara: and also the fact that the first we heard about the problem was from a user report i.e. we don't have a good measure of when this problem is even happening matsubara: maybe not, but I'd like a bit of discussion about this class of problem and what can be done to prevent it in the future mthaddon: what is the exact problem that we need to be able to track? (sorry, I am not fully up to date on what broke) danilo_: aiui email notifications of branch updates failed to be sent out mthaddon: "reverted ...twice...on production": I think we all agree this sucks. However, AIUI, this was successfully QAd. Either the QA was bad, or staging is not close enough to prod in some way. I don't think we know yet. mthaddon, I'm unaware of the details as well. My expectation is that a IncidentLog will be filed and action to prevent it will be included in the incident log mthaddon: ah, right, that could have a bigger impact (it might be harming us in translations as well) matsubara: this doesn't really qualify as an incident log item since there's no measurable service that's been interrupted (we don't have any kind of nagios monitoring of this) - I guess I'm asking how we plan to approach it from here and how we got into this situation mthaddon, gary_poster, matsubara: we are obviously missing a dedicated "communications person" for this specific item (someone to keep the entire situation in check); we've discussed that approach before, it'd be nice to find someone who can offload the communication side from abentley and others working on it danilo_: to the degree there's a failure there (communications), it'd probably be mine as RM maybe we can have somebody else too but that's RM stuff but AIUI that's not the prob gary_poster, not necessarily, we discussed this in a TL call a few weeks (months?) back where we need someone to communicate with everyone maybe so but probs I see: gary_poster, it's mostly about having someone take responsibility for making sure problems are visible and we know what's going on - we didn't catch this on staging. Why? either QA was bad or staging is too diff we need to know why and fix it yep, I agree with that then also, unless I misunderstand, mthaddon is saying that we don't have an automated nagios-like process verifying basic success on production for this thing gary_poster, neither of those is easy to fix (one depends on people always DTRT, another on machines always DTRT), so we need to be able to easily find out when it's broken rather than wait for users to report it danilo_: but doesn't that depend on one of the three things I said? (people DTRT, machines DTRT, nagios-like-thing DTRT) gary_poster, it does, I was typing before you typed the last one :) gary_poster: it's possible we can't do that for *everything*, but if we decide this is a sufficiently important thing that we care about it if it fails, it sounds like we need to monitor it somehow, yeah (possibly we are already with OOPSes, but why didn't we catch it til a user told us about it?) :-) ok gary_poster, the 4th is lack of coordination and communication :) * salgado-lunch has quit (Read error: 110 (Connection timed out)) mthaddon: right. For me, this gets to my "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks. maybe the jobs system can help with this anyway, gary_poster, I think we should just raise the importance of ensuring sufficient monitoring of this part of code-hosting by thumper, and we can be done with the topic maybe we can architect the jobs system to give us a nagios-like hook gary_poster, we don't have to solve the problem here :) because doing it with cron scripts is a one-per job danilo_, can you raise the topic in the next TL meeting? danilo_: ack. I kind of disagree with your summary though, and your action item, so that's why I'm continuing to blather :-) matsubara, we are having a week long TL meeting next week, so it'd be best to action it for someone from code team to pass it on to thumper, imho :) (IOW, this is not a problem for thumper, it is a problem for Björn, team leads, etc.) gary_poster, well, sure, I agree, but one step at a time matsubara: two action items: :-) [action] rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper ACTION received: rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper gary_poster, there's immediate problem and then there's the elegant solution; I'm always for fixing the immediate problem first and having the elegant solution come out of that yeah, that's number one number two is gary to bring up archtecture concerns to team lead mtg :-) gary_poster, as for the other one, I think it ties in well with what we discussed today and what we'll want to discuss anyway [action] TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks. ACTION received: TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks. does that summarize it well? yeah thank you. though it's probably my action, since I'm the one with the bee in my bonnet :-) but that's fine gary_poster, matsubara: I don't like action items like that because they put no responsibility on anyone in particular, thus meaning that if they get done, they get done unrelated to the action item; thus, you don't really need it so give it to me :-) danilo_, I'll add it to gary's queue when I add the summary to the MeetingAgenda page gary_poster, heh, that's ok, I am certain we would have discussed this regardless of us having any particular action item matsubara, sure, thanks :-) it serves as a reminder as well anyway, thanks for the comments It fairness, the "not getting branch update emails" thing was because a rather large part of the code hosting system was made into a job. To whom are you being fair? :-) Never mind, I'll be quiet :-) :) we have 3 critical bugs, one in progress, one fix committed I'm not sure how "sufficient monitoring" would have fixed this. the other one is triaged, bug 511567 Launchpad bug 511567 in launchpad-foundations "Can't remove authorised app" [Critical,Triaged] https://launchpad.net/bugs/511567 gary_poster, to the code team in general. hmm that's a dupe rockstar, sufficient monitoring of scripts that do this and I filed that bug a few days ago danilo_, howso? or maybe I filed the dupe rockstar: ah, gotcha. Tim can beat us into shape at the TL sprint so we understand. gary_poster, yeah, I'll talk to him. cool rockstar, monitoring should have caught the problem (i.e. "hey, this script is failing"); I won't pretend to understand the entire problem, so we might be entirely off base, but we should be able to check our service level danilo_, there wasn't a script failing. It ran fine, it was just a new script that had apparently left out some old functionality. rockstar, right, never mind the "implementation details", the problem is: "why we didn't catch it before someone told us it's failing"; there's not necessarily a technical solution matsubara, am I still on the channel? yes oh, ok, it's just everybody being quite :) matsubara, I think we should go on sorry, I was looking for a bug report to dupe against 511567 anyway thanks [TOPIC] * Operations report (mthaddon/Chex/spm/mbarnett) New Topic: * Operations report (mthaddon/Chex/spm/mbarnett) hello? Chex, mbarnett ? sorry sorry here is the report - LP rollout 10.01 Wednesday was successful: : See https://wiki.canonical.com/InformationInfrastructure/OSA/LPRollout20100127 for more details. : The read-only switch left idle connections to the master DB, it is currently being investigated - New LP Appserver is online, some issues with internal access, but now everything is OK. - New branch-scanner having issues, just reverted back to old again. Based on meeting dicsussion here, continuing to address. and thats all for us. Any questions/comments? Chex, what's this new LP appserver online? I guess I'll have to tell oops-tools about oops reports from it? [action] matsubara to update oops-tools to know about the new lp appserver ACTION received: matsubara to update oops-tools to know about the new lp appserver Chex: do you know if the new servers have access to the private librarian? matsubara: soybean was recently put online as a replacement for gangotri + noodles775: that was resolved earlier today mbarnett, oh, so it's using the same config files? A user was seeing about 1 in 4 requests to download a... ah, great, thanks! matsubara: it took over lpnet1, lpnet2, and edge1 from gangotri, stole lpnet9 from gandwana, and added a sparkly new lpnet15 standard lpnet appserver mbarnett, ok, it's the new lpnet15 instance I care about. I'll check the configs and update oops-tools accordingly thanks moving on matsubara: thank you. [TOPIC] * DBA report (stub) New Topic: * DBA report (stub) stub sent the report to the list allenap, he mentioned something about checkwatches being very cpu intensive. it's probably of interest of the Bugs team mars: deryck has just forwarded the message to me. matsubara: ^ thanks allenap [TOPIC] * Proposed items New Topic: * Proposed items no proposed items which brings this meeting to a close Thank you all for attending this week's Launchpad Production Meeting. See https://dev.launchpad.net/MeetingAgenda for the logs. and sorry for the delay #endmeeting