Quantcast
Channel: Planet Plone - Where Developers And Integrators Write
Viewing all articles
Browse latest Browse all 3535

Reinout van Rees: Service monitoring victory

$
0
0

As a company, we're handling quite a lot of customer data. This involves a lot of different software components: databases (postgres, oracle), jdbc, xml-rpc, django website, a windows server, etcetera. And sometimes one of those components falls over, bringing the data import/whatever for one or more customers to a halt.

Waiting for the phone to ring ("Hello, this is customer xyz....") isn't the best way of monitoring it. Doing a manual click-through every morning to check it by hand isn't fun (and so it isn't guaranteed to happen).

Solution: automatic monitoring. The things we want to monitor are apparently not covered by the standard munin/nagios types of checks (I don't have enough knowledge about that part of our software to know for sure). So a colleague is working on a dashboard in Django. External checks write their data to the django database and django shows it, basically. It keeps historical records (much like munin/nagios does).

The dashboard isn't finished yet, however, and Mandatory Menial Manual Morning Checking was about to be scheduled for the IT department. Manual menial tasks is something I abhor, so I asked around a bit and found an existing snippet of code that checks the component that is the cause of 99% of the downtime. Three hours of coding later, we've now got a temporary web page that lists whether that component is up for our various customers. Easily integratable into the real dashboard later on.

Note that getting someone to code such an automatic check was exaclty the purpose of asking the IT department to do the manual monitoring. The trick worked :-)

This morning we had our first small victory: a colleague looked at the page and noticed three JDBC couplings were down. We restarted them and got them back on-line.

Like I said in my Hudson continuous integration article, we've got a laptop + big monitor in our IT room with our hudson on continuous display. I've now opened a second window with the temporary JDBC tester page so that we can't possibly overlook another downtime :-)

simple monitoring page

Viewing all articles
Browse latest Browse all 3535

Trending Articles