Nav-dev Juli 2008

nav-dev@lister.sikt.no

2 participants
8 discussions

Status meeting minutes, 28 July 2008
by Morten Brekkevold 28 Jul '08

28 Jul '08

Status meeting minutes, 28 July 2008 ==================================== Morten ------ * Wrote a simple script to check SQL scripts when they are updated in the default branch. * Thomas proposed a different solution, using buildbot, which looks a lot better :-) * Testing latest netmap changes, alertprofiles and alertengine, feedback to developers. * Cleaned up and merged db-namespaces branch on default. * Resumed development on ipdevpoll. * Looking at Twisted's adbapi (asynchronous db api) * Looking at how Twisted can be used for scheduling jobs. * Looking at strategies for repeated loading of changed model data from database into cache. * Teknobyen-vk's old hardware is ready for reuse as new navdevdb, needs Debian reinstall. Thomas ------ * Useradmin as good as done. Userinfo merged into the new tool. (need to test) * Been looking at buildbot. * Fixing loose ends in code produced so far this summer. * testmaker is a possibly interesting tool for us. * Uses web logs to record replayable unit tests for web interfaces. * Still in development. Jørgen ------ * Been on vacation. * Overloading report config with local reports now works * Testing new code. * Working on caching reults of report SQL queries in user sessions * This requested feature will simplify the sum functions Jørgen is * Shelve extension disappeard with Mercurial upgrade, which caused problems. Magnus ------ * System for storing messages in NAV sessions now works. * MySMS merged into new alertprofiles. * Fixed permissions bugs and some usability fixes. * Been on two days vacation. * Feedback: Test database migration paths from 3.4 for alert profiles. Kristian -------- * Still working on Netmap and Network Explorer. * Only admins can save netmap layout. * LP migration news: Imports from SF.net to LP's test box are now complete. * Wrote script to report NAV bugs to LP from console. * Fixed DEBUG-setting in Django and 500.html-page. -- mvh Morten Brekkevold UNINETT

1 0

Status meeting minutes, 21 July 2008
by Morten Brekkevold 28 Jul '08

28 Jul '08

Seems I forgot to send this out last week, here it goes: Status meeting minutes, 21 July 2008 ==================================== Morten ------ * Released 3.4.1 on Tuesday. * Researched some more of NTNU's previously reported problems, found one of them to be another result of the HP ifindex mangling. * Posted about cherrypicking vs. stable/unstable branches * Found a simpler way to retain a single head in the series repositories. * Found some manage.sql bugs while testing Werner's latest package. * Looks like a 3.4.2 is imminent. * Should look into automatic testing of sql scripts on commits. * Been trying out Mercurial's API (Python) to log stats for repositories found locally at navdev. Magnus ------ * Everything's in place in Alert Profiles. * Authorization checks added several places. * Messages system for web sessions. Thomas ------ * Fixed quickselect bugs. * Jabber alerts seems to be working fine. * Making sure we have robustness, plugins should not be able to bring down the alert daemon. * Useradmin replacement well on the way. * ETA today or tomorrow. * Delete restriction for uids/gids < 1000. * Django Account models have checkPassword and setPassword methods now (copied from forgetSQL "models"). Kristian -------- * Network explorer * Trying to reduce number of small requests * Big requests may appear slow, causing Firefox to issue Javascript warnings. * Launchpad migration: * Belated, Launchpad released a new version last week, which caused a delay. * Test import of NAV's sf.net data mostly successful. * Working on "save layout" feature for Netmap. Discussion ---------- Adamcik: Django 1.0 possibly in september. Will NAV 3.5 come out before or after this time? Feedback: As long as Django 1.0 is backwards compatible with the SVN revision we currently use, everything should be in order. Assigned tasks: Kristian will make a 500.html template, and add new options in nav.conf to control Django's DEBUG and TIMEZONE settings. -- mvh Morten Brekkevold UNINETT

1 0

Bugfixing on stable vs. unstable branches
by Morten Brekkevold 18 Jul '08

18 Jul '08

The recent release of 3.4.1 gave me some more practical experience with using Mercurial for cherrypicking bugfixes, and I've found I want to do things slightly different than we do today. Cherrypicking changesets is done with the transplant extension. I've never been 100% comfortable with this, as I feel it circumvents Mercurial's fine merging capabilites. It gives me slightly the same feeling as merging bugfixes with Subversion did, the only improvement being that each changeset is transplanted separately instead of being mashed together in one big changeset. With transplanted changesets in the series-branches, we cannot easily compare the contents of the series-branches with the default branch, using commands such as 'hg in' or 'hg out'. Really, the only reason I have to do cherrypicking is because we fix bugs on the unstable default branch. I can't very well pull the default branch into any of the series-repositories, as they're supposed to be stable, so I have to pick single bugfix changesets from the default branch and transplant them. If we look at things the other way around, anything that goes into the stable series-branches should also go into the default branch (except for version bumping and tags, see my side note below). If bugfixes are committed to the latest series-branch instead of the default branch, the series-branch can easily be pulled and merged into the default branch, and no transplants need to take place. Conclusion ---------- Starting from series/3.5.x (whenever it is branched from default), I would like everyone to commit bugfixes for stable code directly to this branch. I'm the only one with push access to these branches on MetaNAV, since I'm the only one working as a release technician at the moment, so I will need to know when and where to pull your bugfixes from. The series/3.4.x branch is already a mix of directly committed bugfixes and transplanted bugfixes, which complicates committing bugfixes to this branch (you need to checkout the latest bugfix revision and commit your changes as a child node to that changeset, then merge the whole thing with the last head). This is why I won't start requiring this method of bugfixing until 3.5.x is on its way and this can be done more consistently. Side note --------- For those of you who haven't looked at the series-repositories, please be aware of the specifics of how I tag a new release version on these. I'll use the 3.4.x branch as my example. When I'm satisified that the latest changeset (A) on the 3.4.x branch is ready to become the 3.4.1 release, I change the version number in the file VERSION to 3.4.1 and commit it (changeset B). I then tag changeset B as "3.4.1". The tag changeset itself becomes changeset C. When committing bugfixes and other changes for a later 3.4.2 version, I check out changeset A again, and base my changes on this. This is because I don't want the VERSION file in the following changesets to say "3.4.1", but rather "3.4.x_devel", as it did in changeset A. Also, I don't want to pull the changesets that changed the VERSION file into the default branch at a later time. This may seem a bit contrived, and it leads to having multiple heads in any series-branch, but so far I haven't thought of a more useful way of updating the VERSION file without changing it back and forth all the time. -- mvh Morten Brekkevold UNINETT

1 1

Status meeting minutes, 15 July 2008
by Morten Brekkevold 16 Jul '08

16 Jul '08

Status meeting minutes, 15 July 2008 ==================================== Morten ------ * Mailing list activity. * Several bugfixes on series/3.4.x. * Transplanted Magnus' alertprofiles (php) fix branch onto series/3.4.x w/simple tests and merge. * 3.4.1 candidate in testing, should be released later today. Kristian -------- * Still waiting for launchpad admins to migrate NAV project from SourceForge to Launcphad. * Launchpad admins will run a test import to demo.launchpad.net within the next couple of days. * Unknown users from SourceForge data will be mapped to inactive users on Launchpad, which can be activated later. * Working on Netmap improvements * Basic support for saving layouts is ready. Thomas ------ * Found and fixed several small bugs in time for 3.4.1 release: * PostgreSQL 8.3 stricter typecasting * Errors in web pages. * Idea for system to automatically crawl the dev server web pages and report 404/500 errors. * Quickselect, the Treeselect replacement is finished. * Created a branch for some generic CSS/HTML-fixes, pushable to default branch. * Found an improved and currently maintained fork of the Jabber library used in the AlertEngine Jabber plugin. * Trying to test what happens when a Jabber host goes down. Magnus ------ * Still working on alertprofiles in Django. * Most functions finished. * Working on profile creation. * Need to do code cleanup and make more user friendly interface. Jørgen, Vidar and John-Magne are currently on vacation. -- mvh Morten Brekkevold UNINETT

1 0

Speeding up netbox delete / cam table optimizations
by Morten Brekkevold 14 Jul '08

14 Jul '08

Hi everyone, long post alert! As Vidar wrote in the minutes from a meeting a few weeks back, I've been looking into optimizing netbox deletes. It is an old and recurring complaint that deleting IP devices from Edit DB can take quite some time, meaning that the web interface blocks, waiting for the results of a PostgreSQL DELETE statement. We're talking up to 20-40 minutes in the worst cases here! We haven't really prioritized these complaints, as IP device deletion isn't considered a frequent nor critical task. I've looked into this from time to time, and just before I went on vacation I tested a few schema changes and did some timings. I'll try to summarize my findings here, if my memory allows; vacations tend to wipe these sort of things from ones memory, and I'll have to rely on my notes ;-) Why are deletes to the netbox table so slow? -------------------------------------------- The netbox table is central to most things in the manage schema, which means that there are many foreign keys referring to this table. A simple DELETE statement on the netbox table will cause many cascading updates and deletes. A switch that's been monitoried by NAV for a long time, will tend to have tens of thousands, if not hundreds of thousands of referring rows in the cam table. This is an extreme amount of referring rows compared to other tables, such as module, alerthist and such. All these rows will need to be touched as a netbox is deleted. The cam table is a log table, which retains historic information. When a cam row is created, the sysname, module and interface name of the referred-to netbox is copied into the cam table for historic reference. When a netbox is deleted, its referring cam rows are not deleted: Instead, the foreign key referring to the netbox table will be set to NULL. I believe this is a crucial point. When a netbox delete cascades to cam, PostgreSQL cannot simply mark thousands of rows as deleted - instead it needs to update thousands of rows, which under PostgreSQL's MVCC model basically means that thousands of new rows need to be written to replace the old ones, while the thousands of old rows are expired. This smells of slowness, and is why my focus has been on the cam table. The arp table has a similar problem, but it will be less pronounced, as it contains much fewer rows, related to routers. Analysis of cam --------------- The first obvious observation about the cam table is that it isn't normalized very well. A netbox with 100.000 referring rows in cam will have it's sysname repeated 100.000 times in the table. This is due to the historic/log nature of cam. The shear size of the table could be greatly reduced, though, by normalizing the sysname column. Much the same reasoning could be applied to the port column. While a box may have 100.000 referring rows in the cam table, only a fraction of those rows will be active rows, i.e. the end_time is set to infinity. When the box is deleted, PostgreSQL will have to set netboxid=NULL for 100.000 cam rows, and set end_time=NOW() for a fraction of those (50 rows wouldn't be an unreal number). If the relationship between netbox and cam tables were stored in a separate table (i.e. a table referring to netboxid and camid), PostgreSQL would only need to expire 100.000 rows of this new, smaller table, and update 50 rows in the cam table with end_time=NOW(). Finally, the cam table has a unique constraint on the combination of the columns netboxid, sysname, module, port, mac and start_time. Any update to the netboxid column in cam will also cause this constraint's index to be updated. So 100.000 index entries are also updated. On another note, this index will become inordinately large. In fact, in the databases I've examined, the index itself will take up many more pages on disk than the cam table itself! Although this observation really has nothing to do with updates per se, this will affect index lookups greatly. The index is mainly used to verify that updates and inserts do not create duplicate cam rows, but PostgreSQL will also use it for lookups for statements that refer to the netboxid field (Machine Tracker, IP Info Center and possibly Arnold). An index should ideally fit in memory, but this index will no more fit in memory than the cam table itself. If this constraint is even necessary, we should consider redefining the index to use hashtext(sysname) and hashtext(port) to reduce its size. The distribution of sysname and port values is concentrated around a few distinct values, considering their repetetive nature, so hash collisions for these columns are very unlikely. (un)Real world numbers ---------------------- For my timing tests, I duplicated our production database on my workstation, and ran some tests using netbox A as my guinea pig. The cam table contains ~12M rows, netbox A has ~750K referring cam rows, 427 of which are active rows. I proceeded to delete netbox A from the database. This took ~43 minutes (sic) on my workstation. Starting from scratch again, I created a new table, netbox_cam, with two columns, netboxid and camid - foreign keys referring to the netbox and cam tables respectively. I populated this table with netboxid,camid pairs from the cam table, and subsequently dropped the netboxid column from cam. I then proceeded to delete netbox A again. This time the delete only took ~10 minutes. Although both 43 and 10 minutes are unacceptable times in real-world usage, the relative performance increase is the interesting part here. In the second scenario, PostgreSQL only needed to update 427 cam rows, and expiring ~750K netbox_cam rows. I also have some numbers on the aforementioned unique constraint's index. The production cam table took up 259298 pages on disk. Said index took up 340929 pages! I replaced the constraint and index with a unique index using the hashtext function on the sysname, module, port and mac columns. The resulting index was only 67962 pages. Other speed-up strategies ------------------------- Although some schema changes are obviously in order, deleting an IP device from EditDB may still be a time consuming process. Another obvious strategy to speed up the web interface response times for this operation is to run deletes on the netbox table as a background process. Deleting an IP device in EditDB could post netbox delete events on the event queue, producing an immediate response to the end user. The response would explain to the user that the devices are being deleted in the background and may still be visible for some minutes. The event engine (or possibly some other process) would pick up the delete events and run the actual DELETE statements in PostgreSQL. -- mvh Morten Brekkevold UNINETT

1 0

UnicodeDecodeErrors in IP Info Center
by Morten Brekkevold 14 Jul '08

14 Jul '08

As per the attached message from nav-users, I've added SF#2014809 (UnicodeDecodeError in IP Device Info): http://sourceforge.net/support/tracker.php?aid=2014809 The problem stems from the fact that the integration between Django and Cheetah templates will mix unicode and str objects in NAV's Cheetah templates. Django will fill the template with unicode objects, while legacy NAV code will fill it with strings encoded as UTF-8. When Cheetah attempts join these objects into a single string, using the regular string join method, Python will try do decode the str objects into unicode objects using the ASCII codec. This fails miserably as soon as it hits a str object containing international characters. Our local quickfix on navdev seems to have been to place a call to sys.setdefaultencoding('utf-8') in sitecustomize.py, to make sure Python uses the utf-8 codec for these operations instead. This must be done in the sitecustomize module, because the site module will remove the function from the sys module's namespace before the actual Python program starts. We don't want to force users to add these lines to their local sitecustomize module, so we need to find a better fix for this. I propose two different solutions, one for which I've attached a patch. My patch alters the _cheetah_render helper function in the nav.django.shortcuts module. This function takes a string/unicode object representing a fully rendered Django template and places this in a Cheetah template variable. My patch encodes the entire unicode object from Django as an UTF-8 string, and places this in the Cheetah template. Another possible solution is to write a Cheetah filter (an example of this can be found under "Encoding with Unicode" on this page: http://wiki.cheetahtemplate.org/cheetah-recipes.html) which makes sure that all values in a Cheetah template are either unicode objects or utf-8 encoded strings. I've opted for the first fix, as it was smaller and quicker to implement, but if you other Django enthusiasts have comments, I would really like to hear them. I also wouldn't mind a comment from Stein Magnus, who wrote the Django/Cheetah integration in the first place :-) -- mvh Morten Brekkevold UNINETT

1 2

Status meeting minutes, 8 July 2008
by Morten Brekkevold 09 Jul '08

09 Jul '08

Status meeting minutes, 8 July 2008 =================================== Morten ------ * Been working mostly on other projects. Kristian -------- * Been on vacation, worked last week from Moss. * Evaluating bugtrackers * Posted findings to MetaNAV wiki and nav-dev list (http://metanav.uninett.no/bugtracker) * A few alternative suggestions received off-list, but mostly unrealistic ones. * Evaluated using these criteria: * UI for reporting bugs must be fast and simple to use. * Should have good email interface/control capability. * RPC interface is a big plus. * Main contenders reduced to Trac and Launchpad. * Trac missing two-way email interface. * Trac must be configured, hosted and maintained locally by our own staff. * Launchpad is ready for use now. * Migrating from SourceForge is relatively trivial for both tools. * Conclusion: Launchpad recommended. The meeting votes for Launchpad as the new bug/task/feature-tracking tool. Kristian will commence migration from SourceForge shortly. * Fixed a few bugs in netmap. * Working on Network Explorer. Feedback: Fix Layer2/layer3 switch and GW/GSW view on first load, as commented in previous meetings. Want to have this in 3.4.1 release. Thomas ------ * AlertEngine (the new one) code cleanup. * Fixing exception handling in Jabber alert plugin. * Finished a treeselect replacement for maintenance tool. * Looking at IP device history integration. Magnus ------ * Been at Roskilde. * Installed PHP4 on navdev * Tested old Alert Profiles on PHP4, fixing bugs for 3.4.1 release. * Set up a database using the old Norwegian schema for this bugfix session. * Followed up on comments from last meeting. Discussion ---------- Q: Can we put app specific javascript/css files in each subsystem instead of in the webFront subsystem? A: YES! We don't need more pollution of the webFront file tree. Kristian suggests looking at Graphite (a Python library) as a possible RRD replacement. -- mvh Morten Brekkevold UNINETT

1 0

Bugtracker evaluation
by Kristian Klette 03 Jul '08

03 Jul '08

Hi all! It's been on the wishlist for many developers and users for some time to switch to a more modern bugtracker than sourceforge.net. I have tested some alternatives and written a small page about my findings on metanav. Please take a minute and read through and see if I've missed anything important. Be it testing of features or if you'd like to see other trackers tested as well. http://metanav.uninett.no/bugtracker Thanks in advance. - Kristian Klette -- Mvh Kristian Klette «Programs for sale: Fast, Reliable, Cheap: choose two.»

1 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Nav-dev Juli 2008