Hi everyone, long post alert!
As Vidar wrote in the minutes from a meeting a few weeks back, I've been looking into optimizing netbox deletes. It is an old and recurring complaint that deleting IP devices from Edit DB can take quite some time, meaning that the web interface blocks, waiting for the results of a PostgreSQL DELETE statement. We're talking up to 20-40 minutes in the worst cases here! We haven't really prioritized these complaints, as IP device deletion isn't considered a frequent nor critical task.
I've looked into this from time to time, and just before I went on vacation I tested a few schema changes and did some timings. I'll try to summarize my findings here, if my memory allows; vacations tend to wipe these sort of things from ones memory, and I'll have to rely on my notes ;-)
Why are deletes to the netbox table so slow? -------------------------------------------- The netbox table is central to most things in the manage schema, which means that there are many foreign keys referring to this table. A simple DELETE statement on the netbox table will cause many cascading updates and deletes.
A switch that's been monitoried by NAV for a long time, will tend to have tens of thousands, if not hundreds of thousands of referring rows in the cam table. This is an extreme amount of referring rows compared to other tables, such as module, alerthist and such. All these rows will need to be touched as a netbox is deleted.
The cam table is a log table, which retains historic information. When a cam row is created, the sysname, module and interface name of the referred-to netbox is copied into the cam table for historic reference. When a netbox is deleted, its referring cam rows are not deleted: Instead, the foreign key referring to the netbox table will be set to NULL.
I believe this is a crucial point. When a netbox delete cascades to cam, PostgreSQL cannot simply mark thousands of rows as deleted - instead it needs to update thousands of rows, which under PostgreSQL's MVCC model basically means that thousands of new rows need to be written to replace the old ones, while the thousands of old rows are expired.
This smells of slowness, and is why my focus has been on the cam table. The arp table has a similar problem, but it will be less pronounced, as it contains much fewer rows, related to routers.
Analysis of cam --------------- The first obvious observation about the cam table is that it isn't normalized very well. A netbox with 100.000 referring rows in cam will have it's sysname repeated 100.000 times in the table. This is due to the historic/log nature of cam. The shear size of the table could be greatly reduced, though, by normalizing the sysname column. Much the same reasoning could be applied to the port column.
While a box may have 100.000 referring rows in the cam table, only a fraction of those rows will be active rows, i.e. the end_time is set to infinity. When the box is deleted, PostgreSQL will have to set netboxid=NULL for 100.000 cam rows, and set end_time=NOW() for a fraction of those (50 rows wouldn't be an unreal number).
If the relationship between netbox and cam tables were stored in a separate table (i.e. a table referring to netboxid and camid), PostgreSQL would only need to expire 100.000 rows of this new, smaller table, and update 50 rows in the cam table with end_time=NOW().
Finally, the cam table has a unique constraint on the combination of the columns netboxid, sysname, module, port, mac and start_time. Any update to the netboxid column in cam will also cause this constraint's index to be updated. So 100.000 index entries are also updated.
On another note, this index will become inordinately large. In fact, in the databases I've examined, the index itself will take up many more pages on disk than the cam table itself! Although this observation really has nothing to do with updates per se, this will affect index lookups greatly. The index is mainly used to verify that updates and inserts do not create duplicate cam rows, but PostgreSQL will also use it for lookups for statements that refer to the netboxid field (Machine Tracker, IP Info Center and possibly Arnold).
An index should ideally fit in memory, but this index will no more fit in memory than the cam table itself. If this constraint is even necessary, we should consider redefining the index to use hashtext(sysname) and hashtext(port) to reduce its size. The distribution of sysname and port values is concentrated around a few distinct values, considering their repetetive nature, so hash collisions for these columns are very unlikely.
(un)Real world numbers ---------------------- For my timing tests, I duplicated our production database on my workstation, and ran some tests using netbox A as my guinea pig.
The cam table contains ~12M rows, netbox A has ~750K referring cam rows, 427 of which are active rows.
I proceeded to delete netbox A from the database. This took ~43 minutes (sic) on my workstation.
Starting from scratch again, I created a new table, netbox_cam, with two columns, netboxid and camid - foreign keys referring to the netbox and cam tables respectively. I populated this table with netboxid,camid pairs from the cam table, and subsequently dropped the netboxid column from cam.
I then proceeded to delete netbox A again. This time the delete only took ~10 minutes.
Although both 43 and 10 minutes are unacceptable times in real-world usage, the relative performance increase is the interesting part here. In the second scenario, PostgreSQL only needed to update 427 cam rows, and expiring ~750K netbox_cam rows.
I also have some numbers on the aforementioned unique constraint's index. The production cam table took up 259298 pages on disk. Said index took up 340929 pages! I replaced the constraint and index with a unique index using the hashtext function on the sysname, module, port and mac columns. The resulting index was only 67962 pages.
Other speed-up strategies ------------------------- Although some schema changes are obviously in order, deleting an IP device from EditDB may still be a time consuming process. Another obvious strategy to speed up the web interface response times for this operation is to run deletes on the netbox table as a background process.
Deleting an IP device in EditDB could post netbox delete events on the event queue, producing an immediate response to the end user. The response would explain to the user that the devices are being deleted in the background and may still be visible for some minutes. The event engine (or possibly some other process) would pick up the delete events and run the actual DELETE statements in PostgreSQL.