Hi everyone, long post alert!
As Vidar wrote in the minutes from a meeting a few weeks back, I've
been looking into optimizing netbox deletes. It is an old and
recurring complaint that deleting IP devices from Edit DB can take
quite some time, meaning that the web interface blocks, waiting for
the results of a PostgreSQL DELETE statement. We're talking up to
20-40 minutes in the worst cases here! We haven't really prioritized
these complaints, as IP device deletion isn't considered a frequent
nor critical task.
I've looked into this from time to time, and just before I went on
vacation I tested a few schema changes and did some timings. I'll try
to summarize my findings here, if my memory allows; vacations tend to
wipe these sort of things from ones memory, and I'll have to rely on
my notes ;-)
Why are deletes to the netbox table so slow?
--------------------------------------------
The netbox table is central to most things in the manage schema, which
means that there are many foreign keys referring to this table. A
simple DELETE statement on the netbox table will cause many cascading
updates and deletes.
A switch that's been monitoried by NAV for a long time, will tend to
have tens of thousands, if not hundreds of thousands of referring rows
in the cam table. This is an extreme amount of referring rows
compared to other tables, such as module, alerthist and such. All
these rows will need to be touched as a netbox is deleted.
The cam table is a log table, which retains historic information.
When a cam row is created, the sysname, module and interface name of
the referred-to netbox is copied into the cam table for historic
reference. When a netbox is deleted, its referring cam rows are not
deleted: Instead, the foreign key referring to the netbox table will
be set to NULL.
I believe this is a crucial point. When a netbox delete cascades to
cam, PostgreSQL cannot simply mark thousands of rows as deleted -
instead it needs to update thousands of rows, which under PostgreSQL's
MVCC model basically means that thousands of new rows need to be
written to replace the old ones, while the thousands of old rows are
expired.
This smells of slowness, and is why my focus has been on the cam
table. The arp table has a similar problem, but it will be less
pronounced, as it contains much fewer rows, related to routers.
Analysis of cam
---------------
The first obvious observation about the cam table is that it isn't
normalized very well. A netbox with 100.000 referring rows in cam
will have it's sysname repeated 100.000 times in the table. This is
due to the historic/log nature of cam. The shear size of the table
could be greatly reduced, though, by normalizing the sysname column.
Much the same reasoning could be applied to the port column.
While a box may have 100.000 referring rows in the cam table, only a
fraction of those rows will be active rows, i.e. the end_time is set
to infinity. When the box is deleted, PostgreSQL will have to set
netboxid=NULL for 100.000 cam rows, and set end_time=NOW() for a
fraction of those (50 rows wouldn't be an unreal number).
If the relationship between netbox and cam tables were stored in a
separate table (i.e. a table referring to netboxid and camid),
PostgreSQL would only need to expire 100.000 rows of this new, smaller
table, and update 50 rows in the cam table with end_time=NOW().
Finally, the cam table has a unique constraint on the combination of
the columns netboxid, sysname, module, port, mac and start_time. Any
update to the netboxid column in cam will also cause this constraint's
index to be updated. So 100.000 index entries are also updated.
On another note, this index will become inordinately large. In fact,
in the databases I've examined, the index itself will take up many
more pages on disk than the cam table itself! Although this
observation really has nothing to do with updates per se, this will
affect index lookups greatly. The index is mainly used to verify that
updates and inserts do not create duplicate cam rows, but PostgreSQL
will also use it for lookups for statements that refer to the netboxid
field (Machine Tracker, IP Info Center and possibly Arnold).
An index should ideally fit in memory, but this index will no more fit
in memory than the cam table itself. If this constraint is even
necessary, we should consider redefining the index to use
hashtext(sysname) and hashtext(port) to reduce its size. The
distribution of sysname and port values is concentrated around a few
distinct values, considering their repetetive nature, so hash
collisions for these columns are very unlikely.
(un)Real world numbers
----------------------
For my timing tests, I duplicated our production database on my
workstation, and ran some tests using netbox A as my guinea pig.
The cam table contains ~12M rows, netbox A has ~750K referring cam
rows, 427 of which are active rows.
I proceeded to delete netbox A from the database. This took ~43
minutes (sic) on my workstation.
Starting from scratch again, I created a new table, netbox_cam, with
two columns, netboxid and camid - foreign keys referring to the netbox
and cam tables respectively. I populated this table with
netboxid,camid pairs from the cam table, and subsequently dropped the
netboxid column from cam.
I then proceeded to delete netbox A again. This time the delete only
took ~10 minutes.
Although both 43 and 10 minutes are unacceptable times in real-world
usage, the relative performance increase is the interesting part here.
In the second scenario, PostgreSQL only needed to update 427 cam rows,
and expiring ~750K netbox_cam rows.
I also have some numbers on the aforementioned unique constraint's
index. The production cam table took up 259298 pages on disk. Said
index took up 340929 pages! I replaced the constraint and index with
a unique index using the hashtext function on the sysname, module,
port and mac columns. The resulting index was only 67962 pages.
Other speed-up strategies
-------------------------
Although some schema changes are obviously in order, deleting an IP
device from EditDB may still be a time consuming process. Another
obvious strategy to speed up the web interface response times for this
operation is to run deletes on the netbox table as a background
process.
Deleting an IP device in EditDB could post netbox delete events on the
event queue, producing an immediate response to the end user. The
response would explain to the user that the devices are being deleted
in the background and may still be visible for some minutes. The
event engine (or possibly some other process) would pick up the delete
events and run the actual DELETE statements in PostgreSQL.
--
mvh
Morten Brekkevold
UNINETT