I wrote this entry for planet postgreql back in 2008. I am posting it here again not to lose it. The details here are a bit out of date but some of them might still be relevant.
Autovacuum, something that comes up a lot when you talk about PostgreSQL maintainence, and it is relatively new, started from PostgreSQL 8.1 to be part of the backend proper. Before that it was a contrib module IIRC. In the generally excellent PotgreSQL documentation you can find its reference here however i will also try and state a few things about autovaccum which are not in docs but perhaps should be, i will also try to keep it as end-userish as possible.
When does it run?
As the documentation tells you , it runs every autovacuum_naptime. Infact in each iteration of main server loop, it calls the autovac_start method(doesn’t fork it ),which checks if the lastautovacuum end time AND last autovacuum start time is less than autovacuum_naptime, it then forks a process if the required time has infact elapsed. i did wonder for a second why we need to check for last autovacuum start time before launching it, and the answer ofcourse is that it does that because if the previous fork call failed somehow, it will start thrashing by trying to fork it in each iteration of the main server loop.
The old contrib module had a nifty feature to automatically increase the nap time looking at the volume of vacuum done, but it didn’t make it when the code was brought into the backend, i guess people tend to think the vacuum delay settings is a better way to do this, but i imagine the old feature would easier to understand and use. To be honest i don’t know this feature first hand, but from what i read it, it sounds neat.
Does it process all databases on each iteration?
No, it just picks one database each iteration. So the immediate question that comes to mind is whats the criteria for picking up a database to vacuum.
Database vacuum ordering
Following criteria is used pick up a database to vacuum in each iteration
1. Is there any database in XID wraparound danger? if there are more than one,then pick one in the order of older forzenxid.
2. If there is no database in XID wraparound danger, pick the one which is least recently autovacuumed.
My opinion btw, on this is that the process should pick the database least recently vacuumed, it shouldn’t matter if it was autovacuumed or regular vacuumed, currently it only decides on lastautovacuum timestamp.
Once a database is selected , the autovacuum daemon proceeds to vacuum the tables in that dataabase
Which tables are vacuumed in the selected database?
Not every table is vacuumed in the selected database, only tables which have dead rows exceeding the vacuum threshold. Vacuum threshold is calculated as
reltuples is the estimated number of rows in table
Temporary tables are not vacuumed similarly any table having no entry in the stat subsystem , will be skipped.
Note that you can disable a table’s vacuuming through autovacuum daemon by making an entry in pg_autovacuum table (setting pg_autovacuum.enabled=false), however if the table is in XID warparound danger, a vacuum is forced
Table vacuum ordering
There is no particular order in which the list of tables eligible for vacuum is processed. The list is just created as the autovacuum process picks up tables from pg_class and adds them to a list of tables to vaccum , according to the criteria mentioned above, so it just has the same order as pg_class entries, which maybe on oid i guess.
This in my opinion this is also something sub-optimal in the autovacuum daemon , i think it can be impoved by ordering on how many dead rows exceed the threshold. This way we will be vacuuming tables which need it the most, first.
This is something i am working on currently and will be trying to contribute to the community if they want it.
What about toast tables?
While scanning for tables to vacuum, the autovacuum process makes a note of all tables with a toast entry (regardless if they need a vacuum or not), every table which has a toast table is vacuumed forcefully even if the base tables doesn’t need a vacuum.
Does autovacuum daemon also perform ANALYZE?
Yes, if autovacuum will also anayze the tables to keep the stats updated. The decision to analyze or not to analyze a table is again based on analyze threshold. First it determines how many tuples in the table need analysis as following
anltuples = ( table live tuples + table dead tuples – tuples analyze in previous run )
it analyzes the table if anltuples > analthresh, where analthresh is calculated as
analthresh = autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * reltuples
Keepin’ it real
Since autovacuum relies on stat collector so much, its extremely important to ‘keep it real’. An important activity which autovacuum does is to clean up the dead rows in stat collector itself. It does that by scanning through entries for all databases and tables present in stat collector , and then making sure these objects actually exist, and are not left overs form old stat collection while the object is actually dropped.
Just looking at this, i have a weired feeling that this can be an expensive process to do if the number of tables are in thousands , and there are definitely systems like that, but something which is required as well.
One alternative can be to actually remove any stat entry for the object when a drop command for it is issued, i.e. DROP TABLE etc.
Whats new in 8.3?
The 8.3 release of postgres has a cool feature of multiple workers for autovacuum. In 8.3 you can configure multiple autovacuum worker processes to vacuum different tables in parallel. Which sounds pretty promising