martes, 14 de septiembre de 2021

Solving "select count without where clause" performance issues in PostgreSQL databases

"select count without where clause" is a performance headache in most of the database engines, in this post we´ll talk about PostgreSQL engine.


In PostgreSQL we don´t have the "count" of a table anywhere, so, every time you need to "select count(1) from table", PostgreSQL will perform a sequential scan of the table, that will require large number of IO to be performed if it´s a big table. We´ll provide some solutions to this problem, please don´t miss the last one …


Solution 1

Increase server capacity and provide much more IO capacity, partition the table and use parallelism, so, these type of queries will perform much better but you will spend lots of resources (partitioning the tables, increasing the IO capacity of the server). This will not need any change from your application side and the information should be accurate.


Solution 2

Remove the "select count", yes, sometimes your application really don´t need to do a "select count" without where part. So, review with your development team the need to do so. Sometimes it´s necessary to do "select count" for paging in the frontend, to calculate the number of pages and put something like "1 of 2500", is this really necessary?, why frontend just put something like "next"?


Solution 3

You can rely on statistical information to get an approximation of the number of rows, but, that´s not accurate and it´ll depend on how often you take statistics of a single table. You have the pg_stat_all_tables/pg_stat_user_tables with column "n_live_tup" with the estimated number of live rows. This will require changes from your application side to change the "select count(1) from public.tests" to "select n_live_tup from pg_stat_user_tables where schemaname='public' and relname='tests'", and the information should not be accurate.


Solution 4

Use some analytical repository with calculated metrics like "number of rows in table" so, instead using "select count(1) from public.tests" you can use: "select numRows from aggtable where schemaname='public' and tablename='tests'" and have some ETL process or materialized view to update this information frequently, maybe each 10min, so, you´ll only have 1 process each 10min doing a "select count(1) from public.tests" to update the information in "aggtable". In this case, you need to change the application query like explained. And the information should not be totally accurate.


Solution 5

Use a staging table with the current calculated number of rows of all the requested tables and have a rewrite mechanism in the middle of the application and the database to "transform" automatically those "select count(1) from public.tests" into "SELECT totalrows FROM aggtableinfo where tableschema='public' and tablename='tests'". So, in this case, the information will be accurate and it´ll not require an application change.


We´ll implement the last solution by using two components


1. Staging table "aggtableinfo" and triggers in our PostgreSQL database to update that staging table whenever there is an insert or delete on the tables we want to improve the "select count"

2. Open source project "pgbouncer-rr", please visit the project here: GitHub - awslabs/pgbouncer-rr-patch: Adds query routing and rewriting extensions to pgbouncer, that will allow us to rewrite queries on fly without modifying the application side.


Staging table and triggers


  • Create the staging table:


create table aggtableinfo (tableschema varchar(100),tablename varchar(100),totalrows bigint);

create index aggtableinfo_01_i on aggtableinfo(tableschema,tablename);


  • Create the trigger function:


CREATE OR REPLACE FUNCTION public.aggtableinfo_func()
  RETURNS trigger AS
$BODY$
DECLARE
vcount integer;
vinc integer;
BEGIN
IF (TG_OP = 'DELETE') THEN
vinc=-1;
ELSIF (TG_OP = 'INSERT') THEN
vinc=1;
END IF;
 
UPDATE aggtableinfo set totalrows=(case when totalrows+vinc <0 then 0 else totalrows+vinc end) where tableschema=TG_TABLE_SCHEMA::text and tablename=TG_TABLE_NAME::text;
IF NOT FOUND THEN
EXECUTE 'SELECT COUNT(1) FROM '|| TG_TABLE_SCHEMA::text||'.'||TG_TABLE_NAME::text||'' INTO vcount;
INSERT INTO aggtableinfo (tableschema,tablename,totalrows) values(TG_TABLE_SCHEMA::text, TG_TABLE_NAME::text, vcount);
END IF;
    RETURN NULL;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE SECURITY DEFINER
  COST 100;


The trigger function will update staging table "aggtableinfo" whenever there is an insert or delete on the requested table. If function fails to update that staging table, it´ll execute a complete "select count" on the requested table.


  • Create the trigger on the requested table:


CREATE TRIGGER aggtableinfo_trigger
  AFTER INSERT OR DELETE
  ON public.tests
  FOR EACH ROW
  EXECUTE PROCEDURE public.aggtableinfo_func();


Here we´re creating the trigger on table "tests" with clause "FOR EACH ROW", so, every single row affected by a delete or insert statement will fire the execution of the trigger. You need to consider this, as this could cause performance issues if you insert/delete multiple rows by a single sentence.


Pgbouncer-rr 


  • Install pgbouncer-rr


Please follow up the instructions in pgbouncer-rr project: GitHub - awslabs/pgbouncer-rr-patch: Adds query routing and rewriting extensions to pgbouncer


# install required packages - see https://github.com/pgbouncer/pgbouncer#building
sudo yum install libevent-devel openssl-devel python-devel libtool git patch make -y

# download the latest tested pgbouncer distribution - 1.12
git clone https://github.com/pgbouncer/pgbouncer.git --branch "pgbouncer_1_12_0"

# download pgbouncer-rr extensions
git clone https://github.com/awslabs/pgbouncer-rr-patch.git

# merge pgbouncer-rr extensions into pgbouncer code
cd pgbouncer-rr-patch
./install-pgbouncer-rr-patch.sh ../pgbouncer

# build and install
cd ../pgbouncer
git submodule init
git submodule update
./autogen.sh
./configure ...
make
sudo make install


After that, you´ll need to start pgbouncer as a service, I used these configuration files to make it work with my Azure Database for PostgreSQL server:


  • Pgbouncer.ini


[pgbouncer]
listen_port = 5439
listen_addr = *
auth_type = trust
auth_file = users.txt
logfile = /tmp/pgbouncer.log
pidfile = /tmp/pgbouncer.pid
admin_users = master
ignore_startup_parameters = extra_float_digits,TimeZone
pkt_buf = 32768
pool_mode = statement
rewrite_query_py_module_file = ./rewrite_query.py
rewrite_query_disconnect_on_failure = true
[databases]
testsdb = host=<servername>.postgres.database.azure.com port=5432 dbname=testsdb



  • rewrite_query.py


import re
def rewrite_query(username, query):
    q1="SELECT count\(1\) FROM public.tests"
    if re.match(q1, query):
        new_query = "SELECT totalrows FROM aggtableinfo where tableschema='public' and tablename='tests'"
    else:
        new_query = query
    return new_query


  • users.txt


"username" "password"


Important file here is "rewrite_query.py" where you can see the rewrite sentence, when if found the query "select count(1) FROM public.tests" pgbouncer-rr will replace it by "SELECT totalrows FROM aggtableinfo where tableschema='public' and tablename='tests'"


After that, you only need to point your application to pgbouncer endpoint (in this case, using port 5439) and your applications will have much better performance when issuing the "select count(1) FROM tests", because that query will be replaced by other query with much more performance as no need to sequentially scan all blocks of "tests" table.


With this approach you´re using also the benefits of pgbouncer, an open source connection pooler for PostgreSQL database that will help you to reduce the overall resource consumption, as it´ll reuse the connections.


As a summary, with pgbouncer-rr, we have a powerful connection pooler for PostgreSQL, but also a routing and rewrite engine to extend its capabilities, that creates another abstraction layer that allow us to improve our application performance even without touching a line of code.