lunes, 10 de enero de 2022

PostgreSQL pg_dump filtering data by using Row Level Security (RLS)

pg_dump command doesn´t provide a native way to execute the dump based in a filter criteria. 

Sometimes we need such feature in order to dump only a part of a table without needing to create other artifacts to do so. We might also need this feature to have greater parallelism dump control within big tables.

For that reason in this post I´ll create an example by using RLS (Row Level Security) to allow partial dumps by using pg_dump tool.

Another way to obtain a filtered dump could be by using staging tables. You can create staging tables that contains just the data to be exported, in this case, you need to execute a previous step before executing pg_dump to create those tables based in some SQL clause.

By example, if I need to export just only "report_id" between 1 and 3 in table "report_logs", I could create a staging table "report_logs_stg1" like this: 

create table report_logs_stg1 as select * from report_logs where report_id between 1 and 3;

and execute the pg_dump just only for that table: 

pg_dump ... -t report_logs_stg1 ...

in the target server, after importing, I´ll need some additional step to merge the data from report_logs_stg1 into the actual table name:

insert into report_logs select * from report_logs_stg1;

This approach will need to write in source server and will require some free space to allocate the staging tables.

Our proposed solution for this should be using RLS, so you can define some policies that will apply to some users when accessing the table data by using SQL conditional expressions. These users will be used only by pg_dump processes, so RLS policies will only apply to pg_dump, it will not affect other applications working with the database.

Lets put all together in an example:

Situation

I have a database with some small tables and a big one "table1". This table has a column "report_id" with different values: 1,2,3,4 and other timestamp column "ts". Database is in use by other applications.

Request

I need to export the complete database by using parallelism, 4 parallelism degree, (both at table and data levels within table "table1") discarding old rows from "table1" ("ts" < 2years ago).

Solution

To achieve this, we can use parallelism at table level by using parameter "--jobs" in pg_dump command, with this option the pg_dump process will span multiple dump processes each of them will take care of some of the tables (but each table will be dumped by only one process).

To use parallelism at data level within table "table1" and to discard rows based in timestamp, we´ll use RLS to filter the data based in the username but letting other applications working as usual.

Data example, table1,...table16, table1 having "report_id" from 1 to 4 with mixed old and new rows based in column "ts":

CREATE TABLE TABLE1 (REPORT_ID BIGINT, TS TIMESTAMP(6));
INSERT INTO TABLE1 SELECT 1, NOW()-INTERVAL '10 YEARS' FROM GENERATE_SERIES(1,100);
INSERT INTO TABLE1 SELECT 2, NOW()-INTERVAL '10 YEARS' FROM GENERATE_SERIES(1,100);
INSERT INTO TABLE1 SELECT 3, NOW()-INTERVAL '10 YEARS' FROM GENERATE_SERIES(1,100);
INSERT INTO TABLE1 SELECT 4, NOW()-INTERVAL '10 YEARS' FROM GENERATE_SERIES(1,100);
INSERT INTO TABLE1 (REPORT_ID, TS) SELECT REPORT_ID, NOW() FROM TABLE1;

CREATE TABLE TABLE2 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE3 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE4 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE5 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE6 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE7 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE8 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE9 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE10 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE11 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE12 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE13 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE14 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE15 AS SELECT * FROM TABLE1 LIMIT 10;
CREATE TABLE TABLE16 AS SELECT * FROM TABLE1 LIMIT 10;

Users test1, .. test4 to use for pg_dump parallel processes:

CREATE ROLE TEST1 WITH LOGIN NOSUPERUSER INHERIT CREATEDB NOCREATEROLE NOREPLICATION PASSWORD '<password>';
CREATE ROLE TEST2 WITH LOGIN NOSUPERUSER INHERIT CREATEDB NOCREATEROLE NOREPLICATION PASSWORD '<password>';
CREATE ROLE TEST3 WITH LOGIN NOSUPERUSER INHERIT CREATEDB NOCREATEROLE NOREPLICATION PASSWORD '<password>';
CREATE ROLE TEST4 WITH LOGIN NOSUPERUSER INHERIT CREATEDB NOCREATEROLE NOREPLICATION PASSWORD '<password>';

GRANT CONNECT ON DATABASE DB TO TEST1;
GRANT CONNECT ON DATABASE DB TO TEST2;
GRANT CONNECT ON DATABASE DB TO TEST3;
GRANT CONNECT ON DATABASE DB TO TEST4;

GRANT SELECT ON ALL TABLES IN SCHEMA PUBLIC TO TEST1;
GRANT SELECT ON ALL TABLES IN SCHEMA PUBLIC TO TEST2;
GRANT SELECT ON ALL TABLES IN SCHEMA PUBLIC TO TEST3;
GRANT SELECT ON ALL TABLES IN SCHEMA PUBLIC TO TEST4;

Policies drop:

DROP POLICY DUMP_RLS_1 ON TABLE1;
DROP POLICY DUMP_RLS_2 ON TABLE1;
DROP POLICY DUMP_RLS_3 ON TABLE1;
DROP POLICY DUMP_RLS_4 ON TABLE1;

Policies creation:

CREATE POLICY DUMP_RLS_1 ON TABLE1 FOR ALL TO PUBLIC USING ((REPORT_ID =1 AND TS>NOW()-INTERVAL '2 YEARS' AND CURRENT_USER='TEST1') OR CURRENT_USER NOT IN ('TEST1','TEST2','TEST3','TEST4'));
CREATE POLICY DUMP_RLS_2 ON TABLE1 FOR ALL TO PUBLIC USING ((REPORT_ID =2 AND TS>NOW()-INTERVAL '2 YEARS' AND CURRENT_USER='TEST2') OR CURRENT_USER NOT IN ('TEST1','TEST2','TEST3','TEST4'));
CREATE POLICY DUMP_RLS_3 ON TABLE1 FOR ALL TO PUBLIC USING ((REPORT_ID =3 AND TS>NOW()-INTERVAL '2 YEARS' AND CURRENT_USER='TEST3') OR CURRENT_USER NOT IN ('TEST1','TEST2','TEST3','TEST4'));
CREATE POLICY DUMP_RLS_4 ON TABLE1 FOR ALL TO PUBLIC USING ((REPORT_ID =4 AND TS>NOW()-INTERVAL '2 YEARS' AND CURRENT_USER='TEST4') OR CURRENT_USER NOT IN ('TEST1','TEST2','TEST3','TEST4'));

Lets explain dump_rls_1 policy: this policy will force data to be retrieved to match following criteria: "report_id"=1 and "ts" > 2years ago and current_user=test1 (so, when connected with user=test1, the connection will only retrieve report_id=1 and ts>2years ago. Apart of that there is an "or CURRENT_USER NOT IN ('TEST1','TEST2','TEST3','TEST4')", so, if user currently connected is different from test1, ..test4, it could see all the data in the table without restriction.

So, lets summarize the criteria based on user connected:
test1 => will take report_id=1 and ts> 2years ago from table "table1"
test2 => no data from table "table1"
test3 => no data from table "table1"
test4 => no data from table "table1"
other user => all data from table "table1"

Enable row level security for table1 and force it to use even for owner user:

ALTER TABLE TABLE1 ENABLE ROW LEVEL SECURITY;
ALTER TABLE TABLE1 FORCE ROW LEVEL SECURITY;

Dump processes:

export PGPASSWD=<password>
pg_dump -h <servername> -U test1 -s -d db -f db.schema.dump
nohup pg_dump -h <servername> -U test1 -Fd -j 4 --enable-row-security -a -d db -f db.data1.dump &
nohup pg_dump -h <servername> -U test2 -t table1 --enable-row-security -a -d db -f db.data2.dump &
nohup pg_dump -h <servername> -U test3 -t table1 --enable-row-security -a -d db -f db.data3.dump &
nohup pg_dump -h <servername> -U test4 -t table1 --enable-row-security -a -d db -f db.data4.dump &

First pg_dump is a schema only export, to get all the tables creation and definition without data.

Second pg_dump is a directory format export with 4 parallel connections, it will take a complete database dump in a table parallelized mode by using user "test1", so, for big table "table1" will only take those rows specified by the policy criteria dump_rls_1.

Third and other pg_dump processes will only export "table1" by using users test2,..,test4, so each dump will take only the allowed rows from the policies definitions dump_rls_2, .., dump_rls_4.

As you see, last 4 pg_dump processes start with "nohup" and end by "&" this will launch all these processes in parallel. So we could span up to 7 parallel parallel dump processes (4 for the first pg_dump, and other 3 for the other pg_dump commands), as our database had only 1 big table, first pg_dump process will finish quickly and will keep only 1 dump connection, so the initial spike in number of dump processes will not be so long in time, after that, only 4 parallel processes will remain exporting rows from "table1".

Result

You can use the RLS approach to execute partial exports from big tables without needed to create staging tables in source server (by example, when your source server is out of space and you can´t create any other object), you can also use RLS approach to have parallel exports at data level not only at table level.