tl;dr
Why does an update from ...
statement make a CTE behave differently than an update where...
with subquery statement in postgresql?
Full Context
Disclaimer: The information below has been sanitized to maintain privacy.
While working on a legacy system (about 5 year old design) I have a postgres documents
table that looks like this:
create table documents ( uuid uuid not null primary key, data jsonb not null, created_dt timestamp with time zone not null, deleted_dt timestamp with time zone);
And the information in the data
column looks like this roughly:
{"uuid": "959bd856-707a-4c5e-a76e-4c1496a6103a","bucket": "fake_cloud_bucket_01","md5Sum": "233ea5afd52636b83cf75f7fd39c1f2a","contentType": "application/pdf","tags": {"APPLICATION_UUID": "3e979827-36df-4bb0-9012-22cceb54d5e9","IDENTITY_UUID": "9e137538-b0ad-4322-a6a4-2047bab984e4" },"createdDate": 1704757787264}
The application that uses this database has a search by tag feature, so in the original design, an index was added to the documents
table like this:
create index documents_tags_idxgin on documents using gin ((data -> 'tags'::text) jsonb_path_ops);
Now fast-forward 5 years to today and the documents
table has some 20 million rows and a simple query like the one below takes about 3 minutes to finish, which is unacceptable:
select * from documentswhere data->'tags'->>'key' = 'IDENTITY_UUID' and data->'tags'->>'value' = '9e137538-b0ad-4322-a6a4-2047bab984e4';
To make the search by tag feature perform better, I decided to extract the tags
into its own table and apply more performant indices to the data. So now we have a new table called tags
and an index for it that looks like this:
create table tags ( document_uuid uuid not null references documents(uuid) on delete restrict, tag text not null check (tag <> ''), value text not null check (value <> ''), primary key (document_uuid, tag));create index tags_tag_value_idx on tags (tag, value);
With the above addition of the tags
table, now this query takes on average single digit milliseconds to return a result:
select D.*from documents as Djoin tags as T on D.uuid = T.document_uuidwhere T.tag = 'IDENTITY_UUID' and T.value = '9e137538-b0ad-4322-a6a4-2047bab984e4';
The Question
To convert the data from the jsonb
tags
object in the data
column of the documents
table to rows in the tags
table, I wrote a query that spreads the contents of the tags object for each document into the data to be received as a row in the tags
table:
insert into tags (document_uuid, tag, value) select D.uuid, T.key, replace((T.value)::text, '"', '') from documents as D join jsonb_each(data->'tags') as T on true where D.tags_processed_on is null for update skip locked limit 2500 on conflict do nothing returning document_uuid, tag, value
Given the jsonb
object in the context part above, there are two rows insert into the tags
table and the rows returned by the above query look like this:
document_uuid, tag, value'959bd856-707a-4c5e-a76e-4c1496a6103a', 'APPLICATION_UUID', '3e979827-36df-4bb0-9012-22cceb54d5e9''959bd856-707a-4c5e-a76e-4c1496a6103a', 'IDENTITY_UUID', '9e137538-b0ad-4322-a6a4-2047bab984e4'
In order to keep track of the documents processed, we also added a temporary column to the documents
table:
alter table documents add column tags_processed_on timestamp default null;
So my query to insert into the tags
table and update documents
as processed looked like this:
with docs_to_tags as ( insert into tags (document_uuid, tag, value) select D.uuid, T.key, replace((T.value)::text, '"', '') from documents as D join jsonb_each(data->'tags') as T on true where D.tags_processed_on is null for update skip locked limit 2500 on conflict do nothing returning document_uuid, tag, value)update documents as D1set tags_processed_on = now()from docs_to_tags as DTTwhere D1.uuid = DTT.document_uuid;
The problem with the above CTE is that when I ran it, it would only insert the first tag for any given document and move on to the next document. But when I used a subquery with the outer update statement, then the insert into the tags
table behaved like I expected, inserting all tags into the tags
table. So the CTE that works looks like this:
with docs_to_tags as ( insert into tags (document_uuid, tag, value) select D.uuid, T.key, replace((T.value)::text, '"', '') from documents as D join jsonb_each(data->'tags') as T on true where D.tags_processed_on is null for update skip locked limit 2500 on conflict do nothing returning document_uuid, tag, value)update documents as D1set tags_processed_on = now()where D1.uuid in ( select distinct D2.document_uuid from docs_to_tags as D2);
So finally my question is this:
Why does an update from ...
statement make a CTE behave differently than an update where...
with subquery statement in postgresql?
... or more succinctly: Why does the outer update statement have any influence on what is returned by the inner insert query inside of the CTE?