Quantcast
Viewing all articles
Browse latest Browse all 207

PostgreSQL CTE behavior influenced by outer `update` statement: Why?

tl;dr

Why does an update from ... statement make a CTE behave differently than an update where... with subquery statement in postgresql?

Full Context

Disclaimer: The information below has been sanitized to maintain privacy.

While working on a legacy system (about 5 year old design) I have a postgres documents table that looks like this:

create table documents (    uuid              uuid                     not null primary key,    data              jsonb                    not null,    created_dt        timestamp with time zone not null,    deleted_dt        timestamp with time zone);

And the information in the data column looks like this roughly:

{"uuid": "959bd856-707a-4c5e-a76e-4c1496a6103a","bucket": "fake_cloud_bucket_01","md5Sum": "233ea5afd52636b83cf75f7fd39c1f2a","contentType": "application/pdf","tags": {"APPLICATION_UUID": "3e979827-36df-4bb0-9012-22cceb54d5e9","IDENTITY_UUID": "9e137538-b0ad-4322-a6a4-2047bab984e4"    },"createdDate": 1704757787264}

The application that uses this database has a search by tag feature, so in the original design, an index was added to the documents table like this:

create index documents_tags_idxgin    on documents using gin ((data -> 'tags'::text) jsonb_path_ops);

Now fast-forward 5 years to today and the documents table has some 20 million rows and a simple query like the one below takes about 3 minutes to finish, which is unacceptable:

select * from documentswhere data->'tags'->>'key' = 'IDENTITY_UUID'  and data->'tags'->>'value' = '9e137538-b0ad-4322-a6a4-2047bab984e4';

To make the search by tag feature perform better, I decided to extract the tags into its own table and apply more performant indices to the data. So now we have a new table called tags and an index for it that looks like this:

create table tags (    document_uuid uuid not null references documents(uuid) on delete restrict,    tag text not null check (tag <> ''),    value text not null check (value <> ''),    primary key (document_uuid, tag));create index tags_tag_value_idx on tags (tag, value);

With the above addition of the tags table, now this query takes on average single digit milliseconds to return a result:

select D.*from documents as Djoin tags as T on D.uuid = T.document_uuidwhere T.tag = 'IDENTITY_UUID'  and T.value = '9e137538-b0ad-4322-a6a4-2047bab984e4';

The Question

To convert the data from the jsonbtags object in the data column of the documents table to rows in the tags table, I wrote a query that spreads the contents of the tags object for each document into the data to be received as a row in the tags table:

insert into tags (document_uuid, tag, value)    select D.uuid, T.key, replace((T.value)::text, '"', '')    from documents as D        join jsonb_each(data->'tags') as T on true    where D.tags_processed_on is null        for update skip locked    limit 2500        on conflict do nothing    returning document_uuid, tag, value

Given the jsonb object in the context part above, there are two rows insert into the tags table and the rows returned by the above query look like this:

document_uuid, tag, value'959bd856-707a-4c5e-a76e-4c1496a6103a', 'APPLICATION_UUID', '3e979827-36df-4bb0-9012-22cceb54d5e9''959bd856-707a-4c5e-a76e-4c1496a6103a', 'IDENTITY_UUID', '9e137538-b0ad-4322-a6a4-2047bab984e4'

In order to keep track of the documents processed, we also added a temporary column to the documents table:

alter table documents add column tags_processed_on timestamp default null;

So my query to insert into the tags table and update documents as processed looked like this:

with docs_to_tags as (    insert into tags (document_uuid, tag, value)        select D.uuid, T.key, replace((T.value)::text, '"', '')        from documents as D            join jsonb_each(data->'tags') as T on true        where D.tags_processed_on is null            for update skip locked        limit 2500            on conflict do nothing        returning document_uuid, tag, value)update documents as D1set tags_processed_on = now()from docs_to_tags as DTTwhere D1.uuid = DTT.document_uuid;

The problem with the above CTE is that when I ran it, it would only insert the first tag for any given document and move on to the next document. But when I used a subquery with the outer update statement, then the insert into the tags table behaved like I expected, inserting all tags into the tags table. So the CTE that works looks like this:

with docs_to_tags as (    insert into tags (document_uuid, tag, value)        select D.uuid, T.key, replace((T.value)::text, '"', '')        from documents as D            join jsonb_each(data->'tags') as T on true        where D.tags_processed_on is null            for update skip locked        limit 2500            on conflict do nothing        returning document_uuid, tag, value)update documents as D1set tags_processed_on = now()where D1.uuid in (    select distinct D2.document_uuid    from docs_to_tags as D2);

So finally my question is this:

Why does an update from ... statement make a CTE behave differently than an update where... with subquery statement in postgresql?

... or more succinctly: Why does the outer update statement have any influence on what is returned by the inner insert query inside of the CTE?


Viewing all articles
Browse latest Browse all 207

Trending Articles