I'm using PostgreSQL 17
I am modelling a package index for the Haskell ecosystem, and a feature that is useful is to determine transitive dependencies. Haskell packages can be normalised as:
Package (name + package-specific metadata) \-> Releases (version + release-specific metadata like synopsis, attached data files) \-> Components (library, executable, test suite, benchmark suite) \-> Dependencies (Each component declares a dependency of a package name and version expression).
(Each of these sections are a table, and they are linked together by one-to-many relationships. One package links to many releases, each release links to many components, each component links to many dependencies)
For the purpose of my own enlightenment, I have first reduced the complexity of the model to create a CTE that does what I expect.Especially, I don't use bigints as the PKs of the table in my codebase, but UUIDs.
(Full dbfiddle is available at https://dbfiddle.uk/hVOmMdYQ)
-- Data model where packages and versions are combined, -- and dependencies refer to packagescreate table packages ( package_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, name text unique not null, version int[] not null);create unique index on packages(name, version);create table dependencies ( dependency_id bigint generated always as identity PRIMARY KEY, dependent_id bigint references packages(package_id), depended_id bigint references packages(package_id) );create unique index on dependencies(dependent_id, depended_id);
And here is the data:
insert into packages (name, version) values ('base', '{1,0,0,0}');insert into packages (name, version) values ('vector', '{0,0,7,0}');insert into packages (name, version) values ('random', '{0,1,5,8}');insert into packages (name, version) values ('unix', '{1,2,1,0}');insert into packages (name, version) values ('time', '{3,14,1,2}');insert into dependencies (dependent_id, depended_id) values (2, 1);insert into dependencies (dependent_id, depended_id) values (3, 1);insert into dependencies (dependent_id, depended_id) values (3, 2);insert into dependencies (dependent_id, depended_id) values (4, 1);insert into dependencies (dependent_id, depended_id) values (5, 1);insert into dependencies (dependent_id, depended_id) values (5, 3);insert into dependencies (dependent_id, depended_id) values (5, 4);
Here is a preliminary result:
select dependent.package_id, dependent.name as dependent, depended.name as depended from dependencies as d1 inner join packages as dependent on d1.dependent_id = dependent.package_id inner join packages as depended on d1.depended_id = depended.package_id;
package_id | dependent | depended |
---|---|---|
2 | vector | base |
3 | random | base |
3 | random | vector |
4 | unix | base |
5 | time | base |
5 | time | random |
5 | time | unix |
Until now, every looks good. I then made this recursive CTE to create a view of transitive dependencies, with breadcrumbs:
with recursive transitive_dependencies ( dependent_id, dependent, depended_id, breadcrumbs) as ( select dependent.package_id as dependent_id , dependent.name as dependent , depended.package_id as depended_id , concat_ws('> ', dependent.name, depended.name) as breadcrumbs from dependencies as d1 inner join packages as dependent on d1.dependent_id = dependent.package_id inner join packages as depended on d1.depended_id = depended.package_id where dependent_id = 5 union all select dependent.package_id as dependent_id , dependent.name as dependent , depended.package_id as depended_id , concat_ws('> ', t2.breadcrumbs, depended.name) as breadcrumbs from dependencies as d1 inner join packages as dependent on d1.dependent_id = dependent.package_id inner join packages as depended on d1.depended_id = depended.package_id inner join transitive_dependencies as t2 on t2.depended_id = dependent.package_id -- ← This is where we refer to the CTE) cycle dependent_id set is_cycle using pathselect t3.dependent_id , t3.dependent , t3.depended_id , t3.breadcrumbsfrom transitive_dependencies as t3;
dependent_id | dependent | depended_id | breadcrumbs |
---|---|---|---|
5 | time | 1 | time > base |
5 | time | 3 | time > random |
5 | time | 4 | time > unix |
3 | random | 1 | time > random > base |
3 | random | 2 | time > random > vector |
4 | unix | 1 | time > unix > base |
2 | vector | 1 | time > random > vector > base |
Behold, it works!
Now, I am looking into splitting things a bit further. Namely, package and release will be separated. This is due to the fact that there is some metadata specific to the Haskell ecosystem that targets the notion of "package" and some that is only relevant to "releases", and they are not interchangeable.
-- Data model where packages and releases are separatedcreate table packages2 ( package_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, name text unique not null);create table releases2 ( release_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, package_id bigint references packages2, version text not null);create unique index on releases2(package_id, version);create table dependencies2 ( dependency_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, release_id bigint references releases2 not null, package_id bigint references packages2 not null, requirement int[] not null);
And here is the data
insert into packages2 (name) values ('base'); -- 1insert into packages2 (name) values ('vector'); -- 2insert into packages2 (name) values ('random'); -- 3insert into packages2 (name) values ('unix'); -- 4insert into packages2 (name) values ('time'); -- 5insert into releases2 (package_id, version) values (1, '{1,0,0,0}');insert into releases2 (package_id, version) values (2, '{0,0,7,0}');insert into releases2 (package_id, version) values (3, '{0,1,5,8}');insert into releases2 (package_id, version) values (4, '{1,2,1,0}');insert into releases2 (package_id, version) values (5, '{3,14,1,2}');insert into dependencies2 (release_id, package_id, requirement) values ( 2, 1, '== 1.0.0.0' );insert into dependencies2 (release_id, package_id, requirement) values ( 3, 1, '== 1.0.0.0' );insert into dependencies2 (release_id, package_id, requirement) values ( 3, 2, '>= 0.0.7.0' );insert into dependencies2 (release_id, package_id, requirement) values ( 4, 1, '== 1.0.0.0' );insert into dependencies2 (release_id, package_id, requirement) values ( 5, 1, '== 1.0.0.0' );insert into dependencies2 (release_id, package_id, requirement) values ( 5, 3, '<= 0.1.5.8' );insert into dependencies2 (release_id, package_id, requirement) values ( 5, 4, '== 1.2.1.0' );
And I tried to apply the lesson of the CTE above to this schema:
with recursive transitive_dependencies2 ( dependent_id, dependent, dependency_id, breadcrumbs) as(select p2.package_id as dependent_id , p2.name as dependent , p3.package_id as dependency_id , concat_ws('> ', p2.name, p3.name) as breadcrumbs from dependencies2 as d0 -- Dependent releases inner join releases2 as r1 on d0.release_id = r1.release_id -- Dependent packages inner join packages2 as p2 on r1.package_id = p2.package_id -- Dependencies packages inner join packages2 as p3 on d0.package_id = p3.package_id where r1.release_id = 5 union select p2.package_id as dependent_id , p2.name as dependent , p3.package_id as dependency_id , concat_ws('> ', p2.name, p3.name) as breadcrumbs from dependencies2 as d0 -- Dependent releases inner join releases2 as r1 on d0.release_id = r1.release_id -- Dependent packages inner join packages2 as p2 on r1.package_id = p2.package_id -- Dependencies packages inner join packages2 as p3 on d0.package_id = p3.package_id inner join transitive_dependencies2 as t2 on t2.dependency_id = p2.package_id ← This is where we refer to the CTE) cycle dependent_id set is_cycle using pathselect t3.dependent_id , t3.dependent , t3.dependency_id , t3.breadcrumbsfrom transitive_dependencies2 as t3;
Quite unfortunately, this does not give the expected result:
dependent_id | dependent | dependency_id | breadcrumbs |
---|---|---|---|
5 | time | 1 | time > base |
5 | time | 3 | time > random |
5 | time | 4 | time > unix |
3 | random | 1 | random > base |
3 | random | 2 | random > vector |
4 | unix | 1 | unix > base |
2 | vector | 1 | vector > base |
My question is as follow: How can I build my intuition to further split a CTE that works, over more granular tables? I'm still very new to all of this, and this is my first "real-world" use case of CTEs.
Happy to clarify or disambiguate things.
Through this, I'm also interested in best practices when it comes to data modelling. I was warned in the past against storing arrays of foreign keys, for instance, and to strive and reach normal forms and splitting entities that have different life cycles.