Quantcast
Channel: Active questions tagged cte - Database Administrators Stack Exchange
Viewing all articles
Browse latest Browse all 207

How can I efficiently traverse graph data with this pattern?

$
0
0

I have some relations embodying a directed acyclic graph that includes patterns similar to the following:

enter image description here

I'm looking for an efficient way to traverse this graph data. Here is an example of the seemingly simple task of counting descendants of node 0:

db<>fiddle

DROP TABLE IF EXISTS #edges;CREATE TABLE #edges(tail int, head int);INSERT INTO #edges(tail,head) VALUES    (0,1), (5, 6),  (10,11), (15,16),    (0,2), (5, 7),  (10,12), (15,17),    (1,2), (6, 7),  (11,12), (16,17),    (1,3), (7, 8),  (11,13), (17,18),    (2,3), (7, 9),  (12,13), (17,19),    (2,4), (8, 9),  (12,14), (18,19),    (3,4), (8,10),  (13,14),    (3,5), (9,10),  (13,15),    (4,5), (9,11),  (14,15),    (4,6),          (14,16);WITH descendents(node)AS(    SELECT 0 as node    UNION ALL    SELECT head as node FROM descendents as prior    JOIN #edges ON prior.node = tail)SELECT    (SELECT COUNT(node) FROM descendents)             as total_nodes,    (SELECT COUNT(node) FROM        (SELECT DISTINCT node FROM descendents) as d) as distinct_nodes

This results in the following:

total_nodes | distinct_nodes      10512 |             20

Every path is visited instead of each node once

total_nodes seems to grow at about 2^n where n is the number of nodes in the example. This is because every possible path is traversed rather than each node once. This n=29 example results in 1,305,729 total_nodes and took 75 seconds to complete on my local instance of SQL Server Express.

The obvious strategy is to only visit previously unvisited nodes on each iteration.

Excluding redundant additions using WHERE...NOT IN does not seem to be supported

The most direct method of preventing visits to previously visited nodes would seem to be to filter right in the recursive member of the CTE as follows:

SELECT head as node FROM descendents as priorJOIN #edges ON prior.node = tailWHERE head NOT IN (SELECT node from descendents)

This produces the error "Recursive member of a common table expression 'descendents' has multiple recursive references." I suspect this is an example of this restriction from the documenation:

The FROM clause of a recursive member must refer only one time to the CTE expression_name.

No success avoiding previously visited nodes in the CTE's recursive member

I've tried a few different techniques for avoiding previously visited nodes as recursion progresses in the CTE. I haven't found a technique that works with this graph pattern, however. The limitation seems to boil down to the following statement from the documentation:

...aggregate functions in the recursive part of the CTE are applied to the set for the current recursion level and not to the set for the CTE. Functions like ROW_NUMBER operate only on the subset of data passed to them by the current recursion level and not the entire set of data passed to the recursive part of the CTE.

Previously visited nodes appear at different recursion levels. The above limitation seems to rule out the deduplication across recursion levels that would be needed to make this query efficient.

Query Analysis

The following is from a run of a query run on 42 nodes invoked on my local instance of SQL Server Express. (Note that half the nodes are commented out in the fiddle because running with all 42 nodes causes the fiddle to timeout.)

The query was run as follows:

set statistics xml on;SET STATISTICS IO ON;SET STATISTICS TIME ON;WITH descendents(node)AS(    SELECT 0 as node    UNION ALL    SELECT head as node FROM descendents as prior    JOIN #edges ON prior.node = tail)SELECT DISTINCT    nodeFROM descendentsSET STATISTICS TIME OFF;SET STATISTICS IO OFF;set statistics xml off;

This resulted in the following messages:

Started executing query at Line 1(76 rows affected)

SQL Server parse and compile time:CPU time = 0 ms, elapsed time = 14 ms.(42 rows affected)

Table 'Worktable'. Scan count 2, logical reads 943777243, physical reads 0, read-ahead reads 12117, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#edges______________________________________________________________________________________________________________000000002516'. Scan count 1, logical reads 157296207, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

SQL Server Execution Times:CPU time = 2972344 ms, elapsed time = 3319271 ms.Total execution time: 00:55:19.424

Here is the execution plan:

execution plan at brentozar.com

And these are screenshots of the tooltips for what would seem to be the most time consuming steps:

Sort tooltipTable Scan tooltip

Possible alternatives

I can think of a few programmatic approaches that achieve node traversal without relying on CTE recursion. I'm reluctant to prematurely abandon CTEs given the abundance of advice to prefer set logic where possible. I expect that there are some built-in SQL Server features that would support some improved approaches. The other approaches I've contemplated are as follows:

I have limited or no experience with each of these approaches and so am interested in advice about the tradeoffs of these approaches for the above example.

Questions

  1. Is there a more efficient way to traverse the graph in the above example using recursive CTEs?
  2. Of the above alternatives, what are the tradeoffs?
  3. Are there any other alternatives I should consider?

Viewing all articles
Browse latest Browse all 207

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>