Free Join: A Practical WCOJ?

The Big Idea

WCOJ: NPRR

Question: Can n-way joins do better than binary joins?

Answer: NPRR (Ngo, Porat, Re’, Rudra, PODS 2012). Yes, sometimes!

Background: AGM

Evolution

Let’s Revisit Binary join

Here is a standard “query plan” using binary joins and a Hashjoin-like scheme where we index the right relation on the fly (build a hashtable) and scan the left relation to probe the index. I like drawing query plans because they can be read either as (a) parse trees over collection types (relations), or (b) dataflow implementations that iterate over tuples. To me this is a nice middle-ground between declarative (SQL) and imperative (for loops).

flowchart TD
direction LR
classDef attr fill:#9f9,stroke:#333,stroke-width:2px;
classDef tuple fill:#f9f,stroke:#333,stroke-width:2px;
  subgraph BJ["Binary Join"]
    direction BT
    R[R] --> j1[⨝_IX] --> j2[⨝_IX]
    S[S] --> SIx{S} -->j1
    T[T] --> TIx{T} --> j2
  end

An Aside: Column Stores and Datalog Variables

In the late 1990s, early 2000s, the DB community got excited about optimizing big warehouse queries by storing tables 1 column per file (“columnar”) rather than all columns in a file (row storage). Various benefits:

A nice, simple WCOJ: Generic Join

Generic Join is a WCOJ, meaning it reaches the inherent asymptotic complexity of producing the join (you can’t do better in big-O terms). It’s kind of like Yannakakis but it works for any query.

Generic join works by ordering the query variables rather than the tables in the query.

An Aside: The Triangle Query

The simplest cyclic query is the “triangle” query: R(x, y) ⨝ S(y, z) ⨝ T(x, z), i.e.

SELECT *
  FROM R, S, T
 WHERE R.to = S.from
   AND S.to = T.from
   AND T.to = R.from

This is the general version of actually finding triangles in graph, which is the self-join version:

Back to Generic Join

I prefer to think of this as dataflow. Obviously is a simple binary operator (a join algorithm). The array indexing (S[a], etc) is literally Index Join of the loop variable values (bindings of a) with an index on another relation(S). Given that the loop variable is “a set of variable bindings”, this is actually semi-join!

This gives us a “query plan” of binary operators, alternating between project/intersect and index join. The remaining trick is to put it together at the output, for which we may want to carry around our loop indexes. The pseudocode above doesn’t need the loop indexes, it gets them “for free” via control-flow, but we can make the loop index explicit and do it via Merge Join.

flowchart TD
direction LR
classDef attr fill:#9f9,stroke:#333,stroke-width:2px;
classDef tuple fill:#f9f,stroke:#333,stroke-width:2px;
  subgraph BJ["Binary Join"]
    direction BT
    R[R] --> j1[⨝_IX] --> j2[⨝_IX]
    S[S] --> SIx{S:y=>z} -->j1
    T[T] --> TIx{T:z=>x} --> j2
  end
  subgraph BJ'["Binary Join sideways. Forces mermaid to shrink the GJ diagram to 1 page"]
    direction LR
    R'[R] --> j1'[⨝_IX] --> j2'[⨝_IX]
    S'[S] --> SIx'{S:y=>z} -->j1'
    T'[T] --> TIx'{T:z=>x} --> j2'
  end
  subgraph GJ["Generic Join for QΔ"]
    direction BT
    subgraph Sub1["x bindings"]
        Rgj[R] --> p1["𝜋_x"] --> i1["∩"]
        Tgj[T] --> p2[𝜋_x] --> i1
        i1 --> e1[enumerate] -- (i_x<=>x) --> agj((a))
    end
    subgraph Sub2["semijoin on x bindings"]
        Rgj --> RgjIx{R:x=>y}
        agj --> gj1[⨝_IX]
        RgjIx --> gj1 -- (i_x=>(x,y)) --> rgj(r)
        agj --> gj2[⨝_IX]
        Tgj --> TgjIx{T:x=>z} --> gj2[⨝_IX] -- (i_x=>(x,z)) --> tgj(t)
    end
    subgraph Sub3["y bindings (filtered by x)"]
        rgj --> p3["𝜋_(i_x,y)"] --> i2[semijoin]
        Sgj[S] --> p4[𝜋_y] --> i2 
        i2 --> e2[enumerate_y] -- ((i_x, i_y)=>y) --> bgj((b))
    end
    subgraph Sub4["semijoins on y bindings"]
        bgj --> gj3[⨝_IX]
        Sgj --> SgjIx{S:y=>z} --> gj3 -- ((i_x, i_y)=>(y=>z)) --> sgj(s)
    end
    subgraph Sub5["z bindings (filtered by x, y)"]
        sgj --> p5["𝜋_(i_x,i_y,z)"] -- ((i_x, i_y)=>z) --> i3[semijoin]
        tgj -- (i_x, z) --> p6["𝜋_(i_x,z)"] --> i3 
        i3 --> cgj((c))
    end
    agj -- (i_x<>=>x) --> align[⨝_MJ]
    bgj -- ((i_x, i_y)=>y) --> align
    cgj -- ((i_x, i_y)=>z) --> align
    class agj,bgj,cgj attr;
    class r,s,t tuple;
  end

FreeJoin

Note some funky things above:

Aside: Generalized Hash Trie: Column Stores meet Hashtables

FreeJoin Plan

Generalize binary and generic joins!

Execution is a recursive algorithm.

Optimizing a FreeJoin Plan

binary2fj(bin_plan):

To optimize, pass information sideways! - i.e. pull any probeable subgoals into earlier nodes - need to rethink the GHTs as a result

COLT: Column-Oriented Lazy Trie - all columns stored separately, as in columnstore - OUT OF TIME TO PREP

First, let’s look at a new query, the “clover” query: Q♣(x,a,b,c) = R(x, a), S(x, b), T(x, c), i.e.:

SELECT *
  FROM R, S, T
 WHERE R.1 = S.1
   AND S.1 = T.1
flowchart TD
direction LR
classDef attr fill:#9f9,stroke:#333,stroke-width:2px;
classDef tuple fill:#f9f,stroke:#333,stroke-width:2px;

subgraph GFJ["Generic Free Join for Q♣"]
    direction BT
    subgraph prep[" "]
        Rgfj
    end
        Rgfj[R] --> pgfj1["𝜋_x"] --> RgfjIx{R.x}
    subgraph fora[" "]
        Rgfj --> gfj6[⨝_IX]
        RgfjIx --> gfj6 -- (x=>a) --> agfj((a))
    end
    subgraph forb[" "]
        Rgfj --> pgfj2["𝜋_x"] --> gfj2[⨝_IX]
        Sgfj[S] --> SgfjIx{S.x} --> gfj2 -- (x=>b) --> bgfj((b))
    end
    subgraph forc[" "]
        Rgfj --> pgfj3["𝜋_x"] --> gfj3[⨝_IX]
        Tgfj[T] --> TgfjIx{T.x} --> gfj3 -- (x=>c) --> cgfj((c))
    end
    agfj --> gfj4[⨝_MJ]
    bgfj --> gfj4
    gfj4 --> gfj5[⨝_MJ]
    cgfj  --> gfj5

    class agfj,bgfj,cgfj attr;
    class r,s,t attr;
end
subgraph BFJ["Binary Join for Q♣"]
    direction BT
    Rbfj[R] --> bfj4[⨝_IX] --> bfj5[⨝_IX]
    Sbfj[S] --> SbfjIx{S: x=>b} --> bfj4
    Tbfj[T] --> TbfjIx{T: y=>c} --> bfj5

    class a1,b1,c1 attr;
    class r1,s1,t1 attr;
end
subgraph FFJ["Factorized Free Join for Q♣"]
    direction BT
    subgraph prep2[" "]
        Rffj[R] --> ffj1[⨝_IX]
        Sffj[S] --> SffjIx{S: x=>b} --> ffj1 -- ((x,a)=>b) --> s2((s))
        Rffj --> ffj2[⨝_IX]
        Tffj[T]  --> TffjIx{T: x=>c} --> ffj2 -- ((x,a)=>c) --> t2((t))
    end
    s2 --> ffj3[⨝_MJ]
    t2 --> ffj3
    %% ffj3 --> ffj4[⨝_NL]
    %% t2 --> ffj4
    class s2,t2 attr;
end