Free Join: A Practical WCOJ?
The Big Idea
 Traditional query processing implements declarative queries (SQL/Datalog/etc) using binary join operators from relational algebra.
 In principle, we could have nway join operators.
 The query processor (dataflow engine) doesn’t mind
 The query optimizer (a la Cascades) doesn’t mind
 In fact, we saw this with eddies, esp. with SteMs/STAIRS.
 That is an nway join
 And we had no real trouble hacking it into PostgreSQL
WCOJ: NPRR
Question: Can nway joins do better than binary joins?
Answer: NPRR (Ngo, Porat, Re’, Rudra, PODS 2012). Yes, sometimes!
 Even better, they are optimal in a bigO (worstcase) sense
 Even in cases where no sequence of binary joins is optimal.
Background: AGM
 How big can the output of an equijoin of 3 tables of size N be?
 AGM bound: Atserias, Grohe, Marx, FOCS 2008: tight bound N^(3/2) on output cardinality of a “full” conjunctive query (every column is in the output), given input/output cardinalities <= N.
 How bad can the output of a binary subjoin of 2 tables be? AHA!
 We’ve seen that Yannakakis/LIP avoids this!
 How well does it do??
 But Yannakakis is only for “acyclic” queries
 Acyclic queries have a rather technical definition. I like Paris Koutris’ lecture notes (see lectures 4 and 5).)
 For example, The Triangle Query is cyclic. More on this below.
 We’ve seen that Yannakakis/LIP avoids this!
 Is there an nway join algorithm that achieves the bound in general?
Evolution

The NPRR paper was hard to read (Chris Re told me not to assign it to PhD students), and the constructive algorithm wasn’t practical.

Coincidentally (??), Todd Veldhuizen at LogicBlox invented a very understandable join algorithm called Leapfrog Trie Join, and then realized (because he was hanging out with Hung Ngo?) that it is a WCOJ!
 NPRR > Leapfrog Trie Join –> Generic Join –> FreeJoin.
 We’re on the road to clearing up this whole business.
 I think there’s one or two more steps to go!
Let’s Revisit Binary join
 Start by assuming equijoin queries, with Hash Join implementation in memory
 Consider the query
R(a, b) ⨝ S(b, c)
, i.e.SELECT * FROM R, S WHERE R.b = S.b
Here is a standard “query plan” using binary joins and a Hashjoinlike scheme where we index the right relation on the fly (build a hashtable) and scan the left relation to probe the index. I like drawing query plans because they can be read either as (a) parse trees over collection types (relations), or (b) dataflow implementations that iterate over tuples. To me this is a nice middleground between declarative (SQL) and imperative (for loops).
flowchart TD
direction LR
classDef attr fill:#9f9,stroke:#333,strokewidth:2px;
classDef tuple fill:#f9f,stroke:#333,strokewidth:2px;
subgraph BJ["Binary Join"]
direction BT
R[R] > j1[⨝_IX] > j2[⨝_IX]
S[S] > SIx{S} >j1
T[T] > TIx{T} > j2
end
An Aside: Column Stores and Datalog Variables
In the late 1990s, early 2000s, the DB community got excited about optimizing big warehouse queries by storing tables 1 column per file (“columnar”) rather than all columns in a file (row storage). Various benefits:
 Projection pushdown into storage (scan only columns you need)
 Easy to compress/decompress columnar storage (single data type)
 Easy to “vectorize” (i.e. batch) process in query plans:
 a batch of column values fits in cache at once
 no pointerchasing or arithmetic to access a column within a tuple
 We’ll read more about this later in the semester
 Columnstores are typically viewed by an optimizer as as some kind of “materialized view” of a projection query.
 The DB theory literature has lots of literature focusing on join variables, which are apparent in Datalog notation. Thinking at the level of variables rather than relations is used in PODS literature a lot, including Yannakakis and WCOJ.
Output(x, y, z) : R(x, y), S(y, x)
 Equivalent to
SELECT * FROM R, S WHERE R.2 = S.1;
x
,y
,z
are “variables”. Shared variables express equijoin.
 The FreeJoin paper is the first I’ve seen to put these ideas together usefully!
A nice, simple WCOJ: Generic Join
Generic Join is a WCOJ, meaning it reaches the inherent asymptotic complexity of producing the join (you can’t do better in bigO terms). It’s kind of like Yannakakis but it works for any query.
 I’d like to connect Generic Join and Yannakakis directly. Didn’t happen before today’s lecture!
Generic join works by ordering the query variables rather than the tables in the query.
An Aside: The Triangle Query
The simplest cyclic query is the “triangle” query: R(x, y) ⨝ S(y, z) ⨝ T(x, z)
, i.e.
SELECT *
FROM R, S, T
WHERE R.to = S.from
AND S.to = T.from
AND T.to = R.from
This is the general version of actually finding triangles in graph, which is the selfjoin version:
Edge(x, y) ⨝ Edge(y, z) ⨝ Edge(x, z)
, i.e.SELECT * FROM Edge AS R, Edge AS S, Edge AS T WHERE R.to = S.from AND S.to = T.from AND T.to = R.from
Back to Generic Join
 Pick an order like
x, y, z
, and alternate phases of projection/intersection and lookup:for a in R.x ∩ T.x: r = R[a]; t = T[a] for b in r.y ∩ S.y: s = S[b] for x in s.z ∩ t.z: output(a, b, c)
 For our purposes below, let’s use an
enumerate
to make the loop indexes explicit:for (i_x, a) in enumerate(R.x ∩ T.x): r = R[a]; t = T[a] for (i_y, b) in enumerate(r.y ∩ S.y): s = S[b] for (i_z, x) in enumerate(s.z ∩ t.z): # at loop index (i_x, i_y, i_z) output(a, b, c)
I prefer to think of this as dataflow. Obviously ∩
is a simple binary operator (a join algorithm). The array indexing (S[a]
, etc) is literally Index Join of the loop variable values (bindings of a
) with an index on another relation(S
). Given that the loop variable is “a set of variable bindings”, this is actually semijoin!
This gives us a “query plan” of binary operators, alternating between project/intersect and index join. The remaining trick is to put it together at the output, for which we may want to carry around our loop indexes. The pseudocode above doesn’t need the loop indexes, it gets them “for free” via controlflow, but we can make the loop index explicit and do it via Merge Join.
 Note: due to the extra loop indexes, intersect (
∩
) has to become semijoin.  Might that be better than the for loops?!
flowchart TD
direction LR
classDef attr fill:#9f9,stroke:#333,strokewidth:2px;
classDef tuple fill:#f9f,stroke:#333,strokewidth:2px;
subgraph BJ["Binary Join"]
direction BT
R[R] > j1[⨝_IX] > j2[⨝_IX]
S[S] > SIx{S:y=>z} >j1
T[T] > TIx{T:z=>x} > j2
end
subgraph BJ'["Binary Join sideways. Forces mermaid to shrink the GJ diagram to 1 page"]
direction LR
R'[R] > j1'[⨝_IX] > j2'[⨝_IX]
S'[S] > SIx'{S:y=>z} >j1'
T'[T] > TIx'{T:z=>x} > j2'
end
subgraph GJ["Generic Join for QΔ"]
direction BT
subgraph Sub1["x bindings"]
Rgj[R] > p1["𝜋_x"] > i1["∩"]
Tgj[T] > p2[𝜋_x] > i1
i1 > e1[enumerate]  (i_x<=>x) > agj((a))
end
subgraph Sub2["semijoin on x bindings"]
Rgj > RgjIx{R:x=>y}
agj > gj1[⨝_IX]
RgjIx > gj1  (i_x=>(x,y)) > rgj(r)
agj > gj2[⨝_IX]
Tgj > TgjIx{T:x=>z} > gj2[⨝_IX]  (i_x=>(x,z)) > tgj(t)
end
subgraph Sub3["y bindings (filtered by x)"]
rgj > p3["𝜋_(i_x,y)"] > i2[semijoin]
Sgj[S] > p4[𝜋_y] > i2
i2 > e2[enumerate_y]  ((i_x, i_y)=>y) > bgj((b))
end
subgraph Sub4["semijoins on y bindings"]
bgj > gj3[⨝_IX]
Sgj > SgjIx{S:y=>z} > gj3  ((i_x, i_y)=>(y=>z)) > sgj(s)
end
subgraph Sub5["z bindings (filtered by x, y)"]
sgj > p5["𝜋_(i_x,i_y,z)"]  ((i_x, i_y)=>z) > i3[semijoin]
tgj  (i_x, z) > p6["𝜋_(i_x,z)"] > i3
i3 > cgj((c))
end
agj  (i_x<>=>x) > align[⨝_MJ]
bgj  ((i_x, i_y)=>y) > align
cgj  ((i_x, i_y)=>z) > align
class agj,bgj,cgj attr;
class r,s,t tuple;
end
FreeJoin
Note some funky things above:
 The pseudocode kept “projecting” in order to intersect (
a = R.x ∩ T.x
, etc.), and then “rejoining” via array index (T[a]
, etc).  In the dataflow version, we had to carry around the enumerated loop indexes counters (
i_x
, etc) to emulate the for loop indexes.  Can’t we keep the right variable around as we go?
 FreeJoin does this.
Aside: Generalized Hash Trie: Column Stores meet Hashtables
 Consider table R(x, y).
 We could store the column
R.x
, an array of distinctx
values.  We could do the same for
R.y
, which is ordered to zip withR.x
 That is,
R.y[k]
is they
value associated withR.x[k]
 That is,
 OR we could have the
R.x
be a hashmap fromx
to a range of offsets inR.y
 or no need for
R.y
file, store theR.y
subarray for eachx
in its own location
 or no need for
 Do this recursively for some column ordering, gives you a hashtrie
 For SQL, the rightmost column is a hashmap from
x
value to multiplicity (# of dups of that tuple)
 For SQL, the rightmost column is a hashmap from
 This looks awfully like our
enumerate
d loop indexes above!!
FreeJoin Plan
Generalize binary and generic joins!
 A list of lists like
[[R(x),S(x),T(x)], [R(a)], [S(b)], [T(x)]]
 Inner list is called a “node”
 Consists of “subatoms”, i.e. singletable projections
 Constraints:
 Subatoms in a FreeJoin plan form a partition of the attributes of each table
 First element of a “node” must be a cover
 it contains all variables to the right of that node that do not appear in previous nodes
 think of a cover as a set of variables that joins two subqueries
 the previous nodes are a subquery
 the remaining subatoms in this node are a subquery
 the cover ensures we don’t have a crossproduct
 because we have covers, we will not need
enumerate
d loop indexes!
Execution is a recursive algorithm.
 Hold that thought – we’ll look at dataflow versions below.
Optimizing a FreeJoin Plan
binary2fj(bin_plan)
:
 current_node = [leftmost table]
 for each inner table X in the binary plan:
 for previouslybound attrs [a] of X, form X.[a] and append to current node
 add current_node to plan
 current_node = [X  X.[a]]
To optimize, pass information sideways!  i.e. pull any probeable subgoals into earlier nodes  need to rethink the GHTs as a result
COLT: ColumnOriented Lazy Trie  all columns stored separately, as in columnstore  OUT OF TIME TO PREP
 Open discussion!
First, let’s look at a new query, the “clover” query: Q♣(x,a,b,c) = R(x, a), S(x, b), T(x, c)
, i.e.:
SELECT *
FROM R, S, T
WHERE R.1 = S.1
AND S.1 = T.1
flowchart TD
direction LR
classDef attr fill:#9f9,stroke:#333,strokewidth:2px;
classDef tuple fill:#f9f,stroke:#333,strokewidth:2px;
subgraph GFJ["Generic Free Join for Q♣"]
direction BT
subgraph prep[" "]
Rgfj
end
Rgfj[R] > pgfj1["𝜋_x"] > RgfjIx{R.x}
subgraph fora[" "]
Rgfj > gfj6[⨝_IX]
RgfjIx > gfj6  (x=>a) > agfj((a))
end
subgraph forb[" "]
Rgfj > pgfj2["𝜋_x"] > gfj2[⨝_IX]
Sgfj[S] > SgfjIx{S.x} > gfj2  (x=>b) > bgfj((b))
end
subgraph forc[" "]
Rgfj > pgfj3["𝜋_x"] > gfj3[⨝_IX]
Tgfj[T] > TgfjIx{T.x} > gfj3  (x=>c) > cgfj((c))
end
agfj > gfj4[⨝_MJ]
bgfj > gfj4
gfj4 > gfj5[⨝_MJ]
cgfj > gfj5
class agfj,bgfj,cgfj attr;
class r,s,t attr;
end
subgraph BFJ["Binary Join for Q♣"]
direction BT
Rbfj[R] > bfj4[⨝_IX] > bfj5[⨝_IX]
Sbfj[S] > SbfjIx{S: x=>b} > bfj4
Tbfj[T] > TbfjIx{T: y=>c} > bfj5
class a1,b1,c1 attr;
class r1,s1,t1 attr;
end
subgraph FFJ["Factorized Free Join for Q♣"]
direction BT
subgraph prep2[" "]
Rffj[R] > ffj1[⨝_IX]
Sffj[S] > SffjIx{S: x=>b} > ffj1  ((x,a)=>b) > s2((s))
Rffj > ffj2[⨝_IX]
Tffj[T] > TffjIx{T: x=>c} > ffj2  ((x,a)=>c) > t2((t))
end
s2 > ffj3[⨝_MJ]
t2 > ffj3
%% ffj3 > ffj4[⨝_NL]
%% t2 > ffj4
class s2,t2 attr;
end