转自: https://docs.citusdata.com/en/v7.5/articles/outer_joins.html

SQL is a very powerful language for analyzing and reporting against data. At the core of SQL is the idea of joins and how you combine various tables together. One such type of join: outer joins are useful when we need to retain rows, even if it has no match on the other side.

And while the most common type of join, inner join, against tables A and B would bring only the tuples that have a match for both A and B, outer joins give us the ability to bring together from say all of table A even if they don’t have a corresponding match in table B. For example, let’s say you keep customers in one table and purchases in another table. When you want to see all purchases of customers, you may want to see all customers in the result even if they did not do any purchases yet. Then, you need an outer join. Within this post we’ll analyze a bit on what outer joins are, and then how we support them in a distributed fashion on Citus.

Let’s say we have two tables, customer and purchase:

customer table:
customer_id | name
-------------+-----------------
1 | Corra Ignacio
3 | Warren Brooklyn
2 | Jalda Francis purchase table:
purchase_id | customer_id | category | comment
-------------+-------------+----------+------------------------------
1000 | 1 | books | Nice to Have!
1001 | 1 | chairs | Comfortable
1002 | 2 | books | Good Read, cheap price
1003 | -1 | hardware | Not very cheap
1004 | -1 | laptops | Good laptop but expensive...

The following queries and results help clarifying the inner and outer join behaviors:

SELECT customer.name, purchase.comment
FROM customer JOIN purchase ON customer.customer_id = purchase.customer_id
ORDER BY purchase.comment; name | comment
---------------+------------------------
Corra Ignacio | Comfortable
Jalda Francis | Good Read, cheap price
Corra Ignacio | Nice to Have!

SELECT customer.name, purchase.comment
FROM customer INNER JOIN purchase ON customer.customer_id = purchase.customer_id
ORDER BY purchase.comment; name | comment
---------------+------------------------
Corra Ignacio | Comfortable
Jalda Francis | Good Read, cheap price
Corra Ignacio | Nice to Have!

SELECT customer.name, purchase.comment
FROM customer LEFT JOIN purchase ON customer.customer_id = purchase.customer_id
ORDER BY purchase.comment; name | comment
-----------------+------------------------
Corra Ignacio | Comfortable
Jalda Francis | Good Read, cheap price
Corra Ignacio | Nice to Have!
Warren Brooklyn |

SELECT customer.name, purchase.comment
FROM customer RIGHT JOIN purchase ON customer.customer_id = purchase.customer_id
ORDER BY purchase.comment; name | comment
---------------+------------------------------
Corra Ignacio | Comfortable
Jalda Francis | Good Read, cheap price
| Good laptop but expensive...
Corra Ignacio | Nice to Have!
| Not very cheap

SELECT customer.name, purchase.comment
FROM customer FULL JOIN purchase ON customer.customer_id = purchase.customer_id
ORDER BY purchase.comment; name | comment
-----------------+------------------------------
Corra Ignacio | Comfortable
Jalda Francis | Good Read, cheap price
| Good laptop but expensive...
Corra Ignacio | Nice to Have!
| Not very cheap
Warren Brooklyn |

Distributed Outer Joins with Citus

The Citus extension allows PostgreSQL to distribute big tables into smaller fragments called “shards” and performing outer joins on these distributed tables becomes a bit more challenging, since the union of outer joins between individual shards does not always give the correct result. Currently, Citus support distributed outer joins under some criteria:

  • Outer joins should be between distributed(sharded) tables only, i.e. it is not possible to outer join a sharded table with a regular PostgreSQL table.
  • Join criteria should be on partition columns of the distributed tables.
  • The query should join the distributed tables on the equality of partition columns (table1.a = table2.a)
  • Shards of the distributed table should match one to one, i.e. each shard of table A should overlap with one and only one shard from table B.

For example lets assume we 3 hash distributed tables X, Y and Z and let X and Y have 4 shards while Z has 8 shards.

CREATE TABLE user (user_id int, name text);
SELECT create_distributed_table('user', 'user_id'); CREATE TABLE purchase (user_id int, amount int);
SELECT create_distributed_table('purchase', 'user_id'); CREATE TABLE comment (user_id int, comment text, rating int);
SELECT create_distributed_table('comment', 'user_id');

The following query would work since distributed tables user and purchase have the same number of shards and the join criteria is equality of partition columns:

SELECT * FROM user OUTER JOIN purchase ON user.user_id = purchase.user_id;

The following queries are not supported out of the box:

-- user and comment tables doesn’t have the same number of shards:
SELECT * FROM user OUTER JOIN comment ON user.user_id = comment.user_id; -- join condition is not on the partition columns:
SELECT * FROM user OUTER JOIN purchase ON user.user_id = purchase.amount; -- join condition is not equality:
SELECT * FROM user OUTER JOIN purchase ON user.user_id < purchase.user_id;

How Citus Processes OUTER JOINs When one-to-one matching between shards exists, then performing an outer join on large tables is equivalent to combining outer join results of corresponding shards.

Let’s look at how Citus handles an outer join query:

SELECT table1.a, table1.b AS b1, table2.b AS b2, table3.b AS b3, table4.b AS b4
FROM table1
FULL JOIN table2 ON table1.a = table2.a
FULL JOIN table3 ON table1.a = table3.a
FULL JOIN table4 ON table1.a = table4.a;

First, the query goes through the standard PostgreSQL planner and Citus uses this plan to generate a distributed plan where various checks about Citus’ support of the query are performed. Then individual queries that will go to workers for distributed table fragments are generated.

SELECT table1.a, table1.b AS b1, table2.b AS b2, table3.b AS b3, table4.b AS b4
FROM (((table1_102359 table1
FULL JOIN table2_102363 table2 ON ((table1.a = table2.a)))
FULL JOIN table3_102367 table3 ON ((table1.a = table3.a)))
FULL JOIN table4_102371 table4 ON ((table1.a = table4.a))) WHERE true
SELECT table1.a, table1.b AS b1, table2.b AS b2, table3.b AS b3, table4.b AS b4
FROM (((table1_102360 table1
FULL JOIN table2_102364 table2 ON ((table1.a = table2.a)))
FULL JOIN table3_102368 table3 ON ((table1.a = table3.a)))
FULL JOIN table4_102372 table4 ON ((table1.a = table4.a))) WHERE true
SELECT table1.a, table1.b AS b1, table2.b AS b2, table3.b AS b3, table4.b AS b4
FROM (((table1_102361 table1
FULL JOIN table2_102365 table2 ON ((table1.a = table2.a)))
FULL JOIN table3_102369 table3 ON ((table1.a = table3.a)))
FULL JOIN table4_102373 table4 ON ((table1.a = table4.a))) WHERE true
SELECT table1.a, table1.b AS b1, table2.b AS b2, table3.b AS b3, table4.b AS b4
FROM (((table1_102362 table1
FULL JOIN table2_102366 table2 ON ((table1.a = table2.a)))
FULL JOIN table3_102370 table3 ON ((table1.a = table3.a)))
FULL JOIN table4_102374 table4 ON ((table1.a = table4.a))) WHERE true

The resulting queries may seem complex at first but you can see that they are actually the same with the original query with just the table names are a bit different. This is because Citus stores the data in standard postgres tables called shards with the name as _. With 1-1 matching of shards, the distributed outer join is equivalent to the union of all outer joins of individual matching shards. In many cases you don’t even have to think about this as Citus simply takes care of you. If you’re sharding on some shared id, as is common in certain use cases, then Citus will do the join on the appropriate node without any inter-worker communication.

We hope you found the insight into how we perform distributed outer joins valuable. If you’re curious about trying Citus or learning how more works we encourage you to join the conversation with us on Slack.

 
 
 
 

How Distributed Outer Joins on PostgreSQL with Citus Work的更多相关文章

  1. 转载: C#: Left outer joins with LINQ

    I always considered Left Outer Join in LINQ to be complex until today when I had to use it in my app ...

  2. Lerning Entity Framework 6 ------ Joins and Left outer Joins

    Joins allow developers to combine data from multiple tables into a sigle query. Let's have a look at ...

  3. 分布式 PostgreSQL 集群(Citus)官方示例 - 多租户应用程序实战

    如果您正在构建软件即服务 (SaaS) 应用程序,您可能已经在数据模型中内置了租赁的概念. 通常,大多数信息与租户/客户/帐户相关,并且数据库表捕获这种自然关系. 对于 SaaS 应用程序,每个租户的 ...

  4. 分布式 PostgreSQL 集群(Citus),分布式表中的分布列选择最佳实践

    确定应用程序类型 在 Citus 集群上运行高效查询要求数据在机器之间正确分布.这因应用程序类型及其查询模式而异. 大致上有两种应用程序在 Citus 上运行良好.数据建模的第一步是确定哪些应用程序类 ...

  5. 分布式 PostgreSQL 集群(Citus)官方示例 - 实时仪表盘

    Citus 提供对大型数据集的实时查询.我们在 Citus 常见的一项工作负载涉及为事件数据的实时仪表板提供支持. 例如,您可以是帮助其他企业监控其 HTTP 流量的云服务提供商.每次您的一个客户端收 ...

  6. 分布式 PostgreSQL 集群(Citus)官方安装指南

    单节点 Citus Docker (Mac 与 Linux) Docker 镜像仅用于开发/测试目的, 并且尚未准备好用于生产用途. 您可以使用一个命令在 Docker 中启动 Citus: # st ...

  7. Citus 分布式 PostgreSQL 集群 - SQL Reference(查询分布式表 SQL)

    如前几节所述,Citus 是一个扩展,它扩展了最新的 PostgreSQL 以进行分布式执行.这意味着您可以在 Citus 协调器上使用标准 PostgreSQL SELECT 查询进行查询. Cit ...

  8. 跟我一起读postgresql源码(十三)——Executor(查询执行模块之——Join节点(上))

    Join节点 JOIN节点有以下三种: T_NestLoopState, T_MergeJoinState, T_HashJoinState, 连接类型节点对应于关系代数中的连接操作,PostgreS ...

  9. 跟我一起读postgresql源码(十四)——Executor(查询执行模块之——Join节点(下))

    3.HashJoin 节点 postgres=# explain select a.*,b.* from test_dm a join test_dm2 b on a.xxx = b.xxx; QUE ...

随机推荐

  1. android--------Dagger2介绍与简单使用(一)

    1:Dagger2是啥 Dagger是为Android和Java平台提供的一个完全静态的,在编译时进行依赖注入的框架,原来是由Square公司维护的然后现在把这堆东西扔给Google维护了. 一般的I ...

  2. AndroidStudio使用偷懒插件Butterknife和GsonFormat

    1.Android ButterKnife Zelezny Android Studio上安装插件,如图: 配合ButterKnife实现注解,从此不用写findViewById,想着就爽啊.在Act ...

  3. The Monster CodeForces - 917A (括号匹配)

    链接 大意:给定字符串, 只含'(',')','?', 其中'?'可以替换为'('或')', 求有多少个子串可以的括号可以匹配 (不同子串之间独立) 记$s_($为'('个数, $s_)$为')'个数 ...

  4. FasfDFS intall nginx with image filter

    centOS7 x64 1. install gd-devel 2. ./configure --prefix=/usr/local/nginx --with-http_image_filter_mo ...

  5. hdu3068 manacher模板题

    给出一个只由小写英文字符a,b,c...y,z组成的字符串S,求S中最长回文串的长度. 回文就是正反读都是一样的字符串,如aba, abba等 Input输入有多组case,不超过120组,每组输入为 ...

  6. UVA-12569 Planning mobile robot on Tree (EASY Version) (BFS+状态压缩)

    题目大意:一张无向连通图,有一个机器人,若干个石头,每次移动只能移向相连的节点,并且一个节点上只能有一样且一个东西(机器人或石头),找出一种使机器人从指定位置到另一个指定位置的最小步数方案,输出移动步 ...

  7. UVA-1626 Brackets sequence (简单区间DP)

    题目大意:给一个有小括号和中括号组成的序列,满足题中的三个条件时,是合法的.不满足时是不合法的,问将一个不合法的序列最少添加几个括号可以使之变成合法的.输出最短合法序列. 题目分析:这是<入门经 ...

  8. 双机热备(准)-->RAC(夭折)-->DG(异地容灾)

    以下有的地方为oracle专业术语,非懂勿喷.前段时间某项目负责人告知,他们应用需要一套oracle数据库环境运行模式为双机热备.简单了解下对于现在已经非常成熟的RAC再合适不过了.详细问了问当前服务 ...

  9. Oracle性能诊断艺术-学习笔记(索引访问方式)

    环境准备: 1.0 测试表 CREATE TABLE t ( id NUMBER, d1 DATE, n1 NUMBER, n2 NUMBER, n3 NUMBER, n4 NUMBER, n5 NU ...

  10. HttpServletRequest解决中文乱码的问题

    HTTP请求有get和post,这两中方式解决中文乱码的方式如下: 1.Post方式请求 //这句话是设置post请求体的编码为utf-8 request.setCharacterEncoding(& ...