Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semi join #103

Open
lucasxteixeira opened this issue Aug 6, 2023 · 1 comment
Open

Semi join #103

lucasxteixeira opened this issue Aug 6, 2023 · 1 comment

Comments

@lucasxteixeira
Copy link

Hi,

I recently encountered an issue while using dplyr::semi_join with Clickhouse. The default code generated by dplyr produces a subquery with dependencies, and this isn't supported in Clickhouse (or am I wrong?). However, I noticed that Clickhouse does support LEFT SEMI JOIN. Consequently, I've wrote the following function to address this:

#' @export
#' @importFrom dbplyr sql_query_semi_join
sql_query_semi_join.ClickhouseConnection <- function(con, x, y, anti, by, where, vars, ..., lvl = 0) {

  x <- dbplyr:::dbplyr_sql_subquery(con, x, name = by$x_as, lvl = lvl)
  y <- dbplyr:::dbplyr_sql_subquery(con, y, name = by$y_as, lvl = lvl)

  on <- dbplyr:::sql_join_tbls(con, by)

  JOIN <- ifelse(anti, dplyr::sql("ANTI LEFT JOIN"), dplyr::sql("SEMI LEFT JOIN"))

  # Wrap with SELECT since callers assume a valid query is returned
  clauses <- list(
    dbplyr:::sql_clause_select(con, vars),
    dbplyr:::sql_clause_from(x),
    dbplyr:::sql_clause(JOIN, y),
    dbplyr:::sql_clause("ON", on, sep = " AND", parens = TRUE, lvl = 1)
  )
  dbplyr:::sql_format_clauses(clauses, lvl, con)
}

Nonetheless, I'm aware that my function uses some internal dbplyr functions, and I'm uncertain about the permissibility of this approach. Could someone provide some directions on how to refine this function for a potential PR?

Thank you in advance.

@inkrement
Copy link
Member

Hi Lucas! We appreciate your help! The package comprises two parts: (1) the Clickhouse-Driver and (2) the dbplyr interface. Your code does dplyr-code generation and, thus, targets the second part. We started to develop the package at a time when Clickhouse's Join-support was super basic, not in line with the standards, and highly restricted, but that changed a lot. Consequently, you can do a much better job than we did.

I am thrilled that you want to invest time here, and a pull request is welcome. Ideally, you would adapt the R/dplyr.R file, as this is where the dplyr-specific code is located. But here are a few more settings to keep in mind:

  • The code generation should be in line with the settings. Not all join algorithms support all join clauses.
  • Moreover, the function should also consider ANY/ALL (just in case someone changed join_default_strictness).
  • Please keep in mind that also any_join_distinct_right_table_keys is likely to affect the semantics.
  • Also test your code with the new planner/analyzer (allow_experimental_analyzer). ClickHouse#42648

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants