Is connnector-x beneficial for complex-query with lot of Joins? #194

rsampaths16 · 2021-12-13T11:26:16Z

rsampaths16
Dec 13, 2021

I have a complex query with many operations on different columns ( eg: when case end, concat, sum, replace ... etc ) and joins on multiple tables ( 5 ~ 7 tables ). I want to know if connector-x is the right tool for my use case.

I did a benchmark with time-it on the smallest query we've got and the query performance on pymysql is better than connector-x with response of about 1000000 rows and 1GB size.

pymysql	connector-x
138.1542160679819	170.457891981001
136.12365255801706	163.348922684032
134.96259305998683	165.39596602495294

We have other queries we want to try this on with 40000000 rows and 60GB size and are unsure if connector-x could provide us any gains.

Is connector-x only optimised for single-table loading?

Answered by wangxiaoying

Dec 13, 2021

Hi @rsampaths16 , thanks for brining up this issue!

May I ask what is the destination dataframe you want? Also, may I ask what is the raw query that you refer here? ConnectorX convert the query result into a dataframe for further analysis purpose. So it would be fair if comparing with other tools (e.g. pandas, turbodbc) that also has the same dataframe as result.

ConnectorX is mainly targeting on the large query result fetching scenario. It speeds up the process by optimizing the client-side execution and saturating both network and machine resource through parallelism. When query gets complex, there will be overhead coming from metadata fetching. In ConnectorX, there are up to three info…

View full answer

wangxiaoying · 2021-12-13T17:29:01Z

wangxiaoying
Dec 13, 2021
Maintainer

Hi @rsampaths16 , thanks for brining up this issue!

May I ask what is the destination dataframe you want? Also, may I ask what is the raw query that you refer here? ConnectorX convert the query result into a dataframe for further analysis purpose. So it would be fair if comparing with other tools (e.g. pandas, turbodbc) that also has the same dataframe as result.

ConnectorX is mainly targeting on the large query result fetching scenario. It speeds up the process by optimizing the client-side execution and saturating both network and machine resource through parallelism. When query gets complex, there will be overhead coming from metadata fetching. In ConnectorX, there are up to three info that will be fetched before issue the query to database:

MIN, MAX query for partition range (if partition is applied)
COUNT query (if return_type=pandas)
schema fetching query

Let's say if we do not use partition, and we want pandas as the final result. When query gets complex, in order to avoid the potentially costly COUNT query, we suggest to use Arrow as an intermediate destination from ConnectorX and convert it into Pandas using Arrow’s to_pandas API. For example:

import connectorx as cx

table = cx.read_sql(db_uri, query, return_type=arrow)
df = table.to_pandas(split_blocks=False, date_as_object=False)

For schema fetching query, may I ask which database you are using? For now, we have optimized the procedure in fetching schema info on postgres and mysql. So if you are using these two databases, the performance should be better than other baselines even on complex queries (even with small query result) without partitioning. We are still looking for methods to speed up other databases.

1 reply

rsampaths16 Feb 7, 2022
Author

Hi @wangxiaoying we did observe improvements by setting destination as arrow

N	PyMySQL	Connector-X
1	139.0824223581003	74.11710930301342
2	137.4936218999792	73.39678541501053
3	138.94776023400482	74.96836991701275
4	138.31066780001856	90.68306346703321
5	136.89600106095895	74.58728482399601
6	137.3009505419759	71.24305403605103
7	138.165417968994	72.34290853701532
8	137.28496502304915	72.15110429900233
9	137.69805334298871	72.61519615002908
10	138.33424504601862	73.73904214904178

The database is Azure managed MySQL
By raw query meant query with just select of all fields on joins of required tables without additional operations etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is connnector-x beneficial for complex-query with lot of Joins? #194

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is connnector-x beneficial for complex-query with lot of Joins? #194

rsampaths16 Dec 13, 2021

Replies: 1 comment · 1 reply

wangxiaoying Dec 13, 2021 Maintainer

rsampaths16 Feb 7, 2022 Author

rsampaths16
Dec 13, 2021

Replies: 1 comment 1 reply

wangxiaoying
Dec 13, 2021
Maintainer

rsampaths16 Feb 7, 2022
Author