Data Discrepancy Issue Between Nebula Studio and Spark Connector #715

vaibhavkgiradkar · 2023-12-18T14:06:20Z

Summary:
There is an observed difference in data/count when fetching data from Nebula Studio compared to using the Spark connector. However, the count matches when reading data with the Spark connector. The data in question has been written with the help of the Spark connector.

Steps to Reproduce:

Write data using the Spark connector to Nebula Graph.
Query the data using Nebula Studio.
Query the same data using the Spark connector.
Compare the results and observe the discrepancy.

Expected Behavior:
The data and count retrieved from Nebula Studio should match the data and count obtained through the Spark connector.

Actual Behavior:
There is a discrepancy in the data or count between Nebula Studio and the Spark connector, even though the count matches when using the Spark connector alone.

Environment Details:

Nebula Studio Version: 3.7.0
Spark Connector Version: 3.6.0
Nebula Graph Version: 3.6.0
Spark Version: 3.4.1

vaibhavkgiradkar · 2023-12-20T10:11:20Z

@wey-gu any thoughts on this?

wey-gu · 2023-12-21T03:00:32Z

Could you please help provide what it's like to query the same data of the two?

Like pattern of query etc.

vaibhavkgiradkar · 2023-12-21T08:22:21Z

match (v1:vertex_a)-[:edge_a]->(v2:vertex_b) return count(*)

spark.read.format(
"com.vesoft.nebula.connector.NebulaDataSource").option(
"type", "edge").option(
"spaceName", "sample_space").option(
"label", "edge_a").option(
"returnCols", "").option(
'passwd', 'nebula').option(
'user', 'root').option(
"metaAddress", "").option(
"operateType", "read").option(
"partitionNumber", 10).load()

wey-gu · 2023-12-22T09:17:34Z

Could we assume all edge_a 's source tag is vertex_a and the dest tag is vertex_b?

If not, there are not equivalent.

In case yes, there are cases of dangling edge that lead to the difference between the two.

If there are edge_a edges that with only edges being inserted but the src/dest vertices were not inserted, they are dangling edges, which could be scanned from the storage side(with spark) but cannot be scanned in some queries like match (v1:vertex_a)-[:edge_a]->(v2:vertex_b).

Also, when possible. SHOW STATS is the recommended way to query the counts.

ref: https://docs.nebula-graph.io/3.6.0/8.service-tuning/2.graph-modeling/#about_dangling_edges

QingZ11 added the type/question Type: question about the product label Dec 21, 2023

wey-gu mentioned this issue Dec 23, 2023

Weekly Report 2023-12-22 vesoft-inc/nebula-community#422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Discrepancy Issue Between Nebula Studio and Spark Connector #715

Data Discrepancy Issue Between Nebula Studio and Spark Connector #715

vaibhavkgiradkar commented Dec 18, 2023

vaibhavkgiradkar commented Dec 20, 2023

wey-gu commented Dec 21, 2023

vaibhavkgiradkar commented Dec 21, 2023

wey-gu commented Dec 22, 2023 •

edited

Loading

Data Discrepancy Issue Between Nebula Studio and Spark Connector #715

Data Discrepancy Issue Between Nebula Studio and Spark Connector #715

Comments

vaibhavkgiradkar commented Dec 18, 2023

vaibhavkgiradkar commented Dec 20, 2023

wey-gu commented Dec 21, 2023

vaibhavkgiradkar commented Dec 21, 2023

wey-gu commented Dec 22, 2023 • edited Loading

wey-gu commented Dec 22, 2023 •

edited

Loading