Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Discrepancy Issue Between Nebula Studio and Spark Connector #715

Open
vaibhavkgiradkar opened this issue Dec 18, 2023 · 4 comments
Open
Labels
type/question Type: question about the product

Comments

@vaibhavkgiradkar
Copy link

Summary:
There is an observed difference in data/count when fetching data from Nebula Studio compared to using the Spark connector. However, the count matches when reading data with the Spark connector. The data in question has been written with the help of the Spark connector.

Steps to Reproduce:

  1. Write data using the Spark connector to Nebula Graph.
  2. Query the data using Nebula Studio.
  3. Query the same data using the Spark connector.
  4. Compare the results and observe the discrepancy.

Expected Behavior:
The data and count retrieved from Nebula Studio should match the data and count obtained through the Spark connector.

Actual Behavior:
There is a discrepancy in the data or count between Nebula Studio and the Spark connector, even though the count matches when using the Spark connector alone.

Environment Details:

  • Nebula Studio Version: 3.7.0
  • Spark Connector Version: 3.6.0
  • Nebula Graph Version: 3.6.0
  • Spark Version: 3.4.1
@vaibhavkgiradkar
Copy link
Author

@wey-gu any thoughts on this?

@wey-gu
Copy link
Contributor

wey-gu commented Dec 21, 2023

Could you please help provide what it's like to query the same data of the two?

Like pattern of query etc.

@vaibhavkgiradkar
Copy link
Author

match (v1:vertex_a)-[:edge_a]->(v2:vertex_b) return count(*)

spark.read.format(
"com.vesoft.nebula.connector.NebulaDataSource").option(
"type", "edge").option(
"spaceName", "sample_space").option(
"label", "edge_a").option(
"returnCols", "").option(
'passwd', 'nebula').option(
'user', 'root').option(
"metaAddress", "").option(
"operateType", "read").option(
"partitionNumber", 10).load()

@QingZ11 QingZ11 added the type/question Type: question about the product label Dec 21, 2023
@wey-gu
Copy link
Contributor

wey-gu commented Dec 22, 2023

Could we assume all edge_a 's source tag is vertex_a and the dest tag is vertex_b?

If not, there are not equivalent.

In case yes, there are cases of dangling edge that lead to the difference between the two.

If there are edge_a edges that with only edges being inserted but the src/dest vertices were not inserted, they are dangling edges, which could be scanned from the storage side(with spark) but cannot be scanned in some queries like match (v1:vertex_a)-[:edge_a]->(v2:vertex_b).

Also, when possible. SHOW STATS is the recommended way to query the counts.

ref: https://docs.nebula-graph.io/3.6.0/8.service-tuning/2.graph-modeling/#about_dangling_edges

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Type: question about the product
Projects
None yet
Development

No branches or pull requests

3 participants