Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failure to communicate with tenant in West US #2347

Open
2 of 19 tasks
dbeavon opened this issue Feb 11, 2025 · 7 comments
Open
2 of 19 tasks

[BUG] Failure to communicate with tenant in West US #2347

dbeavon opened this issue Feb 11, 2025 · 7 comments

Comments

@dbeavon
Copy link

dbeavon commented Feb 11, 2025

SynapseML version

Fabric 1.3 (com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader)
SynapseML '1.0.8-spark3.5'

System information

  • Language version python 3.11, scala 2.12
  • Spark Version 3.5
  • Spark Platform Fabric Runtime 1.3

Describe the problem

This library (Synapse ML) is causing problems inside of Fabric.
It appears to be running inside of Fabric, while executing Spark SQL statements against a semantic model.
com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader

We already turned off the automatic logging of ML for experiments and models. (That had been causing problems for us in the past. Hopefully it is not a problem to turn that stuff off.)

The errors in my spark job are meaningless, and seems to be unrelated to the actual work that I'm doing. The errors appear to be related to some perfunctory interaction with our Fabric tenant hosted in West US.

Here are the details:


Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 8) (vm-a9200333 executor 1): java.net.SocketTimeoutException: PowerBI service comm failed (https://WABI-WEST-US-C-PRIMARY-redirect.analysis.windows.net/v1.0/myOrg/internalMetrics/query)
	at com.microsoft.azure.synapse.ml.powerbi.PBISchemas.post(PBISchemas.scala:100)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.$anonfun$executeQuery$1(PBIMeasurePartitionReader.scala:107)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:163)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:160)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.logVerb(PBIMeasurePartitionReader.scala:17)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.executeQuery(PBIMeasurePartitionReader.scala:105)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.init(PBIMeasurePartitionReader.scala:142)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIReaderFactory.createReader(PBIMeasureScan.scala:26)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)

Here is a screenshot of the query and the error:

Image

Notice that I'm simply using "semantic-link" to run a query against a PBI dataset. I'm guessing that 95% of the work is primarily performed on a driver.

I'm hoping I will get some support from here. The error seems related to this community, and not so much related to Fabric. I will otherwise wait a couple of weeks for Mindtree to respond (pro support). At the end, they would probably need help from this community to understand the behavior of SynapseML in Fabric.

Any tips would be very much appreciated.

Code to reproduce issue

%%sql

SELECT
`Fiscal Week[Fiscal Week]`,
`Random[Code]`,

-- SUM(PriceMbfUsd)
 SUM(`USD Price MBF`),
 SUM(`USD Price MSF`)

FROM
    pbi.RandomLengthModel._Metrics


WHERE
 `Fiscal Week[Fiscal Year Number]` = 2025
 AND
 `Fiscal Week[Fiscal Week Number]` = 2

 GROUP BY 
`Fiscal Week[Fiscal Week]`,
`Random[Code]`

Other info / logs

None

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
@dbeavon dbeavon added the bug label Feb 11, 2025
@dbeavon
Copy link
Author

dbeavon commented Feb 11, 2025

Hi @eisber

Sorry to bother you here, but the Semantic Link is putting an odd dependency on Synapse ML. I was wondering if you have any knowledge about this dependency stack.

I'm guessing that the problem I'm facing is a bug in the Semantic Link for spark (SparkLink). The relevant code is not found on the internet. ... If my problem was in SynapseML itself, then I'm guessing I would be able to find the callstacks here in this community.

The problem may be related to the fact that our tenant lives in West US and the capacity lives in North Central. We are getting some basic timeout/connectivity issues. This is not an intermittent issue. I have a ticket open, but I'm worried that the Mindtree folks will take weeks to set up a repro and contact you.

I don't think the scenario involves components that are still in "preview".

@eisber
Copy link
Collaborator

eisber commented Feb 18, 2025

can you find a RAID (root activity id) in the logs?
how long does the same query take if you use sempy evaluate_measure?

@dbeavon
Copy link
Author

dbeavon commented Feb 19, 2025

Hi @eisber I can ask the engineer for his RAID. I was able to send a repro over to the Mindtree side of things.

The full case is reported with the following title and Mindtree case number.
Spark job failing when using semantic link - TrackingID#2502110040012091
I don't think we have created an ICM with Microsoft yet.

Here is an example of a spark native connector query that errors (seemingly because of synapse.ml.powerbi):

Image

The main problems appear to be when introducing "WHERE" clauses. That part of the query appears to be parsed for syntax, but don't seem to have an impact on the SQL profiler queries in the PBI dataset. Moreover, in some cases I can omit the "WHERE" clause, as a way to avoid the error messages.

FYI, Here is a comparable DAX that works great, when it is crafted by hand.

Image

Notice it should take ~5 ms and return under 200 rows.

I'm having a hard time understanding the behavior of this "spark native connector", and I can't distinguish the functionality that will work reliably from the functionality that is broken. My biggest concerns are that "WHERE" clauses seem to be ignored. The secondary concern is that there is a restrictive "TOPN()" applied to the DAX query. That restriction rarely gets enough data from the dataset model, especially when WHERE clause is omitted:

Image

It sounds like you are encouraging us to use the sempy ("evaluate_whatever") methods on the driver as a fallback whenever the spark native connector is misbehaving. Is that so? Are both of them supported as GA features of Fabric?

@eisber
Copy link
Collaborator

eisber commented Feb 19, 2025

the spark native connector path you're using maps to the this function in sempy: https://learn.microsoft.com/en-us/python/api/semantic-link-sempy/sempy.fabric?view=semantic-link-python#sempy-fabric-evaluate-measure

if you don't want to joined queris over semantic model and spark, I'd strongly recommend to use the python API on the driver node.

@dbeavon
Copy link
Author

dbeavon commented Feb 19, 2025

BTW, should we move this discussion to a different github? Seems like this is only loosely related to the open source synapse.ml.

I think I understand that the (A) spark native connector is using the (B) evaluate_measure ... but perhaps it is doing so on a remote executor rather than a driver. So I think you are saying that a problem in one of these (A or B) will always affect the other and I can simplify the repro by swapping one for the other?

I have a tendency to gravitate towards spreading requests out to the executors, given my past experiences with Apache Spark. If any given spark guy is doing everything on the driver, then people will tell us we are doing it wrong (ie. why use a cluster at all). The hope is that some day there will be optimizations or query hints that allow the work to be distributed across executors, and thereby improve the overall execution time.

Of course the bottleneck will ultimately just move. Slow operations on the Spark cluster will ultimately be made faster but the bottleneck will end up at the PBI dataset model.... so it is really doubtful that it makes any difference if queries are submitted from executors or drivers.

At the end, the only real benefit I expected to get out of the spark native connector is to avoid as much DAX as possible. ;) I love MDX and SQL but have some love and some hate for DAX.

Is the spark native connector at least supported?

... I have started getting doubts about that, given the obscure error:
java.net.SocketTimeoutException: PowerBI service comm failed (https://WABI-WEST-US-C-PRIMARY-redirect.analysis.windows.net)

... resulting from a slight change in query syntax.

As per your strong recommendation, I have no doubt that I would be able to get the python API working on the driver, by hook or by crook. My only question at this point is whether to avoid the spark native connector for future workloads in pyspark.

@eisber
Copy link
Collaborator

eisber commented Feb 19, 2025

I don't know your dataset size, but from past experiments for most standard semantic model size you won't see any improvement by using spark or even trying to optimize by moving compute to executors. In general the recommendation to perform computation on the executors for spark jobs is reasonable, but that's for datasets of multiple GB/TB.

If your dataset fits into memory on the driver node, you're probably even faster as you don't have any distributed system overhead.

@dbeavon
Copy link
Author

dbeavon commented Feb 19, 2025

Right. Most of my PBI datasets are small. At my company I'm guessing that 99% of our PBI datasets are under 5 GB, (and could fit easily in duckdb or sqlite).

Still when running a solution on a Spark cluster, people expect to follow Spark patterns. I'm assuming that is why Microsoft created the spark native connector in the first place. Using SQL statements against PBI datasets is also appealing.

Based on your recommendation, I'll start using sempy on the driver ... in the pandas ecosystem ... and subsequently build the spark frame after the fact when I need one. Eg. via:
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#DataFrame-Creation

In the future we may need to combine this data (PBI models) with some other pre-existing spark solution, or delta-table or whatever. Whenever that happens it feels a bit "dirty" if one piece of data forces the whole business to be collect()'ed up to the driver. To avoid that dirty operation, a spark developer would typically push even the small datasets down to the workers. (And the native spark connector would theoretically save us from writing that code ourselves.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants