[BUG] Failure to communicate with tenant in West US #2347

dbeavon · 2025-02-11T18:59:07Z

SynapseML version

Fabric 1.3 (com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader)
SynapseML '1.0.8-spark3.5'

System information

Language version python 3.11, scala 2.12
Spark Version 3.5
Spark Platform Fabric Runtime 1.3

Describe the problem

This library (Synapse ML) is causing problems inside of Fabric.
It appears to be running inside of Fabric, while executing Spark SQL statements against a semantic model.
com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader

We already turned off the automatic logging of ML for experiments and models. (That had been causing problems for us in the past. Hopefully it is not a problem to turn that stuff off.)

The errors in my spark job are meaningless, and seems to be unrelated to the actual work that I'm doing. The errors appear to be related to some perfunctory interaction with our Fabric tenant hosted in West US.

Here are the details:


Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 8) (vm-a9200333 executor 1): java.net.SocketTimeoutException: PowerBI service comm failed (https://WABI-WEST-US-C-PRIMARY-redirect.analysis.windows.net/v1.0/myOrg/internalMetrics/query)
	at com.microsoft.azure.synapse.ml.powerbi.PBISchemas.post(PBISchemas.scala:100)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.$anonfun$executeQuery$1(PBIMeasurePartitionReader.scala:107)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:163)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:160)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.logVerb(PBIMeasurePartitionReader.scala:17)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.executeQuery(PBIMeasurePartitionReader.scala:105)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.init(PBIMeasurePartitionReader.scala:142)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIReaderFactory.createReader(PBIMeasureScan.scala:26)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)

Here is a screenshot of the query and the error:

Notice that I'm simply using "semantic-link" to run a query against a PBI dataset. I'm guessing that 95% of the work is primarily performed on a driver.

I'm hoping I will get some support from here. The error seems related to this community, and not so much related to Fabric. I will otherwise wait a couple of weeks for Mindtree to respond (pro support). At the end, they would probably need help from this community to understand the behavior of SynapseML in Fabric.

Any tips would be very much appreciated.

Code to reproduce issue

%%sql

SELECT
`Fiscal Week[Fiscal Week]`,
`Random[Code]`,

-- SUM(PriceMbfUsd)
 SUM(`USD Price MBF`),
 SUM(`USD Price MSF`)

FROM
    pbi.RandomLengthModel._Metrics


WHERE
 `Fiscal Week[Fiscal Year Number]` = 2025
 AND
 `Fiscal Week[Fiscal Week Number]` = 2

 GROUP BY 
`Fiscal Week[Fiscal Week]`,
`Random[Code]`

Other info / logs

None

What component(s) does this bug affect?

What language(s) does this bug affect?

language/scala: Scala source code
language/python: Pyspark APIs
language/r: R APIs
language/csharp: .NET APIs
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/synapse: Azure Synapse integrations
integrations/azureml: Azure ML integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

dbeavon · 2025-02-11T19:50:55Z

Hi @eisber

Sorry to bother you here, but the Semantic Link is putting an odd dependency on Synapse ML. I was wondering if you have any knowledge about this dependency stack.

I'm guessing that the problem I'm facing is a bug in the Semantic Link for spark (SparkLink). The relevant code is not found on the internet. ... If my problem was in SynapseML itself, then I'm guessing I would be able to find the callstacks here in this community.

The problem may be related to the fact that our tenant lives in West US and the capacity lives in North Central. We are getting some basic timeout/connectivity issues. This is not an intermittent issue. I have a ticket open, but I'm worried that the Mindtree folks will take weeks to set up a repro and contact you.

I don't think the scenario involves components that are still in "preview".

eisber · 2025-02-18T14:50:50Z

can you find a RAID (root activity id) in the logs?
how long does the same query take if you use sempy evaluate_measure?

dbeavon · 2025-02-19T19:34:39Z

Hi @eisber I can ask the engineer for his RAID. I was able to send a repro over to the Mindtree side of things.

The full case is reported with the following title and Mindtree case number.
Spark job failing when using semantic link - TrackingID#2502110040012091
I don't think we have created an ICM with Microsoft yet.

Here is an example of a spark native connector query that errors (seemingly because of synapse.ml.powerbi):

The main problems appear to be when introducing "WHERE" clauses. That part of the query appears to be parsed for syntax, but don't seem to have an impact on the SQL profiler queries in the PBI dataset. Moreover, in some cases I can omit the "WHERE" clause, as a way to avoid the error messages.

FYI, Here is a comparable DAX that works great, when it is crafted by hand.

Notice it should take ~5 ms and return under 200 rows.

I'm having a hard time understanding the behavior of this "spark native connector", and I can't distinguish the functionality that will work reliably from the functionality that is broken. My biggest concerns are that "WHERE" clauses seem to be ignored. The secondary concern is that there is a restrictive "TOPN()" applied to the DAX query. That restriction rarely gets enough data from the dataset model, especially when WHERE clause is omitted:

It sounds like you are encouraging us to use the sempy ("evaluate_whatever") methods on the driver as a fallback whenever the spark native connector is misbehaving. Is that so? Are both of them supported as GA features of Fabric?

eisber · 2025-02-19T20:35:24Z

the spark native connector path you're using maps to the this function in sempy: https://learn.microsoft.com/en-us/python/api/semantic-link-sempy/sempy.fabric?view=semantic-link-python#sempy-fabric-evaluate-measure

if you don't want to joined queris over semantic model and spark, I'd strongly recommend to use the python API on the driver node.

dbeavon · 2025-02-19T20:53:38Z

BTW, should we move this discussion to a different github? Seems like this is only loosely related to the open source synapse.ml.

I think I understand that the (A) spark native connector is using the (B) evaluate_measure ... but perhaps it is doing so on a remote executor rather than a driver. So I think you are saying that a problem in one of these (A or B) will always affect the other and I can simplify the repro by swapping one for the other?

I have a tendency to gravitate towards spreading requests out to the executors, given my past experiences with Apache Spark. If any given spark guy is doing everything on the driver, then people will tell us we are doing it wrong (ie. why use a cluster at all). The hope is that some day there will be optimizations or query hints that allow the work to be distributed across executors, and thereby improve the overall execution time.

Of course the bottleneck will ultimately just move. Slow operations on the Spark cluster will ultimately be made faster but the bottleneck will end up at the PBI dataset model.... so it is really doubtful that it makes any difference if queries are submitted from executors or drivers.

At the end, the only real benefit I expected to get out of the spark native connector is to avoid as much DAX as possible. ;) I love MDX and SQL but have some love and some hate for DAX.

Is the spark native connector at least supported?

... I have started getting doubts about that, given the obscure error:
java.net.SocketTimeoutException: PowerBI service comm failed (https://WABI-WEST-US-C-PRIMARY-redirect.analysis.windows.net)

... resulting from a slight change in query syntax.

As per your strong recommendation, I have no doubt that I would be able to get the python API working on the driver, by hook or by crook. My only question at this point is whether to avoid the spark native connector for future workloads in pyspark.

eisber · 2025-02-19T21:39:42Z

I don't know your dataset size, but from past experiments for most standard semantic model size you won't see any improvement by using spark or even trying to optimize by moving compute to executors. In general the recommendation to perform computation on the executors for spark jobs is reasonable, but that's for datasets of multiple GB/TB.

If your dataset fits into memory on the driver node, you're probably even faster as you don't have any distributed system overhead.

dbeavon · 2025-02-19T22:09:19Z

Right. Most of my PBI datasets are small. At my company I'm guessing that 99% of our PBI datasets are under 5 GB, (and could fit easily in duckdb or sqlite).

Still when running a solution on a Spark cluster, people expect to follow Spark patterns. I'm assuming that is why Microsoft created the spark native connector in the first place. Using SQL statements against PBI datasets is also appealing.

Based on your recommendation, I'll start using sempy on the driver ... in the pandas ecosystem ... and subsequently build the spark frame after the fact when I need one. Eg. via:
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#DataFrame-Creation

In the future we may need to combine this data (PBI models) with some other pre-existing spark solution, or delta-table or whatever. Whenever that happens it feels a bit "dirty" if one piece of data forces the whole business to be collect()'ed up to the driver. To avoid that dirty operation, a spark developer would typically push even the small datasets down to the workers. (And the native spark connector would theoretically save us from writing that code ourselves.)

dbeavon added the bug label Feb 11, 2025

github-actions bot added the triage label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failure to communicate with tenant in West US #2347

[BUG] Failure to communicate with tenant in West US #2347

dbeavon commented Feb 11, 2025 •

edited

Loading

dbeavon commented Feb 11, 2025

eisber commented Feb 18, 2025

dbeavon commented Feb 19, 2025

eisber commented Feb 19, 2025

dbeavon commented Feb 19, 2025

eisber commented Feb 19, 2025

dbeavon commented Feb 19, 2025

[BUG] Failure to communicate with tenant in West US #2347

[BUG] Failure to communicate with tenant in West US #2347

Comments

dbeavon commented Feb 11, 2025 • edited Loading

SynapseML version

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

dbeavon commented Feb 11, 2025

eisber commented Feb 18, 2025

dbeavon commented Feb 19, 2025

eisber commented Feb 19, 2025

dbeavon commented Feb 19, 2025

eisber commented Feb 19, 2025

dbeavon commented Feb 19, 2025

dbeavon commented Feb 11, 2025 •

edited

Loading