Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json4 parse error on ResourceMetadata while running few models in spark-nlp #14327

Open
1 task done
nimesh1601 opened this issue Jun 11, 2024 · 5 comments
Open
1 task done
Assignees
Labels

Comments

@nimesh1601
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

Trying out an example similar to https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbedding

Current Behavior

We are getting json4s exception while spark-nlp is trying to get resource metadata
Exception stacktrace

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: org.json4s.MappingException: Parsed JSON values do not match with class constructor
args=
arg types=
executable=Executable(Constructor(public com.johnsnowlabs.nlp.pretrained.ResourceMetadata(java.lang.String,scala.Option,scala.Option,scala.Option,boolean,java.sql.Timestamp,boolean,scala.Option,java.lang.String,scala.Option)))
cause=wrong number of arguments
types comparison result=MISSING(java.lang.String),MISSING(scala.Option),MISSING(scala.Option),MISSING(scala.Option),MISSING(boolean),MISSING(java.sql.Timestamp),MISSING(boolean),MISSING(scala.Option),MISSING(java.lang.String),MISSING(scala.Option)
	at org.json4s.reflect.package$.fail(package.scala:53)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:724)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:767)
	at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:462)
	at org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780)
	at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
	at scala.PartialFunction$$anon$1.applyOrElse(PartialFunction.scala:257)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:780)
	at org.json4s.Extraction$.extract(Extraction.scala:454)
	at org.json4s.Extraction$.extract(Extraction.scala:56)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
	at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:104)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:136)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:134)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at scala.collection.Iterator$$anon$13.next(Iterator.scala:593)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
	at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
	at scala.collection.AbstractIterator.toList(Iterator.scala:1431)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:134)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:128)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:58)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:69)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:228)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:562)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:782)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

Expected Behavior

Model run successfully

Steps To Reproduce

Run https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbeddings example

Spark NLP version and Apache Spark

spark-nlp version - 5.3.3
Spark version - 3.3.2
Python version - 3.9
Scala version - 212

Type of Spark Application

Python Application

Java Version

jdk-11

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

Other packages installed via pip

  • numpy==1.25.0
  • pandas==2.0.2
  • statsmodels==0.14.0
  • h3core-package==2.14.1.1.post1
  • haversine==2.8.0
  • joblib==1.2.0
  • LatLon==1.0.2
  • patsy==0.5.3
  • polyline==2.0.0
  • py4j==0.10.9.7
  • pyproj==3.6.0
  • python-dateutil==2.8.2
  • pytz==2023.3
  • scikit-learn==1.2.2
  • scipy==1.10.1
  • shapely==2.0.1
  • six==1.16.0
  • BeautifulSoup4==4.12.2
  • Cheetah3==3.2.6.post1
  • FormEncode==2.0.1
  • IPy==1.1
  • Jinja2==3.1.2
  • MarkupSafe==2.1.3
  • Paste==3.5.3
  • PyYAML==6.0
  • SQLAlchemy==2.0.16
  • Tempita==0.5.2
  • ansicolors==1.1.8
  • backports-abc==0.5
  • boto==2.49.0
  • bottle==0.12.25
  • cffi==1.15.1
  • click==8.1.3
  • colorama==0.4.6
  • configobj==5.0.8
  • configparser==5.3.0
  • cronwrap==1.4
  • cryptography==41.0.1
  • decorator==5.1.1
  • dnspython==2.3.0
  • docker-py==1.10.6
  • enum34==1.1.10
  • futures==3.0.5
  • google-common==0.0.1
  • hash-ring==1.3.1
  • html5lib==1.1
  • ipcalc==1.99.0
  • ipython==8.14.0
  • jsonpatch==1.33
  • jsonpointer==2.4
  • jsonschema==4.17.3
  • kafka-python==2.0.2
  • kazoo==2.9.0
  • m3==4.3.3
  • mock==5.0.2
  • msgpack-python==0.5.6
  • mutornadomon==0.5.1
  • ndg-httpsclient==0.5.1
  • oauth==1.0.1
  • pexpect==4.8.0
  • ply==3.11
  • prettytable==3.8.0
  • protobuf==4.23.3
  • psutil==5.9.5
  • psycopg2==2.9.6
  • pyasn1==0.5.0
  • pycparser==2.21
  • pycrypto==2.6.1
  • py3dns==3.2.1
  • pydot==1.4.2
  • pymegacli==0.1.5.3
  • pyparsing==3.1.0
  • pyrasite==2.0
  • pyserial==3.5
  • pysparklines==1.4
  • python-ldap==3.4.3
  • python-snappy==0.6.1
  • python-systemd==0.0.9
  • pyzmq==25.0.2
  • redis==4.5.5
  • salt==3006.1
  • send-nsca==0.1.4.1
  • setproctitle==1.3.2
  • simplegeneric==0.8.1
  • simplejson==3.19.1
  • singledispatch==4.0.0
  • thriftrw==1.9.0
  • tornado==5.1.1
  • ujson==5.8.0
  • urwid==2.1.2
  • uservice==0.1.16
  • virtualenv==20.23.1
  • websocket-client==1.6.0
  • zope.interface==6.0
@maziyarpanahi
Copy link
Member

Please provide your full code preferably in Colab so we can reproduce it.

@olcayc
Copy link

olcayc commented Jun 11, 2024

Hi Maziyar, this is a minimal script to recreate what we're doing. The error happens when sparknlp tries to download a model. This same flow worked correctly for us under Spark 3.0, but somehow it is failing under the Spark 3.3 environment

import sparknlp
from sparknlp.base import EmbeddingsFinisher, DocumentAssembler
from sparknlp.common import AnnotatorType
from sparknlp.annotator import E5Embeddings
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession


# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Spark NLP Example") \
    .getOrCreate()

spark.sparkContext.setCheckpointDir("/path/to/checkpoint/dir")

# input_df is a dataframe with column 'text' containing text to embed
input_df = ...

# Build pipeline
documentAssembler = (
    DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = E5Embeddings.pretrained()

embeddingsFinisher = (
    EmbeddingsFinisher()
    .setInputCols(["sentence_embeddings"])
    .setOutputCols("unpooled_embeddings")
    .setOutputAsVector(True)
    .setCleanAnnotations(False)
)

embeddings = embeddings.setInputCols(["document"]).setOutputCol(
    "sentence_embeddings"
)
pipeline = Pipeline().setStages(
    [documentAssembler, embeddings, embeddingsFinisher]
)

input_df = input_df.repartition(400).checkpoint()

result_df = pipeline.fit(input_df).transform(input_df).checkpoint()

@olcayc
Copy link

olcayc commented Jun 12, 2024

@maziyarpanahi As per the code snippet above, we are not doing anything particularly complex, just generating some embeddings. We get the same error with other pretrained models as well. The code worked under Spark 3.0, but now we are getting this JSON4s parsing error under Spark 3.3.

Is sparknlp 5.3.3 tested under pyspark 3.3.2, jvm/jre 11, scala 212 and python 3.9 ? What's the closest configuration that you've tested successfully on your side.

@Siddharth-Latthe-07
Copy link

This exception typically occurs when the JSON data being parsed does not match the expected format defined by the ResourceMetadata class constructor. This could be due to missing or extra fields, incorrect data types, or changes in the JSON structure.
Here are some of the steps that might help and let me know if it doesn't:-

  1. Check JSON Response
  2. Verify Class Constructor
  3. Update Spark NLP Version
  4. Custom Parsing Logic
  5. Inspection:-
    sample code:-
from pyspark.sql import SparkSession
import json

# Initialize Spark session
spark = SparkSession.builder \
    .appName("SparkNLPExample") \
    .getOrCreate()

# Function to log JSON response
def log_json_response(resource_url):
    import requests
    response = requests.get(resource_url)
    if response.status_code == 200:
        print(json.dumps(response.json(), indent=4))
    else:
        print(f"Failed to fetch resource: {response.status_code}")

# Example resource URL (replace with the actual URL you are using)
resource_url = "https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbeddings"
log_json_response(resource_url)

Hope this helps,
Thanks

@maziyarpanahi
Copy link
Member

I am not sure what causes this, but please test the latest 5.4.1 release instead just in case. And this is pretty simple setup with your minimum code, it works without worrying about any of those versions:

https://colab.research.google.com/drive/1qgD75n8KcSf5ehkZ7obpDTOls_17fiKJ?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants