Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiDateMatcher only returning 1 element #14085

Open
1 task done
TommyDong1998 opened this issue Dec 8, 2023 · 0 comments
Open
1 task done

MultiDateMatcher only returning 1 element #14085

TommyDong1998 opened this issue Dec 8, 2023 · 0 comments
Assignees
Labels

Comments

@TommyDong1998
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

Finding dates in a string.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
date = MultiDateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setOutputFormat("yyyy/MM/dd")
pipeline = Pipeline().setStages([
documentAssembler,
date
])
data = spark.createDataFrame([["Nov 29 2023, Dec 1 2024"]])
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(date) as dates").show(truncate=False)

Current Behavior

Currently when I pass in the following to MultiDateMatcher
["Nov 29 2023, Dec 1 2024"]
It only returns 11/29/23 instead of both dates.

+-----------------------------------------------+
|dates |
+-----------------------------------------------+
|{date, 10, 20, 2023/11/29, {sentence -> 0}, []}|
+-----------------------------------------------+

Expected Behavior

Get both dates

Steps To Reproduce

https://colab.research.google.com/drive/1xGE1MqqcsjOL9kyOoOwkiqnMa4LabETK?usp=sharing

I just copied and paste the example code off doc and add the dates(Nov 29 2023, Dec 1 2024) in.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
date = MultiDateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setOutputFormat("yyyy/MM/dd")
pipeline = Pipeline().setStages([
documentAssembler,
date
])
data = spark.createDataFrame([["Nov 29 2023, Dec 1 2024"]])
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(date) as dates").show(truncate=False)

Spark NLP version and Apache Spark

5.1.4
3.5.0

Type of Spark Application

Python Application

Java Version

openjdk version "11.0.21" 2023-10-17 OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu122.04) OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)

Java Home Directory

N/A

Setup and installation

Google collab

Operating System and Version

Google Collab(ubuntu linux)

Link to your project (if available)

https://colab.research.google.com/drive/1xGE1MqqcsjOL9kyOoOwkiqnMa4LabETK?usp=sharing

Additional Information

https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/MultiDateMatcher$.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants