Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! #13946
maziyarpanahi
announced in
Announcement
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
📢 And RAG whispered to Spark NLP, you complete me!
It's a well-established principle: any LLM, whether open-source or proprietary, isn't dependable without a RAG. And truly, there can't be an effective RAG without an NLP library that is production-ready, natively distributed, state-of-the-art, and user-friendly. This holds true in our 5.1.0 release!
Release Summary:
We're excited to unveil Spark NLP 🚀 5.1.0 with:
We want to thank our community for their valuable feedback, feature requests, and contributions. Our Models Hub now contains over 18,000+ free and truly open-source models & pipelines. 🎉
Don't miss our free Webinar: From GPT-4 to Llama-2: Supercharging State-of-the-Art Embeddings for Vector Databases
🔥 New Features
Spark NLP ❤️ ONNX (toujours)
In Spark NLP 5.1.0, we're persisting with our commitment to ONNX Runtime support. Following our introduction of ONNX Runtime in Spark NLP 5.0.0—which has notably augmented the performance of models like BERT—we're further integrating features to bolster model efficiency. Our endeavors include optimizing existing models and expanding our ONNX-compatible offerings. For a detailed overview of ONNX compatibility in Spark NLP, refer to this issue.
NEW: In the 5.1.0 release, we've extended ONNX support to the E5 embedding annotator and introduced 15 new E5 models in ONNX format. This includes both optimized and quantized versions. Impressively, the enhanced ONNX support and these new models showcase a performance boost ranging from 2.3x to 3.4x when compared to the TensorFlow versions released in the 5.0.0 update.
OpenAI Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
NEW: Introducing WhisperForCTC annotator in Spark NLP 🚀.
WhisperForCTC
can load all state-of-the-art Whisper models inherited fromOpenAI Whisper
for Robust Speech Recognition. Whisper was trained and open-sourced that approaches human level robustness and accuracy on English speech recognition.MPNet: Masked and Permuted Pre-training for Language Understanding
NEW: Introducing MPNetEmbeddings annotator in Spark NLP 🚀.
MPNetEmbeddings
can load all state-of-the-art MPNet Models for Text Embeddings.Available new state-of-the-art BGE, TGE, E5, and INSTRUCTOR models for Text Embeddings are currently dominating the top of the MTEB leaderboard positioning themselves way above OpenAI text-embedding-ada-002
New OpenAI Embeddings and Completions
NEW: In Spark NLP 5.1.0, we're thrilled to introduce the integration of OpenAI Embeddings and Completions transformers. By merging the prowess of OpenAI's language model with the robust NLP processing capabilities of Spark NLP, we've created a powerful synergy. Specifically, with the newly introduced
OpenAIEmbeddings
andOpenAICompletion
transformers, users can now make direct API calls to OpenAI's Embeddings and Completion endpoints right from an Apache Spark DataFrame. This enhancement promises to elevate the efficiency and versatility of data processing workflows within Spark NLP pipelines.Unified Support for All Major Cloud Storage
In Spark NLP 5.1.0, we're thrilled to announce a holistic integration of all major cloud and distributed file storage systems. Building on our existing support for
AWS
,DBFS
, andHDFS
, we've now introduced seamless operations withGoogle Cloud Platform (GCP)
andAzure
. Here's a brief overview of what's been added and improved:cache_pretrained
property now offers unified cloud access, making it easier to cache models from any supported platform.We're confident these updates will provide a smoother, more unified experience for users across all cloud platforms for the following features:
cache_pretrained
directoryBART: New multi-lingual Zero-Shot Text Classification
BartForZeroShotClassification
annotator for text classification with your labels! 💯Let's see how easy it is to just use any set of labels our trained model has never seen via the
setCandidateLabels()
param:For Zero-Short Multi-class Text Classification:
For Zero-Short Multi-class Text Classification:
⭐🐛 Improvements & Bug Fixes
💾 Models
Spark NLP 5.1.0 comes with more than 200+ new state-of-the-art pre-trained transformer models in multi-languages.
Featured Models
xx
xx
en
en
en
en
en
en
The complete list of all 18400+ models & pipelines in 230+ languages is available on Models Hub
📓 New Notebooks
📖 Documentation
❤️ Community support
and show off how you use Spark NLP!
Installation
Python
#PyPI pip install spark-nlp==5.1.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
GPU
Apple Silicon (M1 & M2)
AArch64
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
spark-nlp-gpu:
spark-nlp-silicon:
spark-nlp-aarch64:
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.1.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.1.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.1.0.jar
What's Changed
Full Changelog: 5.0.2...5.1.0
This discussion was created from the release Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more!.
Beta Was this translation helpful? Give feedback.
All reactions