Major new release provides production-grade, scalable, and trainable versions of the most accurate natural language processing techniques available to the open-source community today
Delaware, 12th May 2020 - John Snow Labs is thrilled to announce the immediate availability of the new major version of Spark NLP – the world’s most widely used natural language processing library in the enterprise. The library can be used from Python, Java, and Scala API’s and comes with over 150 pre-trained models & pipelines.
“When we started planning for Spark NLP 2.5 a few months ago, the world was a different place. We have been blown away by the use of Natural Language Processing for early outbreak detections, question-answering chatbot services, text analysis of medical records, monitoring efforts to minimize the spread of COVID-19, and many more.” – said Maziyar Panahi, a lead contributor to Spark NLP.
Spark NLP 2.5 is another milestone in John Snow Labs’ quest to provide the open-source community with the most accurate NLP algorithms & models ever invented. By making the most recent academic advances available as a production grade, scalable, and trainable software library, the global data science community can make faster progress towards putting AI to good use. Here are the major accuracy enhancing capabilities this new release makes available.
ALBERT and XLNet embeddings
“Beyond BERT” embeddings have been part of Spark NLP for Healthcare for a while and are now coming to the open-source package. ALBERT is “a Lite BERT” and provides almost the same accuracy as BERT (for example when used for named entity recognition) while requiring only about 6% of the memory. You can use it in memory-limited edge devices, or when loading models quickly on startup is a priority.
XLNet is a more advanced contextual embedding architecture than BERT and is known to perform particularly well on tasks like question answering. It is now available within Spark NLP – and the library takes care of the engineering heavy lifting required for cashing, distributing, tokenizing, and reusing it across NLP pipelines.
Spark NLP already has native support for word, chunk, sentence, and document encodings. The Universal Sentence Encoder has been part of the library since 2.4 and measures (well) semantic similarity between sentences.
New Contextual Spell Checker
This is a whole new, trainable, deep-learning-based spell checking algorithm that takes into account a word’s context before recommending how to correct it:
"I will call my siter." [sister]
"Due to bad weather , we had to move to a different siter." [site]
"We travelled to three siter in the summer." [sites]
"During the summer we have the best ueather." [weather]
"I have a black ueather jacket, so nice." [leather]
"I introduce you to my sister, she is called ueather." [Heather]
See how the model handles single vs. plural nouns and personal names well (these examples use the pre-trained English model). This model delivers a word error rate of 8.09% for fully automatically correction in the Holbrook benchmark. This is the best we are aware of – compare with a 20.24% error rate that JamSpell attains on the same benchmark.
New Deep-Learning Sentiment Analysis
The SentimentDL annotator applies contextual embeddings and a state-of-the-art deep learning architecture to training multi-class sentiment analysis models. Two pre-trained models – on IMDB reviews with an accuracy of 91% and on "Twitter sentiment 140 - 1.6 million tweets" with an accuracy of 89% are also part of this release.
SentimentDL can also handle neutral statements (in addition to positive and negative ones) and returns a ratio between 0 and 1 for how positive (or negative) a statement is.
The deep-learning Document Classification annotator now supports classifying between 100 classes (up from 50 in the previous release). It also comes with two new pre-trained models – trained with the TREC-6 and TREC-50 benchmark datasets for question classification.
The Spark NLP community has been rapidly growing – with monthly downloads growing by over 50% just from January to April 2020. This release grows this community substantially – by providing direct support for 14 new languages and adding 87 new out-of-the-box NLP models. As always, we thank our community for their feedback, bug reports, and contributions that made this release possible.
ABOUT JOHN SNOW LABS
John Snow Labs is an award-winning AI and NLP company, accelerating progress in data science by providing state-of-the-art models, data and platforms. Founded in 2015, through its offerings, the firm assists healthcare and life sciences industries to build, scale, deploy, and operate AI products and services (through on-premise or cloud architectures) in order to collect and prepare patient data for analysis. In 2018, the company won the CIO Review's AI Solution Provider of the Year award. It also won the Strata Data Award in 2019 for delivering Spark NLP – the world’s most widely used NLP library in the enterprise, and the AI Excellence Award 2020. Today, John Snow Labs is managed by a global team of data specialists that hold either a Ph.D., M.D., or Masters degree in disciplines covering data science, medicine, data engineering, pharma, data security, and DataOps. Presently, top medical and pharmaceutical brands, such as Johnson & Johnson, Roche, and Kaiser Permanente are part of John Snow Labs’ clientele.
and Kaiser Permanente are part of John Snow Labs’ clientele.
If you found this Article interesting, why not review the other Articles in our archive.