Before you start

Knowledge Mining solution accelerator

This repository contains all the code for deploying an end-to-end Knowledge Mining solution based on Azure Cognitive Search.

It is built on top of standards Azure services like Functions, Web App Services, Congitive Services & Cognitive Search. It provides a deployment pipeline allowing quick and easy setup of CI/CD pipelines for your projects.

For detailed documentation please refer to the docs section of the repo containing the solution wiki.

Before you start

In order to successfully setup your solution you will need to have access to and or provisioned the following:

Access to an Azure subscription (required)
Access to an Azure DevOps subscription (optional)

An Owner or Contributor role is assumed on the Azure subscription or the targeted Resource Group.

Getting Started

Please refer to the README to deploy this solution accelerator.

The directions provided in all guides assume you have a fundamental working knowledge of the Azure portal, Azure Functions, Azure Cognitive Search, Functions, Storage and Azure Cognitives Services.

For additional training and support, please see:

Knowledge Mining overview

Knowledge mining (KM) is an emerging discipline in artificial intelligence (AI) that uses a combination of intelligent services to quickly learn from vast amounts of information. It allows organizations to deeply understand and easily explore information, uncover hidden insights, and find relationships and patterns at scale.

Knowledge Mining in Azure

What is this solution accelerator ?

This KM solution accelerator aims to provide you with a workable end-to-end Knowledge Mining solution composed of :

Ingestion
- Data ingestion from Azure Data Lake
Enrichment
- Data enrichment with Azure Applied AI and Cognitive Services
Exploration
- Keyword and Semantic search
- Support for multiples search indexes
- Content security model (permissions)
- Modular User Interface

With this cloud-based accelerator you will get an end-to-end solution with the tools to deploy, extend, operate & monitor.

In that respect, the solution provides

Azure Web App Authentication support
High configurability (json)
Full Extensibility
Operations (PowerShell-based)
Azure Pipelines for CI/CD
Deployment framework (manual or through CI/CD)

Why a knowledge mining solution accelerator?

This Knowledge Mining solution accelerator is inspired from another accelerator Knowledge Mining Solution Accelerator.

Based on our fields experience, we built features/skills to address common unstructured data challenges focusing on the usability and data explore experience.

Below is a non-exhaustive list of key highlights:

Embedded images indexation
- Images embedded in documents are indexed as documents not just for keywords search recall.
- PDF pages are extracted as images (configurable).
- A custom version of Apache Tika is used for images extraction.
- Overcome the limit of 1000 normalized images
Image normalization :
- handling oversized images for OCR completeness
- support for TIFF format
- thumbnails creation for UI support
Metadata
- Using Apache Tika we give you access to all metadata present in each document or image. A common scenario are Images with geo-location metadata i.e. EXIF GPS coordinates.
HTML Conversion
- Having an HTML representation of a document could ease some NLP work.
- Table of contents is a common structure which we expose in the HTML representation of a PDF.
Tables extraction: tabular information are common in unstructured data corpus. The solution will extract, index and project tables to a dedicated knowledge store (optional).
Translation": there are two translation features in this solution
- Text Translation : non-native content and title are normalized to a define language (default is english)
- Document Translation : for non-native documents, the solution will translate them. They will follow the same Document processing as any document. Translated documents will provide you with translated tables for instance.
Text Analytics : extract Entities (Named, Linked) from any document and OCR'ed image text.
Export to Excel: popular ask when exploring unstructured data.
Configurable UI: building a UI is time consuming, we wanted to bring great UI configurability so you could bring to life new KM solutions in a timely manner.

What Knowledge Mining scenarios this accelerator targets ?

This solution accelerator spirit is of a Content Research KM scenario.

Nevertheless, since its architecture is open, you could use it as a foundation for more specialized KM scenarios.

This solution accelerator is not targeted to any domain although its extensibility would give you the tools to make it domain specific.

Some inspirational use-cases

AI-driven Data & Web Exploration
Unstructured data Insights extraction (mine the unseen value)
AI-Driven Strategy planning tool
Intranet Semantic Search
R&D portal for data discovery, patterns extraction & patents exploration
etc.

You may think of productization such accelerator for your organization.

Who is the target audience ?

This solution accelerator targets whoever is in need of

Proof Of Concept to showcase Knowledge Mining to your stakeholders
Deploy an end-to-end KM solution for immediate Production use
Learn how to build a KM solution on Azure
Playground for evaluating Azure Machine Learning, Cognitive & Applied AI Services

Data Science Toolkit Integration

This solution accelerator purpose is also to ease the integration of Data Science modules into your knowledge mining solution.

The Data Science Toolkit team has built accelerators for your data science workload.

Solution	Description
Verseagility	Verseagility is a Python-based toolkit to ramp up your custom natural language processing (NLP) task, allowing you to bring your own data, use your preferred frameworks and bring models into production. It is a central component of the Microsoft Data Science Toolkit.
MLOps Base	This repository contains the basic repository structure for machine learning projects based on Azure technologies (Azure ML and Azure DevOps). The folder names and files are chosen based on personal experience. You can find the principles and ideas behind the structure, which we recommend to follow when customizing your own project and MLOps process. Also, we expect users to be familiar with azure machine learning concepts and how to use the technology.
MLOps for DataBricks	This repository contains the Databricks development framework for delivering any Data Engineering projects, and machine learning projects based on the Azure Technologies.
Classification Solution Accelerator	This repository contains the basic repository structure for delivering classification solutions for machine learning (ML) projects based on Azure technologies (Azure ML and Azure DevOps).
Object Detection Solution Accelerator	This repository contains all the code for training TensorFlow object detection models within Azure Machine Learning (AML) with setups for training on Azure compute, experiment monitoring and endpoint deployment as a webservice. It is built on the MLOps Accelerator and provides end to end training and deployment pipelines allowing quick and easy setup of CI/CD pipelines for your projects.

Documentation

You may refer to the solution accelerator documentation as follows:

Topic	Description	Documentation Link
Pre-Requisites	What do you need to deploy & operate the solution	README
Architecture	How the solution is architected	README
Deployment	How to deploy this solution accelerator	README
Configuration	All you need to know about the solution accelerator configuration	README
Data Science	Integration with Data Science	README
Deployment	Ho to get started by deploying the solution	README
Monitoring	How to monitor the solution	README
Search	How search is configured and managed	README
Search & Explore (UI)	User Interface to Search & Explore	README

Repository Structure

The respository structure of this accelerator is as follows

azure-pipelines - Azure DevOps pipelines to set up your CI/CD
configuration - solution configuration
data - sample data to validate the solution deployment.
- documents : sample documents for your KM solution
deployment - Configuration & scripts for deployment & operations
- config : contains the entire solution base configuration
- modules : PowerShell modules
- scripts : Deployment scripts
- init_env.ps1 : Environment initialization script
docs - contains solution documentation wiki in .md format. Designed to be imported as an Azure DevOps wiki.
overlay - Source code
src - Source code
- CognitiveSearch.Skills Custom skills
- CognitiveSearch.UI User Interface .NET Core MVC
- Data Science - placeholder to add your data science modules.

How to use this accelerator?

Clone or download this repository and then navigate to the Deployment folder, following the steps outlined in the deployment guide.

When you complete all of the steps, you'll have a working end-to-end knowledge mining solution that combines data sources ingestion with data enrichment skills and a web app powered by Azure Cognitive Search.

Credits

This solution is inspired from the original work of the

Contributors of Knowledge Mining Solution Accelerator
Contributors of Azure Search Power Skills

Core contributors to this solution accelerator are

Nicolas Uthurriague
Edoardo Quasso for the Azure Cognitive Functions (Python)
Harika Nagidi for VNET support and deployment improvements.

Special Thanks

The data science toolkit sponsorship team

For the great conversation on Knowledge Mining and Unstructured Data

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
.github		.github
azure-pipelines		azure-pipelines
configuration		configuration
data		data
deployment		deployment
docs		docs
overlay		overlay
src		src
.gitignore		.gitignore
CHANGES.md		CHANGES.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
dstoolkit-km.code-workspace		dstoolkit-km.code-workspace

License

microsoft/dstoolkit-km-solution-accelerator

Folders and files

Latest commit

History

Repository files navigation

Before you start

Getting Started

Knowledge Mining overview

What is this solution accelerator ?

Why a knowledge mining solution accelerator?

What Knowledge Mining scenarios this accelerator targets ?

Who is the target audience ?

Data Science Toolkit Integration

Documentation

Repository Structure

How to use this accelerator?

Credits

Special Thanks

Contributing

Trademarks

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages