Skip to content

Use topics to find semantically and structurally similar documents

License

Notifications You must be signed in to change notification settings

SeeligA/doctopic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocTopic

What is DocTopic?

A topic modelling and similarity retrieval interface that helps you managing your documents. DocTopic uses Gensim, a popular Python library designed for implementing key NLP algorithms at scale.

  • Some excellent tutorials can be found on their website. They also offer support and professional services.
  • An interactive introduction to similarity search can be found here.

Features

  • Create a searchable corpus from your multilingual documents with 2 clicks.
  • Use unsupervised training algorithms such as Latent Semantic Analysis and Latent Dirichlet Allocation for topic modelling purposes.
  • Query your corpus to retrieve documents that are structurally similar or belong to a similar domain.
  • Update your search indices with new files so that they can be retrieved later.
  • Use the Jupyter notebook implementation to run the app on a remote server.

Use cases for translation service providers

Identify relevant resources from historical project data such as:

  • previous translations to be used as templates
  • translation vendors who are experts in their field
  • project parameters such as turn-around times, pre-processing steps, etc.

Quickly assess the similarity of files within a project to help with:

  • staggered/cascading deliveries
  • assigning files to multiple vendors

Classify documents automatically and create topic clusters to better understand:

  • the translation needs of your customer segments
  • your level of specialization and how you can use it to build your brand

Installation

DocTopic has been created with Python 3.7. It requires Gensim in addition to Numpy, Scipy and PyQt5/qtpy. You will probably want to us a virtual environment like conda. The Anaconda distribution comes with the latter packages already installed. Then:

pip install -U gensim

Questions

If you found any of the content from this repo helpful, confusing or missing, I would like to hear from you.

About

Use topics to find semantically and structurally similar documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published