This is a simple library for easily customizable text processing pipelines.
It provides:
- a run-time registry of processors
- a simple mechanism for creating and automatically registering new processors
- a set of built-in processors
- an easy way to use the processors and pipe them together
The registry can be accessed thus:
from stringprocess import registry
The registry is just an {identifier:function} dict so it's easy to access the processors or query, filter or generate simple documentation from them.
There are currently three types of processors:
- removers remove from input text
- converters replace parts of text
- validators drop whole input based on criteria
The built-in processors are listed in
Writing a new one is as simple as adding a function to a, say,
from stringprocess.processors.registry import converter
def quote_replacer(term):
"replace double quotes with single quotes"
return term.replace('"', "'")
To use the processor, it is only necessary that the module is imported before registry is accessed or terms are processed (the decorator does the registration). So:
>>> import myconverters
>>> from stringprocess import process_terms
>>> terms = [
'he said "who knows" and was silent',
'and I responded "yes, who knows" and shrugged'
>>> tuple(process_terms('qr', terms))
("he said 'who knows' and was silent", "and I responded 'yes, who knows' and shrugged")
The built-in ones are registered automatically when the registry is imported.