Textwash

gitGraph
  commit
  commit
  branch develop
  checkout develop
  commit
  commit
  checkout main
  merge develop

Textwash

UPDATE: Textwash is now available for Dutch! See below for details of how you can run the Dutch anonymization model.

Textwash is an automated text anonymisation tool written in Python. The tool can be used to anonymise unstructured text data. To achieve this, Textwash identifies and extracts personally-identifiable information (e.g., names, dates) from text and replaces the identified entities with a generic identifier (e.g., Jane Doe is replaced with PERSON_FIRSTNAME_1 PERSON_LASTNAME_1).

Why is this software special?

Textwash was designed to be a tool that meets the highest standards that we have for text anonymisation. The following principles guided our development decisions:

Complete and transparent evaluation: you can find a full empirical evaluation of this tool in the paper linked below. We put the tool to various tests and show what it can(not) do - this includes a motivated intruder test where humans try to re-identify persons from Textwash-anonymised documents.
Data never leave your system: at no point does the Textwash tool require you to upload (text) data or use an API. The tool is entirely functional offline (you can try it by switching off your Internet connection). This feature is essential to avoid any data leakage or possible risks for your data.
Open source: the code base is open source and can be inspected, used adn modified in line with the GNU General Public License 3 (GPL-3.0). We do this because we think it is essential that you know what this tool does.
Learning-based anonymisation: since the information that can reveal personal data is complex, we are not using a dictionary-based approach (e.g., looking up keywords in a static database). Instead, the core of Textwash is a machine learning model that assigns category probabilities to phrases and anonymises them accordingly.

Note for researchers/organisations/other users

We would be glad if Textwash is helpful to you. But even if you prefer to use another tool, we strongly encourage you to ask the developers to provide you as the bare minimum with (i) an evaluation of their tool that shows empirically what it can and cannot do (you can even point them to our evaluation approach and ask them to show how their tool performs on our evaluation dataset), and (ii) reasons why they require you to send your data to online services or an API (you should never do this, nor does a good software necessitate this).

If they refuse to provide this, you should be skeptical.

Note for commercial anonymisation tools

We have looked hard to find a tool that is as transparent, open and data-averse (as in: not unnecessarily collecting data) as ours. We did not find any.

If you have a tool that meets these requirements, we would be glad to promote it and list it here.If you think your tool is better, we would love to see your evaluation results - you can use all the data we used and we'd be happy to assist with setting up the human intruder evaluation.

Quick start guide

Textwash is built in Python3. To run the software, it is recommended to first create an Anaconda environment and install the required dependencies. For details on how to get and install Anaconda, click here.

$ conda create -n textwash python=3.7
$ conda activate textwash
$ pip install -r requirements.txt

Additionally, you need to download the trained model folders from here. Once you have downloaded the tgz file, unpack it and place it in the data directory. Important: the models (in en and nl) should be directly in ./data and not in the models parent dirctory. The relative path to the models should be ./data/en and ./data/nl. Otherwise, your will encounter the Repo id must be in the form 'repo_name' ... error.

Using Textwash

Textwash can be used to anonymise txt files. To do this, run anon.py by providing the --language ('en' for English and 'nl' for Dutch), the path to the input files --input_dir and the corresponding path to the output folder --output_dir. For example, running

$ python3 anon.py --language en --input_dir examples --output_dir anonymised_examples --cpu

anonymises the three example texts in the examples directory. In doing so, Textwash loads the downloaded model into memory, then automatically anonymises the inputs and writes the anonymised files to the provided output folder anonymised_examples.

Textwash works best when running on a GPU. If no GPU is available, you should use the --cpu flag as in the snippet above. If you have a GPU, remove the --cpu flag and Textwash will resort to pytorch with CUDA support.

Entity selection

Textwash can furthermore be restricted to only consider a subset of all available entity types for anonymisation.

The complete list of available entity types is as follows:

ADDRESS
DATE
EMAIL_ADDRESS
LOCATION
NUMERIC
OCCUPATION
ORGANIZATION
OTHER_IDENTIFYING_ATTRIBUTE
PERSON_FIRSTNAME
PERSON_LASTNAME
PHONE_NUMBER
PRONOUN
TIME

Using the --entities flag, individual entity types can be selected for anonymisation. These entity types need to be separated by comma.

For example, if you would only like to anonymise the LOCATION and PERSON_FIRSTNAME entity types, run

$ python3 anon.py --input_dir examples --output_dir anonymised_examples --cpu --entities LOCATION,PERSON_FIRSTNAME

Examples

You can find examples of person descriptions rich in details in the examples directory with the corresponding anonymised versions after running it through Textwash in the examples_anonymised directory.

Who can use Textwash?

Textwash is developed with non-profit open science principles. If you are a researcher, a research organization, working in the public sector or a non-profit organization, you are free to use this software. Please make sure you cite our work as follows:

(will be added soon)

If you intend to use this software commercially without our consent, please be advised that this software is released under the GNU General Public License 3 (GPL-3.0).

You may copy, distribute and modify the software as long as you track changes/dates of in source files and keep modifications under GPL. You can distribute your application using a GPL library commercially, but you must also provide the source code.

Who developed Textwash?

Textwash is a multi-year project that is led by Maximilian Mozes (University College London) and Bennett Kleinberg (Tilburg University and University College London).

The work is supported by a SAGE Proof of Concept Grant and an Open Science grant from the Dutch Research Council (NWO).

Questions and Comments

Please open a GitHub Issue if you have any questions or remarks.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
examples		examples
paper		paper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
anon.py		anon.py
anonymizer.py		anonymizer.py
config.py		config.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Textwash

Why is this software special?

Note for researchers/organisations/other users

Note for commercial anonymisation tools

Quick start guide

Using Textwash

Entity selection

Examples

Who can use Textwash?

Who developed Textwash?

Questions and Comments

About

Uh oh!

Releases

Packages

Languages

License

Mazgagzam/textwash

Folders and files

Latest commit

History

Repository files navigation

Textwash

Why is this software special?

Note for researchers/organisations/other users

Note for commercial anonymisation tools

Quick start guide

Using Textwash

Entity selection

Examples

Who can use Textwash?

Who developed Textwash?

Questions and Comments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages