Skip to content

A library that removes common unicode confusables/homoglyphs from strings.

License

Notifications You must be signed in to change notification settings

null8626/decancer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

6ec919c Β· Mar 20, 2025
Mar 20, 2025
Mar 20, 2025
Mar 20, 2025
Oct 29, 2024
Oct 29, 2024
Oct 29, 2024
Oct 29, 2024
Oct 28, 2024
Oct 29, 2024
Oct 29, 2024
Mar 20, 2025
Oct 29, 2024
Oct 28, 2024

Repository files navigation

decancer npm crates.io npm downloads crates.io downloads codacy ko-fi

A library that removes common unicode confusables/homoglyphs from strings.

  • Its core is written in Rust and utilizes a form of Binary Search to ensure speed!
  • By default, it's capable of filtering 221,529 (19.88%) different unicode codepoints like:
  • Unlike other packages, this package is unicode bidi-aware where it also interprets right-to-left characters in the same way as it were to be rendered by an application!
  • Its behavior is also highly customizable to your liking!
  • And it's available in the following languages:

Installation

Rust (v1.65 or later)

In your Cargo.toml:

decancer = "3.2.8"
JavaScript (Node.js)

In your shell:

npm install decancer

In your code (CommonJS):

const decancer = require('decancer')

In your code (ESM):

import decancer from 'decancer'
JavaScript (Browser)

In your code:

<script type="module">
  import init from 'https://cdn.jsdelivr.net/gh/null8626/decancer@v3.2.8/bindings/wasm/bin/decancer.min.js'

  const decancer = await init()
</script>
Java

As a JAR file

You can download the latest JAR file here.

As a dependency

In your build.gradle:

repositories {
  mavenCentral()
  maven { url 'https://jitpack.io' }
}

dependencies {
  implementation 'io.github.null8626:decancer:3.2.8'
}

In your pom.xml:

<repositories>
  <repository>
    <id>central</id>
    <url>https://repo.maven.apache.org/maven2</url>
  </repository>
  <repository>
    <id>jitpack.io</id>
    <url>https://jitpack.io</url>
  </repository>
</repositories>

<dependencies>
  <dependency>
    <groupId>io.github.null8626</groupId>
    <artifactId>decancer</artifactId>
    <version>3.2.8</version>
  </dependency>
</dependencies>

Building from source

Windows:

git clone https://github.com/null8626/decancer.git --depth 1
cd .\decancer\bindings\java
powershell -NoLogo -NoProfile -NonInteractive -Command "Expand-Archive -Path .\bin\bindings.zip -DestinationPath .\bin -Force"
gradle build -x test

macOS/Linux:

git clone https://github.com/null8626/decancer.git --depth 1
cd ./decancer/bindings/java
unzip ./bin/bindings.zip -d ./bin
chmod +x ./gradlew
./gradlew build -x test

Tip: You can shrink the size of the resulting JAR file by removing binaries in the bin directory for the platforms you don't want to support.

C/C++

Download

Building from source

Building from source requires Rust v1.65 or later.

git clone https://github.com/null8626/decancer.git --depth 1
cd decancer/bindings/native
cargo build --release

And the binary files should be generated in the target/release directory.

Examples

Rust

For more information, please read the documentation.

let mut cured = decancer::cure!(r"vοΌ₯ⓑ𝔂 π”½π•ŒΕ‡β„•ο½™ ţ乇𝕏𝓣 wWiIiIIttHh l133t5p3/-\|<").unwrap();

assert_eq!(cured, "very funny text with leetspeak");

// WARNING: it's NOT recommended to coerce this output to a Rust string
//          and process it manually from there, as decancer has its own
//          custom comparison measures, including leetspeak matching!
assert_ne!(cured.as_str(), "very funny text with leetspeak");

assert!(cured.contains("funny"));

cured.censor("funny", '*');
assert_eq!(cured, "very ***** text with leetspeak");

cured.censor_multiple(["very", "text"], '-');
assert_eq!(cured, "---- ***** ---- with leetspeak");
JavaScript (Node.js)
const assert = require('assert')
const cured = decancer('vοΌ₯ⓑ𝔂 π”½π•ŒΕ‡β„•ο½™ ţ乇𝕏𝓣 wWiIiIIttHh l133t5p3/-\\|<')

assert(cured.equals('very funny text with leetspeak'))

// WARNING: it's NOT recommended to coerce this output to a JavaScript string
//          and process it manually from there, as decancer has its own
//          custom comparison measures, including leetspeak matching!
assert(cured.toString() !== 'very funny text with leetspeak')
console.log(cured.toString())
// => very funny text wwiiiiitthh l133t5p3/-\|<

assert(cured.contains('funny'))

cured.censor('funny', '*')
console.log(cured.toString())
// => very ***** text wwiiiiitthh l133t5p3/-\|<

cured.censorMultiple(['very', 'text'], '-')
console.log(cured.toString())
// => ---- ***** ---- wwiiiiitthh l133t5p3/-\|<
JavaScript (Browser)
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>Decancerer!!! (tm)</title>
    <style>
      textarea {
        font-size: 30px;
      }
    
      #cure {
        font-size: 20px;
        padding: 5px 30px;
      }
    </style>
  </head>
  <body>
    <h3>Input cancerous text here:</h3>
    <textarea rows="10" cols="30"></textarea>
    <br />
    <button id="cure" onclick="cure()">cure!</button>
    <script type="module">
      import init from 'https://cdn.jsdelivr.net/gh/null8626/decancer@v3.2.8/bindings/wasm/bin/decancer.min.js'
    
      const decancer = await init()
    
      window.cure = function () {
        const textarea = document.querySelector('textarea')
        
        if (!textarea.value.length) {
          return alert("There's no text!!!")
        }
        
        textarea.value = decancer(textarea.value).toString()
      }
    </script>
  </body>
</html>

See this in action here.

Java

For more information, please read the documentation.

import io.github.null8626.decancer.CuredString;

public class Program {
  public static void main(String[] args) {
    CuredString cured = new CuredString("vοΌ₯ⓑ𝔂 π”½π•ŒΕ‡β„•ο½™ ţ乇𝕏𝓣 wWiIiIIttHh l133t5p3/-\\|<");
    
    assert cured.equals("very funny text with leetspeak");
    
    // WARNING: it's NOT recommended to coerce this output to a Java String
    //          and process it manually from there, as decancer has its own
    //          custom comparison measures, including leetspeak matching!
    assert !cured.toString().equals("very funny text with leetspeak");
    System.out.println(cured.toString());
    // => very funny text wwiiiiitthh l133t5p3/-\|<
    
    assert cured.contains("funny");
    
    cured.censor("funny", '*');
    System.out.println(cured.toString());
    // => very ***** text wwiiiiitthh l133t5p3/-\|<
    
    String[] keywords = { "very", "text" };
    cured.censorMultiple(keywords, '-');
    System.out.println(cured.toString());
    // => ---- ***** ---- wwiiiiitthh l133t5p3/-\|<
    
    cured.destroy();
  }
}
C/C++

For more information, please read the documentation.

UTF-8 example:

#include <decancer.h>

#include <string.h>
#include <stdlib.h>
#include <stdio.h>

#define decancer_assert(expr, notes)                           \
  if (!(expr)) {                                               \
    fprintf(stderr, "assertion failure at " notes "\n");       \
    ret = 1;                                                   \
    goto END;                                                  \
  }

int main(void) {
  int ret = 0;

  // UTF-8 bytes for "vοΌ₯ⓑ𝔂 π”½π•ŒΕ‡β„•ο½™ ţ乇𝕏𝓣"
  uint8_t input[] = {0x76, 0xef, 0xbc, 0xa5, 0xe2, 0x93, 0xa1, 0xf0, 0x9d, 0x94, 0x82, 0x20, 0xf0, 0x9d,
                     0x94, 0xbd, 0xf0, 0x9d, 0x95, 0x8c, 0xc5, 0x87, 0xe2, 0x84, 0x95, 0xef, 0xbd, 0x99,
                     0x20, 0xc5, 0xa3, 0xe4, 0xb9, 0x87, 0xf0, 0x9d, 0x95, 0x8f, 0xf0, 0x9d, 0x93, 0xa3};

  decancer_error_t error;
  decancer_cured_t cured = decancer_cure(input, sizeof(input), DECANCER_OPTION_DEFAULT, &error);

  if (cured == NULL) {
    fprintf(stderr, "curing error: %.*s\n", (int)error.message_length, error.message);
    return 1;
  }

  decancer_assert(decancer_contains(cured, "funny", 5), "decancer_contains");

END:
  decancer_cured_free(cured);
  return ret;
}

UTF-16 example:

#include <decancer.h>

#include <string.h>
#include <stdlib.h>
#include <stdio.h>

#define decancer_assert(expr, notes)                           \
  if (!(expr)) {                                               \
    fprintf(stderr, "assertion failure at " notes "\n");       \
    ret = 1;                                                   \
    goto END;                                                  \
  }

int main(void) {
  int ret = 0;

  // UTF-16 bytes for "vοΌ₯ⓑ𝔂 π”½π•ŒΕ‡β„•ο½™ ţ乇𝕏𝓣"
  uint16_t input[] = {
    0x0076, 0xff25, 0x24e1,
    0xd835, 0xdd02, 0x0020,
    0xd835, 0xdd3d, 0xd835,
    0xdd4c, 0x0147, 0x2115,
    0xff59, 0x0020, 0x0163,
    0x4e47, 0xd835, 0xdd4f,
    0xd835, 0xdce3
  };

  // UTF-16 bytes for "funny"
  uint16_t funny[] = { 0x66, 0x75, 0x6e, 0x6e, 0x79 };

  decancer_error_t error;
  decancer_cured_t cured = decancer_cure_utf16(input, sizeof(input) / sizeof(uint16_t), DECANCER_OPTION_DEFAULT, &error);

  if (cured == NULL) {
    fprintf(stderr, "curing error: %.*s\n", (int)error.message_length, error.message);
    return 1;
  }

  decancer_assert(decancer_contains_utf16(cured, funny, sizeof(funny) / sizeof(uint16_t)), "decancer_contains_utf16");

END:
  decancer_cured_free(cured);
  return ret;
}

Donations

If you want to support my eyes for manually looking at thousands of unicode characters, consider donating! ❀

ko-fi

Contributing

Please read CONTRIBUTING.md for newbie contributors who want to contribute!