The kEACC
field in Unihan 6.2 is woefully out of date. Compared to the mappings in the latest MARC-8 Code Table at the Library of Congress (LoC) it has 8 different mappings and is missing 235.
This directory contains an updated table for Unihan derived from the LoC data.
Unihan_OtherMappings.txt
6.2 from the Unicode Consortium- “MARC-8 to Unicode XML mapping file” from the Library of Congress
loc-eacc-ucs.txt
was generated with loc.xslt
XSLT script from the LoC MARC-8 table.
- loc.xslt
- XSLT script to extract the Han Ideograph mappings from the LoC XML file. Handles the cases where the EACC code maps to both the PUA and to U+3013. The output of this script is a file containing two tab-separated columns:
- The 3-byte EACC code as six hexadecimal numbers
- The USV of the corresponding Unicode character
- eacc-loc-unihan.lisp
- functions for reading the mapping tables and comparing their entries. This uses the CL-PPCRE library which is easily installable via QuickLisp. Tested with Clozure Common Lisp it should work with any implementation.
Load eacc-loc-unihan.lisp
into your Lisp image and switch to the EACC
package.
EACC> (defvar *unihan* (read-unihan-eacc-mappings "Unihan_OtherMappings.txt")) *UNIHAN* EACC> (defvar *loc* (read-loc-eacc-mappings "loc-eacc-ucs.txt")) *LOC* EACC> (compare-entries *UNIHAN* *LOC*) 4B5F58 0F9B2 096F6 215C32 0FA25 09038 215061 0FA1D 07CBE 4B7421 0F9A9 056F9 4B4B3E 0F9AD 073B2 215F71 0FA1C 09756 4B333E 0F92E 051B7 214339 0FA12 06674 NIL
The output of the call to compare-entries
shows the 8 ideographs in EACC that have different mappings in Unihan (e.g., U+F982) than in the LoC table (e.g., U+96F6).
Comparing in the other direction shows the 235 characters that have mappings in the LoC table without a kEACC
mapping in Unihan:
EACC> (compare-entries *LOC* *UNIHAN*) 4B3474 0537F 213F53 061F2 4B5361 089D2 214456 06813 ;;; lots deleted 216053 0985E 216044 09818 3A284C 053A9 45564B 0865E NIL
The source code is in the public domain: do with it what you will.