A simple, reversible, language-agnostic tokenizer

2018-04-22

tl;dr: download it here.^[1]

Use as python3 reversible_tokenize.py --tok < infile > outfile and python3 reversible_tokenize.py --detok < infile > outfile.

Described in “Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model” ( arXiv / page ).

Universal character categories

The Unicode standard defines all symbols in use in current computer systems. In it, each symbol is assigned to exactly one “General category”, e.g., Lu for “Letter, Uppercase”, Ll for “Letter, Lowercase”, Sc for “Symbol, Currency”, or Cc for “Other, Control”. We define the set of “weird” characters, i.e., characters we want to break the string on as those whose category does not start with L (i.e., letters), with M (i.e., marks like accents), or with N (i.e., numbers), and are not “space” either, where “space” is defined as a character that Python's str.isspace() method returns true on. It would be tempting to use Z, i.e., the “Separator” category, as this third option, but since Python classifies some control characters (i.e., characters in Cc) as spaces, we use this behavior to ensure compatibility with Python whitespace splitting.

Tokenize

To tokenize a string, we look at each character c_i of the string:

If it is not weird, output it as it is.
If it is weird, we need to split and leave markers for detokenization:
1. If c_i-1 is not space (i.e., we are really introducing a new split before this weird character), output a space and a merge symbol “↹”.
2. Output c_i.
3. If c_i+1 is not space (i.e., we are really introducing a new split after this weird character) and not weird (if it is, it will just split itself off from the left context, no need to split now), output a merge symbol “↹” and a space.

Tokenization thus turns a string like: “Some of 100,000 households (usually, a minority) ate breakfast.” into “Some of 100 ↹,↹ 000 households (↹ usually ↹, a minority ↹) ate breakfast ↹.”.

Detokenize

Again, we look at each character c_i of the string that is to be detokenized:

If c_i is a space, c_i+1 is the merge symbol “↹”, and c_i+2 is weird, skip ahead to c_i+2 (i.e., undo a right split).
Otherwise, if c_i is weird, c_i+1 is the merge symbol “↹”, and c_i+2 is a space, output c_i and move on to c_i+3 (i.e., undo a left split).
Otherwise, just write out c_i and then continue to c_i+1.

Python implementation

In summary, the relevant methods look like this in Python: MERGESYMBOL = '↹' def is_weird(c): return not (unicodedata.category(c)[0] == 'L' or unicodedata.category(c)[0] == 'N' or c.isspace()) def tokenize(instring): for i in range(len(instring)): c = instring[i] c_p = instring[i-1] if i > 0 else c c_n = instring[i+1] if i < len(instring) - 1 else c if not is_weird(c): stdout.write(c) else: if not c_p.isspace(): stdout.write(' ' + MERGESYMBOL) stdout.write(c) if not c_n.isspace() and not is_weird(c_n): stdout.write(MERGESYMBOL + ' ') def detokenize(instring): i = 0 while i < len(instring): c = instring[i] c_p = instring[i-1] if i > 0 else c c_n = instring[i+1] if i < len(instring) - 1 else c c_pp = instring[i-2] if i > 1 else c c_nn = instring[i+2] if i < len(instring) - 2 else c if c + c_n == ' ' + MERGESYMBOL and is_weird(c_nn): i += 2 elif is_weird(c) and c_n + c_nn == MERGESYMBOL + ' ': stdout.write(c) i += 3 else: stdout.write(c) i += 1

An old version of the script did not include a link to the paper, had no license, and failed to count “marks” (M) as “non-weird” characters, resulting in oversplitting for NFKD encoded strings (or languages that don't have consolidated alternatives to these marks). It was replaced on 2018-06-29. Thanks to Graham Neubig for noticing. ↩

Sabrina J. Mielke

Sabrina J. Mielke