Text Normalization - Ethan Young

Text normalization is a way to preserve as much signal as possible while minimizing noise. It’s important that you execute the filters in the correct order and apply them to content and queries **identically**. Steps include: 1. Unicode Normalization Form Canonical Decomposition (NFD): https://docs.python.org/2/library/unicodedata.html, https://unicode.org/faq/normalization.html 2. Removing Accents ("Diacritics"): https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#stripAccents-java.lang.String- 3. Case Folding: converting all strings to lowercase with https://docs.python.org/2/library/stdtypes.html#str.lower Be careful not to be so finicky about precision that you give up too much recall. The character filters described here are conservative, and you’ll probably want all of them in order to achieve effective string matching.