[normalization] Adding NORMALIZE_STRING_SIMPLE_LATIN_ASCII option so parser can normalize punctuation and HTML entities, etc. without touching the alphanumeric parts of the original input

2016-08-21 19:45:32 -04:00
parent 8b9702b43d
commit 58851a9088
2 changed files with 39 additions and 9 deletions
--- a/src/normalize.h
+++ b/src/normalize.h
@@ -46,6 +46,7 @@ As well as normalizations for individual string tokens:
 #define NORMALIZE_STRING_TRIM 1 << 5
 #define NORMALIZE_STRING_REPLACE_HYPHENS 1 << 6
 #define NORMALIZE_STRING_COMPOSE 1 << 7
+#define NORMALIZE_STRING_SIMPLE_LATIN_ASCII 1 << 8

 #define NORMALIZE_TOKEN_REPLACE_HYPHENS 1 << 0
 #define NORMALIZE_TOKEN_DELETE_HYPHENS 1 << 1