[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.

2015-09-23 19:40:51 -04:00
parent a1d272077d
commit f6c30778bf
2 changed files with 22 additions and 16 deletions
--- a/src/normalize.h
+++ b/src/normalize.h
@@ -52,13 +52,17 @@ As well as normalizations for individual string tokens:
 #define NORMALIZE_TOKEN_DROP_ENGLISH_POSSESSIVES 1 << 4
 #define NORMALIZE_TOKEN_DELETE_OTHER_APOSTROPHE 1 << 5
 #define NORMALIZE_TOKEN_SPLIT_ALPHA_FROM_NUMERIC 1 << 6
+#define NORMALIZE_TOKEN_REPLACE_DIGITS 1 << 7
+
+// Replace digits with capital D e.g. 10013 => DDDDD, intended for use with lowercased strings
+#define DIGIT_CHAR "D"

 char *normalize_string_utf8(char *str, uint64_t options);

 char *normalize_string_latin(char *str, size_t len, uint64_t options);

 // Takes NORMALIZE_TOKEN_* options
-void append_normalized_token(char_array *array, char *str, token_t token, uint64_t options);
+void add_normalized_token(char_array *array, char *str, token_t token, uint64_t options);
 void normalize_token(cstring_array *array, char *str, token_t token, uint64_t options);

 // Takes NORMALIZE_STRING_* options