Commit Graph

12 Commits

Author SHA1 Message Date
Al
89d0fd5718 [fix] Alpha-numeric splitting 2015-10-03 16:40:10 -04:00
Al
f6c30778bf [normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling. 2015-09-23 19:41:01 -04:00
Al
66a71ab70d [normalize] Need to do a Latin-ASCII transliteration even if the string is entirely ASCII since it may contain HTML escapes 2015-08-11 23:36:08 -04:00
Al
4bc6adf669 [normalize] Adding the original script as an alternative in transliteration mode as well 2015-08-10 17:48:48 -04:00
Al
0f77ca1213 [normalize] Adding a char_array version of normalize token 2015-08-10 16:11:34 -04:00
Al
46141a6c36 [normalize] Adding an option when normalizing tokens to split tokens of the form [\w]+[\.\-]?[\d]+ for cases like I35, CR123, R-66, RN.7, etc. where the alpha component is an expansion 2015-08-02 14:34:36 -06:00
Al
551904d202 [normalize] cstring_array instead of string_tree for token-based normalization 2015-07-28 19:09:50 -04:00
Al
053b987d58 [normalize] adding an option for string trimming in normalize 2015-07-27 01:59:14 -04:00
Al
a38b924c5d [fix] add_token_alternatives 2015-07-21 17:26:59 -04:00
Al
6ff91fef6b [normalization] adding a normalize_string_latin method 2015-07-05 23:38:01 -04:00
Al
a08d59c277 [fix] NFD normalization should be the default in normalize.c, not NFKD, as NFKD does some unwanted things like converting superscripts and the Latin-ASCII transliterator does a better, more thorough job while staying faithful to the original string 2015-07-05 15:28:07 -04:00
Al
6cfbab9969 [normalization] string normalization module for tokens and full strings 2015-07-01 14:52:28 -04:00