libpostal

Author	SHA1	Message	Date
Al	be7b696cb2	[fix] actually that temporary array is unnecessary altogether, eliminating	2016-03-21 17:00:11 -04:00
Al	e0f7638372	[fix] Freeing up temporary char_array	2016-03-21 16:50:48 -04:00
Al	c32ef9ccf8	[fix] freeing up iterator in normalize_string	2016-02-09 01:06:51 -05:00
Al	0695738253	[fix] cleaning up memory in normalize_string_languages	2016-02-08 02:43:12 -05:00
Al	afd5844f21	[normalize] Permuting transliterators only once on the entire string rather than at each script break (so # permutations is bounded and can't get huge). Fixing some spacing issues. Adding method to check for an alpha+numeric token in normalization.	2016-02-08 01:16:47 -05:00
Al	ff75c5cc50	[normalize] Adding normalize_string_languages method which can use additional transliterators	2015-12-31 03:50:36 -05:00
Al	3fbb3c587a	[fix] using a char_array instead of copying the string in normalize_string	2015-12-23 19:21:54 -05:00
Al	f8da44e8b0	[fix] Making a copy even on pure Latin-script transliteration since string_trim modifies in-place, occasionally causes issues	2015-12-19 01:31:56 -05:00
Al	40918812e2	[normalize] Adding hyphen elimination as a string option (changes tokenization)	2015-10-27 13:32:47 -04:00
Al	1a1d74785c	[fix] Compiler warnings for casts/printf	2015-10-26 18:52:18 -04:00
Al	89d0fd5718	[fix] Alpha-numeric splitting	2015-10-03 16:40:10 -04:00
Al	f6c30778bf	[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.	2015-09-23 19:41:01 -04:00
Al	66a71ab70d	[normalize] Need to do a Latin-ASCII transliteration even if the string is entirely ASCII since it may contain HTML escapes	2015-08-11 23:36:08 -04:00
Al	4bc6adf669	[normalize] Adding the original script as an alternative in transliteration mode as well	2015-08-10 17:48:48 -04:00
Al	0f77ca1213	[normalize] Adding a char_array version of normalize token	2015-08-10 16:11:34 -04:00
Al	46141a6c36	[normalize] Adding an option when normalizing tokens to split tokens of the form [\w]+[\.\-]?[\d]+ for cases like I35, CR123, R-66, RN.7, etc. where the alpha component is an expansion	2015-08-02 14:34:36 -06:00
Al	551904d202	[normalize] cstring_array instead of string_tree for token-based normalization	2015-07-28 19:09:50 -04:00
Al	053b987d58	[normalize] adding an option for string trimming in normalize	2015-07-27 01:59:14 -04:00
Al	a38b924c5d	[fix] add_token_alternatives	2015-07-21 17:26:59 -04:00
Al	6ff91fef6b	[normalization] adding a normalize_string_latin method	2015-07-05 23:38:01 -04:00
Al	a08d59c277	[fix] NFD normalization should be the default in normalize.c, not NFKD, as NFKD does some unwanted things like converting superscripts and the Latin-ASCII transliterator does a better, more thorough job while staying faithful to the original string	2015-07-05 15:28:07 -04:00
Al	6cfbab9969	[normalization] string normalization module for tokens and full strings	2015-07-01 14:52:28 -04:00

22 Commits