libpostal

Author	SHA1	Message	Date
Al	c5bb9d8daa	[normalize/api] exposing normalize_string_languages and normalized_tokens_languages to the API for pre-normalizing numeric expressions at tokenization time	2018-02-22 18:47:36 -05:00
Al	053dca82ba	[expand] adding a normalization for a single non-acronym internal period where there's an expansion at the prefix/suffix (for #218 and https://github.com/openvenues/libpostal/issues/216#issuecomment-306617824 ). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility	2017-10-28 02:38:15 -04:00
Al	448ca6a61a	[merge] merging commit from v1.1	2017-10-12 01:41:04 -04:00
Al	cddc368533	[numex] adding one form of normalization which strips ordinal suffixes so {96th, Ninety-sixth} => 96. This is an additional form of normalization, so there's still one form where the suffixes are kept. One case that's still not handled is something like "IXe Arrondissement"	2017-04-18 21:39:54 -04:00
Al	58851a9088	[normalization] Adding NORMALIZE_STRING_SIMPLE_LATIN_ASCII option so parser can normalize punctuation and HTML entities, etc. without touching the alphanumeric parts of the original input	2016-08-21 19:45:32 -04:00
Al	6c39c663ff	[normalize] Adding NORMALIZE_STRING_COMPOSE for NFC unicode normalization	2016-07-21 17:04:57 -04:00
Al	6dad58c696	[fix][ci skip] last remaining instance of vignt in libpostal	2016-03-29 12:51:19 -04:00
Al	afd5844f21	[normalize] Permuting transliterators only once on the entire string rather than at each script break (so # permutations is bounded and can't get huge). Fixing some spacing issues. Adding method to check for an alpha+numeric token in normalization.	2016-02-08 01:16:47 -05:00
Al	ff75c5cc50	[normalize] Adding normalize_string_languages method which can use additional transliterators	2015-12-31 03:50:36 -05:00
Al	40918812e2	[normalize] Adding hyphen elimination as a string option (changes tokenization)	2015-10-27 13:32:47 -04:00
Al	f6c30778bf	[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.	2015-09-23 19:41:01 -04:00
Al	0f77ca1213	[normalize] Adding a char_array version of normalize token	2015-08-10 16:11:34 -04:00
Al	9b69d1f67a	[fix] Removing C++ checks from all but the main API functions	2015-08-07 17:15:39 -04:00
Al	359a1efb03	[fix] Adding stdint.h include to most of the header files for portability	2015-08-07 02:43:44 -04:00
Al	46141a6c36	[normalize] Adding an option when normalizing tokens to split tokens of the form [\w]+[\.\-]?[\d]+ for cases like I35, CR123, R-66, RN.7, etc. where the alpha component is an expansion	2015-08-02 14:34:36 -06:00
Al	551904d202	[normalize] cstring_array instead of string_tree for token-based normalization	2015-07-28 19:09:50 -04:00
Al	053b987d58	[normalize] adding an option for string trimming in normalize	2015-07-27 01:59:14 -04:00
Al	ee96dab93c	[fix] unnecessary headers	2015-07-25 13:49:42 -04:00
Al	5239c365d0	[docs] Adding some documentation for normalize.h options	2015-07-24 15:23:25 -04:00
Al	a38b924c5d	[fix] add_token_alternatives	2015-07-21 17:26:59 -04:00
Al	6ff91fef6b	[normalization] adding a normalize_string_latin method	2015-07-05 23:38:01 -04:00
Al	6cfbab9969	[normalization] string normalization module for tokens and full strings	2015-07-01 14:52:28 -04:00

22 Commits