Al
|
6ea2273263
|
[fix] terminate the char_array if input token is zero-length in add_normalized_token
|
2017-04-28 11:25:07 -04:00 |
|
Al
|
cddc368533
|
[numex] adding one form of normalization which strips ordinal suffixes so {96th, Ninety-sixth} => 96. This is an additional form of normalization, so there's still one form where the suffixes are kept. One case that's still not handled is something like "IXe Arrondissement"
|
2017-04-18 21:39:54 -04:00 |
|
Al
|
b88487f633
|
[utils] string_replace_char does single byte/character replacement, new string_replace to do full string replacement, again using char_array for safety, string_replace_with_array function for memory reuse
|
2017-02-17 13:58:51 -05:00 |
|
Al
|
a78937f265
|
[normalize] use the new utf8proc lowercasing (as opposed to case folding), free copies since none of the string functions operate in-place any more, add minimal HTML escaping transliterator even to ASCII text
|
2017-01-01 20:06:32 -05:00 |
|
Al
|
58b063b632
|
[strings] making string_tree_iterator_done more meaningful (returns true if the iterator has no paths left to traverse)
|
2016-12-31 00:54:36 -05:00 |
|
Al
|
42cf686b8e
|
[normalization] adding LATIN_ASCII_SIMPLE option to normalize_string_latin
|
2016-12-26 04:15:58 -05:00 |
|
Al
|
6f37f9ae86
|
[merge] merging in master changes
|
2016-12-21 15:40:25 -05:00 |
|
Al
|
b639fa5127
|
[utils] string_replace also creates a copy
|
2016-11-30 10:09:33 -08:00 |
|
Al
|
89f6611c4e
|
[strings] string_trim makes a copy rather than modifying the pointer
|
2016-11-28 15:06:07 -08:00 |
|
Al
|
58851a9088
|
[normalization] Adding NORMALIZE_STRING_SIMPLE_LATIN_ASCII option so parser can normalize punctuation and HTML entities, etc. without touching the alphanumeric parts of the original input
|
2016-08-21 19:45:32 -04:00 |
|
Al
|
6c39c663ff
|
[normalize] Adding NORMALIZE_STRING_COMPOSE for NFC unicode normalization
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
2f5f226faa
|
[fix] Add original string to normalizations if all options were set to false
|
2016-07-15 13:23:23 -04:00 |
|
Al
|
be7b696cb2
|
[fix] actually that temporary array is unnecessary altogether, eliminating
|
2016-03-21 17:00:11 -04:00 |
|
Al
|
e0f7638372
|
[fix] Freeing up temporary char_array
|
2016-03-21 16:50:48 -04:00 |
|
Al
|
c32ef9ccf8
|
[fix] freeing up iterator in normalize_string
|
2016-02-09 01:06:51 -05:00 |
|
Al
|
0695738253
|
[fix] cleaning up memory in normalize_string_languages
|
2016-02-08 02:43:12 -05:00 |
|
Al
|
afd5844f21
|
[normalize] Permuting transliterators only once on the entire string rather than at each script break (so # permutations is bounded and can't get huge). Fixing some spacing issues. Adding method to check for an alpha+numeric token in normalization.
|
2016-02-08 01:16:47 -05:00 |
|
Al
|
ff75c5cc50
|
[normalize] Adding normalize_string_languages method which can use additional transliterators
|
2015-12-31 03:50:36 -05:00 |
|
Al
|
3fbb3c587a
|
[fix] using a char_array instead of copying the string in normalize_string
|
2015-12-23 19:21:54 -05:00 |
|
Al
|
f8da44e8b0
|
[fix] Making a copy even on pure Latin-script transliteration since string_trim modifies in-place, occasionally causes issues
|
2015-12-19 01:31:56 -05:00 |
|
Al
|
40918812e2
|
[normalize] Adding hyphen elimination as a string option (changes tokenization)
|
2015-10-27 13:32:47 -04:00 |
|
Al
|
1a1d74785c
|
[fix] Compiler warnings for casts/printf
|
2015-10-26 18:52:18 -04:00 |
|
Al
|
89d0fd5718
|
[fix] Alpha-numeric splitting
|
2015-10-03 16:40:10 -04:00 |
|
Al
|
f6c30778bf
|
[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.
|
2015-09-23 19:41:01 -04:00 |
|
Al
|
66a71ab70d
|
[normalize] Need to do a Latin-ASCII transliteration even if the string is entirely ASCII since it may contain HTML escapes
|
2015-08-11 23:36:08 -04:00 |
|
Al
|
4bc6adf669
|
[normalize] Adding the original script as an alternative in transliteration mode as well
|
2015-08-10 17:48:48 -04:00 |
|
Al
|
0f77ca1213
|
[normalize] Adding a char_array version of normalize token
|
2015-08-10 16:11:34 -04:00 |
|
Al
|
46141a6c36
|
[normalize] Adding an option when normalizing tokens to split tokens of the form [\w]+[\.\-]?[\d]+ for cases like I35, CR123, R-66, RN.7, etc. where the alpha component is an expansion
|
2015-08-02 14:34:36 -06:00 |
|
Al
|
551904d202
|
[normalize] cstring_array instead of string_tree for token-based normalization
|
2015-07-28 19:09:50 -04:00 |
|
Al
|
053b987d58
|
[normalize] adding an option for string trimming in normalize
|
2015-07-27 01:59:14 -04:00 |
|
Al
|
a38b924c5d
|
[fix] add_token_alternatives
|
2015-07-21 17:26:59 -04:00 |
|
Al
|
6ff91fef6b
|
[normalization] adding a normalize_string_latin method
|
2015-07-05 23:38:01 -04:00 |
|
Al
|
a08d59c277
|
[fix] NFD normalization should be the default in normalize.c, not NFKD, as NFKD does some unwanted things like converting superscripts and the Latin-ASCII transliterator does a better, more thorough job while staying faithful to the original string
|
2015-07-05 15:28:07 -04:00 |
|
Al
|
6cfbab9969
|
[normalization] string normalization module for tokens and full strings
|
2015-07-01 14:52:28 -04:00 |
|