Commit Graph

716 Commits

Author SHA1 Message Date
Al
0fa1c2389c [fix] Leak in expanding strings that have a separable prefix and suffix, other than that ran through 78 million expansions with no discernable memory issues 2015-12-26 17:19:59 -05:00
Al
deeb8f007e [fix] Check for result.len > 0 in false start continuation numex parsing, plus additional safety check during replacement 2015-12-24 02:26:53 -05:00
Al
507dd631f8 [build] Adding json_encode.c to the address parser client sources 2015-12-23 19:37:28 -05:00
Al
5e6d24ff7e [unicode] Upgrading to latest utf8proc from JuliaLang (Unicode 8) 2015-12-23 19:33:09 -05:00
Al
3fbb3c587a [fix] using a char_array instead of copying the string in normalize_string 2015-12-23 19:21:54 -05:00
Al
2eea999692 [fix] Fixing false start continuations in numex parsing 2015-12-23 19:19:14 -05:00
Al
850d82de6e [fix] In trie search, moving fall-off and tail checks inside the inner character loop dding tail position as a separate variable from offset in the string 2015-12-23 19:16:43 -05:00
Al
19173d3a6e [transliteration] In set match checks, use the current index, not current index - char_len 2015-12-23 13:12:30 -05:00
Al
e9e05bb929 [transliteration] Distinguishing between variables with numbers and backreferences in transliteration rules 2015-12-23 13:07:44 -05:00
Al
aaa1fc0387 [fix] Stepping through codepoints first then through chars in trie_search_prefixes_from_index (used in transliteration and numex) 2015-12-23 01:58:39 -05:00
Al
baa8e3cc3f [fix] Compare the remaining part of the current UTF-8 character using simple string comparison, since it may be in the middle of a valid UTF-8 character 2015-12-21 20:34:15 -05:00
Al
ceda863e9f [fix] Encode strings as JSON in address parser cli 2015-12-21 17:45:09 -05:00
Al
e55ff54be1 [fix] Adding Korean-Latin-BGN to excluded transliterators 2015-12-21 16:24:50 -05:00
Al
c7fb7f685d [transliteration] Fixing group replacement in transliteration in the case of multiple groups, not adding to phrase length when checking context 2015-12-21 16:06:04 -05:00
Al
ab124465e6 [fix] regenerating transliteration data 2015-12-20 15:41:42 -05:00
Al
5439f4679f [fix] Special tokens like emails/urls/phone numbers bypass normalization 2015-12-20 03:07:36 -05:00
Al
cf2a0efa11 [fix] Prefixes and suffixes that are the same length as the original token should be handled as regular expansions 2015-12-19 17:29:26 -05:00
Al
aaecd7961a [fix] Options out of order 2015-12-19 15:05:50 -05:00
Al
48cb2b5c7b [api] Node was complaining about non-trivial designated initializers (probably the bit fields), so converting to old-school initializer 2015-12-19 02:34:31 -05:00
Al
97906c86a8 [fix] Strip punctuation in final output in cases where there are no expansions 2015-12-19 02:10:41 -05:00
Al
4497c4501e [fix] do not add a token if prefix/suffix expansions are inseparable and canonical 2015-12-19 01:36:02 -05:00
Al
f8da44e8b0 [fix] Making a copy even on pure Latin-script transliteration since string_trim modifies in-place, occasionally causes issues 2015-12-19 01:31:56 -05:00
Al
39e83961ef [fix] Bug in suffix expansion affecting inseparable suffixes like burg as well as ordinal suffixes like first=>1st 2015-12-19 01:30:08 -05:00
Al
b4a8a69226 [expansion] Fixing extra space on prefix/suffix expansions 2015-12-18 20:28:59 -05:00
Al
df47dad817 [fix] Partial matches, ultimate misses in concatenated suffixes 2015-12-18 17:37:06 -05:00
Al
66073c17d5 [fix] Handling case of concatenated suffixes like straße when they stand alone 2015-12-18 17:17:35 -05:00
Al
31ed88bf6a [api] Adding a --json option to expand cli 2015-12-17 13:46:55 -05:00
Al
41ea105bb4 [api] Simple JSON encoding for strings, UTF-8 rather than Unicode 2015-12-17 12:25:05 -05:00
Al
af78614f62 [fix] Print usage info on -h/--help to libpostal cli 2015-12-16 22:21:13 -05:00
Al
e0c0ed2d04 [numex] Return true if numex table already loaded 2015-12-15 14:28:40 -05:00
Al
b9bf5c629e [fix] Moving address_parser_response_destroy into libpostal so caller can free 2015-12-15 00:52:24 -05:00
Al
b59c830ba6 [fix] warning about size_t 2015-12-14 18:17:09 -05:00
Al
406f9c533d [api] Separating parser setup/teardown into two separate methods 2015-12-14 18:15:57 -05:00
Al
43b212a09b [fix] size_t in benchmark script 2015-12-14 14:57:11 -05:00
Al
dc03c83bb2 [math] Adding an aligned memory allocator for vectors to help with vectorization/SIMD 2015-12-14 14:56:38 -05:00
Al
bd1e8ecaf8 [fix] default address parser dir 2015-12-12 12:55:37 -05:00
Al
2950358697 [build] address_parser client now links to libpostal, adding address_parser to download script with an "all" option 2015-12-12 12:49:50 -05:00
Al
88836e56e1 [api] Adding parse_address implementation to the libpostal API. GeoDB and address parser are now required. Stripping punctuation from the normalized output 2015-12-12 12:47:44 -05:00
Al
bce6ba2595 [fix] typedef 2015-12-12 11:58:41 -05:00
Al
a8d6cc4053 [api] Moving parse_address definition into libpostal.h 2015-12-12 03:55:31 -05:00
Al
fe4c528f26 [parser] Using different char_array for each of the potential phrases as token i 2015-12-12 03:23:26 -05:00
Al
e6303f70f3 [fix] removing printf 2015-12-11 02:53:22 -05:00
Al
671dd4a5d2 [parser] Fixing possible invalid writes in training for values beginning with a separator 2015-12-11 02:05:05 -05:00
Al
743b74aea5 [parser] Simplifying args in address_parser_data_set_tokenize_line 2015-12-10 18:48:23 -05:00
Al
88b8023ac8 [fix] Bug in address parser feature extraction, can hold onto the wrong pointer 2015-12-10 18:42:28 -05:00
Al
3de59506ae [parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations) 2015-12-10 18:08:51 -05:00
Al
71d6d3c5e1 [utils] Removing kvec and using similar implementation with pointers that can be passed around 2015-12-10 17:52:23 -05:00
Al
ab205eff96 [utils] Adding a default small size to all arrays based on a look at malloc/realloc usage 2015-12-09 19:46:09 -05:00
Al
f252869671 [dictionaries] adding ste to English dictionaries 2015-12-08 22:29:52 -05:00
Al
fe37286bcf [fix] Fixes to matrix methods 2015-12-08 17:33:38 -05:00