Commit Graph

88 Commits

Author SHA1 Message Date
Al
835de327c3 [dedupe] for near-dupe hashing, remove whitespace from root expansions so something like "Ocean Walk Dr" and "Oceanwalk Dr" will have a chance of matching downstream 2018-02-24 00:34:09 -05:00
Al
591891951d [utils] adding utf8 case insensitive comparison 2018-02-23 01:22:58 -05:00
Al
0ee18b4f6c [dedupe] adding a function to acronyms module to detect existing/known acronyms like MS for middle school, HS for high school, etc. Forms like MS have to be deined in the dictionaries specifically but any acronym written like M.S. will be detected as such by the tokenizer 2018-01-15 23:47:16 -05:00
Al
c78566c241 [utils] adding cstring_array_extend and string_tree_clear 2017-12-24 01:46:20 -05:00
Al
e4e84f0147 [utils] adding unicode_common_prefix/unicode_common_suffix, string_hyphen_prefix_len and string_hyphen_suffix_len to string_utils 2017-12-08 14:28:30 -05:00
Al
cfa5b1ce42 [similarity] adding a stopword-aware acronym alignment method for matching U.N. with United Nations, Museum of Modern Art with MoMA, as well as things like University of California - Los Angeles with UCLA. All of these should work across languages, including non-Latin character sets like Cyrllic (but not ideograms as the concept doesn't make as much sense there). Skipping tokens like "of" or "the" depends only on the stopwords dictionary being defined for a given language. 2017-12-04 15:21:44 -05:00
Al
ec4d683d1b Merge branch 'master' into lieu_api 2017-11-29 15:49:52 -05:00
AeroXuk
26ac9ab5c2 Removing EXPORT statements from all source files and most header files, leaving only the exports for the main API in libpostal.h. Modified Makefiles so that all the test apps build without having extra functions exported from libpostal. 2017-11-25 04:35:28 +00:00
AeroXuk
f0246e7333 Fix bug in strndup fix for windows. Move all includes out of headers and into code for strndup.h and move it to be the last include. 2017-11-23 19:11:25 +00:00
AeroXuk
f07ab765cb Adding the export marker to all functions used in tests. 2017-11-20 20:58:37 +00:00
Al
665b780422 [utils] adding unicode_equals function in string_utils for testing equality of unicode char arrays 2017-11-11 02:45:41 -05:00
Al
6d430f7e9b [utils] adding functions for finding the next index of a full stop/period charater in a string 2017-10-27 04:07:28 -04:00
Al
245aa226e0 [utils] function to create an array of uint32_t codepoints from a UTF-8 string, a few bug fixes to string_utils 2017-10-19 04:48:50 -04:00
Al
09fbb02042 [utils] adding utf8_equal_ignore_separators to string utils 2017-10-14 01:36:56 -04:00
Al
f8a808e254 [utils] adding utf8_len function for strings, and utf8_is_digit 2017-10-12 11:16:53 -04:00
Oliver Keyes
35821f975e Remove unused variable
What it says on the tin!
2017-04-18 21:25:00 -07:00
Al
1b2696b3b5 [utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category) 2017-03-15 13:04:39 -04:00
Al
b88487f633 [utils] string_replace_char does single byte/character replacement, new string_replace to do full string replacement, again using char_array for safety, string_replace_with_array function for memory reuse 2017-02-17 13:58:51 -05:00
Al
ae35da8d17 [fix] uninitialized var 2017-02-08 01:58:53 -05:00
Al
ec3a563591 Merge branch 'master' into parser-data 2017-01-14 13:06:25 -05:00
Rinigus
67624f89d0 cstring_array_from_char_array: return empty initializes cstring_array from empty string 2017-01-14 10:43:47 +02:00
Al
b320aed9ac [merge] merging master 2017-01-13 19:58:49 -05:00
Al
e1f258171f [fix] handle cstring_array_from_char_array where char_array is NULL or 0-length 2017-01-13 16:52:41 -05:00
Al
953a26e54e [utils] char_array_add_vjoined to stay consistent (add_* methods NUL termiante) 2017-01-09 16:10:07 -05:00
Al
4ad3a52fe1 [strings] fix lowercasing in string_utils.c 2017-01-01 20:08:34 -05:00
Al
7d6c85aeec [fix] new string tree iterator, don't decrement permutations on rollovers 2017-01-01 13:34:08 -05:00
Al
1780c5e053 [fix] moving enum 2016-12-31 13:01:57 -05:00
Al
475aa3dbfa [strings] fixing and simplifying string tree iterator. This version is inspired by Python's itertools.product (itertoolsmodule.c has so many goodies) 2016-12-31 03:22:27 -05:00
Al
58b063b632 [strings] making string_tree_iterator_done more meaningful (returns true if the iterator has no paths left to traverse) 2016-12-31 00:54:36 -05:00
Al
8978000320 [strings] adding latest utf8proc, new functions for utf8_lower (instead of case folding) and utf8_upper, and a utf8_is_whitespace that takes things like tabs into account 2016-12-31 00:52:12 -05:00
Al
0284913aa7 [utils] ignore initial separators when splitting on delimiter 2016-12-26 04:14:20 -05:00
Al
3ac2c93e1c [utils] using renaming char_array_append_vjoined to char_array_add_vjoined to follow convention that add_* calls NUL-terminate while append_* calls do not 2016-12-18 15:26:58 -05:00
Al
3939dd0ca6 [fix] cstring_array_split calls 2016-12-12 11:37:27 -05:00
Al
b1816e9b70 [utils] Adding cstring_array_split_ignore_consecutive 2016-12-12 11:37:27 -05:00
Al
b639fa5127 [utils] string_replace also creates a copy 2016-11-30 10:09:33 -08:00
Al
89f6611c4e [strings] string_trim makes a copy rather than modifying the pointer 2016-11-28 15:06:07 -08:00
Al
92e66fd60c [utils] string_next_hyphen_index 2016-08-16 12:49:52 -04:00
Al
b8d43dc601 [fix] cstring_array_split calls 2016-07-21 17:04:57 -04:00
Al
b664ab1cea [utils] Adding cstring_array_split_ignore_consecutive 2016-07-21 17:04:57 -04:00
Al
98c395d34c [numex] Concatenating a string of numeric expressions with no intervening tokens like Seventeen Eighty or Ten Oh Four 2016-02-10 09:21:31 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
0d5cf0d6d7 [utils] char_array_cat_printf was forcing a doubling of the size of the buffer, which is bad if calling many times. Now only initiates a realloc if the char_array is almost full. Also adding cstring_array_from_strings which takes a list of char *s 2016-01-06 22:56:01 -05:00
Al
d0aaff1482 [utils] string_equals with NULL check 2015-12-01 13:12:08 -05:00
Al
40918812e2 [normalize] Adding hyphen elimination as a string option (changes tokenization) 2015-10-27 13:32:47 -04:00
Al
6428c0ae20 [utils] cstring_array_cat 2015-10-03 16:00:13 -04:00
Al
3fab0f984f [fix] fixing some compiler warnings, using type-specific abs functions for vector_math 2015-09-19 16:11:09 -04:00
Al
35b9122a1a [utils] inlining a few functions 2015-09-10 16:33:54 -07:00
Al
0ddf50cb5f [utils] add to feature array with printf syntax 2015-09-10 10:24:51 -07:00
Al
b3f89a207a [utils] Version of string_split for single character delimiters which modifies the input string directly rather than creating (essentially) a copy 2015-09-09 18:07:31 -07:00
Al
9d2ca08fc2 [utils] Adding _copy and _new_copy methods to vectors (the former copies data to a pre-allocated vector, the latter allocates a new vector) 2015-09-06 21:01:26 -07:00