Commit Graph

79 Commits

Author SHA1 Message Date
Al
835de327c3 [dedupe] for near-dupe hashing, remove whitespace from root expansions so something like "Ocean Walk Dr" and "Oceanwalk Dr" will have a chance of matching downstream 2018-02-24 00:34:09 -05:00
Al
591891951d [utils] adding utf8 case insensitive comparison 2018-02-23 01:22:58 -05:00
Al
0ee18b4f6c [dedupe] adding a function to acronyms module to detect existing/known acronyms like MS for middle school, HS for high school, etc. Forms like MS have to be deined in the dictionaries specifically but any acronym written like M.S. will be detected as such by the tokenizer 2018-01-15 23:47:16 -05:00
Al
c78566c241 [utils] adding cstring_array_extend and string_tree_clear 2017-12-24 01:46:20 -05:00
Al
d03ce4e058 [expand] remove blank expansions and strip spaces 2017-12-18 18:17:16 -05:00
Al
e4e84f0147 [utils] adding unicode_common_prefix/unicode_common_suffix, string_hyphen_prefix_len and string_hyphen_suffix_len to string_utils 2017-12-08 14:28:30 -05:00
Al
cfa5b1ce42 [similarity] adding a stopword-aware acronym alignment method for matching U.N. with United Nations, Museum of Modern Art with MoMA, as well as things like University of California - Los Angeles with UCLA. All of these should work across languages, including non-Latin character sets like Cyrllic (but not ideograms as the concept doesn't make as much sense there). Skipping tokens like "of" or "the" depends only on the stopwords dictionary being defined for a given language. 2017-12-04 15:21:44 -05:00
Al
665b780422 [utils] adding unicode_equals function in string_utils for testing equality of unicode char arrays 2017-11-11 02:45:41 -05:00
Al
6d430f7e9b [utils] adding functions for finding the next index of a full stop/period charater in a string 2017-10-27 04:07:28 -04:00
Al
b7eda37e44 [utils] adding utf8_is_digit to string_utils.h 2017-10-20 02:46:00 -04:00
Al
245aa226e0 [utils] function to create an array of uint32_t codepoints from a UTF-8 string, a few bug fixes to string_utils 2017-10-19 04:48:50 -04:00
Al
09fbb02042 [utils] adding utf8_equal_ignore_separators to string utils 2017-10-14 01:36:56 -04:00
Al
f8a808e254 [utils] adding utf8_len function for strings, and utf8_is_digit 2017-10-12 11:16:53 -04:00
Al
1b2696b3b5 [utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category) 2017-03-15 13:04:39 -04:00
Al
b88487f633 [utils] string_replace_char does single byte/character replacement, new string_replace to do full string replacement, again using char_array for safety, string_replace_with_array function for memory reuse 2017-02-17 13:58:51 -05:00
Al
b320aed9ac [merge] merging master 2017-01-13 19:58:49 -05:00
Al
953a26e54e [utils] char_array_add_vjoined to stay consistent (add_* methods NUL termiante) 2017-01-09 16:10:07 -05:00
Al
77035fbdbd [strings] adding utf8_is_whitespace to the header so it can be referenced from multiple files 2017-01-02 02:23:21 -05:00
Al
4ad3a52fe1 [strings] fix lowercasing in string_utils.c 2017-01-01 20:08:34 -05:00
Al
0b5cc96654 [transliteration] add decompose option when stripping accents 2017-01-01 13:54:20 -05:00
Al
475aa3dbfa [strings] fixing and simplifying string tree iterator. This version is inspired by Python's itertools.product (itertoolsmodule.c has so many goodies) 2016-12-31 03:22:27 -05:00
Al
261ec3888a [strings] header changes for new utf8 lower/upper functions 2016-12-31 03:20:43 -05:00
Al
b1816e9b70 [utils] Adding cstring_array_split_ignore_consecutive 2016-12-12 11:37:27 -05:00
Al
b639fa5127 [utils] string_replace also creates a copy 2016-11-30 10:09:33 -08:00
Al
89f6611c4e [strings] string_trim makes a copy rather than modifying the pointer 2016-11-28 15:06:07 -08:00
Al
92e66fd60c [utils] string_next_hyphen_index 2016-08-16 12:49:52 -04:00
Al
b664ab1cea [utils] Adding cstring_array_split_ignore_consecutive 2016-07-21 17:04:57 -04:00
Al
98c395d34c [numex] Concatenating a string of numeric expressions with no intervening tokens like Seventeen Eighty or Ten Oh Four 2016-02-10 09:21:31 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
2e67afab09 [fix] adding functions to string_utils header 2016-01-06 23:03:16 -05:00
Al
d0aaff1482 [utils] string_equals with NULL check 2015-12-01 13:12:08 -05:00
Al
40918812e2 [normalize] Adding hyphen elimination as a string option (changes tokenization) 2015-10-27 13:32:47 -04:00
Al
bf596b9184 [utils] integer string sizes 2015-10-09 15:40:47 -04:00
Al
6428c0ae20 [utils] cstring_array_cat 2015-10-03 16:00:13 -04:00
Al
3fab0f984f [fix] fixing some compiler warnings, using type-specific abs functions for vector_math 2015-09-19 16:11:09 -04:00
Al
17cfdb0625 [fix] adding char_array_append_* methods to header 2015-09-18 13:19:42 -04:00
Al
0ddf50cb5f [utils] add to feature array with printf syntax 2015-09-10 10:24:51 -07:00
Al
b3f89a207a [utils] Version of string_split for single character delimiters which modifies the input string directly rather than creating (essentially) a copy 2015-09-09 18:07:31 -07:00
Al
aa454c4430 [fix] removing char_array_copy from header 2015-09-07 23:58:05 -07:00
Al
ec3ab7234a [utils] Adding index to cstring_array_foreach, similar to Python's enumerate 2015-09-04 19:34:06 -04:00
Al
a13e5117b5 [utils] string_tree_num_strings method 2015-08-10 17:46:37 -04:00
Al
064b6b5898 [utils] char_array_append_reversed for adding reversed strings without a malloc 2015-08-10 16:10:05 -04:00
Al
9b69d1f67a [fix] Removing C++ checks from all but the main API functions 2015-08-07 17:15:39 -04:00
Al
359a1efb03 [fix] Adding stdint.h include to most of the header files for portability 2015-08-07 02:43:44 -04:00
Al
0738a57caa [fix] restoring ctype.h include 2015-08-07 01:52:08 -04:00
Al
d7ebcd046e [fix] includes 2015-08-07 01:00:26 -04:00
Al
3178eda501 [utils] string_contains_hyphen method 2015-08-02 14:35:18 -06:00
Al
7aee159c0c [utils] string_tree_num_tokens 2015-07-27 12:36:34 -04:00
Al
a67ec44a08 [utils] cstring_array_terminate, moving msgpack_utils to separate file 2015-07-25 18:41:02 -04:00
Al
e549e76806 [utils] string_tree_iterator_foreach_token 2015-07-25 13:49:02 -04:00