Al
|
835de327c3
|
[dedupe] for near-dupe hashing, remove whitespace from root expansions so something like "Ocean Walk Dr" and "Oceanwalk Dr" will have a chance of matching downstream
|
2018-02-24 00:34:09 -05:00 |
|
Al
|
591891951d
|
[utils] adding utf8 case insensitive comparison
|
2018-02-23 01:22:58 -05:00 |
|
Al
|
0ee18b4f6c
|
[dedupe] adding a function to acronyms module to detect existing/known acronyms like MS for middle school, HS for high school, etc. Forms like MS have to be deined in the dictionaries specifically but any acronym written like M.S. will be detected as such by the tokenizer
|
2018-01-15 23:47:16 -05:00 |
|
Al
|
c78566c241
|
[utils] adding cstring_array_extend and string_tree_clear
|
2017-12-24 01:46:20 -05:00 |
|
Al
|
d03ce4e058
|
[expand] remove blank expansions and strip spaces
|
2017-12-18 18:17:16 -05:00 |
|
Al
|
e4e84f0147
|
[utils] adding unicode_common_prefix/unicode_common_suffix, string_hyphen_prefix_len and string_hyphen_suffix_len to string_utils
|
2017-12-08 14:28:30 -05:00 |
|
Al
|
cfa5b1ce42
|
[similarity] adding a stopword-aware acronym alignment method for matching U.N. with United Nations, Museum of Modern Art with MoMA, as well as things like University of California - Los Angeles with UCLA. All of these should work across languages, including non-Latin character sets like Cyrllic (but not ideograms as the concept doesn't make as much sense there). Skipping tokens like "of" or "the" depends only on the stopwords dictionary being defined for a given language.
|
2017-12-04 15:21:44 -05:00 |
|
Al
|
665b780422
|
[utils] adding unicode_equals function in string_utils for testing equality of unicode char arrays
|
2017-11-11 02:45:41 -05:00 |
|
Al
|
6d430f7e9b
|
[utils] adding functions for finding the next index of a full stop/period charater in a string
|
2017-10-27 04:07:28 -04:00 |
|
Al
|
b7eda37e44
|
[utils] adding utf8_is_digit to string_utils.h
|
2017-10-20 02:46:00 -04:00 |
|
Al
|
245aa226e0
|
[utils] function to create an array of uint32_t codepoints from a UTF-8 string, a few bug fixes to string_utils
|
2017-10-19 04:48:50 -04:00 |
|
Al
|
09fbb02042
|
[utils] adding utf8_equal_ignore_separators to string utils
|
2017-10-14 01:36:56 -04:00 |
|
Al
|
f8a808e254
|
[utils] adding utf8_len function for strings, and utf8_is_digit
|
2017-10-12 11:16:53 -04:00 |
|
Al
|
1b2696b3b5
|
[utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category)
|
2017-03-15 13:04:39 -04:00 |
|
Al
|
b88487f633
|
[utils] string_replace_char does single byte/character replacement, new string_replace to do full string replacement, again using char_array for safety, string_replace_with_array function for memory reuse
|
2017-02-17 13:58:51 -05:00 |
|
Al
|
b320aed9ac
|
[merge] merging master
|
2017-01-13 19:58:49 -05:00 |
|
Al
|
953a26e54e
|
[utils] char_array_add_vjoined to stay consistent (add_* methods NUL termiante)
|
2017-01-09 16:10:07 -05:00 |
|
Al
|
77035fbdbd
|
[strings] adding utf8_is_whitespace to the header so it can be referenced from multiple files
|
2017-01-02 02:23:21 -05:00 |
|
Al
|
4ad3a52fe1
|
[strings] fix lowercasing in string_utils.c
|
2017-01-01 20:08:34 -05:00 |
|
Al
|
0b5cc96654
|
[transliteration] add decompose option when stripping accents
|
2017-01-01 13:54:20 -05:00 |
|
Al
|
475aa3dbfa
|
[strings] fixing and simplifying string tree iterator. This version is inspired by Python's itertools.product (itertoolsmodule.c has so many goodies)
|
2016-12-31 03:22:27 -05:00 |
|
Al
|
261ec3888a
|
[strings] header changes for new utf8 lower/upper functions
|
2016-12-31 03:20:43 -05:00 |
|
Al
|
b1816e9b70
|
[utils] Adding cstring_array_split_ignore_consecutive
|
2016-12-12 11:37:27 -05:00 |
|
Al
|
b639fa5127
|
[utils] string_replace also creates a copy
|
2016-11-30 10:09:33 -08:00 |
|
Al
|
89f6611c4e
|
[strings] string_trim makes a copy rather than modifying the pointer
|
2016-11-28 15:06:07 -08:00 |
|
Al
|
92e66fd60c
|
[utils] string_next_hyphen_index
|
2016-08-16 12:49:52 -04:00 |
|
Al
|
b664ab1cea
|
[utils] Adding cstring_array_split_ignore_consecutive
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
98c395d34c
|
[numex] Concatenating a string of numeric expressions with no intervening tokens like Seventeen Eighty or Ten Oh Four
|
2016-02-10 09:21:31 -05:00 |
|
Al
|
7b300639f1
|
[fix] Trie prefix search tail comparison
|
2016-01-17 20:56:37 -05:00 |
|
Al
|
2e67afab09
|
[fix] adding functions to string_utils header
|
2016-01-06 23:03:16 -05:00 |
|
Al
|
d0aaff1482
|
[utils] string_equals with NULL check
|
2015-12-01 13:12:08 -05:00 |
|
Al
|
40918812e2
|
[normalize] Adding hyphen elimination as a string option (changes tokenization)
|
2015-10-27 13:32:47 -04:00 |
|
Al
|
bf596b9184
|
[utils] integer string sizes
|
2015-10-09 15:40:47 -04:00 |
|
Al
|
6428c0ae20
|
[utils] cstring_array_cat
|
2015-10-03 16:00:13 -04:00 |
|
Al
|
3fab0f984f
|
[fix] fixing some compiler warnings, using type-specific abs functions for vector_math
|
2015-09-19 16:11:09 -04:00 |
|
Al
|
17cfdb0625
|
[fix] adding char_array_append_* methods to header
|
2015-09-18 13:19:42 -04:00 |
|
Al
|
0ddf50cb5f
|
[utils] add to feature array with printf syntax
|
2015-09-10 10:24:51 -07:00 |
|
Al
|
b3f89a207a
|
[utils] Version of string_split for single character delimiters which modifies the input string directly rather than creating (essentially) a copy
|
2015-09-09 18:07:31 -07:00 |
|
Al
|
aa454c4430
|
[fix] removing char_array_copy from header
|
2015-09-07 23:58:05 -07:00 |
|
Al
|
ec3ab7234a
|
[utils] Adding index to cstring_array_foreach, similar to Python's enumerate
|
2015-09-04 19:34:06 -04:00 |
|
Al
|
a13e5117b5
|
[utils] string_tree_num_strings method
|
2015-08-10 17:46:37 -04:00 |
|
Al
|
064b6b5898
|
[utils] char_array_append_reversed for adding reversed strings without a malloc
|
2015-08-10 16:10:05 -04:00 |
|
Al
|
9b69d1f67a
|
[fix] Removing C++ checks from all but the main API functions
|
2015-08-07 17:15:39 -04:00 |
|
Al
|
359a1efb03
|
[fix] Adding stdint.h include to most of the header files for portability
|
2015-08-07 02:43:44 -04:00 |
|
Al
|
0738a57caa
|
[fix] restoring ctype.h include
|
2015-08-07 01:52:08 -04:00 |
|
Al
|
d7ebcd046e
|
[fix] includes
|
2015-08-07 01:00:26 -04:00 |
|
Al
|
3178eda501
|
[utils] string_contains_hyphen method
|
2015-08-02 14:35:18 -06:00 |
|
Al
|
7aee159c0c
|
[utils] string_tree_num_tokens
|
2015-07-27 12:36:34 -04:00 |
|
Al
|
a67ec44a08
|
[utils] cstring_array_terminate, moving msgpack_utils to separate file
|
2015-07-25 18:41:02 -04:00 |
|
Al
|
e549e76806
|
[utils] string_tree_iterator_foreach_token
|
2015-07-25 13:49:02 -04:00 |
|