Commit Graph

50 Commits

Author SHA1 Message Date
Al
95e97c0585 [fix/utf8] reviewed and fixed all points where utf8proc_iterate is called and may return an error which can cause the iteration not to make forward progress. This includes fixing a bug where injecting invalid UTF-8 through a series of HTML-encoded codepoints can cause the C library to hang. Note: we're not fixing all the garbage encoding in the world, so if encoding is bad the output of expand_address may not be useful but it won't hang. Fixes #448 2025-07-02 00:10:49 -04:00
Al
1c5afcafd2 [phrases] when skipping/ignoring hyphens in trie search, make sure that the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.) 2017-10-20 02:43:39 -04:00
Iestyn Pryce
6aa3cb61fd Fix log_* formats which expect long long int but receive int64_t. 2017-05-21 10:29:34 +01:00
Iestyn Pryce
87a76bf967 Fix log_{debug,info} formats which expect size_t but receive int. 2017-05-17 22:40:53 +01:00
Al
278679b7fb [fix] in tokenized trie_search, in the case of a partial failed match, reset to the root node before rolling the pointer back to phrase start + 1 2017-04-21 13:51:07 -04:00
Al
dfabd25e5d [phrases] set node data only when we're sure we have a correct match, otherwise the longer phrase may actually be matched 2017-03-17 03:40:29 -04:00
Al
56f68e4399 [phrases] fixing trie suffix search 2017-02-14 03:36:29 -05:00
Al
6e4f641743 [phrases] adding token_phrase_memberships to trie_search for reuse 2017-02-08 01:59:39 -05:00
Al
bdb51a244e [phrases] fix case in trie search when searching for tokens in a string tail. If we're on the last token in a sequenence and the token matches the tail, check that the tail is complete, and if so return the match before exiting the loop. Affects multiword phrases that tend to appear toward the end of a sequence (long country names like "United States of America", etc.) 2016-12-29 16:17:09 -05:00
Al
f78281456a [fix] header defintion 2016-11-27 01:00:25 -08:00
Al
330edc2c93 [utils] cstring_array_get_phrase requires a char_array to be passed in so it doesn't have to do any memory allocation 2016-08-16 13:11:45 -04:00
Al
965bac1833 [trie] Making methods to construct string phrases from phrase matches available through trie_search.h 2016-07-30 17:06:20 -04:00
Al
41ae742285 [fix] tokenized trie search when falling off the trie at the start of a valid phrase 2016-07-21 17:04:57 -04:00
Al
12c2477359 [phrases] Another fix to tail token search 2016-02-08 17:55:21 -05:00
Al
39f162b029 [phrases] fix in tokenized tail search when whitespace tokens are preserved 2016-02-08 16:37:52 -05:00
Al
9ac0379a65 [phrases] Case where trie search finds a match, makes progress beyond the next token but has to fall back. Adding trie search test case 2016-02-08 01:07:56 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
850d82de6e [fix] In trie search, moving fall-off and tail checks inside the inner character loop dding tail position as a separate variable from offset in the string 2015-12-23 19:16:43 -05:00
Al
aaa1fc0387 [fix] Stepping through codepoints first then through chars in trie_search_prefixes_from_index (used in transliteration and numex) 2015-12-23 01:58:39 -05:00
Al
baa8e3cc3f [fix] Compare the remaining part of the current UTF-8 character using simple string comparison, since it may be in the middle of a valid UTF-8 character 2015-12-21 20:34:15 -05:00
Al
39e83961ef [fix] Bug in suffix expansion affecting inseparable suffixes like burg as well as ordinal suffixes like first=>1st 2015-12-19 01:30:08 -05:00
Al
df47dad817 [fix] Partial matches, ultimate misses in concatenated suffixes 2015-12-18 17:37:06 -05:00
Al
66073c17d5 [fix] Handling case of concatenated suffixes like straße when they stand alone 2015-12-18 17:17:35 -05:00
Al
596c5ffdd3 [fix] Tokenized trie search 2015-12-05 15:21:52 -05:00
Al
25e89bcc41 [fix] tokenized trie search edge case where tail is stored on the space node 2015-12-03 12:25:21 -05:00
Al
1a1d74785c [fix] Compiler warnings for casts/printf 2015-10-26 18:52:18 -04:00
Al
2394f817e4 [phrases] Fixing fallback at the end of a string in trie search 2015-10-11 00:13:21 -05:00
Al
f2f7db92ff [fix] phrases 2015-09-18 13:19:18 -04:00
Al
23103a21d4 [phrases] Adding with_phrases versions of trie search methods for pre-allocated phrases 2015-09-16 21:23:34 -04:00
Al
e511eede74 [phrases] Prefix/suffix trie search using the new characters, fixing length of matched prefixes/suffixes and exiting early on falling off the the trie 2015-08-10 16:02:38 -04:00
Al
11a9881988 [phrases] adding _from_index_get_prefix_char/_from_index_get_suffix_char methods 2015-08-09 03:41:20 -04:00
Al
2eb67ad850 [phrases] trie_search_prefixes/trie_search_suffixes now take a length param 2015-08-09 02:01:37 -04:00
Al
5acf7a4f3e [phrases] resetting node position when continuation falls off the trie 2015-08-08 22:18:05 -04:00
Al
b27030e39f [fix] tokenized trie search was skipping tokens in some cases 2015-08-02 14:36:21 -06:00
Al
0f5b69c06b [fix] transition to SEARCH_STATE_NO_MATCH in trie_search_tokens_from_index on a return to the start node 2015-07-27 16:35:27 -04:00
Al
8ff4ace63b [phrases] Allowing trie_search to process tokenized input with or without whitespace, and to handle ideographic characters correctly 2015-07-26 23:41:57 -04:00
Al
90a91cadd0 [search] Modifying trie_search_prefixes to use the new key schema 2015-07-24 15:59:49 -04:00
Al
9337bf9aea [phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes 2015-06-25 17:24:19 -04:00
Al
c159f83f9b [fix] trie_search logging 2015-06-12 16:17:41 -04:00
Al
6b60446dbe [phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie 2015-06-12 11:30:24 -04:00
Al
6841ed8fb3 [phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token 2015-06-11 11:05:56 -04:00
Al
cb603562e0 [phrases] Adding *_from_index methods to trie_search 2015-06-09 11:14:42 -04:00
Al
2856c2b401 [utils] string_utils category functions take a category instead of a codepoint 2015-06-05 16:55:21 -04:00
Al
0177fd4b13 [fix] trie_search using proper length in utf8proc_iterate 2015-05-27 16:08:19 -04:00
Al
eecee39904 [fix] giving constant trie node names more specificity 2015-05-18 14:24:39 -04:00
Al
1373843b86 [fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't. 2015-04-27 01:49:08 -04:00
Al
908e3dc03c [phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search 2015-04-19 09:32:20 -04:00
Al
79fd7a8ded [tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string 2015-04-05 16:33:14 -04:00
Al
310acbed2c [phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays 2015-04-01 02:52:57 -04:00
Al
5dd3896c4a [phrases] trie_search module for searching for millions of patterns in a trie simultanously. Works for strings, token sequences, and can search for suffixes. 2015-03-03 13:51:01 -05:00