Al
|
95e97c0585
|
[fix/utf8] reviewed and fixed all points where utf8proc_iterate is called and may return an error which can cause the iteration not to make forward progress. This includes fixing a bug where injecting invalid UTF-8 through a series of HTML-encoded codepoints can cause the C library to hang. Note: we're not fixing all the garbage encoding in the world, so if encoding is bad the output of expand_address may not be useful but it won't hang. Fixes #448
|
2025-07-02 00:10:49 -04:00 |
|
Al
|
1c5afcafd2
|
[phrases] when skipping/ignoring hyphens in trie search, make sure that the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.)
|
2017-10-20 02:43:39 -04:00 |
|
Iestyn Pryce
|
6aa3cb61fd
|
Fix log_* formats which expect long long int but receive int64_t.
|
2017-05-21 10:29:34 +01:00 |
|
Iestyn Pryce
|
87a76bf967
|
Fix log_{debug,info} formats which expect size_t but receive int.
|
2017-05-17 22:40:53 +01:00 |
|
Al
|
278679b7fb
|
[fix] in tokenized trie_search, in the case of a partial failed match, reset to the root node before rolling the pointer back to phrase start + 1
|
2017-04-21 13:51:07 -04:00 |
|
Al
|
dfabd25e5d
|
[phrases] set node data only when we're sure we have a correct match, otherwise the longer phrase may actually be matched
|
2017-03-17 03:40:29 -04:00 |
|
Al
|
56f68e4399
|
[phrases] fixing trie suffix search
|
2017-02-14 03:36:29 -05:00 |
|
Al
|
6e4f641743
|
[phrases] adding token_phrase_memberships to trie_search for reuse
|
2017-02-08 01:59:39 -05:00 |
|
Al
|
bdb51a244e
|
[phrases] fix case in trie search when searching for tokens in a string tail. If we're on the last token in a sequenence and the token matches the tail, check that the tail is complete, and if so return the match before exiting the loop. Affects multiword phrases that tend to appear toward the end of a sequence (long country names like "United States of America", etc.)
|
2016-12-29 16:17:09 -05:00 |
|
Al
|
f78281456a
|
[fix] header defintion
|
2016-11-27 01:00:25 -08:00 |
|
Al
|
330edc2c93
|
[utils] cstring_array_get_phrase requires a char_array to be passed in so it doesn't have to do any memory allocation
|
2016-08-16 13:11:45 -04:00 |
|
Al
|
965bac1833
|
[trie] Making methods to construct string phrases from phrase matches available through trie_search.h
|
2016-07-30 17:06:20 -04:00 |
|
Al
|
41ae742285
|
[fix] tokenized trie search when falling off the trie at the start of a valid phrase
|
2016-07-21 17:04:57 -04:00 |
|
Al
|
12c2477359
|
[phrases] Another fix to tail token search
|
2016-02-08 17:55:21 -05:00 |
|
Al
|
39f162b029
|
[phrases] fix in tokenized tail search when whitespace tokens are preserved
|
2016-02-08 16:37:52 -05:00 |
|
Al
|
9ac0379a65
|
[phrases] Case where trie search finds a match, makes progress beyond the next token but has to fall back. Adding trie search test case
|
2016-02-08 01:07:56 -05:00 |
|
Al
|
7b300639f1
|
[fix] Trie prefix search tail comparison
|
2016-01-17 20:56:37 -05:00 |
|
Al
|
850d82de6e
|
[fix] In trie search, moving fall-off and tail checks inside the inner character loop dding tail position as a separate variable from offset in the string
|
2015-12-23 19:16:43 -05:00 |
|
Al
|
aaa1fc0387
|
[fix] Stepping through codepoints first then through chars in trie_search_prefixes_from_index (used in transliteration and numex)
|
2015-12-23 01:58:39 -05:00 |
|
Al
|
baa8e3cc3f
|
[fix] Compare the remaining part of the current UTF-8 character using simple string comparison, since it may be in the middle of a valid UTF-8 character
|
2015-12-21 20:34:15 -05:00 |
|
Al
|
39e83961ef
|
[fix] Bug in suffix expansion affecting inseparable suffixes like burg as well as ordinal suffixes like first=>1st
|
2015-12-19 01:30:08 -05:00 |
|
Al
|
df47dad817
|
[fix] Partial matches, ultimate misses in concatenated suffixes
|
2015-12-18 17:37:06 -05:00 |
|
Al
|
66073c17d5
|
[fix] Handling case of concatenated suffixes like straße when they stand alone
|
2015-12-18 17:17:35 -05:00 |
|
Al
|
596c5ffdd3
|
[fix] Tokenized trie search
|
2015-12-05 15:21:52 -05:00 |
|
Al
|
25e89bcc41
|
[fix] tokenized trie search edge case where tail is stored on the space node
|
2015-12-03 12:25:21 -05:00 |
|
Al
|
1a1d74785c
|
[fix] Compiler warnings for casts/printf
|
2015-10-26 18:52:18 -04:00 |
|
Al
|
2394f817e4
|
[phrases] Fixing fallback at the end of a string in trie search
|
2015-10-11 00:13:21 -05:00 |
|
Al
|
f2f7db92ff
|
[fix] phrases
|
2015-09-18 13:19:18 -04:00 |
|
Al
|
23103a21d4
|
[phrases] Adding with_phrases versions of trie search methods for pre-allocated phrases
|
2015-09-16 21:23:34 -04:00 |
|
Al
|
e511eede74
|
[phrases] Prefix/suffix trie search using the new characters, fixing length of matched prefixes/suffixes and exiting early on falling off the the trie
|
2015-08-10 16:02:38 -04:00 |
|
Al
|
11a9881988
|
[phrases] adding _from_index_get_prefix_char/_from_index_get_suffix_char methods
|
2015-08-09 03:41:20 -04:00 |
|
Al
|
2eb67ad850
|
[phrases] trie_search_prefixes/trie_search_suffixes now take a length param
|
2015-08-09 02:01:37 -04:00 |
|
Al
|
5acf7a4f3e
|
[phrases] resetting node position when continuation falls off the trie
|
2015-08-08 22:18:05 -04:00 |
|
Al
|
b27030e39f
|
[fix] tokenized trie search was skipping tokens in some cases
|
2015-08-02 14:36:21 -06:00 |
|
Al
|
0f5b69c06b
|
[fix] transition to SEARCH_STATE_NO_MATCH in trie_search_tokens_from_index on a return to the start node
|
2015-07-27 16:35:27 -04:00 |
|
Al
|
8ff4ace63b
|
[phrases] Allowing trie_search to process tokenized input with or without whitespace, and to handle ideographic characters correctly
|
2015-07-26 23:41:57 -04:00 |
|
Al
|
90a91cadd0
|
[search] Modifying trie_search_prefixes to use the new key schema
|
2015-07-24 15:59:49 -04:00 |
|
Al
|
9337bf9aea
|
[phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes
|
2015-06-25 17:24:19 -04:00 |
|
Al
|
c159f83f9b
|
[fix] trie_search logging
|
2015-06-12 16:17:41 -04:00 |
|
Al
|
6b60446dbe
|
[phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie
|
2015-06-12 11:30:24 -04:00 |
|
Al
|
6841ed8fb3
|
[phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token
|
2015-06-11 11:05:56 -04:00 |
|
Al
|
cb603562e0
|
[phrases] Adding *_from_index methods to trie_search
|
2015-06-09 11:14:42 -04:00 |
|
Al
|
2856c2b401
|
[utils] string_utils category functions take a category instead of a codepoint
|
2015-06-05 16:55:21 -04:00 |
|
Al
|
0177fd4b13
|
[fix] trie_search using proper length in utf8proc_iterate
|
2015-05-27 16:08:19 -04:00 |
|
Al
|
eecee39904
|
[fix] giving constant trie node names more specificity
|
2015-05-18 14:24:39 -04:00 |
|
Al
|
1373843b86
|
[fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't.
|
2015-04-27 01:49:08 -04:00 |
|
Al
|
908e3dc03c
|
[phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search
|
2015-04-19 09:32:20 -04:00 |
|
Al
|
79fd7a8ded
|
[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string
|
2015-04-05 16:33:14 -04:00 |
|
Al
|
310acbed2c
|
[phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays
|
2015-04-01 02:52:57 -04:00 |
|
Al
|
5dd3896c4a
|
[phrases] trie_search module for searching for millions of patterns in a trie simultanously. Works for strings, token sequences, and can search for suffixes.
|
2015-03-03 13:51:01 -05:00 |
|