Commit Graph

37 Commits

Author SHA1 Message Date
Al
12c2477359 [phrases] Another fix to tail token search 2016-02-08 17:55:21 -05:00
Al
39f162b029 [phrases] fix in tokenized tail search when whitespace tokens are preserved 2016-02-08 16:37:52 -05:00
Al
9ac0379a65 [phrases] Case where trie search finds a match, makes progress beyond the next token but has to fall back. Adding trie search test case 2016-02-08 01:07:56 -05:00
Al
7b300639f1 [fix] Trie prefix search tail comparison 2016-01-17 20:56:37 -05:00
Al
850d82de6e [fix] In trie search, moving fall-off and tail checks inside the inner character loop dding tail position as a separate variable from offset in the string 2015-12-23 19:16:43 -05:00
Al
aaa1fc0387 [fix] Stepping through codepoints first then through chars in trie_search_prefixes_from_index (used in transliteration and numex) 2015-12-23 01:58:39 -05:00
Al
baa8e3cc3f [fix] Compare the remaining part of the current UTF-8 character using simple string comparison, since it may be in the middle of a valid UTF-8 character 2015-12-21 20:34:15 -05:00
Al
39e83961ef [fix] Bug in suffix expansion affecting inseparable suffixes like burg as well as ordinal suffixes like first=>1st 2015-12-19 01:30:08 -05:00
Al
df47dad817 [fix] Partial matches, ultimate misses in concatenated suffixes 2015-12-18 17:37:06 -05:00
Al
66073c17d5 [fix] Handling case of concatenated suffixes like straße when they stand alone 2015-12-18 17:17:35 -05:00
Al
596c5ffdd3 [fix] Tokenized trie search 2015-12-05 15:21:52 -05:00
Al
25e89bcc41 [fix] tokenized trie search edge case where tail is stored on the space node 2015-12-03 12:25:21 -05:00
Al
1a1d74785c [fix] Compiler warnings for casts/printf 2015-10-26 18:52:18 -04:00
Al
2394f817e4 [phrases] Fixing fallback at the end of a string in trie search 2015-10-11 00:13:21 -05:00
Al
f2f7db92ff [fix] phrases 2015-09-18 13:19:18 -04:00
Al
23103a21d4 [phrases] Adding with_phrases versions of trie search methods for pre-allocated phrases 2015-09-16 21:23:34 -04:00
Al
e511eede74 [phrases] Prefix/suffix trie search using the new characters, fixing length of matched prefixes/suffixes and exiting early on falling off the the trie 2015-08-10 16:02:38 -04:00
Al
11a9881988 [phrases] adding _from_index_get_prefix_char/_from_index_get_suffix_char methods 2015-08-09 03:41:20 -04:00
Al
2eb67ad850 [phrases] trie_search_prefixes/trie_search_suffixes now take a length param 2015-08-09 02:01:37 -04:00
Al
5acf7a4f3e [phrases] resetting node position when continuation falls off the trie 2015-08-08 22:18:05 -04:00
Al
b27030e39f [fix] tokenized trie search was skipping tokens in some cases 2015-08-02 14:36:21 -06:00
Al
0f5b69c06b [fix] transition to SEARCH_STATE_NO_MATCH in trie_search_tokens_from_index on a return to the start node 2015-07-27 16:35:27 -04:00
Al
8ff4ace63b [phrases] Allowing trie_search to process tokenized input with or without whitespace, and to handle ideographic characters correctly 2015-07-26 23:41:57 -04:00
Al
90a91cadd0 [search] Modifying trie_search_prefixes to use the new key schema 2015-07-24 15:59:49 -04:00
Al
9337bf9aea [phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes 2015-06-25 17:24:19 -04:00
Al
c159f83f9b [fix] trie_search logging 2015-06-12 16:17:41 -04:00
Al
6b60446dbe [phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie 2015-06-12 11:30:24 -04:00
Al
6841ed8fb3 [phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token 2015-06-11 11:05:56 -04:00
Al
cb603562e0 [phrases] Adding *_from_index methods to trie_search 2015-06-09 11:14:42 -04:00
Al
2856c2b401 [utils] string_utils category functions take a category instead of a codepoint 2015-06-05 16:55:21 -04:00
Al
0177fd4b13 [fix] trie_search using proper length in utf8proc_iterate 2015-05-27 16:08:19 -04:00
Al
eecee39904 [fix] giving constant trie node names more specificity 2015-05-18 14:24:39 -04:00
Al
1373843b86 [fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't. 2015-04-27 01:49:08 -04:00
Al
908e3dc03c [phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search 2015-04-19 09:32:20 -04:00
Al
79fd7a8ded [tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string 2015-04-05 16:33:14 -04:00
Al
310acbed2c [phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays 2015-04-01 02:52:57 -04:00
Al
5dd3896c4a [phrases] trie_search module for searching for millions of patterns in a trie simultanously. Works for strings, token sequences, and can search for suffixes. 2015-03-03 13:51:01 -05:00