Commit Graph

1172 Commits

Author SHA1 Message Date
Al
e8fdd4564d [utils] adding string_tree for listing sets of token alternatives and string_tree_iterator to generate permutations over the strings, needed for transliteration and ambiguous address elements/place names 2015-05-16 23:16:10 -04:00
Al
f151a2232c [transliteration] new transliteration rules data file 2015-05-16 23:14:47 -04:00
Al
5983cb6af0 [i18n] Adding NUM_SCRIPTS to the end of the scripts enum 2015-05-16 12:19:40 -04:00
Al
8699409f15 [transliteration] resulting data file 2015-05-14 16:34:49 -04:00
Al
2d49369e78 [utils] Adding read/write for 64-bit ints to file_utils 2015-05-13 17:51:03 -04:00
Al
6898f8ecd9 [transliteration] Adding context types back to transtlieration rule struct since they don't matter in the actual transliteration table 2015-05-13 16:51:07 -04:00
Al
b777b60e07 [transliteration] new data file 2015-05-13 16:21:16 -04:00
Al
cbe83376f2 [transliteration] Adding new, even smaller, generated data file 2015-05-12 18:58:38 -04:00
Al
0984fb9ea4 [transliteration] new, more compact transliteration data file 2015-05-12 12:13:11 -04:00
Al
2a69488f9b [fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file. 2015-05-08 17:14:26 -04:00
Al
10ebaf147a [transliteration] literal ^ and $ escaped 2015-05-01 19:16:36 -04:00
Al
ff851a464c [fix] escaping curly braces for regex compilation 2015-04-30 13:27:17 -04:00
Al
fa43abd8d9 [transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key 2015-04-29 14:31:15 -04:00
Al
1c25238af7 [fix] string lengths on the various transliteration rules 2015-04-27 13:51:35 -04:00
Al
1373843b86 [fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't. 2015-04-27 01:49:08 -04:00
Al
b2ba629f95 [fix] trie_get methods just return node index rather than data value 2015-04-27 01:28:05 -04:00
Al
8fb9bacfa6 [phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs 2015-04-27 01:01:43 -04:00
Al
8bc77372ef [phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries 2015-04-26 22:24:02 -04:00
Al
6ebea11640 [transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters 2015-04-26 19:47:54 -04:00
Al
ff9b6735f8 [transliteration] Adding header + generated C data file for simplified transliteration rules 2015-04-25 15:44:36 -04:00
Al
1b33744956 [tokenization] Numeric tokens must end in number or letter 2015-04-22 14:55:18 -04:00
Al
9c0126a01c [utils] two set types in collections.h 2015-04-19 09:32:53 -04:00
Al
908e3dc03c [phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search 2015-04-19 09:32:20 -04:00
Al
606a669c01 [tokenization] breaking dashes or double hyphens break a word while other dashes don't 2015-04-17 19:14:42 -04:00
Al
6718182443 [tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words 2015-04-17 15:21:22 -04:00
Al
e21873635c [utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions 2015-04-15 20:17:03 -04:00
Al
e241c1dfc8 [rm] Removing dependency on sds, char_array and cstring_array have similar benefits/functionality with fewer drawbacks 2015-04-12 18:07:33 -04:00
Al
83813bb980 [geodisambig] Models for geonames with msgpack serialization/deserialization 2015-04-12 16:47:01 -04:00
Al
1f9da05dd5 [geodisambig] C msgpack serialization dependency 2015-04-12 15:14:01 -04:00
Al
0234754c20 [fix] warnings in string_utils 2015-04-12 12:16:32 -04:00
Al
3a7f18581e [utils] Adding min, max, argmin, argmax and log_sum_exp to generic vector math header 2015-04-12 12:11:04 -04:00
Al
4729dfe178 [utils] string_[rl]strip => string_[rl]trim, removing warning about allocation 2015-04-06 02:19:19 -04:00
Al
53844067b1 [fix] better allocation sizes for tokenized strings 2015-04-05 22:02:31 -04:00
Al
198e51b8a3 [utils] more/better char_array methods 2015-04-05 22:01:46 -04:00
Al
79fd7a8ded [tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string 2015-04-05 16:33:14 -04:00
Al
5f3d74de18 [fix] contiguous string array 2015-04-03 11:22:50 -04:00
Al
c81aa72254 [utils] a few changes to contiguous string arrays 2015-04-01 19:02:11 -04:00
Al
fa59b63ab2 [fix] type name/import 2015-04-01 02:54:14 -04:00
Al
310acbed2c [phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays 2015-04-01 02:52:57 -04:00
Al
1ac4438e39 [utils] More consistent naming in string_utils 2015-03-27 21:12:08 -04:00
Al
127a61d492 [utils] adding pop method on the improved vectors 2015-03-27 21:00:03 -04:00
Al
3678d4a3ca [gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN) 2015-03-27 20:59:21 -04:00
Al
4ccd1b1fe2 [fix] update feature arrays to use the new APIs 2015-03-27 20:57:42 -04:00
Al
6768936953 [utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors. 2015-03-27 20:57:03 -04:00
Al
70195fffd5 [utils] new methods on string_utils for better dynamic strings which retains the benefits of sds without having to worry about the pointer changing, renaming contiguous string array methods to something more succinct 2015-03-27 20:55:36 -04:00
Al
2d1c24a6e9 [tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types 2015-03-24 16:43:53 -04:00
Al
7ffe788913 [unicode] header 2015-03-18 17:25:53 -04:00
Al
d5a9041cd3 [unicode] Adding generated unicode script data 2015-03-18 17:01:03 -04:00
Al
d2ceb5f418 [fix] removing struct definition from scanner.re for future generation of scanner.c 2015-03-17 19:46:40 -04:00
Al
f794ef7222 [tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation 2015-03-17 18:38:30 -04:00