Commit Graph

103 Commits

Author SHA1 Message Date
Al
10ebaf147a [transliteration] literal ^ and $ escaped 2015-05-01 19:16:36 -04:00
Al
ff851a464c [fix] escaping curly braces for regex compilation 2015-04-30 13:27:17 -04:00
Al
fa43abd8d9 [transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key 2015-04-29 14:31:15 -04:00
Al
1c25238af7 [fix] string lengths on the various transliteration rules 2015-04-27 13:51:35 -04:00
Al
1373843b86 [fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't. 2015-04-27 01:49:08 -04:00
Al
b2ba629f95 [fix] trie_get methods just return node index rather than data value 2015-04-27 01:28:05 -04:00
Al
8fb9bacfa6 [phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs 2015-04-27 01:01:43 -04:00
Al
8bc77372ef [phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries 2015-04-26 22:24:02 -04:00
Al
6ebea11640 [transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters 2015-04-26 19:47:54 -04:00
Al
ff9b6735f8 [transliteration] Adding header + generated C data file for simplified transliteration rules 2015-04-25 15:44:36 -04:00
Al
be29874f13 [transliteration] Parser for CLDR transforms to generate (simple) C transform rules 2015-04-25 15:42:21 -04:00
Al
1b33744956 [tokenization] Numeric tokens must end in number or letter 2015-04-22 14:55:18 -04:00
Al
9c0126a01c [utils] two set types in collections.h 2015-04-19 09:32:53 -04:00
Al
908e3dc03c [phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search 2015-04-19 09:32:20 -04:00
Al
606a669c01 [tokenization] breaking dashes or double hyphens break a word while other dashes don't 2015-04-17 19:14:42 -04:00
Al
6718182443 [tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words 2015-04-17 15:21:22 -04:00
Al
e21873635c [utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions 2015-04-15 20:17:03 -04:00
Al
24e62b1c6c [tokenization] Script to generate TR-29 ranges for re2c scanner 2015-04-14 15:50:50 -04:00
Al
5fa03587fb [cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing 2015-04-14 15:49:24 -04:00
Al
efdcbc9eef [project] adding a Python .gitignore for scripts, Python lib, etc. 2015-04-14 15:48:43 -04:00
Al
6e9295154a [fix] local dirs for cldr data 2015-04-14 15:46:15 -04:00
Al
744231c148 [fix] cldr supplemental uses local copy 2015-04-13 19:03:44 -04:00
Al
a8b9981c9b [fix] vars 2015-04-13 19:03:14 -04:00
Al
d1267145f7 [fix] args to wget 2015-04-13 19:02:50 -04:00
Al
d771da7c78 [i18n] unicode scripts file downloaded and cached locally 2015-04-13 19:02:29 -04:00
Al
cc4d2d08eb [cldr] Adding script to download latest cldr release instead of pulling from the repo 2015-04-13 01:03:15 -04:00
Al
e241c1dfc8 [rm] Removing dependency on sds, char_array and cstring_array have similar benefits/functionality with fewer drawbacks 2015-04-12 18:07:33 -04:00
Al
83813bb980 [geodisambig] Models for geonames with msgpack serialization/deserialization 2015-04-12 16:47:01 -04:00
Al
acb575c84c [fix] splitting out methods for unicode scripts 2015-04-12 15:21:23 -04:00
Al
1f9da05dd5 [geodisambig] C msgpack serialization dependency 2015-04-12 15:14:01 -04:00
Al
0234754c20 [fix] warnings in string_utils 2015-04-12 12:16:32 -04:00
Al
d50d7d182e [fix] geonames import script for admin 1 codes 2015-04-12 12:16:08 -04:00
Al
888baa86f3 [fix] English dictionaries 2015-04-12 12:15:47 -04:00
Al
3a7f18581e [utils] Adding min, max, argmin, argmax and log_sum_exp to generic vector math header 2015-04-12 12:11:04 -04:00
Al
fdd0c489f3 [fix] refactoring unicode script fetching into more reusable functions 2015-04-09 02:18:13 -04:00
Al
4729dfe178 [utils] string_[rl]strip => string_[rl]trim, removing warning about allocation 2015-04-06 02:19:19 -04:00
Al
53844067b1 [fix] better allocation sizes for tokenized strings 2015-04-05 22:02:31 -04:00
Al
198e51b8a3 [utils] more/better char_array methods 2015-04-05 22:01:46 -04:00
Al
79fd7a8ded [tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string 2015-04-05 16:33:14 -04:00
Al
5f3d74de18 [fix] contiguous string array 2015-04-03 11:22:50 -04:00
Al
fcaeebd656 [dictionaries] fixes to French dictionary 2015-04-01 19:02:38 -04:00
Al
c81aa72254 [utils] a few changes to contiguous string arrays 2015-04-01 19:02:11 -04:00
Al
fa59b63ab2 [fix] type name/import 2015-04-01 02:54:14 -04:00
Al
310acbed2c [phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays 2015-04-01 02:52:57 -04:00
Al
1ac4438e39 [utils] More consistent naming in string_utils 2015-03-27 21:12:08 -04:00
Al
70831b5005 [dictionaries] French elisions 2015-03-27 21:03:55 -04:00
Al
127a61d492 [utils] adding pop method on the improved vectors 2015-03-27 21:00:03 -04:00
Al
3678d4a3ca [gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN) 2015-03-27 20:59:21 -04:00
Al
4ccd1b1fe2 [fix] update feature arrays to use the new APIs 2015-03-27 20:57:42 -04:00
Al
6768936953 [utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors. 2015-03-27 20:57:03 -04:00