Al
|
cbe83376f2
|
[transliteration] Adding new, even smaller, generated data file
|
2015-05-12 18:58:38 -04:00 |
|
Al
|
0984fb9ea4
|
[transliteration] new, more compact transliteration data file
|
2015-05-12 12:13:11 -04:00 |
|
Al
|
2a69488f9b
|
[fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file.
|
2015-05-08 17:14:26 -04:00 |
|
Al
|
10ebaf147a
|
[transliteration] literal ^ and $ escaped
|
2015-05-01 19:16:36 -04:00 |
|
Al
|
ff851a464c
|
[fix] escaping curly braces for regex compilation
|
2015-04-30 13:27:17 -04:00 |
|
Al
|
fa43abd8d9
|
[transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key
|
2015-04-29 14:31:15 -04:00 |
|
Al
|
1c25238af7
|
[fix] string lengths on the various transliteration rules
|
2015-04-27 13:51:35 -04:00 |
|
Al
|
1373843b86
|
[fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't.
|
2015-04-27 01:49:08 -04:00 |
|
Al
|
b2ba629f95
|
[fix] trie_get methods just return node index rather than data value
|
2015-04-27 01:28:05 -04:00 |
|
Al
|
8fb9bacfa6
|
[phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs
|
2015-04-27 01:01:43 -04:00 |
|
Al
|
8bc77372ef
|
[phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries
|
2015-04-26 22:24:02 -04:00 |
|
Al
|
6ebea11640
|
[transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters
|
2015-04-26 19:47:54 -04:00 |
|
Al
|
ff9b6735f8
|
[transliteration] Adding header + generated C data file for simplified transliteration rules
|
2015-04-25 15:44:36 -04:00 |
|
Al
|
1b33744956
|
[tokenization] Numeric tokens must end in number or letter
|
2015-04-22 14:55:18 -04:00 |
|
Al
|
9c0126a01c
|
[utils] two set types in collections.h
|
2015-04-19 09:32:53 -04:00 |
|
Al
|
908e3dc03c
|
[phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search
|
2015-04-19 09:32:20 -04:00 |
|
Al
|
606a669c01
|
[tokenization] breaking dashes or double hyphens break a word while other dashes don't
|
2015-04-17 19:14:42 -04:00 |
|
Al
|
6718182443
|
[tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words
|
2015-04-17 15:21:22 -04:00 |
|
Al
|
e21873635c
|
[utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions
|
2015-04-15 20:17:03 -04:00 |
|
Al
|
e241c1dfc8
|
[rm] Removing dependency on sds, char_array and cstring_array have similar benefits/functionality with fewer drawbacks
|
2015-04-12 18:07:33 -04:00 |
|
Al
|
83813bb980
|
[geodisambig] Models for geonames with msgpack serialization/deserialization
|
2015-04-12 16:47:01 -04:00 |
|
Al
|
1f9da05dd5
|
[geodisambig] C msgpack serialization dependency
|
2015-04-12 15:14:01 -04:00 |
|
Al
|
0234754c20
|
[fix] warnings in string_utils
|
2015-04-12 12:16:32 -04:00 |
|
Al
|
3a7f18581e
|
[utils] Adding min, max, argmin, argmax and log_sum_exp to generic vector math header
|
2015-04-12 12:11:04 -04:00 |
|
Al
|
4729dfe178
|
[utils] string_[rl]strip => string_[rl]trim, removing warning about allocation
|
2015-04-06 02:19:19 -04:00 |
|
Al
|
53844067b1
|
[fix] better allocation sizes for tokenized strings
|
2015-04-05 22:02:31 -04:00 |
|
Al
|
198e51b8a3
|
[utils] more/better char_array methods
|
2015-04-05 22:01:46 -04:00 |
|
Al
|
79fd7a8ded
|
[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string
|
2015-04-05 16:33:14 -04:00 |
|
Al
|
5f3d74de18
|
[fix] contiguous string array
|
2015-04-03 11:22:50 -04:00 |
|
Al
|
c81aa72254
|
[utils] a few changes to contiguous string arrays
|
2015-04-01 19:02:11 -04:00 |
|
Al
|
fa59b63ab2
|
[fix] type name/import
|
2015-04-01 02:54:14 -04:00 |
|
Al
|
310acbed2c
|
[phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays
|
2015-04-01 02:52:57 -04:00 |
|
Al
|
1ac4438e39
|
[utils] More consistent naming in string_utils
|
2015-03-27 21:12:08 -04:00 |
|
Al
|
127a61d492
|
[utils] adding pop method on the improved vectors
|
2015-03-27 21:00:03 -04:00 |
|
Al
|
3678d4a3ca
|
[gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN)
|
2015-03-27 20:59:21 -04:00 |
|
Al
|
4ccd1b1fe2
|
[fix] update feature arrays to use the new APIs
|
2015-03-27 20:57:42 -04:00 |
|
Al
|
6768936953
|
[utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors.
|
2015-03-27 20:57:03 -04:00 |
|
Al
|
70195fffd5
|
[utils] new methods on string_utils for better dynamic strings which retains the benefits of sds without having to worry about the pointer changing, renaming contiguous string array methods to something more succinct
|
2015-03-27 20:55:36 -04:00 |
|
Al
|
2d1c24a6e9
|
[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types
|
2015-03-24 16:43:53 -04:00 |
|
Al
|
7ffe788913
|
[unicode] header
|
2015-03-18 17:25:53 -04:00 |
|
Al
|
d5a9041cd3
|
[unicode] Adding generated unicode script data
|
2015-03-18 17:01:03 -04:00 |
|
Al
|
d2ceb5f418
|
[fix] removing struct definition from scanner.re for future generation of scanner.c
|
2015-03-17 19:46:40 -04:00 |
|
Al
|
f794ef7222
|
[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation
|
2015-03-17 18:38:30 -04:00 |
|
Al
|
daf3f8706b
|
[utils] adding tab and comma constants to file_utils for parsing CSV/TSV files
|
2015-03-17 18:35:45 -04:00 |
|
Al
|
f787851754
|
[unicode] Upgrading to JuliaLang's utf8proc (Unicode 7, maintained)
|
2015-03-17 12:20:08 -04:00 |
|
Al
|
0df849b440
|
[features] Feature array, a special case of contiguous string array for adding namespaced features in CRF-like sequence models
|
2015-03-14 18:37:41 -04:00 |
|
Al
|
53aa9bccb1
|
[geodisambig] adding MurmurHash3, used by the Bloom filter
|
2015-03-11 17:47:57 -04:00 |
|
Al
|
cf613ee475
|
[geodisambig] Bloom filter implementation for quick probabilistic set membership tests before hitting disk. 100% recall and bounded precision, saves disk seeks for keys that definitely do not exist (useful for Geonames disambiguation-related lookups and in-process deduping).
|
2015-03-11 17:47:15 -04:00 |
|
Al
|
eb391bf4d5
|
[dictionaries] Making address_components bit set a 16 bit int so we can bit pack trie values
|
2015-03-11 17:36:38 -04:00 |
|
Al
|
a446290829
|
[fix] IDEOGRAM class name
|
2015-03-11 17:33:53 -04:00 |
|