Al
|
1373843b86
|
[fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't.
|
2015-04-27 01:49:08 -04:00 |
|
Al
|
b2ba629f95
|
[fix] trie_get methods just return node index rather than data value
|
2015-04-27 01:28:05 -04:00 |
|
Al
|
8fb9bacfa6
|
[phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs
|
2015-04-27 01:01:43 -04:00 |
|
Al
|
8bc77372ef
|
[phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries
|
2015-04-26 22:24:02 -04:00 |
|
Al
|
6ebea11640
|
[transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters
|
2015-04-26 19:47:54 -04:00 |
|
Al
|
ff9b6735f8
|
[transliteration] Adding header + generated C data file for simplified transliteration rules
|
2015-04-25 15:44:36 -04:00 |
|
Al
|
1b33744956
|
[tokenization] Numeric tokens must end in number or letter
|
2015-04-22 14:55:18 -04:00 |
|
Al
|
9c0126a01c
|
[utils] two set types in collections.h
|
2015-04-19 09:32:53 -04:00 |
|
Al
|
908e3dc03c
|
[phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search
|
2015-04-19 09:32:20 -04:00 |
|
Al
|
606a669c01
|
[tokenization] breaking dashes or double hyphens break a word while other dashes don't
|
2015-04-17 19:14:42 -04:00 |
|
Al
|
6718182443
|
[tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words
|
2015-04-17 15:21:22 -04:00 |
|
Al
|
e21873635c
|
[utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions
|
2015-04-15 20:17:03 -04:00 |
|
Al
|
e241c1dfc8
|
[rm] Removing dependency on sds, char_array and cstring_array have similar benefits/functionality with fewer drawbacks
|
2015-04-12 18:07:33 -04:00 |
|
Al
|
83813bb980
|
[geodisambig] Models for geonames with msgpack serialization/deserialization
|
2015-04-12 16:47:01 -04:00 |
|
Al
|
1f9da05dd5
|
[geodisambig] C msgpack serialization dependency
|
2015-04-12 15:14:01 -04:00 |
|
Al
|
0234754c20
|
[fix] warnings in string_utils
|
2015-04-12 12:16:32 -04:00 |
|
Al
|
3a7f18581e
|
[utils] Adding min, max, argmin, argmax and log_sum_exp to generic vector math header
|
2015-04-12 12:11:04 -04:00 |
|
Al
|
4729dfe178
|
[utils] string_[rl]strip => string_[rl]trim, removing warning about allocation
|
2015-04-06 02:19:19 -04:00 |
|
Al
|
53844067b1
|
[fix] better allocation sizes for tokenized strings
|
2015-04-05 22:02:31 -04:00 |
|
Al
|
198e51b8a3
|
[utils] more/better char_array methods
|
2015-04-05 22:01:46 -04:00 |
|
Al
|
79fd7a8ded
|
[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string
|
2015-04-05 16:33:14 -04:00 |
|
Al
|
5f3d74de18
|
[fix] contiguous string array
|
2015-04-03 11:22:50 -04:00 |
|
Al
|
c81aa72254
|
[utils] a few changes to contiguous string arrays
|
2015-04-01 19:02:11 -04:00 |
|
Al
|
fa59b63ab2
|
[fix] type name/import
|
2015-04-01 02:54:14 -04:00 |
|
Al
|
310acbed2c
|
[phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays
|
2015-04-01 02:52:57 -04:00 |
|
Al
|
1ac4438e39
|
[utils] More consistent naming in string_utils
|
2015-03-27 21:12:08 -04:00 |
|
Al
|
127a61d492
|
[utils] adding pop method on the improved vectors
|
2015-03-27 21:00:03 -04:00 |
|
Al
|
3678d4a3ca
|
[gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN)
|
2015-03-27 20:59:21 -04:00 |
|
Al
|
4ccd1b1fe2
|
[fix] update feature arrays to use the new APIs
|
2015-03-27 20:57:42 -04:00 |
|
Al
|
6768936953
|
[utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors.
|
2015-03-27 20:57:03 -04:00 |
|
Al
|
70195fffd5
|
[utils] new methods on string_utils for better dynamic strings which retains the benefits of sds without having to worry about the pointer changing, renaming contiguous string array methods to something more succinct
|
2015-03-27 20:55:36 -04:00 |
|
Al
|
2d1c24a6e9
|
[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types
|
2015-03-24 16:43:53 -04:00 |
|
Al
|
7ffe788913
|
[unicode] header
|
2015-03-18 17:25:53 -04:00 |
|
Al
|
d5a9041cd3
|
[unicode] Adding generated unicode script data
|
2015-03-18 17:01:03 -04:00 |
|
Al
|
d2ceb5f418
|
[fix] removing struct definition from scanner.re for future generation of scanner.c
|
2015-03-17 19:46:40 -04:00 |
|
Al
|
f794ef7222
|
[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation
|
2015-03-17 18:38:30 -04:00 |
|
Al
|
daf3f8706b
|
[utils] adding tab and comma constants to file_utils for parsing CSV/TSV files
|
2015-03-17 18:35:45 -04:00 |
|
Al
|
f787851754
|
[unicode] Upgrading to JuliaLang's utf8proc (Unicode 7, maintained)
|
2015-03-17 12:20:08 -04:00 |
|
Al
|
0df849b440
|
[features] Feature array, a special case of contiguous string array for adding namespaced features in CRF-like sequence models
|
2015-03-14 18:37:41 -04:00 |
|
Al
|
53aa9bccb1
|
[geodisambig] adding MurmurHash3, used by the Bloom filter
|
2015-03-11 17:47:57 -04:00 |
|
Al
|
cf613ee475
|
[geodisambig] Bloom filter implementation for quick probabilistic set membership tests before hitting disk. 100% recall and bounded precision, saves disk seeks for keys that definitely do not exist (useful for Geonames disambiguation-related lookups and in-process deduping).
|
2015-03-11 17:47:15 -04:00 |
|
Al
|
eb391bf4d5
|
[dictionaries] Making address_components bit set a 16 bit int so we can bit pack trie values
|
2015-03-11 17:36:38 -04:00 |
|
Al
|
a446290829
|
[fix] IDEOGRAM class name
|
2015-03-11 17:33:53 -04:00 |
|
Al
|
a5f7c73374
|
[utils] is_relative_path
|
2015-03-11 17:31:08 -04:00 |
|
Al
|
5157a0fd8b
|
[utils] float and double arrays in collections.h
|
2015-03-11 17:30:26 -04:00 |
|
Al
|
94805fb1a7
|
[tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters
|
2015-03-11 17:29:37 -04:00 |
|
Al
|
642d3697d4
|
[dictionaries] additions to German dictionaries, including a separable prefix dictionary
|
2015-03-08 17:55:57 -04:00 |
|
Al
|
38ec03bf2b
|
[phrases] default constructor for a trie uses a default alphabet derived from Wikipedia character frequencies for convenience. In practice the alphabet size/ordering matters only for very small tries or specialized alphabets. Mostly just use trie_new()
|
2015-03-05 13:40:52 -05:00 |
|
Al
|
939c3af293
|
[dictionaries] gazetteers.h has the config for in-memory dictionaries' directory structure
|
2015-03-04 16:01:16 -05:00 |
|
Al
|
6d9c6a6fe7
|
[utils] geohash
|
2015-03-03 18:51:49 -05:00 |
|