Commit Graph

255 Commits

Author SHA1 Message Date
Al
9360ff2c4b [geodb] geodb_builder using new trie_get/set_data_at_index methds 2015-07-20 16:53:48 -04:00
Al
9374745140 [fix] var name and placement 2015-07-20 16:53:19 -04:00
Al
9f697e0256 [transliteration] transliterate now using the new trie_get_data_at_index API 2015-07-20 16:47:56 -04:00
Al
7f96726e82 [phrases] Adding trie_get_data/trie_set_data + at_index methods 2015-07-20 16:39:58 -04:00
Al
b9771921fc [fix] Path joins in geodb_builder use new char_array methods 2015-07-20 16:31:43 -04:00
Al
d55d505329 [phrases] trie_get_data and trie_set_data interface for simpler dictionary-style trie get/set 2015-07-20 16:29:48 -04:00
Al
1d7247d7e1 [polygons] Adding Belgium regional languages 2015-07-17 00:53:25 -04:00
Al
5f2be3022b [expansion] dictionary_type_t enum instead of uint64_t 2015-07-16 03:49:37 -04:00
Al
f713c53993 [utils] Adding an option to char_array_add_joined to strip separators for path manipulation 2015-07-16 03:49:00 -04:00
Al
f181c04e7a [expansion] expansion rule structs and Python script to generate rules from dictionaries tree. Note that a canonical_index of -1 indicates that a given phrase is the canonical (saves space) 2015-07-16 02:49:53 -04:00
Al
a8b2fb5b90 [tokenization] Regenerating scanner file 2015-07-14 18:16:24 -04:00
Al
43293d0ae3 [tokenization] Fixing a tokenization where mid-number characters appear in the middle of a word+numeric sequence e.g. Zigor,2 should be 3 separate tokens. Sequences like 35,37,39 are still treated as a single token for the moment. 2015-07-14 18:15:58 -04:00
Al
a9967ec9bd [numex] Regenerating numex file 2015-07-13 01:16:39 -04:00
Al
86fe289320 [numex] Re-generated numex data file 2015-07-13 00:56:48 -04:00
Al
fbef0a15fe [geodb] Adding sparkey dependency 2015-07-09 15:26:11 -04:00
Al
4f1b4756d0 [geodb] Adding builder program (requires 11GB disk space and ~4GB RAM to build, but only ~300MB RAM to use after building) 2015-07-09 15:25:29 -04:00
Al
8889a5c0c3 [geodb] GeoDB memory allocation and I/O 2015-07-09 15:01:06 -04:00
Al
2d5641892a [config] lower Bloom filter error rate 2015-07-09 14:59:23 -04:00
Al
20c6436e6d [geodisambig] Return success if admin1/admin2 IDs are 0 2015-07-09 04:19:49 -04:00
Al
20303ad94f [geohash] Adding bounds checks from python-geohash 2015-07-09 04:13:53 -04:00
Al
722904ce59 [fix] geoname_clear needs to clear feature code as well 2015-07-09 03:08:52 -04:00
Al
14500f8c7e [config] Adding GeoDB default bloom filter size and error rate 2015-07-08 20:50:52 -04:00
Al
0e2a0aa56d [geodisambig] adding new methods to header 2015-07-08 19:05:08 -04:00
Al
ce54a2146b [fix] geo disambiguation features 2015-07-08 19:03:39 -04:00
Al
fc32a66d95 [fix] geonames I/O 2015-07-08 19:02:45 -04:00
Al
8c02073b54 [geonames] Adding country_geonames_id to both geoname and postal code structs 2015-07-08 18:44:21 -04:00
Al
9af0b0ab65 [geodisambig] adding a few more features to geonames disambiguation 2015-07-08 18:43:28 -04:00
Al
742079cc6a [geonames] Re-generating postal/geonames fields headers 2015-07-08 17:02:59 -04:00
Al
b76f9e47d1 [utils] max string size for int8_t and int16_t 2015-07-08 16:46:12 -04:00
Al
c0a5607f5e [fix] Adding NUM_BOUNDARY_TYPES for enumeration purposes 2015-07-08 16:43:57 -04:00
Al
24835fd088 [geonames] namespace specificity 2015-07-07 03:38:48 -04:00
Al
af1a5f6213 [trie] trie_set_data_node method 2015-07-07 03:38:17 -04:00
Al
53908ac604 [config] Adding geonames dir as a separate #define 2015-07-06 17:09:02 -04:00
Al
c4fd48e7f7 [config] geodb dir 2015-07-06 16:55:11 -04:00
Al
e7a3987656 [geodisambig] renaming module 2015-07-06 16:53:53 -04:00
Al
d7f73e62f1 [utils] Adding cstring_array_clear method 2015-07-06 12:48:26 -04:00
Al
0df816fd31 [geodisambig] Helper methods to add features for a given geoname/postal_code 2015-07-06 12:41:10 -04:00
Al
6ff91fef6b [normalization] adding a normalize_string_latin method 2015-07-05 23:38:01 -04:00
Al
a08d59c277 [fix] NFD normalization should be the default in normalize.c, not NFKD, as NFKD does some unwanted things like converting superscripts and the Latin-ASCII transliterator does a better, more thorough job while staying faithful to the original string 2015-07-05 15:28:07 -04:00
Al
47ed2e58fd [geodisambig] feature functions for GeoNames disambiguation 2015-07-04 10:35:56 -04:00
Al
20a8b9611d [fix] Removing feature length variables from geonames.c 2015-07-04 10:33:08 -04:00
Al
3f07cc6c71 [geohash] Modified geohash implementation (based on python-geohash) with no mallocs 2015-07-04 01:30:30 -04:00
Al
4fd4fa7dca [fix] moving int string size constants to string_utils.h 2015-07-02 17:50:09 -04:00
Al
055e6d8905 [fix] typo in constant 2015-07-02 16:12:24 -04:00
Al
e273caac22 [geonames] generated postal code TSV fields 2015-07-02 16:00:06 -04:00
Al
fd28ee27bf [geonames] generated geonames TSV fields 2015-07-02 15:59:54 -04:00
Al
6cfbab9969 [normalization] string normalization module for tokens and full strings 2015-07-01 14:52:28 -04:00
Al
46e51ae91e [transliterate] no need to strdup transliterator names if they are lowercased, breaking on NUL byte 2015-07-01 14:51:22 -04:00
Al
b58877ec6c [utils] string_is_lower/string_is_upper method 2015-07-01 14:49:22 -04:00
Al
d0db015667 [geodisambig] Adding new fields to geonames struct, plus I/O 2015-07-01 13:02:00 -04:00