Commit Graph

544 Commits

Author SHA1 Message Date
Al
243f327928 [fix] NULL check 2015-07-27 16:32:01 -04:00
Al
7aee159c0c [utils] string_tree_num_tokens 2015-07-27 12:36:34 -04:00
Al
b812d90c59 [fix] specifying numex dir with cross-platform PATH_SEPARATOR 2015-07-27 12:36:06 -04:00
Al
7ff9a6054d [geodb] trim strings in geodb builder 2015-07-27 02:37:20 -04:00
Al
053b987d58 [normalize] adding an option for string trimming in normalize 2015-07-27 01:59:14 -04:00
Al
b94526a27b [utils] Making string_trim handle all kinds of UTF-8 whitespace/separators 2015-07-27 01:55:46 -04:00
Al
eab4c554d6 [numex] Regenerating numex data file 2015-07-27 01:53:13 -04:00
Al
0ab1434f20 [numex] Making all languages except the ideographic writing systems (CJK) whole_tokens_only for numex. Otherwise non-number prefixes may accidentally get converted into numbers. May add some more options around this in the future. 2015-07-27 01:52:44 -04:00
Al
d2539f5b57 [numex] Fixing case of hyphen/space-initial phrases in numex, as well as whole token only languages with ordinals 2015-07-27 01:44:33 -04:00
Al
8ff4ace63b [phrases] Allowing trie_search to process tokenized input with or without whitespace, and to handle ideographic characters correctly 2015-07-26 23:41:57 -04:00
Al
38b10b9dd0 [fix] Clearing paths before reuse in geodb_builder 2015-07-26 23:36:34 -04:00
Al
93042761ac [fix] warnings in string_utils.c 2015-07-26 23:36:03 -04:00
Al
50ee95ff7d [geodb] Adding a msgpack'd list of ids for naked string keys in geodb builder 2015-07-25 18:42:13 -04:00
Al
a67ec44a08 [utils] cstring_array_terminate, moving msgpack_utils to separate file 2015-07-25 18:41:02 -04:00
Al
42f6be7434 [fix] county road 2015-07-25 14:19:38 -04:00
Al
2ff8c0fd1e [transliteration] fixing length-based transliteration 2015-07-25 13:53:28 -04:00
Al
71ffdf9cbc [expansion] tokenized version of search_address_dictionaries 2015-07-25 13:50:53 -04:00
Al
ee96dab93c [fix] unnecessary headers 2015-07-25 13:49:42 -04:00
Al
e549e76806 [utils] string_tree_iterator_foreach_token 2015-07-25 13:49:02 -04:00
Al
2adaf475c2 [utils] cstring_array (contiguous) to array of malloc'd strings 2015-07-25 12:14:01 -04:00
Al
e9277d7339 [utils] vector extend method 2015-07-25 01:33:45 -04:00
Al
cdb9afddd3 [fix] address training data carriage returns 2015-07-25 00:35:27 -04:00
Al
9fb1eae877 [expansion] Regenerating address data file 2015-07-24 16:09:22 -04:00
Al
cff72a0cb3 [dictionaries] Adding a few versions of the phrase "centro commerical" in French, Spanish and Italian after a review of addresses in those languages 2015-07-24 16:07:34 -04:00
Al
351c7c8c2e [expansion] Add concatenated suffixes to the suffix keyspace of the address dictionary trie and concatenated prefixes and elisions to the prefix keyspace 2015-07-24 16:02:47 -04:00
Al
90a91cadd0 [search] Modifying trie_search_prefixes to use the new key schema 2015-07-24 15:59:49 -04:00
Al
bb7688d8d1 [phrases] trie_add_prefix method and a schema for prefix keys, e.g. elisions in French and Italian, separable prefixes like Hinter in German, etc. 2015-07-24 15:56:19 -04:00
Al
359cd62e20 [numex] Adding a replace_numeric_expressions method (returns NULL if no replacements were made), fixing lengths in situations where two unrelated numbers are joined by a stopword e.g. in the phrase "one and one" the "and" acts as a delimiter vs a phrase where the stopword acts as a joiner like "one hundred and twenty" 2015-07-24 15:31:05 -04:00
Al
12959aa483 [numex] Re-generating numex data 2015-07-24 15:24:03 -04:00
Al
5239c365d0 [docs] Adding some documentation for normalize.h options 2015-07-24 15:23:25 -04:00
Al
caf714f06f [fix] typo and frivolous key 2015-07-24 15:22:57 -04:00
Al
87566bb6a5 [numex] Adding validation checks for numex JSON 2015-07-24 15:22:07 -04:00
Al
96538469dd [utils] Adding a cstring_array_foreach macro 2015-07-23 15:57:12 -04:00
Al
27af28eacf [expansion] Changes to address_expansion struct to allow for multiple dictionaries per record. Only adding unique canonical strings to the string array 2015-07-22 20:35:29 -04:00
Al
454be89121 [expansion] generated header and data files 2015-07-22 20:31:54 -04:00
Al
b27af13f8a [expansion] Adding an array of dictionaries to each (phrase, canonical) pair 2015-07-22 20:24:14 -04:00
Al
0a9e92f11f [expansion] Adding both key (for membership tests) and language-prefixed key to address dictionary 2015-07-22 17:21:09 -04:00
Al
09004aa5f1 [expansion] Constant for the "all" dictionary 2015-07-22 17:18:19 -04:00
Al
f61d993157 [expansion] removing the self param from address_dictionary methods, adding search_address_dictionaries method which searches a string for phrases in a particular language 2015-07-22 03:51:28 -04:00
Al
3da4b5d8c2 [numex] New numex generated data file 2015-07-22 02:24:16 -04:00
Al
ba8ff2b0c6 [expansion] Language prefixed keys 2015-07-22 02:16:22 -04:00
Al
157727d249 [fix] method name, strlen and fclose 2015-07-22 02:15:45 -04:00
Al
64a63fdf51 [mv] Moving all repo data files to a resources dir, data is only for runtime files 2015-07-21 18:11:36 -04:00
Al
a38b924c5d [fix] add_token_alternatives 2015-07-21 17:26:59 -04:00
Al
71be52275d [tokenization] Adding a version which of tokenize which keeps whitespace tokens 2015-07-21 17:26:20 -04:00
Al
5d21cb1604 [expansion] Address dictionary builder 2015-07-21 16:46:57 -04:00
Al
6eccde0df8 [fix] trie_set_data_at_index 2015-07-21 16:46:38 -04:00
Al
c798876b3d [expansion] Address dictionary allocation, I/O, get/set 2015-07-21 16:46:15 -04:00
Al
2114b21399 [fix] A few anomalies in the Wikipedia/Wiktionary-generated given names 2015-07-21 16:07:28 -04:00
Al
3509b203f8 [gazetteers] Moving data out of the header file 2015-07-21 16:06:49 -04:00