Commit Graph

881 Commits

Author SHA1 Message Date
Al
5113a1bc32 [utils] tracking keys added in trie construction from hash 2017-03-06 15:28:26 -05:00
Al
dd4f3eb84c [parser] simpler feature names for the state-transition features 2017-03-06 15:25:10 -05:00
Al
39fa8ff1a5 [parser] counting num classes in address parser init for models where it is needed a priori 2017-03-06 15:17:52 -05:00
Al
5f19e63cbe [parser] more logging in init 2017-03-06 15:11:39 -05:00
Al
bb922e4ce4 [parser] adding log message 2017-03-06 12:25:22 -05:00
Al
b97de96ab4 [parser] fixing chunked shuffle, making awk splitting work on Mac 2017-03-05 15:06:02 -05:00
Al
0e49fc580a [parser] uint64_t chunk size, no warning if gshuf is available 2017-03-05 14:50:47 -05:00
Al
b76b7b8527 [parser] adding chunked shuffle as a C function (writes each line to one of n random files, runs shuf on each file and concatenates the result). Adding a version which allows specifying a specific chunk size, and using a 2GB limit for address parser training. Allowing gshuf again for Mac as it seems the only problem there was not having enough memory when testing on a Mac laptop. The new limited-memory version should be fast enough. 2017-03-05 02:15:11 -05:00
Al
e39d4d2f00 [parser] check for non-null prev/prev2 before creating tag-based features 2017-02-24 02:57:16 -05:00
Al
182d60b623 [fix] removing include 2017-02-23 22:45:03 -05:00
Al
6a079e86b3 [fix] using size_t instead of int in address_parser/address_parser_train 2017-02-20 19:22:13 -08:00
Al
8ea5405c20 [parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction) 2017-02-19 14:21:58 -08:00
Al
715520f681 [parser] using new zeros API in averaged_perceptron.c 2017-02-19 14:02:54 -08:00
Al
b88487f633 [utils] string_replace_char does single byte/character replacement, new string_replace to do full string replacement, again using char_array for safety, string_replace_with_array function for memory reuse 2017-02-17 13:58:51 -05:00
Al
da856ea5c3 [parser] adding phrase features for category, unit, level, entrance, staircase, and po_box phrases from the libpostal dictionaries, excluding phrases which match the toponyms dictionary (e.g. US states that can also be found in street/venue names, useful for expansion but not here), if the current token is part of both an address dictionary phrase and a component phrase derived from the training data, use the longer of the two, or both if they are the same length 2017-02-17 03:00:48 -05:00
Al
c380b3e91b [parser] phrase search with address dictionaries should not use the language given at training time since it's not currently available at runtime (without pulling in the language classifier, which may be warranted at some point, especially if the model can be made smaller/sparser) 2017-02-15 22:32:30 -05:00
Al
a3e51db32d [api] include some of the new components in default address_components for the libpostal expansion API 2017-02-15 22:29:22 -05:00
Al
32fb483e96 [gazetteers] adding ADDRESS_PO_BOX component 2017-02-15 22:23:28 -05:00
Al
ba0ccc82a3 [fix] var name in address_parser_train 2017-02-15 22:22:33 -05:00
Al
0196fe8736 [utils] fixing key_type in hash_get, adding int64_double map 2017-02-15 22:20:36 -05:00
Al
8abfa766fd [fix] paren 2017-02-15 02:26:18 -05:00
Al
8eafc5730b [parser] adding long-context features which help classify the first token in the string by finding the relative positions of a) the first numeric token and b) the first street-level phrase like "Ave" or "Calle" 2017-02-14 18:42:51 -05:00
Al
56f68e4399 [phrases] fixing trie suffix search 2017-02-14 03:36:29 -05:00
Al
2f4bcaeec2 [parser] address_parser_test memory cleanup, add print-errors option to print individual parser errors on held-out data 2017-02-12 16:05:11 -05:00
Al
b1e178b7b2 [fix] is_numeric_token includes IDEOGRAPHIC_NUMBER 2017-02-12 15:11:56 -05:00
Al
b570855b78 [parser] adding postcode context features and associated data structures to the parser. Masking digits, which should hopefully help with generalization. Creating positive/negative features for postcode with and without context support. Note: even with known postcodes in known contexts, only use the masked digits to avoid creating too many features that are redundant with the index. 2017-02-10 03:41:14 -05:00
Al
9a93e95938 [api] removing geodb from setup functions 2017-02-10 01:02:52 -05:00
Al
ff245d74f8 [parser] building an index of postal codes and their valid admin contexts (city, state, country, etc.) during training e.g. "11216" => ["brooklyn", "ny"]. Postal code phrases like CP in Spanish are removed when constructing the index. 2017-02-10 00:50:48 -05:00
Al
1aacb5bccc Merge branch 'master' into parser-data 2017-02-09 15:09:28 -05:00
Al
ea168279bd [fix] free json-encoded string in parser client output 2017-02-09 14:34:15 -05:00
Al
38c6c26146 [fix] freeing normalized string in address_parser_parse 2017-02-09 14:33:13 -05:00
Al
8aa3749cfb [utils] some convenience functions for generic hashtables (incr, get, etc) 2017-02-08 19:01:13 -05:00
Al
a6844c8ec1 [parser] structural changes for postal codes index 2017-02-08 18:52:45 -05:00
Al
6e4f641743 [phrases] adding token_phrase_memberships to trie_search for reuse 2017-02-08 01:59:39 -05:00
Al
ae35da8d17 [fix] uninitialized var 2017-02-08 01:58:53 -05:00
Al
0380f565d2 [parser] shorter first word feature 2017-01-29 22:10:28 -05:00
Al
ec3a563591 Merge branch 'master' into parser-data 2017-01-14 13:06:25 -05:00
Rinigus
67624f89d0 cstring_array_from_char_array: return empty initializes cstring_array from empty string 2017-01-14 10:43:47 +02:00
Al
b320aed9ac [merge] merging master 2017-01-13 19:58:49 -05:00
Al
df89387b5c [fix] calloc instead of malloc when performing initialization on structs that may fail halfway and need to clean up while partially initialized (calloc will set all the bytes to zero so the member pointers are NULL instead of garbage memory) 2017-01-13 18:30:04 -05:00
Al
1398df1260 [fix] accept 0 for array_new_size 2017-01-13 17:49:31 -05:00
Al
e1f258171f [fix] handle cstring_array_from_char_array where char_array is NULL or 0-length 2017-01-13 16:52:41 -05:00
Al
a3506131fe [build] adding libpostal_setup_datadir, libpostal_setup_parser_datadir, libpostal_setup_language_classifier_datadir functions for configuring the datadir at runtime 2017-01-09 16:11:26 -05:00
Al
953a26e54e [utils] char_array_add_vjoined to stay consistent (add_* methods NUL termiante) 2017-01-09 16:10:07 -05:00
Al
7a8f94330b [parser] only adding ngrams in a hyphenated word if the subword is not rare 2017-01-09 02:53:33 -05:00
Al
7a31802a04 [fix] also fix german-ascii transliteration on uppercase U with umlaut 2017-01-05 04:07:29 -05:00
Rinigus
26aeb0ebec drop AC_FUNC_MALLOC and _REALLOC and check for them as regular functions; add extra cflags for scanner 2017-01-05 07:34:24 +02:00
Al
ccd555d020 [transliteration] regenerated transliteration_scripts_data.c 2017-01-02 13:52:48 -05:00
Al
77035fbdbd [strings] adding utf8_is_whitespace to the header so it can be referenced from multiple files 2017-01-02 02:23:21 -05:00
Al
182976214c [logging] converting most of the steps in building the transliteration table to use debug logging 2017-01-02 00:41:11 -05:00