Al
|
da856ea5c3
|
[parser] adding phrase features for category, unit, level, entrance, staircase, and po_box phrases from the libpostal dictionaries, excluding phrases which match the toponyms dictionary (e.g. US states that can also be found in street/venue names, useful for expansion but not here), if the current token is part of both an address dictionary phrase and a component phrase derived from the training data, use the longer of the two, or both if they are the same length
|
2017-02-17 03:00:48 -05:00 |
|
Al
|
c380b3e91b
|
[parser] phrase search with address dictionaries should not use the language given at training time since it's not currently available at runtime (without pulling in the language classifier, which may be warranted at some point, especially if the model can be made smaller/sparser)
|
2017-02-15 22:32:30 -05:00 |
|
Al
|
a3e51db32d
|
[api] include some of the new components in default address_components for the libpostal expansion API
|
2017-02-15 22:29:22 -05:00 |
|
Al
|
32fb483e96
|
[gazetteers] adding ADDRESS_PO_BOX component
|
2017-02-15 22:23:28 -05:00 |
|
Al
|
ba0ccc82a3
|
[fix] var name in address_parser_train
|
2017-02-15 22:22:33 -05:00 |
|
Al
|
0196fe8736
|
[utils] fixing key_type in hash_get, adding int64_double map
|
2017-02-15 22:20:36 -05:00 |
|
Al
|
8abfa766fd
|
[fix] paren
|
2017-02-15 02:26:18 -05:00 |
|
Al
|
8eafc5730b
|
[parser] adding long-context features which help classify the first token in the string by finding the relative positions of a) the first numeric token and b) the first street-level phrase like "Ave" or "Calle"
|
2017-02-14 18:42:51 -05:00 |
|
Al
|
56f68e4399
|
[phrases] fixing trie suffix search
|
2017-02-14 03:36:29 -05:00 |
|
Al
|
2f4bcaeec2
|
[parser] address_parser_test memory cleanup, add print-errors option to print individual parser errors on held-out data
|
2017-02-12 16:05:11 -05:00 |
|
Al
|
b1e178b7b2
|
[fix] is_numeric_token includes IDEOGRAPHIC_NUMBER
|
2017-02-12 15:11:56 -05:00 |
|
Al
|
b570855b78
|
[parser] adding postcode context features and associated data structures to the parser. Masking digits, which should hopefully help with generalization. Creating positive/negative features for postcode with and without context support. Note: even with known postcodes in known contexts, only use the masked digits to avoid creating too many features that are redundant with the index.
|
2017-02-10 03:41:14 -05:00 |
|
Al
|
9a93e95938
|
[api] removing geodb from setup functions
|
2017-02-10 01:02:52 -05:00 |
|
Al
|
ff245d74f8
|
[parser] building an index of postal codes and their valid admin contexts (city, state, country, etc.) during training e.g. "11216" => ["brooklyn", "ny"]. Postal code phrases like CP in Spanish are removed when constructing the index.
|
2017-02-10 00:50:48 -05:00 |
|
Al
|
1aacb5bccc
|
Merge branch 'master' into parser-data
|
2017-02-09 15:09:28 -05:00 |
|
Al
|
ea168279bd
|
[fix] free json-encoded string in parser client output
|
2017-02-09 14:34:15 -05:00 |
|
Al
|
38c6c26146
|
[fix] freeing normalized string in address_parser_parse
|
2017-02-09 14:33:13 -05:00 |
|
Al
|
8aa3749cfb
|
[utils] some convenience functions for generic hashtables (incr, get, etc)
|
2017-02-08 19:01:13 -05:00 |
|
Al
|
a6844c8ec1
|
[parser] structural changes for postal codes index
|
2017-02-08 18:52:45 -05:00 |
|
Al
|
6e4f641743
|
[phrases] adding token_phrase_memberships to trie_search for reuse
|
2017-02-08 01:59:39 -05:00 |
|
Al
|
ae35da8d17
|
[fix] uninitialized var
|
2017-02-08 01:58:53 -05:00 |
|
Al
|
0380f565d2
|
[parser] shorter first word feature
|
2017-01-29 22:10:28 -05:00 |
|
Al
|
ec3a563591
|
Merge branch 'master' into parser-data
|
2017-01-14 13:06:25 -05:00 |
|
Rinigus
|
67624f89d0
|
cstring_array_from_char_array: return empty initializes cstring_array from empty string
|
2017-01-14 10:43:47 +02:00 |
|
Al
|
b320aed9ac
|
[merge] merging master
|
2017-01-13 19:58:49 -05:00 |
|
Al
|
df89387b5c
|
[fix] calloc instead of malloc when performing initialization on structs that may fail halfway and need to clean up while partially initialized (calloc will set all the bytes to zero so the member pointers are NULL instead of garbage memory)
|
2017-01-13 18:30:04 -05:00 |
|
Al
|
1398df1260
|
[fix] accept 0 for array_new_size
|
2017-01-13 17:49:31 -05:00 |
|
Al
|
e1f258171f
|
[fix] handle cstring_array_from_char_array where char_array is NULL or 0-length
|
2017-01-13 16:52:41 -05:00 |
|
Al
|
a3506131fe
|
[build] adding libpostal_setup_datadir, libpostal_setup_parser_datadir, libpostal_setup_language_classifier_datadir functions for configuring the datadir at runtime
|
2017-01-09 16:11:26 -05:00 |
|
Al
|
953a26e54e
|
[utils] char_array_add_vjoined to stay consistent (add_* methods NUL termiante)
|
2017-01-09 16:10:07 -05:00 |
|
Al
|
7a8f94330b
|
[parser] only adding ngrams in a hyphenated word if the subword is not rare
|
2017-01-09 02:53:33 -05:00 |
|
Al
|
7a31802a04
|
[fix] also fix german-ascii transliteration on uppercase U with umlaut
|
2017-01-05 04:07:29 -05:00 |
|
Rinigus
|
26aeb0ebec
|
drop AC_FUNC_MALLOC and _REALLOC and check for them as regular functions; add extra cflags for scanner
|
2017-01-05 07:34:24 +02:00 |
|
Al
|
ccd555d020
|
[transliteration] regenerated transliteration_scripts_data.c
|
2017-01-02 13:52:48 -05:00 |
|
Al
|
77035fbdbd
|
[strings] adding utf8_is_whitespace to the header so it can be referenced from multiple files
|
2017-01-02 02:23:21 -05:00 |
|
Al
|
182976214c
|
[logging] converting most of the steps in building the transliteration table to use debug logging
|
2017-01-02 00:41:11 -05:00 |
|
Al
|
d8d3840700
|
[transliteration] constant for the html-escape transliterator
|
2017-01-02 00:40:12 -05:00 |
|
Al
|
4ad3a52fe1
|
[strings] fix lowercasing in string_utils.c
|
2017-01-01 20:08:34 -05:00 |
|
Al
|
a78937f265
|
[normalize] use the new utf8proc lowercasing (as opposed to case folding), free copies since none of the string functions operate in-place any more, add minimal HTML escaping transliterator even to ASCII text
|
2017-01-01 20:06:32 -05:00 |
|
Al
|
5c56a44faa
|
[strings] reverting to utf8proc v1.3.1, as 2.0 and above can chop off certain sequences
|
2017-01-01 20:03:23 -05:00 |
|
Al
|
fe88630f78
|
[dictionaries] regenerating address_expansion_data.c from upstream changes
|
2017-01-01 14:26:54 -05:00 |
|
Al
|
101bbcc02d
|
Merge remote-tracking branch 'origin/master' into parser-data
|
2017-01-01 14:25:37 -05:00 |
|
Travis
|
d61e90a33d
|
[auto][ci skip] Adding data files from Travis build #188
|
2017-01-01 19:20:54 +00:00 |
|
Al
|
0b5cc96654
|
[transliteration] add decompose option when stripping accents
|
2017-01-01 13:54:20 -05:00 |
|
Al
|
7d6c85aeec
|
[fix] new string tree iterator, don't decrement permutations on rollovers
|
2017-01-01 13:34:08 -05:00 |
|
Al
|
1780c5e053
|
[fix] moving enum
|
2016-12-31 13:01:57 -05:00 |
|
Al
|
475aa3dbfa
|
[strings] fixing and simplifying string tree iterator. This version is inspired by Python's itertools.product (itertoolsmodule.c has so many goodies)
|
2016-12-31 03:22:27 -05:00 |
|
Al
|
261ec3888a
|
[strings] header changes for new utf8 lower/upper functions
|
2016-12-31 03:20:43 -05:00 |
|
Al
|
58b063b632
|
[strings] making string_tree_iterator_done more meaningful (returns true if the iterator has no paths left to traverse)
|
2016-12-31 00:54:36 -05:00 |
|
Al
|
8978000320
|
[strings] adding latest utf8proc, new functions for utf8_lower (instead of case folding) and utf8_upper, and a utf8_is_whitespace that takes things like tabs into account
|
2016-12-31 00:52:12 -05:00 |
|