libpostal

Author	SHA1	Message	Date
Al	8abfa766fd	[fix] paren	2017-02-15 02:26:18 -05:00
Al	8eafc5730b	[parser] adding long-context features which help classify the first token in the string by finding the relative positions of a) the first numeric token and b) the first street-level phrase like "Ave" or "Calle"	2017-02-14 18:42:51 -05:00
Al	56f68e4399	[phrases] fixing trie suffix search	2017-02-14 03:36:29 -05:00
Al	2f4bcaeec2	[parser] address_parser_test memory cleanup, add print-errors option to print individual parser errors on held-out data	2017-02-12 16:05:11 -05:00
Al	b1e178b7b2	[fix] is_numeric_token includes IDEOGRAPHIC_NUMBER	2017-02-12 15:11:56 -05:00
Al	b570855b78	[parser] adding postcode context features and associated data structures to the parser. Masking digits, which should hopefully help with generalization. Creating positive/negative features for postcode with and without context support. Note: even with known postcodes in known contexts, only use the masked digits to avoid creating too many features that are redundant with the index.	2017-02-10 03:41:14 -05:00
Al	9a93e95938	[api] removing geodb from setup functions	2017-02-10 01:02:52 -05:00
Al	ff245d74f8	[parser] building an index of postal codes and their valid admin contexts (city, state, country, etc.) during training e.g. "11216" => ["brooklyn", "ny"]. Postal code phrases like CP in Spanish are removed when constructing the index.	2017-02-10 00:50:48 -05:00
Al	1aacb5bccc	Merge branch 'master' into parser-data	2017-02-09 15:09:28 -05:00
Al	ea168279bd	[fix] free json-encoded string in parser client output	2017-02-09 14:34:15 -05:00
Al	38c6c26146	[fix] freeing normalized string in address_parser_parse	2017-02-09 14:33:13 -05:00
Al	8aa3749cfb	[utils] some convenience functions for generic hashtables (incr, get, etc)	2017-02-08 19:01:13 -05:00
Al	a6844c8ec1	[parser] structural changes for postal codes index	2017-02-08 18:52:45 -05:00
Al	6e4f641743	[phrases] adding token_phrase_memberships to trie_search for reuse	2017-02-08 01:59:39 -05:00
Al	ae35da8d17	[fix] uninitialized var	2017-02-08 01:58:53 -05:00
Al	0380f565d2	[parser] shorter first word feature	2017-01-29 22:10:28 -05:00
Al	ec3a563591	Merge branch 'master' into parser-data	2017-01-14 13:06:25 -05:00
Rinigus	67624f89d0	cstring_array_from_char_array: return empty initializes cstring_array from empty string	2017-01-14 10:43:47 +02:00
Al	b320aed9ac	[merge] merging master	2017-01-13 19:58:49 -05:00
Al	df89387b5c	[fix] calloc instead of malloc when performing initialization on structs that may fail halfway and need to clean up while partially initialized (calloc will set all the bytes to zero so the member pointers are NULL instead of garbage memory)	2017-01-13 18:30:04 -05:00
Al	1398df1260	[fix] accept 0 for array_new_size	2017-01-13 17:49:31 -05:00
Al	e1f258171f	[fix] handle cstring_array_from_char_array where char_array is NULL or 0-length	2017-01-13 16:52:41 -05:00
Al	a3506131fe	[build] adding libpostal_setup_datadir, libpostal_setup_parser_datadir, libpostal_setup_language_classifier_datadir functions for configuring the datadir at runtime	2017-01-09 16:11:26 -05:00
Al	953a26e54e	[utils] char_array_add_vjoined to stay consistent (add_* methods NUL termiante)	2017-01-09 16:10:07 -05:00
Al	7a8f94330b	[parser] only adding ngrams in a hyphenated word if the subword is not rare	2017-01-09 02:53:33 -05:00
Al	7a31802a04	[fix] also fix german-ascii transliteration on uppercase U with umlaut	2017-01-05 04:07:29 -05:00
Rinigus	26aeb0ebec	drop AC_FUNC_MALLOC and _REALLOC and check for them as regular functions; add extra cflags for scanner	2017-01-05 07:34:24 +02:00
Al	ccd555d020	[transliteration] regenerated transliteration_scripts_data.c	2017-01-02 13:52:48 -05:00
Al	77035fbdbd	[strings] adding utf8_is_whitespace to the header so it can be referenced from multiple files	2017-01-02 02:23:21 -05:00
Al	182976214c	[logging] converting most of the steps in building the transliteration table to use debug logging	2017-01-02 00:41:11 -05:00
Al	d8d3840700	[transliteration] constant for the html-escape transliterator	2017-01-02 00:40:12 -05:00
Al	4ad3a52fe1	[strings] fix lowercasing in string_utils.c	2017-01-01 20:08:34 -05:00
Al	a78937f265	[normalize] use the new utf8proc lowercasing (as opposed to case folding), free copies since none of the string functions operate in-place any more, add minimal HTML escaping transliterator even to ASCII text	2017-01-01 20:06:32 -05:00
Al	5c56a44faa	[strings] reverting to utf8proc v1.3.1, as 2.0 and above can chop off certain sequences	2017-01-01 20:03:23 -05:00
Al	fe88630f78	[dictionaries] regenerating address_expansion_data.c from upstream changes	2017-01-01 14:26:54 -05:00
Al	101bbcc02d	Merge remote-tracking branch 'origin/master' into parser-data	2017-01-01 14:25:37 -05:00
Travis	d61e90a33d	[auto][ci skip] Adding data files from Travis build #188	2017-01-01 19:20:54 +00:00
Al	0b5cc96654	[transliteration] add decompose option when stripping accents	2017-01-01 13:54:20 -05:00
Al	7d6c85aeec	[fix] new string tree iterator, don't decrement permutations on rollovers	2017-01-01 13:34:08 -05:00
Al	1780c5e053	[fix] moving enum	2016-12-31 13:01:57 -05:00
Al	475aa3dbfa	[strings] fixing and simplifying string tree iterator. This version is inspired by Python's itertools.product (itertoolsmodule.c has so many goodies)	2016-12-31 03:22:27 -05:00
Al	261ec3888a	[strings] header changes for new utf8 lower/upper functions	2016-12-31 03:20:43 -05:00
Al	58b063b632	[strings] making string_tree_iterator_done more meaningful (returns true if the iterator has no paths left to traverse)	2016-12-31 00:54:36 -05:00
Al	8978000320	[strings] adding latest utf8proc, new functions for utf8_lower (instead of case folding) and utf8_upper, and a utf8_is_whitespace that takes things like tabs into account	2016-12-31 00:52:12 -05:00
Al	db16e656ca	[parser/cli] adding .print_features option in address_parser client for debugging	2016-12-31 00:20:35 -05:00
Al	bdb51a244e	[phrases] fix case in trie search when searching for tokens in a string tail. If we're on the last token in a sequenence and the token matches the tail, check that the tail is complete, and if so return the match before exiting the loop. Affects multiword phrases that tend to appear toward the end of a sequence (long country names like "United States of America", etc.)	2016-12-29 16:17:09 -05:00
Al	05732f6718	[build] Makefile changes for new parser feature extraction	2016-12-29 02:39:29 -05:00
Al	091167ed3c	[api] remove geodb from libpostal.c	2016-12-29 02:35:43 -05:00
Al	acd953ce51	[parser] first pass at new parser feature extraction - removing geodb phrases - use Latin-ASCII-simple transliteration (no umlauts, etc.) - no digit normalization for admin component phrases and postcodes - tag = START + word, special feature for first word in the sequence - add the new admin boundary categories - for hyphenated non-phrase words, add each sub-word - for rare and unknown words, add ngram features of 3-6 characters with underscores to indicate beginnings and endings (similar to language classifier features) - defines notion of "rare words" (known words with a frequency <= n where n > the unknown word threshold), so known words can share statistical strength with artificial and real unknown words	2016-12-29 02:17:35 -05:00
Al	e62101b8bf	[parser] remove geodb from address_parser_test, sort confusion matrix	2016-12-29 02:14:40 -05:00

1 2 3 4 5 ...

861 Commits