libpostal

Author	SHA1	Message	Date
Al	db16e656ca	[parser/cli] adding .print_features option in address_parser client for debugging	2016-12-31 00:20:35 -05:00
Al	bdb51a244e	[phrases] fix case in trie search when searching for tokens in a string tail. If we're on the last token in a sequenence and the token matches the tail, check that the tail is complete, and if so return the match before exiting the loop. Affects multiword phrases that tend to appear toward the end of a sequence (long country names like "United States of America", etc.)	2016-12-29 16:17:09 -05:00
Al	05732f6718	[build] Makefile changes for new parser feature extraction	2016-12-29 02:39:29 -05:00
Al	091167ed3c	[api] remove geodb from libpostal.c	2016-12-29 02:35:43 -05:00
Al	acd953ce51	[parser] first pass at new parser feature extraction - removing geodb phrases - use Latin-ASCII-simple transliteration (no umlauts, etc.) - no digit normalization for admin component phrases and postcodes - tag = START + word, special feature for first word in the sequence - add the new admin boundary categories - for hyphenated non-phrase words, add each sub-word - for rare and unknown words, add ngram features of 3-6 characters with underscores to indicate beginnings and endings (similar to language classifier features) - defines notion of "rare words" (known words with a frequency <= n where n > the unknown word threshold), so known words can share statistical strength with artificial and real unknown words	2016-12-29 02:17:35 -05:00
Al	e62101b8bf	[parser] remove geodb from address_parser_test, sort confusion matrix	2016-12-29 02:14:40 -05:00
Al	174529e8d0	[parser] remove geodb and fix small memory leak in address_parser_train	2016-12-29 02:12:06 -05:00
Al	bde5fdfaad	[merge] merging in master	2016-12-29 02:00:31 -05:00
Al	646d96e13e	Merge remote-tracking branch 'origin/master' into parser-data	2016-12-29 01:58:38 -05:00
Travis	6c35eb9e65	[auto][ci skip] Adding data files from Travis build #186	2016-12-28 06:29:35 +00:00
Travis	dc528affd5	[auto][ci skip] Adding data files from Travis build #184	2016-12-27 23:45:40 +00:00
Al	654fc2c463	[fix] memory cleanup in address_parser_data_set, logging any bad input lines	2016-12-26 16:18:15 -05:00
Al	e6d7b09e08	[expansions] adding generated expansion data	2016-12-26 16:16:59 -05:00
Al	4cdd245dc2	[logging] log error in address_dictionary_get_expansions	2016-12-26 16:16:26 -05:00
Al	42cf686b8e	[normalization] adding LATIN_ASCII_SIMPLE option to normalize_string_latin	2016-12-26 04:15:58 -05:00
Al	0284913aa7	[utils] ignore initial separators when splitting on delimiter	2016-12-26 04:14:20 -05:00
Brad Hards	fb68e22bbf	[fix] Use UTC date reference to avoid repeating S3 downloads. Resolves https://github.com/openvenues/libpostal/issues/143	2016-12-26 12:04:02 +11:00
Al	dd744c6d99	[merge] configure/Makefile changes from master	2016-12-22 12:37:27 -05:00
Al	8fe7958969	[build] allowing --disable-data-download option to configure. N.B. this is mostly for people building Docker images. The data files are NOT optional.	2016-12-22 12:31:27 -05:00
Al	dbe801fa08	[ngrams] changing args to ngrams	2016-12-21 18:09:45 -05:00
Al	6f37f9ae86	[merge] merging in master changes	2016-12-21 15:40:25 -05:00
Al	09b4e2ba2f	[build] pulling in change from parser-data that allows user to pass CFLAGS	2016-12-21 14:39:27 -05:00
Al	3ac2c93e1c	[utils] using renaming char_array_append_vjoined to char_array_add_vjoined to follow convention that add_* calls NUL-terminate while append_* calls do not	2016-12-18 15:26:58 -05:00
Al	3ed95a175e	[ngrams] adding function to extract an array of ngrams from a string, with optional special prefixes/suffixes for the edges	2016-12-17 01:33:18 -05:00
Al	8f1e69960f	[fix] loading transliteration module in address_parser_test.c as well	2016-12-12 11:37:27 -05:00
Al	3939dd0ca6	[fix] cstring_array_split calls	2016-12-12 11:37:27 -05:00
Al	a42d0e917a	[fix] brace	2016-12-12 11:37:27 -05:00
Al	ced8f9ae27	[parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent	2016-12-12 11:37:27 -05:00
Al	b1816e9b70	[utils] Adding cstring_array_split_ignore_consecutive	2016-12-12 11:37:27 -05:00
Al	6baa7087fe	[fix] calls and NULL checks	2016-12-12 11:37:27 -05:00
Al	5e07f5e8c5	[fix] tokenized_string_t should copy its source string	2016-12-12 11:37:27 -05:00
Al	521a094a47	[fix] Need to load transliteration module for Latin-ASCII normalization	2016-12-12 11:37:27 -05:00
Al	d575caba8a	[data] using UTC for libpostal data files on the Mac version of the download script as well	2016-12-09 19:43:05 -05:00
Al	c3f3896b48	[fix] update test for date function in data download script	2016-12-09 19:29:00 -05:00
Al	318773ffe7	[parser] header changes for the data set struct	2016-12-09 13:37:45 -05:00
Al	22c4e99ea0	[parser] As part of reading/tokenizing the address parser data set, several copies of the same training example will be generated. 1. with only lowercasing 2. with simple Latin-ASCII normalization (no umlauts, only things that are common to all languages) 3. basic UTF-8 normalizations (accent stripping) 4. language-specific Latin-ASCII transliteration (e.g. ü => ue in German) This will apply both on the initial passes when building the phrase gazetteers and during each iteration of training. In this way, only the most basic normalizations like lowercasing need to be done at runtime and it's possible to use only minimal normalizations like lowercasing. May have a small effect on randomization as examples are created in a deterministic order. However, this should not lead to cycles since the base examples are shuffled, thus still satisfying the random permutation requirement of an online/stochastic learning algorithm.	2016-12-02 13:09:03 -05:00
Al	4b35da629f	[numex] regenerated numex data file	2016-11-30 15:58:55 -08:00
Al	4677874610	[parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations	2016-11-30 15:58:03 -08:00
Al	0e29cdd9fd	[parser] fixing some uninitialized value issues during parser training	2016-11-30 15:42:09 -08:00
Al	f5a6bd0f36	[fix] sparse_matrix_new_from_matrix uses new matrix types	2016-11-30 10:15:12 -08:00
Al	b639fa5127	[utils] string_replace also creates a copy	2016-11-30 10:09:33 -08:00
Al	89f6611c4e	[strings] string_trim makes a copy rather than modifying the pointer	2016-11-28 15:06:07 -08:00
Al	d922d9a60a	[expansion] regenerated address_expansion_data.c	2016-11-28 10:47:15 -08:00
Al	f78281456a	[fix] header defintion	2016-11-27 01:00:25 -08:00
Al	eea11beb6a	[expansion] using easier-to-access data structure for address dictionaries	2016-11-27 00:56:48 -08:00
Al	7298c895c8	[utils] adding a chunked shuffle as the concatenated file sizes may get larger than memory	2016-11-21 14:04:34 -05:00
Travis	04f8130c46	[auto][ci skip] Adding data files from Travis build #168	2016-10-07 00:46:48 +00:00
Al	01afbf80ef	[data] Each curl process will retry the chunk up to 3 times	2016-08-25 23:18:39 -04:00
Travis	de1255af00	[auto][ci skip] Adding data files from Travis build #161	2016-08-23 22:48:20 +00:00
Travis	f19c9852aa	[auto][ci skip] Adding data files from Travis build #160	2016-08-23 22:24:19 +00:00

1 2 3 4 5 ...

817 Commits