libpostal

Author	SHA1	Message	Date
Al	7a31802a04	[fix] also fix german-ascii transliteration on uppercase U with umlaut	2017-01-05 04:07:29 -05:00
Al	ccd555d020	[transliteration] regenerated transliteration_scripts_data.c	2017-01-02 13:52:48 -05:00
Al	77035fbdbd	[strings] adding utf8_is_whitespace to the header so it can be referenced from multiple files	2017-01-02 02:23:21 -05:00
Al	182976214c	[logging] converting most of the steps in building the transliteration table to use debug logging	2017-01-02 00:41:11 -05:00
Al	d8d3840700	[transliteration] constant for the html-escape transliterator	2017-01-02 00:40:12 -05:00
Al	4ad3a52fe1	[strings] fix lowercasing in string_utils.c	2017-01-01 20:08:34 -05:00
Al	a78937f265	[normalize] use the new utf8proc lowercasing (as opposed to case folding), free copies since none of the string functions operate in-place any more, add minimal HTML escaping transliterator even to ASCII text	2017-01-01 20:06:32 -05:00
Al	5c56a44faa	[strings] reverting to utf8proc v1.3.1, as 2.0 and above can chop off certain sequences	2017-01-01 20:03:23 -05:00
Al	fe88630f78	[dictionaries] regenerating address_expansion_data.c from upstream changes	2017-01-01 14:26:54 -05:00
Al	101bbcc02d	Merge remote-tracking branch 'origin/master' into parser-data	2017-01-01 14:25:37 -05:00
Travis	d61e90a33d	[auto][ci skip] Adding data files from Travis build #188	2017-01-01 19:20:54 +00:00
Al	0b5cc96654	[transliteration] add decompose option when stripping accents	2017-01-01 13:54:20 -05:00
Al	7d6c85aeec	[fix] new string tree iterator, don't decrement permutations on rollovers	2017-01-01 13:34:08 -05:00
Al	1780c5e053	[fix] moving enum	2016-12-31 13:01:57 -05:00
Al	475aa3dbfa	[strings] fixing and simplifying string tree iterator. This version is inspired by Python's itertools.product (itertoolsmodule.c has so many goodies)	2016-12-31 03:22:27 -05:00
Al	261ec3888a	[strings] header changes for new utf8 lower/upper functions	2016-12-31 03:20:43 -05:00
Al	58b063b632	[strings] making string_tree_iterator_done more meaningful (returns true if the iterator has no paths left to traverse)	2016-12-31 00:54:36 -05:00
Al	8978000320	[strings] adding latest utf8proc, new functions for utf8_lower (instead of case folding) and utf8_upper, and a utf8_is_whitespace that takes things like tabs into account	2016-12-31 00:52:12 -05:00
Al	db16e656ca	[parser/cli] adding .print_features option in address_parser client for debugging	2016-12-31 00:20:35 -05:00
Al	bdb51a244e	[phrases] fix case in trie search when searching for tokens in a string tail. If we're on the last token in a sequenence and the token matches the tail, check that the tail is complete, and if so return the match before exiting the loop. Affects multiword phrases that tend to appear toward the end of a sequence (long country names like "United States of America", etc.)	2016-12-29 16:17:09 -05:00
Al	05732f6718	[build] Makefile changes for new parser feature extraction	2016-12-29 02:39:29 -05:00
Al	091167ed3c	[api] remove geodb from libpostal.c	2016-12-29 02:35:43 -05:00
Al	acd953ce51	[parser] first pass at new parser feature extraction - removing geodb phrases - use Latin-ASCII-simple transliteration (no umlauts, etc.) - no digit normalization for admin component phrases and postcodes - tag = START + word, special feature for first word in the sequence - add the new admin boundary categories - for hyphenated non-phrase words, add each sub-word - for rare and unknown words, add ngram features of 3-6 characters with underscores to indicate beginnings and endings (similar to language classifier features) - defines notion of "rare words" (known words with a frequency <= n where n > the unknown word threshold), so known words can share statistical strength with artificial and real unknown words	2016-12-29 02:17:35 -05:00
Al	e62101b8bf	[parser] remove geodb from address_parser_test, sort confusion matrix	2016-12-29 02:14:40 -05:00
Al	174529e8d0	[parser] remove geodb and fix small memory leak in address_parser_train	2016-12-29 02:12:06 -05:00
Al	bde5fdfaad	[merge] merging in master	2016-12-29 02:00:31 -05:00
Al	646d96e13e	Merge remote-tracking branch 'origin/master' into parser-data	2016-12-29 01:58:38 -05:00
Travis	6c35eb9e65	[auto][ci skip] Adding data files from Travis build #186	2016-12-28 06:29:35 +00:00
Travis	dc528affd5	[auto][ci skip] Adding data files from Travis build #184	2016-12-27 23:45:40 +00:00
Al	654fc2c463	[fix] memory cleanup in address_parser_data_set, logging any bad input lines	2016-12-26 16:18:15 -05:00
Al	e6d7b09e08	[expansions] adding generated expansion data	2016-12-26 16:16:59 -05:00
Al	4cdd245dc2	[logging] log error in address_dictionary_get_expansions	2016-12-26 16:16:26 -05:00
Al	42cf686b8e	[normalization] adding LATIN_ASCII_SIMPLE option to normalize_string_latin	2016-12-26 04:15:58 -05:00
Al	0284913aa7	[utils] ignore initial separators when splitting on delimiter	2016-12-26 04:14:20 -05:00
Brad Hards	fb68e22bbf	[fix] Use UTC date reference to avoid repeating S3 downloads. Resolves https://github.com/openvenues/libpostal/issues/143	2016-12-26 12:04:02 +11:00
Al	dd744c6d99	[merge] configure/Makefile changes from master	2016-12-22 12:37:27 -05:00
Al	8fe7958969	[build] allowing --disable-data-download option to configure. N.B. this is mostly for people building Docker images. The data files are NOT optional.	2016-12-22 12:31:27 -05:00
Al	dbe801fa08	[ngrams] changing args to ngrams	2016-12-21 18:09:45 -05:00
Al	6f37f9ae86	[merge] merging in master changes	2016-12-21 15:40:25 -05:00
Al	09b4e2ba2f	[build] pulling in change from parser-data that allows user to pass CFLAGS	2016-12-21 14:39:27 -05:00
Al	3ac2c93e1c	[utils] using renaming char_array_append_vjoined to char_array_add_vjoined to follow convention that add_* calls NUL-terminate while append_* calls do not	2016-12-18 15:26:58 -05:00
Al	3ed95a175e	[ngrams] adding function to extract an array of ngrams from a string, with optional special prefixes/suffixes for the edges	2016-12-17 01:33:18 -05:00
Al	8f1e69960f	[fix] loading transliteration module in address_parser_test.c as well	2016-12-12 11:37:27 -05:00
Al	3939dd0ca6	[fix] cstring_array_split calls	2016-12-12 11:37:27 -05:00
Al	a42d0e917a	[fix] brace	2016-12-12 11:37:27 -05:00
Al	ced8f9ae27	[parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent	2016-12-12 11:37:27 -05:00
Al	b1816e9b70	[utils] Adding cstring_array_split_ignore_consecutive	2016-12-12 11:37:27 -05:00
Al	6baa7087fe	[fix] calls and NULL checks	2016-12-12 11:37:27 -05:00
Al	5e07f5e8c5	[fix] tokenized_string_t should copy its source string	2016-12-12 11:37:27 -05:00
Al	521a094a47	[fix] Need to load transliteration module for Latin-ASCII normalization	2016-12-12 11:37:27 -05:00

1 2 3 4 5 ...

835 Commits