libpostal

Author	SHA1	Message	Date
Al	318773ffe7	[parser] header changes for the data set struct	2016-12-09 13:37:45 -05:00
Al	22c4e99ea0	[parser] As part of reading/tokenizing the address parser data set, several copies of the same training example will be generated. 1. with only lowercasing 2. with simple Latin-ASCII normalization (no umlauts, only things that are common to all languages) 3. basic UTF-8 normalizations (accent stripping) 4. language-specific Latin-ASCII transliteration (e.g. ü => ue in German) This will apply both on the initial passes when building the phrase gazetteers and during each iteration of training. In this way, only the most basic normalizations like lowercasing need to be done at runtime and it's possible to use only minimal normalizations like lowercasing. May have a small effect on randomization as examples are created in a deterministic order. However, this should not lead to cycles since the base examples are shuffled, thus still satisfying the random permutation requirement of an online/stochastic learning algorithm.	2016-12-02 13:09:03 -05:00
Al	4b35da629f	[numex] regenerated numex data file	2016-11-30 15:58:55 -08:00
Al	4677874610	[parser] stripping postal codes of phrases like CP (in Spanish) before adding them to the gazetteers, whether it's concatenated or a separate token. Adding a command-line argument for the number of iterations	2016-11-30 15:58:03 -08:00
Al	0e29cdd9fd	[parser] fixing some uninitialized value issues during parser training	2016-11-30 15:42:09 -08:00
Al	f5a6bd0f36	[fix] sparse_matrix_new_from_matrix uses new matrix types	2016-11-30 10:15:12 -08:00
Al	b639fa5127	[utils] string_replace also creates a copy	2016-11-30 10:09:33 -08:00
Al	89f6611c4e	[strings] string_trim makes a copy rather than modifying the pointer	2016-11-28 15:06:07 -08:00
Al	d922d9a60a	[expansion] regenerated address_expansion_data.c	2016-11-28 10:47:15 -08:00
Al	f78281456a	[fix] header defintion	2016-11-27 01:00:25 -08:00
Al	eea11beb6a	[expansion] using easier-to-access data structure for address dictionaries	2016-11-27 00:56:48 -08:00
Al	7298c895c8	[utils] adding a chunked shuffle as the concatenated file sizes may get larger than memory	2016-11-21 14:04:34 -05:00
Travis	04f8130c46	[auto][ci skip] Adding data files from Travis build #168	2016-10-07 00:46:48 +00:00
Al	01afbf80ef	[data] Each curl process will retry the chunk up to 3 times	2016-08-25 23:18:39 -04:00
Travis	de1255af00	[auto][ci skip] Adding data files from Travis build #161	2016-08-23 22:48:20 +00:00
Travis	f19c9852aa	[auto][ci skip] Adding data files from Travis build #160	2016-08-23 22:24:19 +00:00
Travis	d797d6c863	[auto][ci skip] Adding data files from Travis build #159	2016-08-23 22:14:07 +00:00
Al	58851a9088	[normalization] Adding NORMALIZE_STRING_SIMPLE_LATIN_ASCII option so parser can normalize punctuation and HTML entities, etc. without touching the alphanumeric parts of the original input	2016-08-21 19:45:32 -04:00
Al	8b9702b43d	[error handling] Checking that resize succeeded in transliterate.c	2016-08-21 19:43:09 -04:00
Al	2644fed18f	[transliteration] Adding LATIN_ASCII_SIMPLE constant to transliterate.h	2016-08-21 19:42:10 -04:00
Al	4375bdea3b	[transliteration] strduping transliterator name while building table	2016-08-21 19:41:34 -04:00
Al	bde8776bc2	[transliteration] Regenerating transliteration data files	2016-08-21 19:41:11 -04:00
Al	330edc2c93	[utils] cstring_array_get_phrase requires a char_array to be passed in so it doesn't have to do any memory allocation	2016-08-16 13:11:45 -04:00
Al	92e66fd60c	[utils] string_next_hyphen_index	2016-08-16 12:49:52 -04:00
Al	3137ef5c6a	[build] configure/Makefile changes to use SIMD exp and BLAS when available	2016-08-06 00:43:24 -04:00
Al	59e28c6c2a	[math] double_array definition in collections.h to use new vectorized exp	2016-08-06 00:40:38 -04:00
Al	46cd725c13	[math] Generic dense matrix implementation using BLAS calls for matrix-matrix multiplication if available	2016-08-06 00:40:01 -04:00
Al	d4a792f33c	[math] Adding fast SIMD exponent using the Remez algorithm for vectorized exp	2016-08-06 00:31:16 -04:00
Al	161f18575d	[utils] Adding realloc checks to vector implementation	2016-08-05 23:02:52 -04:00
Al	20aad99a38	[parser] enum just lists boundary types	2016-07-30 17:07:23 -04:00
Al	965bac1833	[trie] Making methods to construct string phrases from phrase matches available through trie_search.h	2016-07-30 17:06:20 -04:00
Al	08f39d6b80	[parser] Adding address_parser_rewind to make multiple passes through the file when compiling the phrase tries	2016-07-28 17:13:58 -04:00
Al	1b09b7f2e5	[fix] Adding country_region to address_parser_train	2016-07-28 16:18:32 -04:00
Al	c6af5cc071	[parser] Adding country_region label to parser as a boundary component	2016-07-28 15:19:48 -04:00
Tom Davis	18c8e90eb3	Use `xargs` to start workers as soon as possible	2016-07-27 17:46:44 -04:00
Tom Davis	11abf6cb22	Use posix `sh` for systems without `bash`	2016-07-26 20:17:18 -04:00
Al Barrentine	65c4688f89	Merge pull request #97 from uberbaud/multipart_edgecase Don't call `download_multipart` for 1 chunk	2016-07-24 00:03:51 -04:00
Travis	3f0eff228e	[auto][ci skip] Adding data files from Travis build #145	2016-07-23 22:28:32 +00:00
Tom Davis	2991ffd193	Don't call `download_multipart` for 1 chunk Previously, where a file was larger than `$LARGE_FILE_SIZE` but smaller than `$CHUNK_SIZE*2`, `download_multipart` would be called but would only download one (1) chunk that was the whole file. This fix keeps the same download performance as before but optimizes processing chunks out.	2016-07-23 16:41:04 -04:00
Tom Davis	24e0314e71	Remove call to `seq` which may not exist	2016-07-23 01:03:15 -04:00
Al	64f167f045	[tokenization] Re-generating scanner	2016-07-21 17:04:57 -04:00
Al	81b4a4a1cb	[tokenization] Hyphens, etc. between non-ASCII digits (e.g. Unicode full-width numbers) should be single tokens	2016-07-21 17:04:57 -04:00
Al	be5fd79a48	[expansion] Prefix/suffix expansions by default can apply to ADDRESS_ANY but also inherit the types of any dictionary that lists their canonical form (so we can add suffixes without worrying about whether they're for streets or place names, etc.)	2016-07-21 17:04:57 -04:00
Al	8926293063	[parser/cli] Using NFC normalization on the output in the parser client (closes #30 ). Optional command-line arg for parser output dir, useful for spot-checking different experiments	2016-07-21 17:04:57 -04:00
Al	44908ff95a	[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces	2016-07-21 17:04:57 -04:00
Al	41ae742285	[fix] tokenized trie search when falling off the trie at the start of a valid phrase	2016-07-21 17:04:57 -04:00
Al	6e60b3bbda	[fix] semicolon in #define	2016-07-21 17:04:57 -04:00
Al	b5d4dd6f37	[tokenization] Including full-width numbers in numeric tokens	2016-07-21 17:04:57 -04:00
Al	dd7ef6fabf	[dictionaries] Making new component for near/nearby prepositions	2016-07-21 17:04:57 -04:00
Al	2454b98c6d	[tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing	2016-07-21 17:04:57 -04:00

... 3 4 5 6 7 ...

983 Commits