libpostal

Author	SHA1	Message	Date
Al	a3214b7914	[readme] Readme fixes and additions	2015-09-26 23:32:19 -04:00
Al	5b829cd5a7	[fix] blank values containing punctuation in formatting	2015-09-26 21:49:28 -04:00
Al	dac0440be8	[fix] rsplit	2015-09-26 21:07:54 -04:00
Al	e255ae0e09	[dictionaries] Luxembourgish dictionaries	2015-09-26 18:31:07 -04:00
Al	3fe56d029d	[dictionaries] German Swiss dictionaries	2015-09-26 18:30:55 -04:00
Al	ae93552455	[osm/formatting] Moving back to openvenues repo pending resolution of the Turkish address issue	2015-09-26 03:56:52 -04:00
Al	0c792a2cc3	[osm/formatting] Changing the way the formatter elimiates inter-component separators, changing repo back to OpenCageData after pull request merge	2015-09-26 03:21:26 -04:00
Al	856198a352	[tokenization] Regenerated scanner.c	2015-09-26 02:27:45 -04:00
Al	07f1f361e2	[transliteration] Regenerating transliteration data with new categories	2015-09-26 00:07:39 -04:00
Al	172263af58	[tokenization] Adding updated token classes to scanner.re	2015-09-26 00:05:23 -04:00
Al	5417b4e602	[unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories	2015-09-25 23:59:38 -04:00
Al	8fe791a14a	[fix] ensure_dir in file downloads	2015-09-25 17:05:22 -04:00
Al	646b9f7248	[osm/formatting] Continuing to use openvenues formatter for the India fix	2015-09-25 13:36:24 -04:00
Al	5a6b47d0fd	[api] Adding LIBPOSTAL_DEFAULT_OPTIONS to libpostal.h	2015-09-25 01:53:29 -04:00
Al	f5bb72c6f5	[readme] missed a dictionary type	2015-09-24 23:32:36 -04:00
Al	f243b9cfa6	[fix] phrasing	2015-09-24 23:30:03 -04:00
Al	dc31019604	[readme] Heading	2015-09-24 23:20:23 -04:00
Al	cfef3059bb	[readme] Moving paragraph	2015-09-24 23:19:53 -04:00
Al	f62cfb9551	[readme] README changes	2015-09-24 23:16:07 -04:00
Al	3e256404b9	[readme] More informative README	2015-09-24 23:02:09 -04:00
Al	9901dd2aac	[fix] Switching address formatter back to OpenCageData repo	2015-09-24 18:42:17 -04:00
Al	accd8a57e7	[expansion] Regenerating expansion data	2015-09-24 16:38:20 -04:00
Al	fa320defb7	[dictionaries] Afrikaans dictionaries for better disambiguatin in South Africa	2015-09-24 16:37:16 -04:00
Al	050a850fb9	[dictionaries] Dutch directionals, separating out the west vs westen forms	2015-09-24 16:36:52 -04:00
Al	fe5d665533	[dictionaries] Arc in English needn't always expand to Arcade	2015-09-24 16:36:21 -04:00
Al	bcac6a41be	[dictionaries] Separating out Austrian toponym abbreviations	2015-09-24 16:35:56 -04:00
Al	3ce1669c30	[fix] import	2015-09-24 01:25:00 -04:00
Al	c85ce0b11d	[osm/formatting] Tagging separators as well in tagged output of the address formatter	2015-09-24 01:22:49 -04:00
Al	f6c30778bf	[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.	2015-09-23 19:41:01 -04:00
Al	a1d272077d	[doc] Averaged perceptron tagger	2015-09-23 19:37:55 -04:00
Al	4a0da67aa1	[fix] warning	2015-09-23 04:06:54 -04:00
Al	88bd0cd158	[unicode] better segmentation on script breaks	2015-09-23 04:06:34 -04:00
Al	377c947541	[transliteration] Regenerating transliteration data files	2015-09-23 04:04:38 -04:00
Al	abfb1d4a60	[transliteration] Wide char support in transliteration data generator	2015-09-23 03:56:12 -04:00
Al	7e057b0fb8	[utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration)	2015-09-23 00:42:54 -04:00
Al	8562c7a5cb	[unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren.	2015-09-23 00:37:59 -04:00
Al	19e5457a0f	[unicode] Regenerated unicode scripts data file, using simple integers instead of repeating the enum types for succinctness	2015-09-23 00:36:29 -04:00
Al	4ad3fac627	[unicode] Regenerated unicode script types (ignore extraneous scripts, they're not used, just reside in the upper unicode planes)	2015-09-23 00:35:08 -04:00
Al	13bcc35523	[unicode] Allowing wide chars in unicode properties	2015-09-23 00:34:07 -04:00
Al	f13e9fad90	[tokenization] Regenerated scanner.c	2015-09-23 00:33:27 -04:00
Al	b4593b6f88	[unicode/tokenization] Using new character classes including wide chars in scanner	2015-09-23 00:33:14 -04:00
Al	a76831df7a	[unicode] Wide version of word breaks	2015-09-22 18:55:33 -04:00
Al	25917cfb17	[fix] scripts	2015-09-22 15:15:30 -04:00
Al	b405a53fe1	[fix] chars out of range in get_string_script Python version	2015-09-22 08:14:27 -04:00
Al	ca25b48687	[fix] Not writing empty fields in formatted addresses	2015-09-22 08:13:55 -04:00
Al	747de1944b	[fix] Accounting for unknown scripts in disambiguation	2015-09-21 18:05:28 -04:00
Al	3ac89d7ed9	[setup] fixing packaging	2015-09-21 17:31:15 -04:00
Al	236737eab3	[tokenization/osm] Using utf8 encoded version of string for tokens in python tokenizer	2015-09-21 17:27:43 -04:00
Al	134cf616d6	[osm] Using street for language disambiguation in training data	2015-09-21 04:09:15 -04:00
Al	ccac4a5a90	[fix] package directory	2015-09-21 04:01:36 -04:00

1 2 3 4 5 ...

854 Commits