Commit Graph

854 Commits

Author SHA1 Message Date
Al
a3214b7914 [readme] Readme fixes and additions 2015-09-26 23:32:19 -04:00
Al
5b829cd5a7 [fix] blank values containing punctuation in formatting 2015-09-26 21:49:28 -04:00
Al
dac0440be8 [fix] rsplit 2015-09-26 21:07:54 -04:00
Al
e255ae0e09 [dictionaries] Luxembourgish dictionaries 2015-09-26 18:31:07 -04:00
Al
3fe56d029d [dictionaries] German Swiss dictionaries 2015-09-26 18:30:55 -04:00
Al
ae93552455 [osm/formatting] Moving back to openvenues repo pending resolution of the Turkish address issue 2015-09-26 03:56:52 -04:00
Al
0c792a2cc3 [osm/formatting] Changing the way the formatter elimiates inter-component separators, changing repo back to OpenCageData after pull request merge 2015-09-26 03:21:26 -04:00
Al
856198a352 [tokenization] Regenerated scanner.c 2015-09-26 02:27:45 -04:00
Al
07f1f361e2 [transliteration] Regenerating transliteration data with new categories 2015-09-26 00:07:39 -04:00
Al
172263af58 [tokenization] Adding updated token classes to scanner.re 2015-09-26 00:05:23 -04:00
Al
5417b4e602 [unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories 2015-09-25 23:59:38 -04:00
Al
8fe791a14a [fix] ensure_dir in file downloads 2015-09-25 17:05:22 -04:00
Al
646b9f7248 [osm/formatting] Continuing to use openvenues formatter for the India fix 2015-09-25 13:36:24 -04:00
Al
5a6b47d0fd [api] Adding LIBPOSTAL_DEFAULT_OPTIONS to libpostal.h 2015-09-25 01:53:29 -04:00
Al
f5bb72c6f5 [readme] missed a dictionary type 2015-09-24 23:32:36 -04:00
Al
f243b9cfa6 [fix] phrasing 2015-09-24 23:30:03 -04:00
Al
dc31019604 [readme] Heading 2015-09-24 23:20:23 -04:00
Al
cfef3059bb [readme] Moving paragraph 2015-09-24 23:19:53 -04:00
Al
f62cfb9551 [readme] README changes 2015-09-24 23:16:07 -04:00
Al
3e256404b9 [readme] More informative README 2015-09-24 23:02:09 -04:00
Al
9901dd2aac [fix] Switching address formatter back to OpenCageData repo 2015-09-24 18:42:17 -04:00
Al
accd8a57e7 [expansion] Regenerating expansion data 2015-09-24 16:38:20 -04:00
Al
fa320defb7 [dictionaries] Afrikaans dictionaries for better disambiguatin in South Africa 2015-09-24 16:37:16 -04:00
Al
050a850fb9 [dictionaries] Dutch directionals, separating out the west vs westen forms 2015-09-24 16:36:52 -04:00
Al
fe5d665533 [dictionaries] Arc in English needn't always expand to Arcade 2015-09-24 16:36:21 -04:00
Al
bcac6a41be [dictionaries] Separating out Austrian toponym abbreviations 2015-09-24 16:35:56 -04:00
Al
3ce1669c30 [fix] import 2015-09-24 01:25:00 -04:00
Al
c85ce0b11d [osm/formatting] Tagging separators as well in tagged output of the address formatter 2015-09-24 01:22:49 -04:00
Al
f6c30778bf [normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling. 2015-09-23 19:41:01 -04:00
Al
a1d272077d [doc] Averaged perceptron tagger 2015-09-23 19:37:55 -04:00
Al
4a0da67aa1 [fix] warning 2015-09-23 04:06:54 -04:00
Al
88bd0cd158 [unicode] better segmentation on script breaks 2015-09-23 04:06:34 -04:00
Al
377c947541 [transliteration] Regenerating transliteration data files 2015-09-23 04:04:38 -04:00
Al
abfb1d4a60 [transliteration] Wide char support in transliteration data generator 2015-09-23 03:56:12 -04:00
Al
7e057b0fb8 [utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration) 2015-09-23 00:42:54 -04:00
Al
8562c7a5cb [unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren. 2015-09-23 00:37:59 -04:00
Al
19e5457a0f [unicode] Regenerated unicode scripts data file, using simple integers instead of repeating the enum types for succinctness 2015-09-23 00:36:29 -04:00
Al
4ad3fac627 [unicode] Regenerated unicode script types (ignore extraneous scripts, they're not used, just reside in the upper unicode planes) 2015-09-23 00:35:08 -04:00
Al
13bcc35523 [unicode] Allowing wide chars in unicode properties 2015-09-23 00:34:07 -04:00
Al
f13e9fad90 [tokenization] Regenerated scanner.c 2015-09-23 00:33:27 -04:00
Al
b4593b6f88 [unicode/tokenization] Using new character classes including wide chars in scanner 2015-09-23 00:33:14 -04:00
Al
a76831df7a [unicode] Wide version of word breaks 2015-09-22 18:55:33 -04:00
Al
25917cfb17 [fix] scripts 2015-09-22 15:15:30 -04:00
Al
b405a53fe1 [fix] chars out of range in get_string_script Python version 2015-09-22 08:14:27 -04:00
Al
ca25b48687 [fix] Not writing empty fields in formatted addresses 2015-09-22 08:13:55 -04:00
Al
747de1944b [fix] Accounting for unknown scripts in disambiguation 2015-09-21 18:05:28 -04:00
Al
3ac89d7ed9 [setup] fixing packaging 2015-09-21 17:31:15 -04:00
Al
236737eab3 [tokenization/osm] Using utf8 encoded version of string for tokens in python tokenizer 2015-09-21 17:27:43 -04:00
Al
134cf616d6 [osm] Using street for language disambiguation in training data 2015-09-21 04:09:15 -04:00
Al
ccac4a5a90 [fix] package directory 2015-09-21 04:01:36 -04:00