Commit Graph

821 Commits

Author SHA1 Message Date
Al
abfb1d4a60 [transliteration] Wide char support in transliteration data generator 2015-09-23 03:56:12 -04:00
Al
7e057b0fb8 [utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration) 2015-09-23 00:42:54 -04:00
Al
8562c7a5cb [unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren. 2015-09-23 00:37:59 -04:00
Al
19e5457a0f [unicode] Regenerated unicode scripts data file, using simple integers instead of repeating the enum types for succinctness 2015-09-23 00:36:29 -04:00
Al
4ad3fac627 [unicode] Regenerated unicode script types (ignore extraneous scripts, they're not used, just reside in the upper unicode planes) 2015-09-23 00:35:08 -04:00
Al
13bcc35523 [unicode] Allowing wide chars in unicode properties 2015-09-23 00:34:07 -04:00
Al
f13e9fad90 [tokenization] Regenerated scanner.c 2015-09-23 00:33:27 -04:00
Al
b4593b6f88 [unicode/tokenization] Using new character classes including wide chars in scanner 2015-09-23 00:33:14 -04:00
Al
a76831df7a [unicode] Wide version of word breaks 2015-09-22 18:55:33 -04:00
Al
25917cfb17 [fix] scripts 2015-09-22 15:15:30 -04:00
Al
b405a53fe1 [fix] chars out of range in get_string_script Python version 2015-09-22 08:14:27 -04:00
Al
ca25b48687 [fix] Not writing empty fields in formatted addresses 2015-09-22 08:13:55 -04:00
Al
747de1944b [fix] Accounting for unknown scripts in disambiguation 2015-09-21 18:05:28 -04:00
Al
3ac89d7ed9 [setup] fixing packaging 2015-09-21 17:31:15 -04:00
Al
236737eab3 [tokenization/osm] Using utf8 encoded version of string for tokens in python tokenizer 2015-09-21 17:27:43 -04:00
Al
134cf616d6 [osm] Using street for language disambiguation in training data 2015-09-21 04:09:15 -04:00
Al
ccac4a5a90 [fix] package directory 2015-09-21 04:01:36 -04:00
Al
5f912ddcd3 [fix] std=c99 2015-09-21 03:25:32 -04:00
Al
5b2fd0be50 [fix] pytokenize compilation on Ubuntu/gcc 2015-09-21 03:24:14 -04:00
Al
cffa5a4a20 [fix] stdint include 2015-09-20 20:10:47 -04:00
Al
25b3338600 [setup] setup.py for pypostal so it can be installed from the Github url 2015-09-20 20:07:59 -04:00
Al
84cf21df88 [osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples 2015-09-20 20:05:46 -04:00
Al
5485ea2197 [python] Adding initial pypostal bindings for tokenize so we can remove address_normalizer dependency. Not tested on Python 3. 2015-09-20 14:59:39 -04:00
Al
3fab0f984f [fix] fixing some compiler warnings, using type-specific abs functions for vector_math 2015-09-19 16:11:09 -04:00
Al
6731395ca0 [osm] Separating tagged from untagged output 2015-09-19 14:11:47 -04:00
Al
2940cc15b8 [fix] tokenized string destroy frees original string 2015-09-19 01:40:41 -04:00
Al
2b13871341 [constants] max country code length 2015-09-19 01:39:58 -04:00
Al
0396823772 [fix] geodb path separator 2015-09-19 01:39:31 -04:00
Al
17cfdb0625 [fix] adding char_array_append_* methods to header 2015-09-18 13:19:42 -04:00
Al
f2f7db92ff [fix] phrases 2015-09-18 13:19:18 -04:00
Al
b74e92adad [fix] include 2015-09-18 13:18:49 -04:00
Al
2a869894d9 [fix] geodb 2015-09-18 13:18:26 -04:00
Al
9e9131bda0 [parser] Averaged perceptron tagger 2015-09-17 05:51:24 -04:00
Al
8a86f7ec64 [parser] Adding context struct to feature function 2015-09-17 05:48:00 -04:00
Al
87ed7d9a0f [geodb] Adding trie search methods for finding geodb phrases 2015-09-16 22:11:10 -04:00
Al
e62c75b9c6 [phrases] Adding _with_phrases versions of address dictionary methods for pre-allocated phrases 2015-09-16 21:24:28 -04:00
Al
23103a21d4 [phrases] Adding with_phrases versions of trie search methods for pre-allocated phrases 2015-09-16 21:23:34 -04:00
Al
d5ec005787 [transliteration] Similar init method for transliteration 2015-09-16 21:14:02 -04:00
Al
b11362ab98 [numex] using module init method for building, otherwise passing NULL path uses the default path 2015-09-16 21:13:05 -04:00
Al
3cba2e8df3 [api] Using default setup methods for submodules in libpostal setup 2015-09-15 14:01:33 -04:00
Al
e122824448 [expansion] Adding the ability to search address dictionary phrases with a NULL language, will return phrases in any language 2015-09-15 14:00:26 -04:00
Al
c47ff1b113 [utils] Adding source string to tokenized_string struct 2015-09-15 13:21:51 -04:00
Al
b2f690b6f6 [api] Error logging if modules can't be found 2015-09-15 13:21:15 -04:00
Al
9de3029dd3 [parser] Averaged perceptron training does full examples (greedily). During training, features are a hashtable, sorted and converted to a trie during finalize 2015-09-14 17:38:45 -04:00
Al
a5b5f80b04 [fix] new_copy 2015-09-14 16:50:23 -04:00
Al
3ea6358f77 [fix] vector zeros allocation 2015-09-14 16:50:08 -04:00
Al
c21f61b9b4 [parser] Default address parser path 2015-09-11 15:05:38 -07:00
Al
32c180528f [tokens] Adding a copy_tokens option for tokenized_string 2015-09-11 15:04:29 -07:00
Al
9ce658b7a3 [collections] Adding string_array for an array of char pointers 2015-09-10 16:34:16 -07:00
Al
35b9122a1a [utils] inlining a few functions 2015-09-10 16:33:54 -07:00