Al
|
f6c30778bf
|
[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.
|
2015-09-23 19:41:01 -04:00 |
|
Al
|
a1d272077d
|
[doc] Averaged perceptron tagger
|
2015-09-23 19:37:55 -04:00 |
|
Al
|
4a0da67aa1
|
[fix] warning
|
2015-09-23 04:06:54 -04:00 |
|
Al
|
88bd0cd158
|
[unicode] better segmentation on script breaks
|
2015-09-23 04:06:34 -04:00 |
|
Al
|
377c947541
|
[transliteration] Regenerating transliteration data files
|
2015-09-23 04:04:38 -04:00 |
|
Al
|
abfb1d4a60
|
[transliteration] Wide char support in transliteration data generator
|
2015-09-23 03:56:12 -04:00 |
|
Al
|
7e057b0fb8
|
[utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration)
|
2015-09-23 00:42:54 -04:00 |
|
Al
|
8562c7a5cb
|
[unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren.
|
2015-09-23 00:37:59 -04:00 |
|
Al
|
19e5457a0f
|
[unicode] Regenerated unicode scripts data file, using simple integers instead of repeating the enum types for succinctness
|
2015-09-23 00:36:29 -04:00 |
|
Al
|
4ad3fac627
|
[unicode] Regenerated unicode script types (ignore extraneous scripts, they're not used, just reside in the upper unicode planes)
|
2015-09-23 00:35:08 -04:00 |
|
Al
|
13bcc35523
|
[unicode] Allowing wide chars in unicode properties
|
2015-09-23 00:34:07 -04:00 |
|
Al
|
f13e9fad90
|
[tokenization] Regenerated scanner.c
|
2015-09-23 00:33:27 -04:00 |
|
Al
|
b4593b6f88
|
[unicode/tokenization] Using new character classes including wide chars in scanner
|
2015-09-23 00:33:14 -04:00 |
|
Al
|
a76831df7a
|
[unicode] Wide version of word breaks
|
2015-09-22 18:55:33 -04:00 |
|
Al
|
25917cfb17
|
[fix] scripts
|
2015-09-22 15:15:30 -04:00 |
|
Al
|
b405a53fe1
|
[fix] chars out of range in get_string_script Python version
|
2015-09-22 08:14:27 -04:00 |
|
Al
|
ca25b48687
|
[fix] Not writing empty fields in formatted addresses
|
2015-09-22 08:13:55 -04:00 |
|
Al
|
747de1944b
|
[fix] Accounting for unknown scripts in disambiguation
|
2015-09-21 18:05:28 -04:00 |
|
Al
|
3ac89d7ed9
|
[setup] fixing packaging
|
2015-09-21 17:31:15 -04:00 |
|
Al
|
236737eab3
|
[tokenization/osm] Using utf8 encoded version of string for tokens in python tokenizer
|
2015-09-21 17:27:43 -04:00 |
|
Al
|
134cf616d6
|
[osm] Using street for language disambiguation in training data
|
2015-09-21 04:09:15 -04:00 |
|
Al
|
ccac4a5a90
|
[fix] package directory
|
2015-09-21 04:01:36 -04:00 |
|
Al
|
5f912ddcd3
|
[fix] std=c99
|
2015-09-21 03:25:32 -04:00 |
|
Al
|
5b2fd0be50
|
[fix] pytokenize compilation on Ubuntu/gcc
|
2015-09-21 03:24:14 -04:00 |
|
Al
|
cffa5a4a20
|
[fix] stdint include
|
2015-09-20 20:10:47 -04:00 |
|
Al
|
25b3338600
|
[setup] setup.py for pypostal so it can be installed from the Github url
|
2015-09-20 20:07:59 -04:00 |
|
Al
|
84cf21df88
|
[osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples
|
2015-09-20 20:05:46 -04:00 |
|
Al
|
5485ea2197
|
[python] Adding initial pypostal bindings for tokenize so we can remove address_normalizer dependency. Not tested on Python 3.
|
2015-09-20 14:59:39 -04:00 |
|
Al
|
3fab0f984f
|
[fix] fixing some compiler warnings, using type-specific abs functions for vector_math
|
2015-09-19 16:11:09 -04:00 |
|
Al
|
6731395ca0
|
[osm] Separating tagged from untagged output
|
2015-09-19 14:11:47 -04:00 |
|
Al
|
2940cc15b8
|
[fix] tokenized string destroy frees original string
|
2015-09-19 01:40:41 -04:00 |
|
Al
|
2b13871341
|
[constants] max country code length
|
2015-09-19 01:39:58 -04:00 |
|
Al
|
0396823772
|
[fix] geodb path separator
|
2015-09-19 01:39:31 -04:00 |
|
Al
|
17cfdb0625
|
[fix] adding char_array_append_* methods to header
|
2015-09-18 13:19:42 -04:00 |
|
Al
|
f2f7db92ff
|
[fix] phrases
|
2015-09-18 13:19:18 -04:00 |
|
Al
|
b74e92adad
|
[fix] include
|
2015-09-18 13:18:49 -04:00 |
|
Al
|
2a869894d9
|
[fix] geodb
|
2015-09-18 13:18:26 -04:00 |
|
Al
|
9e9131bda0
|
[parser] Averaged perceptron tagger
|
2015-09-17 05:51:24 -04:00 |
|
Al
|
8a86f7ec64
|
[parser] Adding context struct to feature function
|
2015-09-17 05:48:00 -04:00 |
|
Al
|
87ed7d9a0f
|
[geodb] Adding trie search methods for finding geodb phrases
|
2015-09-16 22:11:10 -04:00 |
|
Al
|
e62c75b9c6
|
[phrases] Adding _with_phrases versions of address dictionary methods for pre-allocated phrases
|
2015-09-16 21:24:28 -04:00 |
|
Al
|
23103a21d4
|
[phrases] Adding with_phrases versions of trie search methods for pre-allocated phrases
|
2015-09-16 21:23:34 -04:00 |
|
Al
|
d5ec005787
|
[transliteration] Similar init method for transliteration
|
2015-09-16 21:14:02 -04:00 |
|
Al
|
b11362ab98
|
[numex] using module init method for building, otherwise passing NULL path uses the default path
|
2015-09-16 21:13:05 -04:00 |
|
Al
|
3cba2e8df3
|
[api] Using default setup methods for submodules in libpostal setup
|
2015-09-15 14:01:33 -04:00 |
|
Al
|
e122824448
|
[expansion] Adding the ability to search address dictionary phrases with a NULL language, will return phrases in any language
|
2015-09-15 14:00:26 -04:00 |
|
Al
|
c47ff1b113
|
[utils] Adding source string to tokenized_string struct
|
2015-09-15 13:21:51 -04:00 |
|
Al
|
b2f690b6f6
|
[api] Error logging if modules can't be found
|
2015-09-15 13:21:15 -04:00 |
|
Al
|
9de3029dd3
|
[parser] Averaged perceptron training does full examples (greedily). During training, features are a hashtable, sorted and converted to a trie during finalize
|
2015-09-14 17:38:45 -04:00 |
|
Al
|
a5b5f80b04
|
[fix] new_copy
|
2015-09-14 16:50:23 -04:00 |
|