libpostal

Author	SHA1	Message	Date
Al	f6c30778bf	[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.	2015-09-23 19:41:01 -04:00
Al	a1d272077d	[doc] Averaged perceptron tagger	2015-09-23 19:37:55 -04:00
Al	4a0da67aa1	[fix] warning	2015-09-23 04:06:54 -04:00
Al	88bd0cd158	[unicode] better segmentation on script breaks	2015-09-23 04:06:34 -04:00
Al	377c947541	[transliteration] Regenerating transliteration data files	2015-09-23 04:04:38 -04:00
Al	abfb1d4a60	[transliteration] Wide char support in transliteration data generator	2015-09-23 03:56:12 -04:00
Al	7e057b0fb8	[utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration)	2015-09-23 00:42:54 -04:00
Al	8562c7a5cb	[unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren.	2015-09-23 00:37:59 -04:00
Al	19e5457a0f	[unicode] Regenerated unicode scripts data file, using simple integers instead of repeating the enum types for succinctness	2015-09-23 00:36:29 -04:00
Al	4ad3fac627	[unicode] Regenerated unicode script types (ignore extraneous scripts, they're not used, just reside in the upper unicode planes)	2015-09-23 00:35:08 -04:00
Al	13bcc35523	[unicode] Allowing wide chars in unicode properties	2015-09-23 00:34:07 -04:00
Al	f13e9fad90	[tokenization] Regenerated scanner.c	2015-09-23 00:33:27 -04:00
Al	b4593b6f88	[unicode/tokenization] Using new character classes including wide chars in scanner	2015-09-23 00:33:14 -04:00
Al	a76831df7a	[unicode] Wide version of word breaks	2015-09-22 18:55:33 -04:00
Al	25917cfb17	[fix] scripts	2015-09-22 15:15:30 -04:00
Al	b405a53fe1	[fix] chars out of range in get_string_script Python version	2015-09-22 08:14:27 -04:00
Al	ca25b48687	[fix] Not writing empty fields in formatted addresses	2015-09-22 08:13:55 -04:00
Al	747de1944b	[fix] Accounting for unknown scripts in disambiguation	2015-09-21 18:05:28 -04:00
Al	3ac89d7ed9	[setup] fixing packaging	2015-09-21 17:31:15 -04:00
Al	236737eab3	[tokenization/osm] Using utf8 encoded version of string for tokens in python tokenizer	2015-09-21 17:27:43 -04:00
Al	134cf616d6	[osm] Using street for language disambiguation in training data	2015-09-21 04:09:15 -04:00
Al	ccac4a5a90	[fix] package directory	2015-09-21 04:01:36 -04:00
Al	5f912ddcd3	[fix] std=c99	2015-09-21 03:25:32 -04:00
Al	5b2fd0be50	[fix] pytokenize compilation on Ubuntu/gcc	2015-09-21 03:24:14 -04:00
Al	cffa5a4a20	[fix] stdint include	2015-09-20 20:10:47 -04:00
Al	25b3338600	[setup] setup.py for pypostal so it can be installed from the Github url	2015-09-20 20:07:59 -04:00
Al	84cf21df88	[osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples	2015-09-20 20:05:46 -04:00
Al	5485ea2197	[python] Adding initial pypostal bindings for tokenize so we can remove address_normalizer dependency. Not tested on Python 3.	2015-09-20 14:59:39 -04:00
Al	3fab0f984f	[fix] fixing some compiler warnings, using type-specific abs functions for vector_math	2015-09-19 16:11:09 -04:00
Al	6731395ca0	[osm] Separating tagged from untagged output	2015-09-19 14:11:47 -04:00
Al	2940cc15b8	[fix] tokenized string destroy frees original string	2015-09-19 01:40:41 -04:00
Al	2b13871341	[constants] max country code length	2015-09-19 01:39:58 -04:00
Al	0396823772	[fix] geodb path separator	2015-09-19 01:39:31 -04:00
Al	17cfdb0625	[fix] adding char_array_append_* methods to header	2015-09-18 13:19:42 -04:00
Al	f2f7db92ff	[fix] phrases	2015-09-18 13:19:18 -04:00
Al	b74e92adad	[fix] include	2015-09-18 13:18:49 -04:00
Al	2a869894d9	[fix] geodb	2015-09-18 13:18:26 -04:00
Al	9e9131bda0	[parser] Averaged perceptron tagger	2015-09-17 05:51:24 -04:00
Al	8a86f7ec64	[parser] Adding context struct to feature function	2015-09-17 05:48:00 -04:00
Al	87ed7d9a0f	[geodb] Adding trie search methods for finding geodb phrases	2015-09-16 22:11:10 -04:00
Al	e62c75b9c6	[phrases] Adding _with_phrases versions of address dictionary methods for pre-allocated phrases	2015-09-16 21:24:28 -04:00
Al	23103a21d4	[phrases] Adding with_phrases versions of trie search methods for pre-allocated phrases	2015-09-16 21:23:34 -04:00
Al	d5ec005787	[transliteration] Similar init method for transliteration	2015-09-16 21:14:02 -04:00
Al	b11362ab98	[numex] using module init method for building, otherwise passing NULL path uses the default path	2015-09-16 21:13:05 -04:00
Al	3cba2e8df3	[api] Using default setup methods for submodules in libpostal setup	2015-09-15 14:01:33 -04:00
Al	e122824448	[expansion] Adding the ability to search address dictionary phrases with a NULL language, will return phrases in any language	2015-09-15 14:00:26 -04:00
Al	c47ff1b113	[utils] Adding source string to tokenized_string struct	2015-09-15 13:21:51 -04:00
Al	b2f690b6f6	[api] Error logging if modules can't be found	2015-09-15 13:21:15 -04:00
Al	9de3029dd3	[parser] Averaged perceptron training does full examples (greedily). During training, features are a hashtable, sorted and converted to a trie during finalize	2015-09-14 17:38:45 -04:00
Al	a5b5f80b04	[fix] new_copy	2015-09-14 16:50:23 -04:00

1 2 3 4 5 ...

826 Commits