libpostal

Author	SHA1	Message	Date
Al	9337bf9aea	[phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes	2015-06-25 17:24:19 -04:00
Al	82e85732c4	[fix] Setting codepoint in utf8proc_iterate_reversed	2015-06-25 17:20:55 -04:00
Al	4fbcb72368	[fix] utf8proc option	2015-06-25 10:07:37 -04:00
Al	c376bcef3d	[utils] get_string_script returns a struct rather than modifying a pointer for the length	2015-06-25 10:06:38 -04:00
Al	bcee9832b3	[utils] cstring_array_get_token=>cstring_array_get_string	2015-06-25 10:05:35 -04:00
Al	2b69c185fa	[tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change)	2015-06-25 10:03:34 -04:00
Al	581cf406a6	[utf8] Adding length argument to string_script function	2015-06-24 13:39:09 -05:00
Al	5e71a9d805	[utf8] Adding method to get the script of a string and the length of the span (rolls Common script up with the previuos script)	2015-06-24 13:29:40 -05:00
Al	85348e1178	[fix] enum value conflicted with existing name	2015-06-23 15:38:59 -05:00
Al	077e7fd5e2	[transliteration] Adding script/language lookups and I/O	2015-06-23 15:35:52 -05:00
Al	423d9ca7b7	[transliteration] table builder adds script/language rules	2015-06-23 15:35:16 -05:00
Al	c3143e5291	[transliteration] Adding structs/header stuff for transliterator lookup by script/language	2015-06-23 15:34:38 -05:00
Al	8fb6a28e9c	[fix] using empty string instead of NULL for script languages so we can use fixed length arrays	2015-06-23 15:20:09 -05:00
Al	f2d03a7937	[fix] renaming structure	2015-06-23 02:12:24 -05:00
Al	7dd772de0f	[fix] implementation of cstring_array_split	2015-06-23 02:11:24 -05:00
Al	d4cae97fd3	[transliteration] regenerated scripts data file	2015-06-23 02:10:10 -05:00
Al	b21c3a3a2f	[transliteration] using different struct in script data header file	2015-06-22 22:06:16 -05:00
Al	2e54ca3575	[transliteration] including script data file, adding len to transliterate API for tokenized transliteration	2015-06-21 05:42:20 -05:00
Al	79530ae974	[transliteration] Adding transliteration script data file	2015-06-21 05:39:06 -05:00
Al	c2b4744f55	[transliteration] Using a data file instead of a header for transliteration scripts	2015-06-21 05:37:56 -05:00
Al	b2e201f297	[fix] trailing comma	2015-06-20 15:14:41 -05:00
Al	f8bff25948	[bloom] bloom filter I/O	2015-06-20 12:29:11 -05:00
Al	0ed80c3f6e	[geonames] Geonames generic serialization/deserialization	2015-06-20 12:00:15 -05:00
Al	d4087be40c	[geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs	2015-06-20 11:54:47 -05:00
Al	ab1fb3669f	[geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id	2015-06-19 15:47:50 -05:00
Al	bc306fc6c8	[fix] removing unused vars	2015-06-18 00:33:03 -04:00
Al	8792c38b52	[transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token	2015-06-17 23:51:19 -04:00
Al	be8353ad9b	[transliteration] Regenerated script data	2015-06-17 23:46:29 -04:00
Al	2408cfa6f0	[transliteration] Re-generating data file	2015-06-17 23:45:56 -04:00
Al	84b9a6ff33	[transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group	2015-06-17 23:42:31 -04:00
Al	880d444881	[tokenization] Re-generating scanner	2015-06-16 12:52:37 -04:00
Al	77760f207c	[tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo	2015-06-16 12:52:04 -04:00
Al	f04fad0e93	[i18n] Generating Hangul syllable classes	2015-06-16 12:50:48 -04:00
Al	cb2035867b	[fix] osm geodata imports	2015-06-15 18:36:01 -04:00
Al	d2d25ead6f	[utils] Adding unicode_csv module	2015-06-15 18:06:54 -04:00
Al	651f91fc11	[polygons] Adding language exceptions, now including osm relation ids	2015-06-15 18:04:44 -04:00
Al	ccb64f7ac2	[polygons] Adding address_normalizer polygons package	2015-06-15 17:55:27 -04:00
Al	22fa81b33f	[fix] __init__.py	2015-06-15 17:54:27 -04:00
Al	41dbd97bf2	[geodisambig] quattroshapes download can use default or specified location, unzips files	2015-06-15 17:54:08 -04:00
Al	037d4575ae	[geodisambig] Modifying GeoNames TSV again. Using files again and sorting	2015-06-15 17:51:09 -04:00
Al	67bd9f1a31	[i18n] Adding languages.py	2015-06-15 17:48:47 -04:00
Al	073fe43698	[geodisambig] Adding quattroshapes download script	2015-06-15 17:46:11 -04:00
Al	73f37fe66b	[fix] Moving default Geonames DB path to a shared module	2015-06-15 12:53:00 -04:00
Al	7a4fa7d443	[geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming	2015-06-15 01:58:43 -04:00
Al	43e023077c	[fix] Changing logging to stderr for the Geonames scripts	2015-06-14 15:38:57 -04:00
Al	e3dffc177c	[fix] gazetteers typo	2015-06-12 17:26:14 -04:00
Al	5f5efad6ac	[numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good	2015-06-12 16:21:36 -04:00
Al	c159f83f9b	[fix] trie_search logging	2015-06-12 16:17:41 -04:00
Al	a100cd83c9	[numex] Re-generated numex data file	2015-06-12 16:15:53 -04:00
Al	8520df96c8	[utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method	2015-06-12 16:11:40 -04:00

1 2 3 4 5 ...

290 Commits