libpostal

Author	SHA1	Message	Date
Al	8d64c9301e	[transliteration] Re-generating transliteration data file	2015-06-29 15:03:59 -04:00
Al	3279b31b09	[tokenization] Adding an acronym token type for things like U.N. so we can delete internal periods on those tokens	2015-06-29 03:00:46 -04:00
Al	47efce4b7e	[transliteration] Stopping set check loop on empty transition	2015-06-28 20:46:23 -04:00
Al	cc0401a8d1	[utf8] Adding a boolean struct member for string_script_t return values, set to true if the string is ASCII (no transliteration needed, should be frequent for English addresses)	2015-06-28 19:37:58 -04:00
Al	f0bf7e750c	[transliteration] Fixing edge case in transliteration where a naked character fails context matching but the set-wrapped version matches	2015-06-28 15:19:19 -04:00
Al	a5dacf3d2b	[utils] Adding method to get a particular token alternative from a string tree	2015-06-28 15:15:29 -04:00
Al	246237c1f1	[transliteration] Adding a get_transliteration_table() to foreach_transliterator macro since it lives in the header	2015-06-28 15:14:49 -04:00
Al	7c161ee5b6	[numex] Regenerating numex data file	2015-06-26 12:36:40 -04:00
Al	6a8ab48662	[numex] Adding method to get ordinal suffixes, using single representation	2015-06-25 17:28:06 -04:00
Al	9337bf9aea	[phrases] trie_search_suffixes uses the NUL-byte prefix by default but the _from_index version can start from another node. fixing single character suffixes	2015-06-25 17:24:19 -04:00
Al	82e85732c4	[fix] Setting codepoint in utf8proc_iterate_reversed	2015-06-25 17:20:55 -04:00
Al	4fbcb72368	[fix] utf8proc option	2015-06-25 10:07:37 -04:00
Al	c376bcef3d	[utils] get_string_script returns a struct rather than modifying a pointer for the length	2015-06-25 10:06:38 -04:00
Al	bcee9832b3	[utils] cstring_array_get_token=>cstring_array_get_string	2015-06-25 10:05:35 -04:00
Al	2b69c185fa	[tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change)	2015-06-25 10:03:34 -04:00
Al	581cf406a6	[utf8] Adding length argument to string_script function	2015-06-24 13:39:09 -05:00
Al	5e71a9d805	[utf8] Adding method to get the script of a string and the length of the span (rolls Common script up with the previuos script)	2015-06-24 13:29:40 -05:00
Al	85348e1178	[fix] enum value conflicted with existing name	2015-06-23 15:38:59 -05:00
Al	077e7fd5e2	[transliteration] Adding script/language lookups and I/O	2015-06-23 15:35:52 -05:00
Al	423d9ca7b7	[transliteration] table builder adds script/language rules	2015-06-23 15:35:16 -05:00
Al	c3143e5291	[transliteration] Adding structs/header stuff for transliterator lookup by script/language	2015-06-23 15:34:38 -05:00
Al	8fb6a28e9c	[fix] using empty string instead of NULL for script languages so we can use fixed length arrays	2015-06-23 15:20:09 -05:00
Al	f2d03a7937	[fix] renaming structure	2015-06-23 02:12:24 -05:00
Al	7dd772de0f	[fix] implementation of cstring_array_split	2015-06-23 02:11:24 -05:00
Al	d4cae97fd3	[transliteration] regenerated scripts data file	2015-06-23 02:10:10 -05:00
Al	2e54ca3575	[transliteration] including script data file, adding len to transliterate API for tokenized transliteration	2015-06-21 05:42:20 -05:00
Al	79530ae974	[transliteration] Adding transliteration script data file	2015-06-21 05:39:06 -05:00
Al	f8bff25948	[bloom] bloom filter I/O	2015-06-20 12:29:11 -05:00
Al	0ed80c3f6e	[geonames] Geonames generic serialization/deserialization	2015-06-20 12:00:15 -05:00
Al	bc306fc6c8	[fix] removing unused vars	2015-06-18 00:33:03 -04:00
Al	8792c38b52	[transliteration] Getting pre-context matching correct for > 1 char contexts, refining pre/post context matching in cases with an empty transition or an empty repeat, falling back to the original character in cases e.g. if there are Latin characters in a Hangul token	2015-06-17 23:51:19 -04:00
Al	be8353ad9b	[transliteration] Regenerated script data	2015-06-17 23:46:29 -04:00
Al	2408cfa6f0	[transliteration] Re-generating data file	2015-06-17 23:45:56 -04:00
Al	880d444881	[tokenization] Re-generating scanner	2015-06-16 12:52:37 -04:00
Al	77760f207c	[tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo	2015-06-16 12:52:04 -04:00
Al	e3dffc177c	[fix] gazetteers typo	2015-06-12 17:26:14 -04:00
Al	5f5efad6ac	[numex] Working numex implemenation. Tested on most languages, Germanic, Latin/whole_tokens_only, English concatenated or with separators, French numerals like quatre-vignt-douze, Spanish multiple-token ordinals, Japanese numerals, etc. All looking good	2015-06-12 16:21:36 -04:00
Al	c159f83f9b	[fix] trie_search logging	2015-06-12 16:17:41 -04:00
Al	a100cd83c9	[numex] Re-generated numex data file	2015-06-12 16:15:53 -04:00
Al	8520df96c8	[utils] utf8 comparison can handle a non-valid UTF-8 sequence e.g. for trie suffix comparison where we may be in the middle of a multi-byte character. Adding a standard utf8_common_prefix method	2015-06-12 16:11:40 -04:00
Al	5c2839e534	[numx] header and table builder changes to support whole words languages	2015-06-12 16:10:57 -04:00
Al	6b60446dbe	[phrases] no longer ignoring spaces in the input string, just trying different methods for hyphens, getting indexes right in the case where a space or hyphen precedes the match and backtracking on matches if the rest of the string falls off the trie	2015-06-12 11:30:24 -04:00
Al	3442b9ad92	[utils] require at least one non-space/non-hyphen match in utf8_common_prefix_len_ignore_separators	2015-06-12 11:19:37 -04:00
Al	6841ed8fb3	[phrases] Ignoring separators and dashes in trie_search_prefixes so it can be used for languages like German where numbers, phrases, etc. may just be concatenated together as a single token	2015-06-11 11:05:56 -04:00
Al	ab5ea6d791	[utils] Common prefix-style return value instead of a utf8 strcmp	2015-06-11 10:59:51 -04:00
Al	aad5f3edd3	[utils] UTF-8 lowercasing and string comparison, including a version which ignores dashes/spaces	2015-06-10 18:27:14 -04:00
Al	cb603562e0	[phrases] Adding *_from_index methods to trie_search	2015-06-09 11:14:42 -04:00
Al	81be8e771e	[numex] regen data file. utf8_is_hyphen requires a character, all other methods use category	2015-06-08 21:32:38 -04:00
Al	c1bed8b410	[numex] header changes	2015-06-08 21:29:36 -04:00
Al	fd1ebba720	[numex] Initial implementation of multilingual numeric expression parser	2015-06-08 21:29:04 -04:00

1 2 3 4 5

203 Commits