libpostal

Author	SHA1	Message	Date
Al	1a64ad682b	[merge] merging in the Ohio expansion numex changes from master	2017-11-29 11:51:43 -05:00
Al Barrentine	7d001489ef	Merge pull request #274 from openvenues/fix_oh_expansion Context-sensitive expansion of words like "oh" inside vs. outside numeric expressions	2017-11-24 17:13:24 -05:00
Al	ebe7fc9be9	[test] missing paren in Columbus, OH test. Adding test for "oh" as part of a number in Nineteen oh one W El Segundo Blvd	2017-11-24 16:11:07 -05:00
Al	d7f22544b4	[test] adding an expansion test for the Columbus, OH case	2017-11-24 15:44:37 -05:00
Al	ef098fd2e7	[numex] implementing the numex concat_only_if_number left context, which helps in the case of e.g. Columbus, OH in #271	2017-11-24 15:42:50 -05:00
Al	c276cf1529	[numex] adding a new type of left context for numeric expressions called conat_only_if_number (for something like "oh" which can be "Columbus, OH" or something like "Twenty-One Oh One"	2017-11-24 15:36:53 -05:00
Travis	7d6e648fc3	[auto][ci skip] Adding data files from Travis build #271	2017-11-17 19:36:25 +00:00
Al Barrentine	27b3e99515	Merge pull request #269 from Jeffrey04/ms-dictionary-expansion-1.0 Ms dictionary expansion for 1.0	2017-11-17 14:20:43 -05:00
jeffrey04	86c3105d44	new names with alternate spelling	2017-11-16 11:23:20 +08:00
jeffrey04	e9d2ab6400	reordered list of synonyms	2017-11-16 11:22:42 +08:00
jeffrey04	b3d306456f	new synonyms	2017-11-16 11:22:14 +08:00
jeffrey04	0d76d190e1	updated street types	2017-11-16 11:21:39 +08:00
jeffrey04	f726970d2b	updated qualifiers	2017-11-16 11:20:20 +08:00
jeffrey04	39fd7f0cb1	list of titles update	2017-11-16 11:18:18 +08:00
jeffrey04	865f99a0c1	sorted place names	2017-11-16 11:04:49 +08:00
jeffrey04	ceae1257af	new place names	2017-11-16 11:00:07 +08:00
jeffrey04	f3b76c1f28	some new company types in malay	2017-11-16 10:55:03 +08:00
jeffrey04	c9d22d228f	rearrange according to alphabetical order	2017-11-16 10:53:52 +08:00
jeffrey04	5e9d8f0a1e	rearrange into alphabetical order as in other languages	2017-11-16 10:51:53 +08:00
jeffrey04	6d54cbcc82	new building types	2017-11-16 10:43:58 +08:00
Choon-Siang Lai	867c3b825c	Merge pull request #1 from openvenues/master Synching from upstream	2017-11-15 14:35:47 +08:00
Al	fbf88aee88	[similarity] adding possible abbreviation functions to header, making everything const char *	2017-11-12 04:48:26 -05:00
Al	b34e578366	[similarity] using new sequence alignment breakdown by operation to tell if any two words are an abbreviation. The loose variant requires that the alignment covers all characters in the shortest string, which matches things like Services vs. Svc, whereas the strict variant requires that either the shorter string is a prefix of the longer one (Inc and Incorporated) or that the two strings share both a prefix and a suffix (Dept and Department). Both variants require that the strings share at least the first letter in common.	2017-11-11 04:02:28 -05:00
Al	751873e56b	[similarity] a NEW sequence alignment algorithm which builds on Smith-Waterman-Gotoh with affine gap penalties. Like Smith-Waterman, it performs a local alignment, and like the cost-only version of Gotoh's improvement, it needs O(mn) time and O(m) space (where m is the length of the longer string). However, this version of the algorithm stores and returns a breakdown of the number and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return/compute the full alignment as in Needleman-Wunsch or Hirschberg's variant	2017-11-11 03:07:39 -05:00
Al	665b780422	[utils] adding unicode_equals function in string_utils for testing equality of unicode char arrays	2017-11-11 02:45:41 -05:00
Al	5f0e394ea8	[fix] README badges	2017-11-01 20:12:36 -04:00
Al	669e52b329	[build] adding --no-same-owner explicitly when untarring the data files for #267	2017-11-01 20:05:36 -04:00
Al	3c6629ae3d	[dictionaries] adding variants of & as synonyms in all languages	2017-10-28 17:22:14 -04:00
Al	bc9f11d6e3	[similarity] exposing unicode versions of Damerau-Levenshtein and Jaro-Winkler distances	2017-10-28 02:45:48 -04:00
Al	2d6079b06f	[expand] added search_address_dictionaries_substring to support the new use case (i.e. returns "does this substring in the trie?" regardless of if it's stored under the special prefixes/suffixes namespaces)	2017-10-28 02:40:14 -04:00
Al	053dca82ba	[expand] adding a normalization for a single non-acronym internal period where there's an expansion at the prefix/suffix (for #218 and https://github.com/openvenues/libpostal/issues/216#issuecomment-306617824 ). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility	2017-10-28 02:38:15 -04:00
Al	6d430f7e9b	[utils] adding functions for finding the next index of a full stop/period charater in a string	2017-10-27 04:07:28 -04:00
Al	e38e57b8e8	[numex] fixing edge case where something like "IV Michael" could cause a partial Roman numeral to get added for the MI portion of "Michael"	2017-10-27 04:04:12 -04:00
Al	e8ae3bbbaf	[similarity] using NULL-terminated varargs in double metaphone instead of specifying the number of arguments, should be more maintainable	2017-10-23 15:20:04 -04:00
Al	5c0ecf8963	[dedupe] Jaccard similarity	2017-10-21 10:34:12 -04:00
Al	4ccc2a9e9f	[fix] making string args const in string_similarity module	2017-10-21 02:45:22 -04:00
Al	5c927e780f	[expand] adding ability to expand Roman numerals with ordinal suffixes like IXe in French	2017-10-20 02:51:26 -04:00
Al	b7eda37e44	[utils] adding utf8_is_digit to string_utils.h	2017-10-20 02:46:00 -04:00
Al	1fbc238b60	[numex] adding functions to parse and validate a Roman numeral	2017-10-20 02:45:32 -04:00
Al	1c5afcafd2	[phrases] when skipping/ignoring hyphens in trie search, make sure that the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.)	2017-10-20 02:43:39 -04:00
Al	9d2a111286	[numex] when parsing numex, bail on rules in whole_tokens_only languages if there are contiguous rules with no right context rules (example: something that wouldn't make sense like VL in Latin)	2017-10-20 02:34:30 -04:00
Al	bd477976d1	[similarity] string similarity measures for Damerau-Levenshtein and Jaro-Winkler distances. Both operate on unicode points internally for lengths, etc. instead of byte strings and the Levenshtein distance uses only one array instead of needing to store the full matrix of transitions.	2017-10-19 04:51:33 -04:00
Al	245aa226e0	[utils] function to create an array of uint32_t codepoints from a UTF-8 string, a few bug fixes to string_utils	2017-10-19 04:48:50 -04:00
Al	c61007388b	[similarity] bug fixes and additional French, Spanish, Italian, and Slavic phonetics	2017-10-18 13:31:35 -04:00
Al	3a3aca8490	[similarity] adding basic double metaphone implementation	2017-10-18 03:59:05 -04:00
Al	2f2d3da722	[test] test for utf8_equal_ignore_separators	2017-10-14 01:42:08 -04:00
Al	09fbb02042	[utils] adding utf8_equal_ignore_separators to string utils	2017-10-14 01:36:56 -04:00
Al	f8a808e254	[utils] adding utf8_len function for strings, and utf8_is_digit	2017-10-12 11:16:53 -04:00
Al	448ca6a61a	[merge] merging commit from v1.1	2017-10-12 01:41:04 -04:00
Travis	bb277fb326	[auto][ci skip] Adding data files from Travis build #268	2017-10-10 18:58:10 +00:00

1 2 3 4 5 ...

5084 Commits