Commit Graph

5060 Commits

Author SHA1 Message Date
Al
b34e578366 [similarity] using new sequence alignment breakdown by operation to tell if any two words are an abbreviation. The loose variant requires that the alignment covers all characters in the shortest string, which matches things like Services vs. Svc, whereas the strict variant requires that either the shorter string is a prefix of the longer one (Inc and Incorporated) or that the two strings share both a prefix and a suffix (Dept and Department). Both variants require that the strings share at least the first letter in common. 2017-11-11 04:02:28 -05:00
Al
751873e56b [similarity] a *NEW* sequence alignment algorithm which builds on Smith-Waterman-Gotoh with affine gap penalties. Like Smith-Waterman, it performs a local alignment, and like the cost-only version of Gotoh's improvement, it needs O(mn) time and O(m) space (where m is the length of the longer string). However, this version of the algorithm stores and returns a breakdown of the number and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return/compute the full alignment as in Needleman-Wunsch or Hirschberg's variant 2017-11-11 03:07:39 -05:00
Al
665b780422 [utils] adding unicode_equals function in string_utils for testing equality of unicode char arrays 2017-11-11 02:45:41 -05:00
Al
3c6629ae3d [dictionaries] adding variants of & as synonyms in all languages 2017-10-28 17:22:14 -04:00
Al
bc9f11d6e3 [similarity] exposing unicode versions of Damerau-Levenshtein and Jaro-Winkler distances 2017-10-28 02:45:48 -04:00
Al
2d6079b06f [expand] added search_address_dictionaries_substring to support the new use case (i.e. returns "does this substring in the trie?" regardless of if it's stored under the special prefixes/suffixes namespaces) 2017-10-28 02:40:14 -04:00
Al
053dca82ba [expand] adding a normalization for a single non-acronym internal period where there's an expansion at the prefix/suffix (for #218 and https://github.com/openvenues/libpostal/issues/216#issuecomment-306617824). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility 2017-10-28 02:38:15 -04:00
Al
6d430f7e9b [utils] adding functions for finding the next index of a full stop/period charater in a string 2017-10-27 04:07:28 -04:00
Al
e38e57b8e8 [numex] fixing edge case where something like "IV Michael" could cause a partial Roman numeral to get added for the MI portion of "Michael" 2017-10-27 04:04:12 -04:00
Al
e8ae3bbbaf [similarity] using NULL-terminated varargs in double metaphone instead of specifying the number of arguments, should be more maintainable 2017-10-23 15:20:04 -04:00
Al
5c0ecf8963 [dedupe] Jaccard similarity 2017-10-21 10:34:12 -04:00
Al
4ccc2a9e9f [fix] making string args const in string_similarity module 2017-10-21 02:45:22 -04:00
Al
5c927e780f [expand] adding ability to expand Roman numerals with ordinal suffixes like IXe in French 2017-10-20 02:51:26 -04:00
Al
b7eda37e44 [utils] adding utf8_is_digit to string_utils.h 2017-10-20 02:46:00 -04:00
Al
1fbc238b60 [numex] adding functions to parse and validate a Roman numeral 2017-10-20 02:45:32 -04:00
Al
1c5afcafd2 [phrases] when skipping/ignoring hyphens in trie search, make sure that the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.) 2017-10-20 02:43:39 -04:00
Al
9d2a111286 [numex] when parsing numex, bail on rules in whole_tokens_only languages if there are contiguous rules with no right context rules (example: something that wouldn't make sense like VL in Latin) 2017-10-20 02:34:30 -04:00
Al
bd477976d1 [similarity] string similarity measures for Damerau-Levenshtein and Jaro-Winkler distances. Both operate on unicode points internally for lengths, etc. instead of byte strings and the Levenshtein distance uses only one array instead of needing to store the full matrix of transitions. 2017-10-19 04:51:33 -04:00
Al
245aa226e0 [utils] function to create an array of uint32_t codepoints from a UTF-8 string, a few bug fixes to string_utils 2017-10-19 04:48:50 -04:00
Al
c61007388b [similarity] bug fixes and additional French, Spanish, Italian, and Slavic phonetics 2017-10-18 13:31:35 -04:00
Al
3a3aca8490 [similarity] adding basic double metaphone implementation 2017-10-18 03:59:05 -04:00
Al
2f2d3da722 [test] test for utf8_equal_ignore_separators 2017-10-14 01:42:08 -04:00
Al
09fbb02042 [utils] adding utf8_equal_ignore_separators to string utils 2017-10-14 01:36:56 -04:00
Al
f8a808e254 [utils] adding utf8_len function for strings, and utf8_is_digit 2017-10-12 11:16:53 -04:00
Al
448ca6a61a [merge] merging commit from v1.1 2017-10-12 01:41:04 -04:00
Travis
bb277fb326 [auto][ci skip] Adding data files from Travis build #268 2017-10-10 18:58:10 +00:00
Al Barrentine
e60139757f Merge pull request #257 from mkaranta/patch-1
Add 'bld' as an abbreviation for 'building'
2017-10-10 14:42:29 -04:00
mkaranta
c96a042e86 Add 'bld' as an abbreviation for 'building'
I noticed this was missing while testing a batch of addresses. Hopefully it doesn't introduce much noise.
2017-10-10 14:19:09 -04:00
Al
c984dca459 [fix] removing log error for sequences of length 0 2017-09-19 23:20:03 -04:00
Al Barrentine
94a0e842e7 [fix] typo 2017-08-16 15:04:15 -04:00
Al Barrentine
34e2c4772e [code of conduct] adding stronger, more specific language about hate speech in code of conduct 2017-08-16 15:03:38 -04:00
Al Barrentine
2bfa8efefb [docs] updating README examples of normalization now that canonical forms are no longer transliterated 2017-08-16 12:15:22 -04:00
Al
0c6af2b74c [fix] normalize canonical strings (after expanding abbreviations, concatenated suffixes, etc.) with Latin-ASCII, Latin-ASCII-Simple or simple UTF-8 normalization depending on the options 2017-08-03 14:08:05 -06:00
Al
ed011e50d5 [docs][ci skip] update contributing section in README 2017-08-01 00:27:50 -04:00
Al
caf2415938 [fix][ci skip] updates to contributions guide 2017-08-01 00:25:36 -04:00
Al
da2affbacb [fix][ci skip] removing repetition in contributing guide 2017-08-01 00:13:55 -04:00
Al
2c06f26f3d [docs][ci skip] adding contributing guide for how to submit issues 2017-08-01 00:10:40 -04:00
Al Barrentine
6ca6493d0b Merge pull request #231 from michaelkrog/patch-1
Changes front matter of iis.yaml to correct description
2017-07-27 11:21:34 -04:00
Michael Krog
a36dcc8b9c Update is.yaml 2017-07-27 13:24:54 +02:00
Al Barrentine
7352dc74c6 Moving language around in code of conduct 2017-07-21 12:58:35 -04:00
Al Barrentine
4cde250463 Adding a custom libpostal Code of Conduct 2017-07-21 02:35:07 -04:00
Al Barrentine
dab3b95ae1 Merge pull request #229 from openvenues/32bit_numex_fix
32-bit safety in numex table loading
2017-07-20 18:11:02 -04:00
Al
97044f5a8b [fix] 32-bit safety in numex table loading 2017-07-20 17:55:43 -04:00
Al Barrentine
0cb8c61fb0 Merge pull request #215 from xiamx/patch-2
Add Elixir language binding to README.md
2017-06-05 16:26:11 -04:00
Mengxuan Xia
abcf72be2e Add Elixir language binding to Readme 2017-06-05 16:05:19 -04:00
Al Barrentine
50cf14846c Merge pull request #214 from iestynpryce/master
Fix remaining log_* compile format warnings
2017-05-30 08:45:28 -04:00
Iestyn Pryce
b96a687182 Merge https://github.com/openvenues/libpostal 2017-05-29 18:23:03 +01:00
Travis
8dd84b71ba [auto][ci skip] Adding data files from Travis build #250 2017-05-24 05:05:06 +00:00
Al Barrentine
e9696e9166 Merge pull request #212 from openvenues/bbraunay-master
modified Indonesian dictionary updates
2017-05-24 00:54:05 -04:00
Al
1948634bf3 [dictionaries] adding a separable prefix for Jl. and Jln. so things like Jl.Utara get separated and expanded 2017-05-24 00:26:32 -04:00