libpostal

Author	SHA1	Message	Date
Al	55ba627c3c	[similarity] needed to add utf8proc_category and invert the indices for counting transposes in affine gap	2017-12-08 14:27:23 -05:00
Al	b34e578366	[similarity] using new sequence alignment breakdown by operation to tell if any two words are an abbreviation. The loose variant requires that the alignment covers all characters in the shortest string, which matches things like Services vs. Svc, whereas the strict variant requires that either the shorter string is a prefix of the longer one (Inc and Incorporated) or that the two strings share both a prefix and a suffix (Dept and Department). Both variants require that the strings share at least the first letter in common.	2017-11-11 04:02:28 -05:00
Al	751873e56b	[similarity] a NEW sequence alignment algorithm which builds on Smith-Waterman-Gotoh with affine gap penalties. Like Smith-Waterman, it performs a local alignment, and like the cost-only version of Gotoh's improvement, it needs O(mn) time and O(m) space (where m is the length of the longer string). However, this version of the algorithm stores and returns a breakdown of the number and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return/compute the full alignment as in Needleman-Wunsch or Hirschberg's variant	2017-11-11 03:07:39 -05:00
Al	bc9f11d6e3	[similarity] exposing unicode versions of Damerau-Levenshtein and Jaro-Winkler distances	2017-10-28 02:45:48 -04:00
Al	4ccc2a9e9f	[fix] making string args const in string_similarity module	2017-10-21 02:45:22 -04:00
Al	bd477976d1	[similarity] string similarity measures for Damerau-Levenshtein and Jaro-Winkler distances. Both operate on unicode points internally for lengths, etc. instead of byte strings and the Levenshtein distance uses only one array instead of needing to store the full matrix of transitions.	2017-10-19 04:51:33 -04:00

6 Commits