18eb5ef9eeMerge pull request #272 from AeroXuk/master
Al Barrentine
2017-11-28 21:35:46 -05:00
19ae97d527Adding include config.h to strndup.c so that the function is not compiled and doesn't cause errors when the system has its own implementation.
AeroXuk
2017-11-27 23:40:46 +00:00
9090811826Modifed the libpostal API to add an extra function libpostal_parser_print_features to toggle debugging info. Updated address_parser app to use the new function.
AeroXuk
2017-11-27 19:20:37 +00:00
69e0d5d963Updated linenoise to be MSys2/MinGW compatible. Updated address_parser app to use the defined libpostal api and not include internal components directly. Removed windows src Makefile as it is now the same as the standard one.
AeroXuk
2017-11-27 01:42:25 +00:00
bb5535602aAdding libpostal.h to the AppVeyor package.
AeroXuk
2017-11-25 10:13:14 +00:00
26ac9ab5c2Removing EXPORT statements from all source files and most header files, leaving only the exports for the main API in libpostal.h. Modified Makefiles so that all the test apps build without having extra functions exported from libpostal.
AeroXuk
2017-11-25 04:35:28 +00:00
15b3758be8[auto][ci skip] Adding data files from Travis build #284
Travis
2017-11-24 22:29:45 +00:00
7d001489efMerge pull request #274 from openvenues/fix_oh_expansion
Al Barrentine
2017-11-24 17:13:24 -05:00
ebe7fc9be9[test] missing paren in Columbus, OH test. Adding test for "oh" as part of a number in Nineteen oh one W El Segundo Blvd
Al
2017-11-24 16:11:07 -05:00
d7f22544b4[test] adding an expansion test for the Columbus, OH case
Al
2017-11-24 15:44:37 -05:00
ef098fd2e7[numex] implementing the numex concat_only_if_number left context, which helps in the case of e.g. Columbus, OH in #271
Al
2017-11-24 15:42:50 -05:00
c276cf1529[numex] adding a new type of left context for numeric expressions called conat_only_if_number (for something like "oh" which can be "Columbus, OH" or something like "Twenty-One Oh One"
Al
2017-11-24 15:36:50 -05:00
f0246e7333Fix bug in strndup fix for windows. Move all includes out of headers and into code for strndup.h and move it to be the last include.
AeroXuk
2017-11-23 19:11:25 +00:00
d205f4d2bbAdding artifacts to AppVeyor config.
AeroXuk
2017-11-23 02:24:06 +00:00
f07ab765cbAdding the export marker to all functions used in tests.
AeroXuk
2017-11-20 20:58:37 +00:00
ad682b7592Altered Makefile to include strndup.c on the other programs which require it. For the windows version of the Makefile, commented out address_parser lines as it has dependencies on includes we don't have.
AeroXuk
2017-11-20 20:24:11 +00:00
dbf232b8f8Fix bugs in AppVeyor config and build script. Added call to test script.
AeroXuk
2017-11-19 13:35:08 +00:00
2d3b420d35Merging changes from AeroXuk/libpostal_windows.
AeroXuk
2017-11-19 12:44:38 +00:00
7d6e648fc3[auto][ci skip] Adding data files from Travis build #271
Travis
2017-11-17 19:36:25 +00:00
27b3e99515Merge pull request #269 from Jeffrey04/ms-dictionary-expansion-1.0
Al Barrentine
2017-11-17 14:20:43 -05:00
86c3105d44new names with alternate spelling
jeffrey04
2017-11-16 11:23:20 +08:00
e9d2ab6400reordered list of synonyms
jeffrey04
2017-11-16 11:22:42 +08:00
39fd7f0cb1list of titles update
jeffrey04
2017-11-16 11:18:18 +08:00
865f99a0c1sorted place names
jeffrey04
2017-11-16 11:04:49 +08:00
ceae1257afnew place names
jeffrey04
2017-11-16 11:00:07 +08:00
f3b76c1f28some new company types in malay
jeffrey04
2017-11-16 10:55:03 +08:00
c9d22d228frearrange according to alphabetical order
jeffrey04
2017-11-16 10:53:52 +08:00
5e9d8f0a1erearrange into alphabetical order as in other languages
jeffrey04
2017-11-16 10:51:53 +08:00
6d54cbcc82new building types
jeffrey04
2017-11-16 10:43:58 +08:00
867c3b825cMerge pull request #1 from openvenues/master
Choon-Siang Lai
2017-11-15 14:35:47 +08:00
fbf88aee88[similarity] adding possible abbreviation functions to header, making everything const char *
Al
2017-11-12 04:48:26 -05:00
b34e578366[similarity] using new sequence alignment breakdown by operation to tell if any two words are an abbreviation. The loose variant requires that the alignment covers all characters in the shortest string, which matches things like Services vs. Svc, whereas the strict variant requires that either the shorter string is a prefix of the longer one (Inc and Incorporated) or that the two strings share both a prefix and a suffix (Dept and Department). Both variants require that the strings share at least the first letter in common.
Al
2017-11-11 04:02:28 -05:00
751873e56b[similarity] a *NEW* sequence alignment algorithm which builds on Smith-Waterman-Gotoh with affine gap penalties. Like Smith-Waterman, it performs a local alignment, and like the cost-only version of Gotoh's improvement, it needs O(mn) time and O(m) space (where m is the length of the longer string). However, this version of the algorithm stores and returns a breakdown of the number and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return/compute the full alignment as in Needleman-Wunsch or Hirschberg's variant
Al
2017-11-11 03:07:39 -05:00
665b780422[utils] adding unicode_equals function in string_utils for testing equality of unicode char arrays
Al
2017-11-11 02:45:41 -05:00
5f0e394ea8[fix] README badges
Al
2017-11-01 20:12:36 -04:00
669e52b329[build] adding --no-same-owner explicitly when untarring the data files for #267
Al
2017-11-01 20:05:33 -04:00
3c6629ae3d[dictionaries] adding variants of & as synonyms in all languages
Al
2017-10-28 17:22:14 -04:00
bc9f11d6e3[similarity] exposing unicode versions of Damerau-Levenshtein and Jaro-Winkler distances
Al
2017-10-28 02:45:48 -04:00
2d6079b06f[expand] added search_address_dictionaries_substring to support the new use case (i.e. returns "does this substring in the trie?" regardless of if it's stored under the special prefixes/suffixes namespaces)
Al
2017-10-28 02:40:14 -04:00
053dca82ba[expand] adding a normalization for a single non-acronym internal period where there's an expansion at the prefix/suffix (for #218 and https://github.com/openvenues/libpostal/issues/216#issuecomment-306617824). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility
Al
2017-10-28 02:38:15 -04:00
6d430f7e9b[utils] adding functions for finding the next index of a full stop/period charater in a string
Al
2017-10-27 04:07:28 -04:00
e38e57b8e8[numex] fixing edge case where something like "IV Michael" could cause a partial Roman numeral to get added for the MI portion of "Michael"
Al
2017-10-27 04:04:06 -04:00
e8ae3bbbaf[similarity] using NULL-terminated varargs in double metaphone instead of specifying the number of arguments, should be more maintainable
Al
2017-10-23 15:20:04 -04:00
5c0ecf8963[dedupe] Jaccard similarity
Al
2017-10-21 10:34:12 -04:00
4ccc2a9e9f[fix] making string args const in string_similarity module
Al
2017-10-21 02:45:08 -04:00
5c927e780f[expand] adding ability to expand Roman numerals with ordinal suffixes like IXe in French
Al
2017-10-20 02:51:26 -04:00
b7eda37e44[utils] adding utf8_is_digit to string_utils.h
Al
2017-10-20 02:45:55 -04:00
1fbc238b60[numex] adding functions to parse and validate a Roman numeral
Al
2017-10-20 02:45:32 -04:00
1c5afcafd2[phrases] when skipping/ignoring hyphens in trie search, make sure that the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.)
Al
2017-10-20 02:43:39 -04:00
9d2a111286[numex] when parsing numex, bail on rules in whole_tokens_only languages if there are contiguous rules with no right context rules (example: something that wouldn't make sense like VL in Latin)
Al
2017-10-20 02:34:30 -04:00
bd477976d1[similarity] string similarity measures for Damerau-Levenshtein and Jaro-Winkler distances. Both operate on unicode points internally for lengths, etc. instead of byte strings and the Levenshtein distance uses only one array instead of needing to store the full matrix of transitions.
Al
2017-10-19 04:51:28 -04:00
245aa226e0[utils] function to create an array of uint32_t codepoints from a UTF-8 string, a few bug fixes to string_utils
Al
2017-10-19 04:48:50 -04:00
c61007388b[similarity] bug fixes and additional French, Spanish, Italian, and Slavic phonetics
Al
2017-10-18 04:00:57 -04:00
3a3aca8490[similarity] adding basic double metaphone implementation
Al
2017-10-18 03:59:05 -04:00
2f2d3da722[test] test for utf8_equal_ignore_separators
Al
2017-10-14 01:42:08 -04:00
09fbb02042[utils] adding utf8_equal_ignore_separators to string utils
Al
2017-10-14 01:36:56 -04:00
f8a808e254[utils] adding utf8_len function for strings, and utf8_is_digit
Al
2017-10-12 11:16:53 -04:00
448ca6a61a[merge] merging commit from v1.1
Al
2017-08-14 04:04:58 -06:00
bb277fb326[auto][ci skip] Adding data files from Travis build #268
Travis
2017-10-10 18:58:10 +00:00
e60139757fMerge pull request #257 from mkaranta/patch-1
Al Barrentine
2017-10-10 14:42:29 -04:00
c96a042e86Add 'bld' as an abbreviation for 'building'
mkaranta
2017-10-10 14:19:09 -04:00
c984dca459[fix] removing log error for sequences of length 0
Al
2017-09-19 23:20:03 -04:00
94a0e842e7[fix] typo
Al Barrentine
2017-08-16 15:04:15 -04:00
34e2c4772e[code of conduct] adding stronger, more specific language about hate speech in code of conduct
Al Barrentine
2017-08-16 15:03:38 -04:00
2bfa8efefb[docs] updating README examples of normalization now that canonical forms are no longer transliterated
Al Barrentine
2017-08-16 12:15:22 -04:00
0c6af2b74c[fix] normalize canonical strings (after expanding abbreviations, concatenated suffixes, etc.) with Latin-ASCII, Latin-ASCII-Simple or simple UTF-8 normalization depending on the options
Al
2017-08-03 14:08:05 -06:00
ed011e50d5[docs][ci skip] update contributing section in README
Al
2017-08-01 00:27:50 -04:00
caf2415938[fix][ci skip] updates to contributions guide
Al
2017-08-01 00:25:36 -04:00
da2affbacb[fix][ci skip] removing repetition in contributing guide
Al
2017-08-01 00:13:53 -04:00
2c06f26f3d[docs][ci skip] adding contributing guide for how to submit issues
Al
2017-08-01 00:06:56 -04:00
6ca6493d0bMerge pull request #231 from michaelkrog/patch-1
Al Barrentine
2017-07-27 11:21:34 -04:00
a36dcc8b9cUpdate is.yaml
Michael Krog
2017-07-27 13:24:54 +02:00
7352dc74c6Moving language around in code of conduct
Al Barrentine
2017-07-21 12:58:35 -04:00
4cde250463Adding a custom libpostal Code of Conduct
Al Barrentine
2017-07-21 02:35:07 -04:00
dab3b95ae1Merge pull request #229 from openvenues/32bit_numex_fix
Al Barrentine
2017-07-20 18:11:02 -04:00
97044f5a8b[fix] 32-bit safety in numex table loading
Al
2017-07-20 17:50:53 -04:00
0cb8c61fb0Merge pull request #215 from xiamx/patch-2
Al Barrentine
2017-06-05 16:26:11 -04:00
abcf72be2eAdd Elixir language binding to Readme
Mengxuan Xia
2017-06-05 16:05:19 -04:00
50cf14846cMerge pull request #214 from iestynpryce/master
Al Barrentine
2017-05-30 08:45:28 -04:00
8dd84b71ba[auto][ci skip] Adding data files from Travis build #250
Travis
2017-05-24 05:05:06 +00:00
e9696e9166Merge pull request #212 from openvenues/bbraunay-master
Al Barrentine
2017-05-24 00:54:05 -04:00
1948634bf3[dictionaries] adding a separable prefix for Jl. and Jln. so things like Jl.Utara get separated and expanded
Al
2017-05-24 00:26:32 -04:00
3b5b5d8baa[dictionaries] adding ambiguous expansions for all Indonesian abbreviations 1-2 characters as they could also be initials, etc.
Al
2017-05-23 18:04:09 -04:00
f507102457[dictionaries] removing English words from Indonesian unit types
Al
2017-05-23 18:01:38 -04:00
4b24699e1f[fix] changing national to nasional in Indonesian
Al
2017-05-23 18:00:20 -04:00
4df48fb412[dictionaries] moving Kampong to normalize to Kampung in Indonesian, better if there's one canonical form
Al
2017-05-23 17:57:34 -04:00
ec79c610eb[dictionaries] removing a few English words and dupes from Indonesian place names
Al
2017-05-23 17:55:59 -04:00
77365a56a5[dictionaries] removing no fixed address from Indonesian dictionaries
Al
2017-05-23 17:51:15 -04:00
8a35cfcd80[dictionaries] removing level/platform/podium from Indonesian level types
Al
2017-05-23 17:50:25 -04:00
364b00da01[dictionaries] separating Mas and Abang
Al
2017-05-23 17:46:45 -04:00
83378049ee[dictionaries] remove Doktor from academic degrees in Indonesian dictionaries
Al
2017-05-23 17:35:53 -04:00
52593c6374[dictionaries] remove nonprofit from Indonesian company types
Al
2017-05-23 17:27:11 -04:00
08524f4b07[dictionaries] moving some of the existing chain stores for Indonesia to the all/chains.txt dictionary
Al
2017-05-23 17:25:59 -04:00