Commit Graph

  • f963e175e4 [tests] Expansion tests with and without language classifier Al 2016-01-28 16:35:32 -05:00
  • fed599ac39 [version] bumping version to 0.3 for consistency Al 2016-01-28 16:34:41 -05:00
  • 87899050b2 [tests] Using greatest (https://github.com/silentbicycle/greatest) for automated testing Al 2016-01-28 16:31:32 -05:00
  • 0bad3adf07 [docs] Removing the coming soon label from language classification, cleaning up the README a bit Al 2016-01-27 14:44:48 -05:00
  • 95a7978131 [build] Adding relevant language_classifier sources to build Al 2016-01-27 03:34:35 -05:00
  • 93ed2bf15b [api] Making language optional in libpostal cli Al 2016-01-27 03:32:29 -05:00
  • 789db8f582 [build] Adding language classifier to data file download script. As the current file is rather large, added multipart downloads from S3 to speed things up Al 2016-01-27 03:31:45 -05:00
  • 42d169feee [api] Libpostal expand API will now detect language automatically using a high accuracy language classifier trained on OSM streets/addresses/toponyms. Hooray batch geocoding! Al 2016-01-27 03:20:55 -05:00
  • 71c51f2e45 [language_classification] Making directory optional on language_classifier client/test program Al 2016-01-27 03:18:53 -05:00
  • c770468d03 [expansion] Regenerated address_expansion_data.c Al 2016-01-27 03:17:59 -05:00
  • 36f52d9707 [fix] Removing feature printing Al 2016-01-26 15:34:56 -05:00
  • 239f8adec6 [docs] README updates now that the Python repo is separate Al 2016-01-26 02:40:07 -05:00
  • cffc7e1034 [rm] Removing Python bindings from this project, moving to https://github.com/openvenues/pypostal Al 2016-01-26 02:17:23 -05:00
  • 5077462754 [fix] temporary files for language classifier training Al 2016-01-26 01:42:21 -05:00
  • 426edccbf8 [language_classification] Simple accuracy-based test program for language classifier. Al 2016-01-26 01:27:55 -05:00
  • 9abbf42bf4 [language_classifier] Command-line client for language classification Al 2016-01-26 01:20:59 -05:00
  • 314b65e192 [build] Adding shuffle.c to language_classifier_train Al 2016-01-26 01:18:35 -05:00
  • ababb8f2d0 [fix] sign comparison in regularized gradient computation for logistic regression Al 2016-01-26 01:16:11 -05:00
  • ae2b839f17 [build] Adding language classifier train/test/cli programs to the build Al 2016-01-26 00:09:02 -05:00
  • 299998d8b5 [languages] Making Basque the only default in the Basque region. Al 2016-01-24 19:35:03 -05:00
  • b4dcb83e10 [fix] sets of potential languages in case phrase matches multiple dictionaries Al 2016-01-24 17:57:12 -05:00
  • b713d102d1 [languages] using whole phrase len, not first token, in disambiguation. Using single unambiguous observed default language or unambiguous observed language Al 2016-01-24 17:43:14 -05:00
  • b3e730d83f [languages] If there's a single default language, assume ambiguous abbreviations are the default Al 2016-01-24 17:15:02 -05:00
  • fffaeecfc6 [languages] Only count regional defaults when returning languages Al 2016-01-24 16:35:14 -05:00
  • b735c79326 [languages] Adding Spanish in as a secondary default in Spain to supplement regional language defaults so we're more careful in disambiguation Al 2016-01-24 16:34:23 -05:00
  • f8a0463aa0 [languages] Language disambiguation treats the national languages as non-default Al 2016-01-24 15:09:51 -05:00
  • 87aff60a7e [dictionaries] Gulch Al 2016-01-24 03:23:40 -05:00
  • f04360732c [languages] Single character cannot be sufficient to disambiguate with multiple languages (Avenue A for example) Al 2016-01-24 03:17:18 -05:00
  • cb914ae85b [dictionaries] Adding a few terms to English dictionaries for automated disambiguation in the US/Canada Al 2016-01-24 03:15:10 -05:00
  • 00ce71223f [osm] Using the default probabilities for abbreviations in ways training data Al 2016-01-24 00:53:41 -05:00
  • bab7a0f961 [osm] splitting streets (way names) on semicolons Al 2016-01-24 00:42:25 -05:00
  • 3485738c2b [fix] regional languages in French Canada Al 2016-01-24 00:20:34 -05:00
  • 7646adfc0f [osm] Adding abbreviated street names in addition to the originals Al 2016-01-23 23:23:58 -05:00
  • 67130383ce [fix] converting semicolons to commas in OSM house numbers and picking one at random Al 2016-01-23 23:16:19 -05:00
  • 1bb797f783 [fix] spacing in phrases Al 2016-01-23 21:59:49 -05:00
  • 3a8c3dfcf6 [fix] spacing in phrases at end of string Al 2016-01-23 21:51:40 -05:00
  • 78450bfad9 [fix] Spaces in abbreviation Al 2016-01-23 21:36:20 -05:00
  • 308ceb5a5f [fix] convert UTF8 slices back to unicode before using with the Python trie Al 2016-01-23 20:20:23 -05:00
  • 5eb6bb309b [fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string Al 2016-01-23 20:09:45 -05:00
  • d61207e95a [fix] var name Al 2016-01-23 18:01:02 -05:00
  • e44cba1d06 [fix] geonames db not required in OSM training data Al 2016-01-23 17:59:55 -05:00
  • 4f03711e60 [osm] Adding abbreviated training examples to ways language training data Al 2016-01-23 14:10:47 -05:00
  • c9fb4ee69d [osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used Al 2016-01-22 17:58:18 -05:00
  • ea9bb3f2d5 [fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled Al 2016-01-22 15:48:21 -05:00
  • f9f6558e06 [fix] simple whitespace field splits for the limited format training data (used for language classification) Al 2016-01-22 04:34:36 -05:00
  • cd1db7b288 [fix] Making sure rare components are dropped first, adding state and country back in Al 2016-01-22 04:17:19 -05:00
  • adc3a00264 [fix] var name Al 2016-01-22 04:10:16 -05:00
  • 261beffa36 [fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities Al 2016-01-22 04:00:45 -05:00
  • a6cc3d0114 [fix] Adding state to the more frequently dropped components Al 2016-01-22 03:56:38 -05:00
  • bca3dae004 [fix] state full name probabilities for limited vs. full formatted OSM training sets Al 2016-01-22 03:54:20 -05:00
  • d1cf253092 [osm/formatting] Higher probability of dropout for rare components like counties, etc. Al 2016-01-22 03:39:35 -05:00
  • 9dd965a6fa [fix] removing gazetteer configuration from disambiguation module Al 2016-01-22 03:18:18 -05:00
  • b22646ee30 [mv] Moving gazetteers into their own module Al 2016-01-22 03:15:56 -05:00
  • 5a68e7aeef [fix] import Al 2016-01-22 03:00:43 -05:00
  • 6ac72576bc [osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK Al 2016-01-22 02:56:31 -05:00
  • f4995d4f0f [languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM Al 2016-01-22 00:51:32 -05:00
  • 89aa039692 [dictionaries] Adding some Italian month abbreviations Al 2016-01-21 15:12:46 -05:00
  • 26cbb1eb8d [languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes Al 2016-01-21 04:29:14 -05:00
  • 0269d92e3d [languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms Al 2016-01-21 02:30:02 -05:00
  • 2e15db06dd [text] making normalize_string directly callable from Python geodata Al 2016-01-21 02:07:46 -05:00
  • 71e01e6133 [fix] prefix/suffix phrase search in Python trie search Al 2016-01-19 03:43:51 -05:00
  • 39667b73a2 [build] std=gnu99 in geodata build Al 2016-01-19 03:23:56 -05:00
  • 8b94a018e6 [languages] encoding in language disambiguation Al 2016-01-19 03:22:03 -05:00
  • 3262d2ccd3 [fix] arg count Al 2016-01-19 03:16:14 -05:00
  • 5d5d5713cc [transliteration] Regenerating transliterator scripts Al 2016-01-18 12:04:14 -05:00
  • fe8f3158f6 [fix] missing file in geodata Al 2016-01-17 22:23:44 -05:00
  • 5fd9dc7e2b [scripts] relative dirs in setup.py for geodata Al 2016-01-17 22:22:50 -05:00
  • da62ff309e [transliteration] Fixing Malayalam script Al 2016-01-17 22:15:56 -05:00
  • 5385cb71d6 [languages] Adding English dictionaries to Indonesia Al 2016-01-17 22:08:06 -05:00
  • 8030b235e6 [languages] Changing the definition in script languages so only languages that appear on street signs will be used Al 2016-01-17 22:03:41 -05:00
  • 0dfd8d6439 [language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters) Al 2016-01-17 21:37:45 -05:00
  • b9a3230f65 [language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now Al 2016-01-17 21:13:14 -05:00
  • f808f74271 [language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set Al 2016-01-17 21:11:37 -05:00
  • af5689ee52 [fix] removing unused var Al 2016-01-17 21:00:12 -05:00
  • 7d727fc8f0 [optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0) Al 2016-01-17 20:59:47 -05:00
  • 7b300639f1 [fix] Trie prefix search tail comparison Al 2016-01-17 20:56:37 -05:00
  • 70dbfdd560 [unicode] Regenerating unicode_script_data.c Al 2016-01-17 20:53:28 -05:00
  • de240d2b94 [fix] tokenize_add_tokens respects specified length Al 2016-01-17 20:51:43 -05:00
  • 10cadc67d7 [io] matrix_read using array I/O functions Al 2016-01-17 20:40:18 -05:00
  • baba826d21 [io] Cutting down on system calls in trie_read Al 2016-01-17 20:39:19 -05:00
  • cba2acc21f [io] Sparse matrix using array I/O methods Al 2016-01-17 20:38:16 -05:00
  • 46b35c5202 [utils] Adding functions to read numeric arrays from files Al 2016-01-17 20:36:57 -05:00
  • 3d7dd8966e [languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer Al 2016-01-17 18:28:19 -05:00
  • fa32eacdd1 [phrases] Adding Python phrase filter from address_normalizer until a Python wrapper around libpostal's trie_search is available Al 2016-01-17 15:45:02 -05:00
  • f79a3c5bf4 [osm/polygons] Allowing polygons that GEOS claims are invalid in OSM polygon index (there were some glaring omissions from the index like the polygons for the UK or Berlin). For some reason .buffer(0) creates weird multipolygons that no longer contain their centroids, etc. and aren't useful in reverese geocoding Al 2016-01-17 15:43:21 -05:00
  • 04f251c1cc [polygons] Don't call fix_polygon (force polygon validity) by default Al 2016-01-16 21:21:27 -05:00
  • 19a5541a85 [polygons/osm] append polygon nodes by vertices that connect to each other Al 2016-01-16 21:20:49 -05:00
  • d4143c1685 [parsing] Adding an optimization to the parser API where, if the entire input is a single known geographic phrase like New York, it returns the most likely label from the training data. That way e.g. a search for 'Florida' doesn't get tagged as 'house.' This doesn't affect training, only prediction. Al 2016-01-15 20:07:21 -05:00
  • 24b4a680c3 [languages] Adding English dictionaries for Bangladesh Al 2016-01-14 13:36:07 -05:00
  • edebdf73e0 [dictionaries] Using long forms as canonical for English degrees as new language models may do some auto-abbreviating Al 2016-01-14 13:35:41 -05:00
  • 58e53cab1c [scripts] Adding the tokenize/normalize wrappers directly into the internal geodata package so pypostal can be maintained in an independent repo Al 2016-01-12 13:26:55 -05:00
  • 622dc354e7 [optimization] Adding learning rate to lazy sparse update in stochastic gradient descent Al 2016-01-12 11:02:12 -05:00
  • 79f2b7c192 [build] Removing source from libpostal shared lib Al 2016-01-12 10:31:19 -05:00
  • 6a9c1e8c6d [build] Adding trie_utils.c to address parser train/test Al 2016-01-12 10:22:30 -05:00
  • 7cc201dec3 [optimization] Moving gamma_t calculation to the header in SGD Al 2016-01-11 16:40:50 -05:00
  • 25ae5bed33 [unicode] Adding SCRIPT_INHERITED as a common script so diacritics like COMBING CEDILLA don't break the current script and produce false word breaks Al 2016-01-11 16:39:15 -05:00
  • 3260edcf18 [math] Adding sparse dot sparse given a dense output matrix (suitable for the minibatch use case), fixing sparse dot vector Al 2016-01-11 13:55:54 -05:00
  • 736bc7c70d [config] language_classifier data dir Al 2016-01-10 03:05:36 -05:00
  • ebaedb6bcf [language_classifier] Language classifier training using L2-regularized logistic regression and stochastic gradient descent Al 2016-01-10 01:31:18 -05:00
  • 56710cce21 [language_classifier] Language classifier data set I/O Al 2016-01-10 01:22:29 -05:00