Commit Graph

  • 40641209ee [build] Build shared lib in site-packages Al 2015-12-15 12:19:40 -05:00
  • 04430f1a8e [fix] var Al 2015-12-15 10:51:56 -05:00
  • d8f731b672 [build] setup.py include/library dirs Al 2015-12-15 10:50:57 -05:00
  • faf8b00596 [python] libpostal includes Al 2015-12-15 02:56:02 -05:00
  • d2426d3777 [build] build_ext Al 2015-12-15 02:31:48 -05:00
  • cb648b63da [build] Adding include and library dirs based on autoconf prefix Al 2015-12-15 02:21:15 -05:00
  • 7cf48acd20 [fix] standard headers in new extensions Al 2015-12-15 01:18:33 -05:00
  • bec43750d5 [build] bumping Python version Al 2015-12-15 00:58:11 -05:00
  • 33fdb912b6 [build] setup.py changes for parser extension Al 2015-12-15 00:56:53 -05:00
  • c40ab06dd6 [python] Forgot expand.py Al 2015-12-15 00:56:34 -05:00
  • 842ef4526b [python] Adding address parser Python API Al 2015-12-15 00:55:41 -05:00
  • b9bf5c629e [fix] Moving address_parser_response_destroy into libpostal so caller can free Al 2015-12-15 00:52:24 -05:00
  • ab3ba249d7 [python/build] Modified install command for setup.py allowing --datadir and --prefix to be passed in. If there's a virtualenv active and nothing else is specified, install libpostal and its data files there by default Al 2015-12-14 18:21:21 -05:00
  • 7af0e2d967 [python] Adding Python bindings to the expand API Al 2015-12-14 18:18:16 -05:00
  • b59c830ba6 [fix] warning about size_t Al 2015-12-14 18:17:09 -05:00
  • 406f9c533d [api] Separating parser setup/teardown into two separate methods Al 2015-12-14 18:15:57 -05:00
  • 0f52f97621 [fix] Python 3 version of tokenize/normalize Al 2015-12-14 18:14:57 -05:00
  • 3401045b4f [fix] changing labels in Python normalize, adding a NULL check Al 2015-12-14 14:59:57 -05:00
  • 43b212a09b [fix] size_t in benchmark script Al 2015-12-14 14:57:11 -05:00
  • dc03c83bb2 [math] Adding an aligned memory allocator for vectors to help with vectorization/SIMD Al 2015-12-14 14:56:38 -05:00
  • bd1e8ecaf8 [fix] default address parser dir Al 2015-12-12 12:55:37 -05:00
  • 2950358697 [build] address_parser client now links to libpostal, adding address_parser to download script with an "all" option Al 2015-12-12 12:49:34 -05:00
  • 88836e56e1 [api] Adding parse_address implementation to the libpostal API. GeoDB and address parser are now required. Stripping punctuation from the normalized output Al 2015-12-12 12:40:19 -05:00
  • bce6ba2595 [fix] typedef Al 2015-12-12 11:58:41 -05:00
  • a8d6cc4053 [api] Moving parse_address definition into libpostal.h Al 2015-12-12 03:54:51 -05:00
  • fe4c528f26 [parser] Using different char_array for each of the potential phrases as token i Al 2015-12-12 03:23:26 -05:00
  • e6303f70f3 [fix] removing printf Al 2015-12-11 02:53:22 -05:00
  • 671dd4a5d2 [parser] Fixing possible invalid writes in training for values beginning with a separator Al 2015-12-11 02:05:05 -05:00
  • 743b74aea5 [parser] Simplifying args in address_parser_data_set_tokenize_line Al 2015-12-10 18:48:23 -05:00
  • 1d288954d7 [osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered. Al 2015-12-10 18:45:37 -05:00
  • 88b8023ac8 [fix] Bug in address parser feature extraction, can hold onto the wrong pointer Al 2015-12-10 18:42:28 -05:00
  • 3de59506ae [parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations) Al 2015-12-10 18:08:51 -05:00
  • 71d6d3c5e1 [utils] Removing kvec and using similar implementation with pointers that can be passed around Al 2015-12-10 02:50:34 -05:00
  • ab205eff96 [utils] Adding a default small size to all arrays based on a look at malloc/realloc usage Al 2015-12-09 19:46:05 -05:00
  • 779298360c [osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons Al 2015-12-09 01:00:59 -05:00
  • aeb72d7d26 [osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random Al 2015-12-09 00:20:20 -05:00
  • 2c254ebc5e [fix] Belgium cities again Al 2015-12-08 23:09:28 -05:00
  • f252869671 [dictionaries] adding ste to English dictionaries Al 2015-12-08 22:29:52 -05:00
  • 69a469d9d3 [osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples Al 2015-12-08 20:38:32 -05:00
  • fe37286bcf [fix] Fixes to matrix methods Al 2015-12-08 17:33:38 -05:00
  • d9d53ce17e [math] Matrix method updates Al 2015-12-08 15:39:52 -05:00
  • 48ee665e71 [scripts] Benchmark script using default options Al 2015-12-08 15:38:44 -05:00
  • 2fcc72ae07 [fix] multitoken canonical strings Al 2015-12-08 15:38:04 -05:00
  • a857138d95 [api] Adding place name expansions by default Al 2015-12-08 15:31:36 -05:00
  • beec43fe15 [expansion] regenerating expansion data Al 2015-12-08 15:28:49 -05:00
  • 35db855819 [fix] canonical index in address expansion data, should be -1 for all canonical phrases Al 2015-12-08 15:09:51 -05:00
  • e1ea2ac704 [expansion] Toponym dictionaries can apply to street names and place names Al 2015-12-08 02:10:22 -05:00
  • bfc517ae42 [fix] Belgium districts Al 2015-12-07 22:11:11 -05:00
  • cbe5cd7429 [expansion] The ambiguous expansions dictionary shouldn't add to the component bitset Al 2015-12-07 20:36:56 -05:00
  • d35f519629 [expansion] Fixing case where non-ideographic tokens like # can potentially be concatenated with surrounding tokens and should normalized with whitespace in between Al 2015-12-07 19:18:46 -05:00
  • f5739dd42b [math] Signatures for array_exp and array_log Al 2015-12-07 18:10:04 -05:00
  • 0d8d396108 [expansion] Fixing cases like ML King where a global (all languages) expansion subsumes the specific language expansion (like English) Al 2015-12-07 18:09:20 -05:00
  • 9bab70909d [numex] Always adding a version of the string without Roman numeral expansion since many times those tokens can be ambiguous Al 2015-12-07 14:29:13 -05:00
  • f8a3081d0f [fix] city name in OSM formatting Al 2015-12-07 02:33:12 -05:00
  • a066ee9aad [math] Only reallocate on matrix_resize if needed Al 2015-12-07 01:20:16 -05:00
  • cfd0dc69f2 [parsing] Using the entire phrase as the ith word Al 2015-12-07 01:19:38 -05:00
  • 8186e2606e [dictionaries] Regenerating address expansion data file Al 2015-12-06 16:56:18 -05:00
  • 4dba0c54e4 [dictionaries] Adding state abbreviations for US, CA and AU into dictionaries Al 2015-12-06 16:47:36 -05:00
  • b25a738000 [osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name Al 2015-12-06 16:14:02 -05:00
  • 44f7fd0844 [math] Matrix resize Al 2015-12-06 03:20:03 -05:00
  • dd8f8b4d7b [fix] prefix/suffix regexes Al 2015-12-05 18:41:22 -05:00
  • 5fcb6d2c30 [fix] typo Al 2015-12-05 16:23:55 -05:00
  • 3a7ba0288f [fix] .get Al 2015-12-05 16:13:15 -05:00
  • c92a6de477 [fix] name Al 2015-12-05 15:49:50 -05:00
  • 2a4210f93f [osm] Stripping standard city prefixes/suffies e.g. Township of Al 2015-12-05 15:42:22 -05:00
  • 596c5ffdd3 [fix] Tokenized trie search Al 2015-12-05 15:21:52 -05:00
  • 24208c209f [parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold). Al 2015-12-05 14:34:06 -05:00
  • f41158b8b3 [osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city Al 2015-12-05 14:21:07 -05:00
  • 7c26317903 [fix] osm components Al 2015-12-03 19:30:15 -05:00
  • 42a8890652 [osm] Only removing local language city if there are prior components from OSM Al 2015-12-03 19:11:03 -05:00
  • ab0a4e622d [formatting] Switching back over to OpenCageData Al 2015-12-03 18:03:21 -05:00
  • 5af95ee613 [osm] Adding GeoNames abbreviated city names in a small percentage of cases to get variations like NYC, BK, SF, etc. in the training data Al 2015-12-03 18:00:05 -05:00
  • 25e89bcc41 [fix] tokenized trie search edge case where tail is stored on the space node Al 2015-12-03 12:25:21 -05:00
  • 218361f43f [osm] Removing multilinestring boundaries from OSM polygon index (often partial boundaries e.g. France-Germany) Al 2015-12-03 00:51:09 -05:00
  • 43287db90a [normalization/phrases] Fixing a bug which occurs with an already-separated elision Al 2015-12-02 16:04:39 -05:00
  • 87c04b4d37 [fix] path in setup.py Al 2015-12-02 14:22:07 -05:00
  • 09a3e2ab64 [fix] pip install command Al 2015-12-02 13:43:57 -05:00
  • 746b5d0f34 [fix] transliterate using string_equals Al 2015-12-02 13:09:43 -05:00
  • d0aaff1482 [utils] string_equals with NULL check Al 2015-12-01 13:09:29 -05:00
  • f322ae0a1c [build] adding shuffle.c to Makefile rule Al 2015-12-01 11:28:33 -05:00
  • b94264b745 [parser] Forgot to add shuffle.h/.c Al 2015-12-01 11:25:28 -05:00
  • 116fe857db [parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac Al 2015-12-01 11:24:38 -05:00
  • 8484d4fffd [fix] venue names should be removed probabilistically in the training data, giving neighborhoods a slightly better chance of being included Al 2015-11-30 23:28:12 -05:00
  • 6ef40c1769 [fix] dupe checking Al 2015-11-30 18:43:11 -05:00
  • af170de019 [fix] Smaller probabilities on adding neighborhoods and admin polygons, eliminating duplicates on the row level Al 2015-11-30 18:35:31 -05:00
  • 621fd79002 [fix] var Al 2015-11-30 18:19:56 -05:00
  • b430fb7657 [osm/formatting] Adding pick random name logic to neighborhoods as well, getting rid of drop probabilities as they're covered elsewhere, adding several forms of venue names to the training data Al 2015-11-30 18:10:18 -05:00
  • d4b6450f19 [formatting] Not applying template replacements from address formatting by default Al 2015-11-30 16:11:13 -05:00
  • 839a12b212 [osm/formatting] Changing drop probabilities and doing it in random order Al 2015-11-30 15:27:35 -05:00
  • 5f13041140 [parsing/build] Makefile changes for address parser Al 2015-11-30 14:51:43 -05:00
  • 4ca911baf8 [parsing] Adding a command-line client (with history) to test address parsing Al 2015-11-30 14:51:01 -05:00
  • 89677d94a3 [parsing] Initial commit of the address parser, training/testing, feature function, I/O Al 2015-11-30 14:48:13 -05:00
  • e62eb1e697 [math] Matrix file I/O Al 2015-11-30 12:52:57 -05:00
  • 5682c347ac [fix] close file handle Al 2015-11-30 12:51:13 -05:00
  • 9a8ba14887 [osm/formatting] Adding per-field drop probabilities to OSM training data to make some fields more likely to be dropped, although it might create more training data Al 2015-11-30 11:10:07 -05:00
  • c8e4602d4c [fix] Neighborhoods reverse geocoder discriminates between OSM matched with Zetashapes and OSM matched with Quattroshapes Al 2015-11-30 10:59:50 -05:00
  • feab77970b [cli] Adding antirez's linenoise for command-line interfaces Al 2015-11-29 11:28:31 -05:00
  • 15d9e00121 [osm/formatting] Adding in more ISO alpha-3 codes for countries in the training data Al 2015-11-28 14:08:07 -05:00
  • d3040036ec [fix] moving separator definitions Al 2015-11-28 13:53:13 -05:00
  • 66778737ff [fix] non-local language states Al 2015-11-28 13:48:59 -05:00