libpostal

tommy/libpostal

Fork 0

40641209ee [build] Build shared lib in site-packages Al 2015-12-15 12:19:40 -05:00
04430f1a8e [fix] var Al 2015-12-15 10:51:56 -05:00
d8f731b672 [build] setup.py include/library dirs Al 2015-12-15 10:50:57 -05:00
faf8b00596 [python] libpostal includes Al 2015-12-15 02:56:02 -05:00
d2426d3777 [build] build_ext Al 2015-12-15 02:31:48 -05:00
cb648b63da [build] Adding include and library dirs based on autoconf prefix Al 2015-12-15 02:21:15 -05:00
7cf48acd20 [fix] standard headers in new extensions Al 2015-12-15 01:18:33 -05:00
bec43750d5 [build] bumping Python version Al 2015-12-15 00:58:11 -05:00
33fdb912b6 [build] setup.py changes for parser extension Al 2015-12-15 00:56:53 -05:00
c40ab06dd6 [python] Forgot expand.py Al 2015-12-15 00:56:34 -05:00
842ef4526b [python] Adding address parser Python API Al 2015-12-15 00:55:41 -05:00
b9bf5c629e [fix] Moving address_parser_response_destroy into libpostal so caller can free Al 2015-12-15 00:52:24 -05:00
ab3ba249d7 [python/build] Modified install command for setup.py allowing --datadir and --prefix to be passed in. If there's a virtualenv active and nothing else is specified, install libpostal and its data files there by default Al 2015-12-14 18:21:21 -05:00
7af0e2d967 [python] Adding Python bindings to the expand API Al 2015-12-14 18:18:16 -05:00
b59c830ba6 [fix] warning about size_t Al 2015-12-14 18:17:09 -05:00
406f9c533d [api] Separating parser setup/teardown into two separate methods Al 2015-12-14 18:15:57 -05:00
0f52f97621 [fix] Python 3 version of tokenize/normalize Al 2015-12-14 18:14:57 -05:00
3401045b4f [fix] changing labels in Python normalize, adding a NULL check Al 2015-12-14 14:59:57 -05:00
43b212a09b [fix] size_t in benchmark script Al 2015-12-14 14:57:11 -05:00
dc03c83bb2 [math] Adding an aligned memory allocator for vectors to help with vectorization/SIMD Al 2015-12-14 14:56:38 -05:00
bd1e8ecaf8 [fix] default address parser dir Al 2015-12-12 12:55:37 -05:00
2950358697 [build] address_parser client now links to libpostal, adding address_parser to download script with an "all" option Al 2015-12-12 12:49:34 -05:00
88836e56e1 [api] Adding parse_address implementation to the libpostal API. GeoDB and address parser are now required. Stripping punctuation from the normalized output Al 2015-12-12 12:40:19 -05:00
bce6ba2595 [fix] typedef Al 2015-12-12 11:58:41 -05:00
a8d6cc4053 [api] Moving parse_address definition into libpostal.h Al 2015-12-12 03:54:51 -05:00
fe4c528f26 [parser] Using different char_array for each of the potential phrases as token i Al 2015-12-12 03:23:26 -05:00
e6303f70f3 [fix] removing printf Al 2015-12-11 02:53:22 -05:00
671dd4a5d2 [parser] Fixing possible invalid writes in training for values beginning with a separator Al 2015-12-11 02:05:05 -05:00
743b74aea5 [parser] Simplifying args in address_parser_data_set_tokenize_line Al 2015-12-10 18:48:23 -05:00
1d288954d7 [osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered. Al 2015-12-10 18:45:37 -05:00
88b8023ac8 [fix] Bug in address parser feature extraction, can hold onto the wrong pointer Al 2015-12-10 18:42:28 -05:00
3de59506ae [parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations) Al 2015-12-10 18:08:51 -05:00
71d6d3c5e1 [utils] Removing kvec and using similar implementation with pointers that can be passed around Al 2015-12-10 02:50:34 -05:00
ab205eff96 [utils] Adding a default small size to all arrays based on a look at malloc/realloc usage Al 2015-12-09 19:46:05 -05:00
779298360c [osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons Al 2015-12-09 01:00:59 -05:00
aeb72d7d26 [osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random Al 2015-12-09 00:20:20 -05:00
2c254ebc5e [fix] Belgium cities again Al 2015-12-08 23:09:28 -05:00
f252869671 [dictionaries] adding ste to English dictionaries Al 2015-12-08 22:29:52 -05:00
69a469d9d3 [osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples Al 2015-12-08 20:38:32 -05:00
fe37286bcf [fix] Fixes to matrix methods Al 2015-12-08 17:33:38 -05:00
d9d53ce17e [math] Matrix method updates Al 2015-12-08 15:39:52 -05:00
48ee665e71 [scripts] Benchmark script using default options Al 2015-12-08 15:38:44 -05:00
2fcc72ae07 [fix] multitoken canonical strings Al 2015-12-08 15:38:04 -05:00
a857138d95 [api] Adding place name expansions by default Al 2015-12-08 15:31:36 -05:00
beec43fe15 [expansion] regenerating expansion data Al 2015-12-08 15:28:49 -05:00
35db855819 [fix] canonical index in address expansion data, should be -1 for all canonical phrases Al 2015-12-08 15:09:51 -05:00
e1ea2ac704 [expansion] Toponym dictionaries can apply to street names and place names Al 2015-12-08 02:10:22 -05:00
bfc517ae42 [fix] Belgium districts Al 2015-12-07 22:11:11 -05:00
cbe5cd7429 [expansion] The ambiguous expansions dictionary shouldn't add to the component bitset Al 2015-12-07 20:36:56 -05:00
d35f519629 [expansion] Fixing case where non-ideographic tokens like # can potentially be concatenated with surrounding tokens and should normalized with whitespace in between Al 2015-12-07 19:18:46 -05:00
f5739dd42b [math] Signatures for array_exp and array_log Al 2015-12-07 18:10:04 -05:00
0d8d396108 [expansion] Fixing cases like ML King where a global (all languages) expansion subsumes the specific language expansion (like English) Al 2015-12-07 18:09:20 -05:00
9bab70909d [numex] Always adding a version of the string without Roman numeral expansion since many times those tokens can be ambiguous Al 2015-12-07 14:29:13 -05:00
f8a3081d0f [fix] city name in OSM formatting Al 2015-12-07 02:33:12 -05:00
a066ee9aad [math] Only reallocate on matrix_resize if needed Al 2015-12-07 01:20:16 -05:00
cfd0dc69f2 [parsing] Using the entire phrase as the ith word Al 2015-12-07 01:19:38 -05:00
8186e2606e [dictionaries] Regenerating address expansion data file Al 2015-12-06 16:56:18 -05:00
4dba0c54e4 [dictionaries] Adding state abbreviations for US, CA and AU into dictionaries Al 2015-12-06 16:47:36 -05:00
b25a738000 [osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name Al 2015-12-06 16:14:02 -05:00
44f7fd0844 [math] Matrix resize Al 2015-12-06 03:20:03 -05:00
dd8f8b4d7b [fix] prefix/suffix regexes Al 2015-12-05 18:41:22 -05:00
5fcb6d2c30 [fix] typo Al 2015-12-05 16:23:55 -05:00
3a7ba0288f [fix] .get Al 2015-12-05 16:13:15 -05:00
c92a6de477 [fix] name Al 2015-12-05 15:49:50 -05:00
2a4210f93f [osm] Stripping standard city prefixes/suffies e.g. Township of Al 2015-12-05 15:42:22 -05:00
596c5ffdd3 [fix] Tokenized trie search Al 2015-12-05 15:21:52 -05:00
24208c209f [parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold). Al 2015-12-05 14:34:06 -05:00
f41158b8b3 [osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city Al 2015-12-05 14:21:07 -05:00
7c26317903 [fix] osm components Al 2015-12-03 19:30:15 -05:00
42a8890652 [osm] Only removing local language city if there are prior components from OSM Al 2015-12-03 19:11:03 -05:00
ab0a4e622d [formatting] Switching back over to OpenCageData Al 2015-12-03 18:03:21 -05:00
5af95ee613 [osm] Adding GeoNames abbreviated city names in a small percentage of cases to get variations like NYC, BK, SF, etc. in the training data Al 2015-12-03 18:00:05 -05:00
25e89bcc41 [fix] tokenized trie search edge case where tail is stored on the space node Al 2015-12-03 12:25:21 -05:00
218361f43f [osm] Removing multilinestring boundaries from OSM polygon index (often partial boundaries e.g. France-Germany) Al 2015-12-03 00:51:09 -05:00
43287db90a [normalization/phrases] Fixing a bug which occurs with an already-separated elision Al 2015-12-02 16:04:39 -05:00
87c04b4d37 [fix] path in setup.py Al 2015-12-02 14:22:07 -05:00
09a3e2ab64 [fix] pip install command Al 2015-12-02 13:43:57 -05:00
746b5d0f34 [fix] transliterate using string_equals Al 2015-12-02 13:09:43 -05:00
d0aaff1482 [utils] string_equals with NULL check Al 2015-12-01 13:09:29 -05:00
f322ae0a1c [build] adding shuffle.c to Makefile rule Al 2015-12-01 11:28:33 -05:00
b94264b745 [parser] Forgot to add shuffle.h/.c Al 2015-12-01 11:25:28 -05:00
116fe857db [parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac Al 2015-12-01 11:24:38 -05:00
8484d4fffd [fix] venue names should be removed probabilistically in the training data, giving neighborhoods a slightly better chance of being included Al 2015-11-30 23:28:12 -05:00
6ef40c1769 [fix] dupe checking Al 2015-11-30 18:43:11 -05:00
af170de019 [fix] Smaller probabilities on adding neighborhoods and admin polygons, eliminating duplicates on the row level Al 2015-11-30 18:35:31 -05:00
621fd79002 [fix] var Al 2015-11-30 18:19:56 -05:00
b430fb7657 [osm/formatting] Adding pick random name logic to neighborhoods as well, getting rid of drop probabilities as they're covered elsewhere, adding several forms of venue names to the training data Al 2015-11-30 18:10:18 -05:00
d4b6450f19 [formatting] Not applying template replacements from address formatting by default Al 2015-11-30 16:11:13 -05:00
839a12b212 [osm/formatting] Changing drop probabilities and doing it in random order Al 2015-11-30 15:27:35 -05:00
5f13041140 [parsing/build] Makefile changes for address parser Al 2015-11-30 14:51:43 -05:00
4ca911baf8 [parsing] Adding a command-line client (with history) to test address parsing Al 2015-11-30 14:51:01 -05:00
89677d94a3 [parsing] Initial commit of the address parser, training/testing, feature function, I/O Al 2015-11-30 14:48:13 -05:00
e62eb1e697 [math] Matrix file I/O Al 2015-11-30 12:52:57 -05:00
5682c347ac [fix] close file handle Al 2015-11-30 12:51:13 -05:00
9a8ba14887 [osm/formatting] Adding per-field drop probabilities to OSM training data to make some fields more likely to be dropped, although it might create more training data Al 2015-11-30 11:10:07 -05:00
c8e4602d4c [fix] Neighborhoods reverse geocoder discriminates between OSM matched with Zetashapes and OSM matched with Quattroshapes Al 2015-11-30 10:59:50 -05:00
feab77970b [cli] Adding antirez's linenoise for command-line interfaces Al 2015-11-29 11:28:31 -05:00
15d9e00121 [osm/formatting] Adding in more ISO alpha-3 codes for countries in the training data Al 2015-11-28 14:08:07 -05:00
d3040036ec [fix] moving separator definitions Al 2015-11-28 13:53:13 -05:00
66778737ff [fix] non-local language states Al 2015-11-28 13:48:59 -05:00

Commit Graph Select branches Hide Pull Requests main master Mono Color

Commit Graph

Select branches

Hide Pull Requests

main

master