d8f731b672[build] setup.py include/library dirs
Al
2015-12-15 10:50:57 -05:00
faf8b00596[python] libpostal includes
Al
2015-12-15 02:56:02 -05:00
d2426d3777[build] build_ext
Al
2015-12-15 02:31:48 -05:00
cb648b63da[build] Adding include and library dirs based on autoconf prefix
Al
2015-12-15 02:21:15 -05:00
7cf48acd20[fix] standard headers in new extensions
Al
2015-12-15 01:18:33 -05:00
bec43750d5[build] bumping Python version
Al
2015-12-15 00:58:11 -05:00
33fdb912b6[build] setup.py changes for parser extension
Al
2015-12-15 00:56:53 -05:00
c40ab06dd6[python] Forgot expand.py
Al
2015-12-15 00:56:34 -05:00
842ef4526b[python] Adding address parser Python API
Al
2015-12-15 00:55:41 -05:00
b9bf5c629e[fix] Moving address_parser_response_destroy into libpostal so caller can free
Al
2015-12-15 00:52:24 -05:00
ab3ba249d7[python/build] Modified install command for setup.py allowing --datadir and --prefix to be passed in. If there's a virtualenv active and nothing else is specified, install libpostal and its data files there by default
Al
2015-12-14 18:21:21 -05:00
7af0e2d967[python] Adding Python bindings to the expand API
Al
2015-12-14 18:18:16 -05:00
b59c830ba6[fix] warning about size_t
Al
2015-12-14 18:17:09 -05:00
406f9c533d[api] Separating parser setup/teardown into two separate methods
Al
2015-12-14 18:15:57 -05:00
0f52f97621[fix] Python 3 version of tokenize/normalize
Al
2015-12-14 18:14:57 -05:00
3401045b4f[fix] changing labels in Python normalize, adding a NULL check
Al
2015-12-14 14:59:57 -05:00
43b212a09b[fix] size_t in benchmark script
Al
2015-12-14 14:57:11 -05:00
dc03c83bb2[math] Adding an aligned memory allocator for vectors to help with vectorization/SIMD
Al
2015-12-14 14:56:38 -05:00
bd1e8ecaf8[fix] default address parser dir
Al
2015-12-12 12:55:37 -05:00
2950358697[build] address_parser client now links to libpostal, adding address_parser to download script with an "all" option
Al
2015-12-12 12:49:34 -05:00
88836e56e1[api] Adding parse_address implementation to the libpostal API. GeoDB and address parser are now required. Stripping punctuation from the normalized output
Al
2015-12-12 12:40:19 -05:00
bce6ba2595[fix] typedef
Al
2015-12-12 11:58:41 -05:00
a8d6cc4053[api] Moving parse_address definition into libpostal.h
Al
2015-12-12 03:54:51 -05:00
fe4c528f26[parser] Using different char_array for each of the potential phrases as token i
Al
2015-12-12 03:23:26 -05:00
e6303f70f3[fix] removing printf
Al
2015-12-11 02:53:22 -05:00
671dd4a5d2[parser] Fixing possible invalid writes in training for values beginning with a separator
Al
2015-12-11 02:05:05 -05:00
743b74aea5[parser] Simplifying args in address_parser_data_set_tokenize_line
Al
2015-12-10 18:48:23 -05:00
1d288954d7[osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered.
Al
2015-12-10 18:45:37 -05:00
88b8023ac8[fix] Bug in address parser feature extraction, can hold onto the wrong pointer
Al
2015-12-10 18:42:28 -05:00
3de59506ae[parser] Internal separators for parsing purposes include open/close parens, at sign, semicolon, etc. Ignore stray colons not internal to a word (as in Swedish abbreviations)
Al
2015-12-10 18:08:51 -05:00
71d6d3c5e1[utils] Removing kvec and using similar implementation with pointers that can be passed around
Al
2015-12-10 02:50:34 -05:00
ab205eff96[utils] Adding a default small size to all arrays based on a look at malloc/realloc usage
Al
2015-12-09 19:46:05 -05:00
779298360c[osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons
Al
2015-12-09 01:00:59 -05:00
aeb72d7d26[osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random
Al
2015-12-09 00:20:20 -05:00
2c254ebc5e[fix] Belgium cities again
Al
2015-12-08 23:09:28 -05:00
f252869671[dictionaries] adding ste to English dictionaries
Al
2015-12-08 22:29:52 -05:00
69a469d9d3[osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples
Al
2015-12-08 20:38:32 -05:00
fe37286bcf[fix] Fixes to matrix methods
Al
2015-12-08 17:33:38 -05:00
d9d53ce17e[math] Matrix method updates
Al
2015-12-08 15:39:52 -05:00
48ee665e71[scripts] Benchmark script using default options
Al
2015-12-08 15:38:44 -05:00
2fcc72ae07[fix] multitoken canonical strings
Al
2015-12-08 15:38:04 -05:00
a857138d95[api] Adding place name expansions by default
Al
2015-12-08 15:31:36 -05:00
beec43fe15[expansion] regenerating expansion data
Al
2015-12-08 15:28:49 -05:00
35db855819[fix] canonical index in address expansion data, should be -1 for all canonical phrases
Al
2015-12-08 15:09:51 -05:00
e1ea2ac704[expansion] Toponym dictionaries can apply to street names and place names
Al
2015-12-08 02:10:22 -05:00
bfc517ae42[fix] Belgium districts
Al
2015-12-07 22:11:11 -05:00
cbe5cd7429[expansion] The ambiguous expansions dictionary shouldn't add to the component bitset
Al
2015-12-07 20:36:56 -05:00
d35f519629[expansion] Fixing case where non-ideographic tokens like # can potentially be concatenated with surrounding tokens and should normalized with whitespace in between
Al
2015-12-07 19:18:46 -05:00
f5739dd42b[math] Signatures for array_exp and array_log
Al
2015-12-07 18:10:04 -05:00
0d8d396108[expansion] Fixing cases like ML King where a global (all languages) expansion subsumes the specific language expansion (like English)
Al
2015-12-07 18:09:20 -05:00
9bab70909d[numex] Always adding a version of the string without Roman numeral expansion since many times those tokens can be ambiguous
Al
2015-12-07 14:29:13 -05:00
f8a3081d0f[fix] city name in OSM formatting
Al
2015-12-07 02:33:12 -05:00
a066ee9aad[math] Only reallocate on matrix_resize if needed
Al
2015-12-07 01:20:16 -05:00
cfd0dc69f2[parsing] Using the entire phrase as the ith word
Al
2015-12-07 01:19:38 -05:00
8186e2606e[dictionaries] Regenerating address expansion data file
Al
2015-12-06 16:56:18 -05:00
4dba0c54e4[dictionaries] Adding state abbreviations for US, CA and AU into dictionaries
Al
2015-12-06 16:47:36 -05:00
b25a738000[osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name
Al
2015-12-06 16:14:02 -05:00
44f7fd0844[math] Matrix resize
Al
2015-12-06 03:20:03 -05:00
dd8f8b4d7b[fix] prefix/suffix regexes
Al
2015-12-05 18:41:22 -05:00
5fcb6d2c30[fix] typo
Al
2015-12-05 16:23:55 -05:00
3a7ba0288f[fix] .get
Al
2015-12-05 16:13:15 -05:00
c92a6de477[fix] name
Al
2015-12-05 15:49:50 -05:00
2a4210f93f[osm] Stripping standard city prefixes/suffies e.g. Township of
Al
2015-12-05 15:42:22 -05:00
596c5ffdd3[fix] Tokenized trie search
Al
2015-12-05 15:21:52 -05:00
24208c209f[parsing] Adding a training data derived index of complete phrases from suburb up to country. Only adding bias and word features for non phrases, using UNKNOWN_WORD and UNKNOWN_NUMERIC for infrequent tokens (not meeting minimum vocab count threshold).
Al
2015-12-05 14:34:06 -05:00
f41158b8b3[osm] Avoid using the alternate name (e.g. Brooklyn instead of Kings County) when it is the same as city
Al
2015-12-05 14:21:07 -05:00
7c26317903[fix] osm components
Al
2015-12-03 19:30:15 -05:00
42a8890652[osm] Only removing local language city if there are prior components from OSM
Al
2015-12-03 19:11:03 -05:00
ab0a4e622d[formatting] Switching back over to OpenCageData
Al
2015-12-03 18:03:21 -05:00
5af95ee613[osm] Adding GeoNames abbreviated city names in a small percentage of cases to get variations like NYC, BK, SF, etc. in the training data
Al
2015-12-03 18:00:05 -05:00
25e89bcc41[fix] tokenized trie search edge case where tail is stored on the space node
Al
2015-12-03 12:25:21 -05:00
218361f43f[osm] Removing multilinestring boundaries from OSM polygon index (often partial boundaries e.g. France-Germany)
Al
2015-12-03 00:51:09 -05:00
43287db90a[normalization/phrases] Fixing a bug which occurs with an already-separated elision
Al
2015-12-02 16:04:39 -05:00
87c04b4d37[fix] path in setup.py
Al
2015-12-02 14:22:07 -05:00
09a3e2ab64[fix] pip install command
Al
2015-12-02 13:43:57 -05:00
746b5d0f34[fix] transliterate using string_equals
Al
2015-12-02 13:09:43 -05:00
d0aaff1482[utils] string_equals with NULL check
Al
2015-12-01 13:09:29 -05:00
f322ae0a1c[build] adding shuffle.c to Makefile rule
Al
2015-12-01 11:28:33 -05:00
b94264b745[parser] Forgot to add shuffle.h/.c
Al
2015-12-01 11:25:28 -05:00
116fe857db[parser] gshuf (Mac equivalent of shuf) is quite a bit slower than shuf, so removing it. Need to train on Linux unless a better alternative is found for shuffling large files on Mac
Al
2015-12-01 11:24:38 -05:00
8484d4fffd[fix] venue names should be removed probabilistically in the training data, giving neighborhoods a slightly better chance of being included
Al
2015-11-30 23:28:12 -05:00
6ef40c1769[fix] dupe checking
Al
2015-11-30 18:43:11 -05:00
af170de019[fix] Smaller probabilities on adding neighborhoods and admin polygons, eliminating duplicates on the row level
Al
2015-11-30 18:35:31 -05:00
b430fb7657[osm/formatting] Adding pick random name logic to neighborhoods as well, getting rid of drop probabilities as they're covered elsewhere, adding several forms of venue names to the training data
Al
2015-11-30 18:10:18 -05:00
d4b6450f19[formatting] Not applying template replacements from address formatting by default
Al
2015-11-30 16:11:13 -05:00
839a12b212[osm/formatting] Changing drop probabilities and doing it in random order
Al
2015-11-30 15:27:35 -05:00
5f13041140[parsing/build] Makefile changes for address parser
Al
2015-11-30 14:51:43 -05:00
4ca911baf8[parsing] Adding a command-line client (with history) to test address parsing
Al
2015-11-30 14:51:01 -05:00
89677d94a3[parsing] Initial commit of the address parser, training/testing, feature function, I/O
Al
2015-11-30 14:48:13 -05:00
e62eb1e697[math] Matrix file I/O
Al
2015-11-30 12:52:57 -05:00
5682c347ac[fix] close file handle
Al
2015-11-30 12:51:13 -05:00
9a8ba14887[osm/formatting] Adding per-field drop probabilities to OSM training data to make some fields more likely to be dropped, although it might create more training data
Al
2015-11-30 11:10:07 -05:00
c8e4602d4c[fix] Neighborhoods reverse geocoder discriminates between OSM matched with Zetashapes and OSM matched with Quattroshapes
Al
2015-11-30 10:59:50 -05:00
feab77970b[cli] Adding antirez's linenoise for command-line interfaces
Al
2015-11-29 11:28:31 -05:00
15d9e00121[osm/formatting] Adding in more ISO alpha-3 codes for countries in the training data
Al
2015-11-28 14:08:07 -05:00
d3040036ec[fix] moving separator definitions
Al
2015-11-28 13:53:13 -05:00
66778737ff[fix] non-local language states
Al
2015-11-28 13:48:59 -05:00