Al
|
3cb513a8f2
|
[utils] hash_get is no longer a string-only function, can be used for generic hashtables
|
2017-04-02 23:28:17 -04:00 |
|
Al
|
95e39ad91c
|
[utils] removing default chunk size from address_parser_train
|
2017-04-02 23:26:51 -04:00 |
|
Al
|
a4431dbb27
|
[classification] removing regularization update from gradient computation in logistic regression, as that's now handled by the optimizer
|
2017-04-02 14:32:14 -04:00 |
|
Al
|
64c049730a
|
[classification] flexible logistic regression trainer that can handle either SGD (with either L1 or L2) or FTRL as optimiers
|
2017-04-02 14:30:14 -04:00 |
|
Al
|
cf88bc7f65
|
[optimization] implemented Google's FTRL-Proximal, adapted for the multiclass/multinomial case. It is L1 and L2 regularized, and should both encourage sparsity with the L1 penalty while being robust to collinearity of features due to the L2 penalty. Ref: https://research.google.com/pubs/archive/41159.pdf
|
2017-04-02 14:28:25 -04:00 |
|
Al
|
ed05aaabb1
|
[utils] adding default chunk size to shuffle.h
|
2017-04-02 13:51:45 -04:00 |
|
Al
|
96e1ca5e89
|
[utils] sparse_matrix_add_unique_columns_alias, adds the actual column indices to hashtable/array and aliases those in the table from 1 to N (where N is the number of unique columns in this batch). This way it's compatible with smaller matrices of batch weights.
|
2017-04-02 13:48:46 -04:00 |
|
Al
|
a2563a4dcd
|
[optimization] new sgd_trainer struct to manage weights in stochastic gradient descent, allows L1 or L2 regularization, cumulative penalties instead of exponential decay, SGD using L1 regularization encouraged sparsity and can produce a sparse matrix after training rather than a dense one
|
2017-04-02 13:44:59 -04:00 |
|
Al
|
19fe084974
|
[utils] adding non-branching sign functions
|
2017-04-02 13:41:57 -04:00 |
|
Al
|
74a281e332
|
[dictionaries] more abbreviations for MLK
|
2017-04-01 00:54:14 -04:00 |
|
Al
|
7f30fb8e38
|
[openaddresses] add OSM boundaries to King, NC
|
2017-03-31 21:13:32 -04:00 |
|
Al
|
b52f137b5d
|
[openaddresses] adding units to Chelan County, WA, adding Island County, WA
|
2017-03-31 18:08:43 -04:00 |
|
Al
|
6ec4c1fdc9
|
[openaddresses] adding units to city of Columbia, MO
|
2017-03-31 17:44:04 -04:00 |
|
Al
|
f349607412
|
[openaddresses] adding units in Boone County, MO
|
2017-03-31 17:27:35 -04:00 |
|
Al
|
bd8de15886
|
[openaddresses] OSM boundaries no longer needed in Alamance County, NC. Ignore city when it's {ALAMANCECOUNTY, COUNTY}
|
2017-03-31 17:24:45 -04:00 |
|
Al
|
267be6c05c
|
[data] 12 worker pool in data download instead of 10 to download the new parser in one shot
|
2017-03-31 15:52:17 -04:00 |
|
Al
|
7f8c2f0ad3
|
[fix] remove bloom.c from libpostal sources
|
2017-03-31 15:22:48 -04:00 |
|
Al
|
a64c81b45b
|
[data/models] updating libpostal download script to download new models. The simple data files are stored by libpostal major version, whereas the models are stored by the version of the training data they used. A file called "latest" is stored in S3 to indicate the latest version of the model and checked on make
|
2017-03-31 13:35:07 -04:00 |
|
Al
|
6d4c7984df
|
[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions
|
2017-03-31 03:35:51 -04:00 |
|
Al
|
f8d7bdf364
|
[build] defining libpostal .so version in configure.ac, removing dependency on mmap and sparkey
|
2017-03-31 03:24:19 -04:00 |
|
Al
|
f7b695c642
|
[build] add /usr/local/include as default include path for test Makefile as well
|
2017-03-30 15:57:17 -04:00 |
|
Al
|
ace40bf0aa
|
[rm] removing ax_blas.m4
|
2017-03-30 15:53:53 -04:00 |
|
Al
|
27426e90d0
|
[build] wrap CBLAS check in a check for the cblas.h header
|
2017-03-30 15:53:38 -04:00 |
|
Al
|
db726d5ce1
|
[openaddresses] OSM boundaries no longer needed in Allen County, IN, needed in Clark County, NV
|
2017-03-30 02:10:12 -04:00 |
|
Al
|
979f866678
|
[openaddresses] Ignoring cities starting with UT in St Louis County, MN
|
2017-03-30 02:01:06 -04:00 |
|
Al
|
8980131f3f
|
[openaddresses] adding Mexico countrywide, removing add_osm_boundaries from New Orleans
|
2017-03-30 01:57:42 -04:00 |
|
Al
|
65fadbeea3
|
[fix] add CBLAS_LIBS in the test Makefile
|
2017-03-29 21:38:54 -04:00 |
|
Al
|
f7889bf138
|
[fix] removing WIP
|
2017-03-29 20:46:56 -04:00 |
|
Al
|
d757baaf56
|
[fix] HAVE_CBLAS in matrix.h, memcpy needs to use sizeof(type)
|
2017-03-29 19:01:13 -04:00 |
|
Al
|
a7c9b919e9
|
[build] trying a CBLAS-specific macro that doesn't rope in Fortran
|
2017-03-29 18:57:32 -04:00 |
|
Al
|
9636ef6393
|
[fix] typo
|
2017-03-29 18:55:03 -04:00 |
|
Al
|
3e051a30da
|
[places] allowing training examples in the US and Canada with no city 5% of the time so the road=>{county,state} transition is more likely
|
2017-03-29 16:43:08 -04:00 |
|
Al
|
b90a703746
|
[openaddresses] adding units to Denver
|
2017-03-29 13:50:32 -04:00 |
|
Al
|
f0d37cc56d
|
[openaddresses] adding Piemonte
|
2017-03-29 13:06:01 -04:00 |
|
Al
|
0bd1bdb6f2
|
[test] adding Brazil and Romania parses for demo
|
2017-03-29 13:03:05 -04:00 |
|
Al
|
03ceb18a41
|
[test] adding US tests for parser demo queries
|
2017-03-28 15:04:00 -04:00 |
|
Al
|
22d97a0a35
|
[openaddresses] adding Belmont County, OH
|
2017-03-28 14:46:46 -04:00 |
|
Al
|
5ac891c484
|
[openaddresses] add McKinney, TX
|
2017-03-28 13:03:42 -04:00 |
|
Al
|
40f594e3be
|
[dictionaries] adding Dep. as an abbreviation for departamento in Spanish
|
2017-03-27 10:03:22 -04:00 |
|
Al
|
c0bded412c
|
[openaddresses] Sibley County, MN
|
2017-03-27 09:59:24 -04:00 |
|
Al
|
217de3a8a2
|
[addresses] adding the ability to hyphenate the generated unit/floor numbers, either for ranges or simple hyphenated numbers, including hyphenated variants of the letter + number or number + letter forms. Implementing for English but something similar can be done in the other configs.
|
2017-03-27 01:48:32 -04:00 |
|
Al
|
56f00250c2
|
[addresses] allowing number/ordinal spellout in the Trappa/Trappor Upp syntax in Swedish, didn't make it into the release
|
2017-03-26 20:56:43 -04:00 |
|
Al
|
61d008f349
|
[test] making some of the test cases simpler/easier so they don't fail. In general this should just be for examples that are/are going to be in the docs. Improving overall aggregate statistics like held-out accuracy over time is preferable to worrying about one individual test failure.
|
2017-03-26 20:27:32 -04:00 |
|
Al
|
81c59e116a
|
[countries] use ISO 3166 country name 5% of the time for general addresses, 10% of the time for OpenAddresses. Gives the parser examples of names like "Korea, Republic of" in #168
|
2017-03-25 19:41:59 -04:00 |
|
Al
|
ecfa6855e7
|
[openaddresses] adding Korea countrywide dataset
|
2017-03-25 16:49:16 -04:00 |
|
Al
|
8e4b909013
|
[formatting] adding postcode before city insertion for former USSR countries
|
2017-03-25 01:12:07 -04:00 |
|
Al
|
9fccfa0997
|
[places] increase state_district probability in India
|
2017-03-25 01:01:15 -04:00 |
|
Al
|
3aaa628b25
|
[test] add LaSalle, Montréal tests
|
2017-03-21 14:24:13 -04:00 |
|
Al
|
1f1dbe25e1
|
[test] adding a number of user-contributed test cases from Moz in #21. Almost all are working under the CRF parser trained on 10% of the data. There are a few problematic ones in the UK still that have been omitted here. We currently don't correctly format the training data for locailty + postal town pattern, which are both considered "city" by libpostal and thus one will usually get lumped in with the road or something like that. There may also be some utility in modelling comma usage (training data has commas, but they're ignored by the parser both at train and run time - might be useful to train on them but drop out randomly so the parser doesn't become too dependent on having them)
|
2017-03-21 03:08:09 -04:00 |
|
Al
|
7fe84e6247
|
[matrix/utils] adding resize_fill_zeros
|
2017-03-21 01:37:08 -04:00 |
|