e4ed759f0d[math] using new matrix methods in softmax
Al
2017-04-02 23:29:52 -04:00
3aab15a0a0[math] adding mean, variance and standard deviation to generic vector functions
Al
2017-04-02 23:29:15 -04:00
3cb513a8f2[utils] hash_get is no longer a string-only function, can be used for generic hashtables
Al
2017-04-02 23:28:14 -04:00
95e39ad91c[utils] removing default chunk size from address_parser_train
Al
2017-04-02 23:26:51 -04:00
a4431dbb27[classification] removing regularization update from gradient computation in logistic regression, as that's now handled by the optimizer
Al
2017-04-02 14:32:14 -04:00
64c049730a[classification] flexible logistic regression trainer that can handle either SGD (with either L1 or L2) or FTRL as optimiers
Al
2017-04-02 14:30:14 -04:00
cf88bc7f65[optimization] implemented Google's FTRL-Proximal, adapted for the multiclass/multinomial case. It is L1 and L2 regularized, and should both encourage sparsity with the L1 penalty while being robust to collinearity of features due to the L2 penalty. Ref: https://research.google.com/pubs/archive/41159.pdf
Al
2017-04-02 14:28:25 -04:00
ed05aaabb1[utils] adding default chunk size to shuffle.h
Al
2017-04-02 13:51:45 -04:00
96e1ca5e89[utils] sparse_matrix_add_unique_columns_alias, adds the actual column indices to hashtable/array and aliases those in the table from 1 to N (where N is the number of unique columns in this batch). This way it's compatible with smaller matrices of batch weights.
Al
2017-04-02 13:48:46 -04:00
a2563a4dcd[optimization] new sgd_trainer struct to manage weights in stochastic gradient descent, allows L1 or L2 regularization, cumulative penalties instead of exponential decay, SGD using L1 regularization encouraged sparsity and can produce a sparse matrix after training rather than a dense one
Al
2017-04-02 13:44:40 -04:00
19fe084974[utils] adding non-branching sign functions
Al
2017-04-02 13:41:57 -04:00
74a281e332[dictionaries] more abbreviations for MLK
Al
2017-04-01 00:54:08 -04:00
7f30fb8e38[openaddresses] add OSM boundaries to King, NC
Al
2017-03-31 21:13:32 -04:00
b52f137b5d[openaddresses] adding units to Chelan County, WA, adding Island County, WA
Al
2017-03-31 18:08:43 -04:00
6ec4c1fdc9[openaddresses] adding units to city of Columbia, MO
Al
2017-03-31 17:44:04 -04:00
f349607412[openaddresses] adding units in Boone County, MO
Al
2017-03-31 17:27:35 -04:00
bd8de15886[openaddresses] OSM boundaries no longer needed in Alamance County, NC. Ignore city when it's {ALAMANCECOUNTY, COUNTY}
Al
2017-03-31 17:24:45 -04:00
267be6c05c[data] 12 worker pool in data download instead of 10 to download the new parser in one shot
Al
2017-03-31 15:52:14 -04:00
7f8c2f0ad3[fix] remove bloom.c from libpostal sources
Al
2017-03-31 15:22:48 -04:00
a64c81b45b[data/models] updating libpostal download script to download new models. The simple data files are stored by libpostal major version, whereas the models are stored by the version of the training data they used. A file called "latest" is stored in S3 to indicate the latest version of the model and checked on make
Al
2017-03-31 13:35:07 -04:00
6d4c7984df[api] doing this now since we're bumping a major version. Using a libpostal prefixes for all public header functions and definitions
Al
2017-03-31 03:35:51 -04:00
f8d7bdf364[build] defining libpostal .so version in configure.ac, removing dependency on mmap and sparkey
Al
2017-03-31 03:24:19 -04:00
f7b695c642[build] add /usr/local/include as default include path for test Makefile as well
Al
2017-03-30 15:57:17 -04:00
ace40bf0aa[rm] removing ax_blas.m4
Al
2017-03-30 15:53:53 -04:00
27426e90d0[build] wrap CBLAS check in a check for the cblas.h header
Al
2017-03-30 15:53:38 -04:00
db726d5ce1[openaddresses] OSM boundaries no longer needed in Allen County, IN, needed in Clark County, NV
Al
2017-03-30 02:10:12 -04:00
979f866678[openaddresses] Ignoring cities starting with UT in St Louis County, MN
Al
2017-03-30 02:01:06 -04:00
8980131f3f[openaddresses] adding Mexico countrywide, removing add_osm_boundaries from New Orleans
Al
2017-03-30 01:57:42 -04:00
65fadbeea3[fix] add CBLAS_LIBS in the test Makefile
Al
2017-03-29 21:38:54 -04:00
f7889bf138[fix] removing WIP
Al
2017-03-29 20:46:56 -04:00
d757baaf56[fix] HAVE_CBLAS in matrix.h, memcpy needs to use sizeof(type)
Al
2017-03-29 19:01:13 -04:00
a7c9b919e9[build] trying a CBLAS-specific macro that doesn't rope in Fortran
Al
2017-03-29 18:57:32 -04:00
9636ef6393[fix] typo
Al
2017-03-29 18:55:03 -04:00
3e051a30da[places] allowing training examples in the US and Canada with no city 5% of the time so the road=>{county,state} transition is more likely
Al
2017-03-29 16:43:08 -04:00
b90a703746[openaddresses] adding units to Denver
Al
2017-03-29 13:50:32 -04:00
f0d37cc56d[openaddresses] adding Piemonte
Al
2017-03-29 13:06:01 -04:00
0bd1bdb6f2[test] adding Brazil and Romania parses for demo
Al
2017-03-29 13:03:05 -04:00
03ceb18a41[test] adding US tests for parser demo queries
Al
2017-03-28 15:03:03 -04:00
22d97a0a35[openaddresses] adding Belmont County, OH
Al
2017-03-28 14:46:46 -04:00
5ac891c484[openaddresses] add McKinney, TX
Al
2017-03-28 13:03:42 -04:00
40f594e3be[dictionaries] adding Dep. as an abbreviation for departamento in Spanish
Al
2017-03-27 10:03:22 -04:00
c0bded412c[openaddresses] Sibley County, MN
Al
2017-03-27 09:59:24 -04:00
217de3a8a2[addresses] adding the ability to hyphenate the generated unit/floor numbers, either for ranges or simple hyphenated numbers, including hyphenated variants of the letter + number or number + letter forms. Implementing for English but something similar can be done in the other configs.
Al
2017-03-27 01:48:25 -04:00
56f00250c2[addresses] allowing number/ordinal spellout in the Trappa/Trappor Upp syntax in Swedish, didn't make it into the release
Al
2017-03-26 20:56:43 -04:00
61d008f349[test] making some of the test cases simpler/easier so they don't fail. In general this should just be for examples that are/are going to be in the docs. Improving overall aggregate statistics like held-out accuracy over time is preferable to worrying about one individual test failure.
Al
2017-03-21 03:08:09 -04:00
81c59e116a[countries] use ISO 3166 country name 5% of the time for general addresses, 10% of the time for OpenAddresses. Gives the parser examples of names like "Korea, Republic of" in #168
Al
2017-03-25 19:41:59 -04:00
ecfa6855e7[openaddresses] adding Korea countrywide dataset
Al
2017-03-25 16:49:16 -04:00
8e4b909013[formatting] adding postcode before city insertion for former USSR countries
Al
2017-03-25 01:12:07 -04:00
9fccfa0997[places] increase state_district probability in India
Al
2017-03-25 01:01:15 -04:00
3aaa628b25[test] add LaSalle, Montréal tests
Al
2017-03-21 14:24:13 -04:00
1f1dbe25e1[test] adding a number of user-contributed test cases from Moz in #21. Almost all are working under the CRF parser trained on 10% of the data. There are a few problematic ones in the UK still that have been omitted here. We currently don't correctly format the training data for locailty + postal town pattern, which are both considered "city" by libpostal and thus one will usually get lumped in with the road or something like that. There may also be some utility in modelling comma usage (training data has commas, but they're ignored by the parser both at train and run time - might be useful to train on them but drop out randomly so the parser doesn't become too dependent on having them)
Al
2017-03-21 03:08:09 -04:00
7fe84e6247[matrix/utils] adding resize_fill_zeros
Al
2017-03-21 01:37:08 -04:00
2bda741fa9[openaddresses] adding Sicily statewide
Al
2017-03-20 21:22:49 -04:00
67805047f4[openaddresses] adding Novosibirsk Oblast, Russia
Al
2017-03-20 19:05:45 -04:00
b8a12e0517[test] adding parser test cases in 22 countries. These may change, and I'm generlaly against putting every obscure test case in the world in here. It's better to measure accuracy in aggregate statistics instead of individual test cases (i.e. if a particular change to the parser improves overall performance but fails one test case, should we accept the improvement?) The thought here is: these represent parses that are used in documentation/examples, as well as most of those that have been brought up in Github issues from the initial release, and we want these specific tests to work from build to build. If a model fails one of these test cases, it shouldn't get pushed to our users.
Al
2017-03-20 00:58:52 -04:00
7218ca1316[openaddresses] adding Chesterfield, SC
Al
2017-03-19 16:10:29 -04:00
3b9b43f1b5[fix] handle multiple separators (like parens used in https://www.openstreetmap.org/node/244081449). Creates bad trie entries otherwise, which affect more than just that toponym
Al
2017-03-18 06:06:56 -04:00
c67678087f[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires |V(postal codes)| + |E| 32 bit ints instead of |E| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead
Al
2017-03-18 06:05:28 -04:00
cb112f0ea7[transliteration] regenerate transliteration data
Al
2017-03-17 18:28:41 -04:00
579425049b[fix] with the new CLDR transform format, reverse the lines rather than the nodes in reverse transliterators
Al
2017-03-17 18:28:15 -04:00
8e3c9d0269[test] adding test of new latin-ascii-simple transliterator which only handles things like HTML entities
Al
2017-03-17 18:27:18 -04:00
be07bfe35d[test] adding printfs on expansion test failure so it's more clear what's going on
Al
2017-03-17 17:46:22 -04:00
dfabd25e5d[phrases] set node data only when we're sure we have a correct match, otherwise the longer phrase may actually be matched
Al
2017-03-17 03:40:29 -04:00
f4a9e9d673[fix] don't compare a double to 0
Al
2017-03-15 14:59:33 -04:00
266065f22f[fix] need to store stats for component phrases that have more than one component, otherwise only the first gets stored and everything is an "unambiguous" phrase, which is not true
Al
2017-03-15 14:11:59 -04:00
0b27eb3f74[parser] thought numeric boundary names had already been removed in the source data, but someehow they've made it into one of the data sets. Doing a final check in context_fill for valid boundary names (currently valid if there's at least one non-digit token)
Al
2017-03-15 13:07:21 -04:00
1b2696b3b5[utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category)
Al
2017-03-15 13:04:39 -04:00
1a1f0a44d2[parser] parser only inserts spaces in the output if there were spaces (or other ignorable tokens) in the normalized input
Al
2017-03-15 03:34:59 -04:00
d43989cf1c[fix] log_sum_exp in SSE mode shouldn't modify the original array
Al
2017-03-15 00:22:17 -04:00
c201939f3a[openaddresses] adding some of the new counties in GA. Adding the simple unit regex to DeKalb county's ignore list as there are a few in there
Al
2017-03-13 16:08:37 -04:00
e0a9171c09[openaddresses] adding language-delineated files for South Tyrol
Al
2017-03-13 01:27:23 -04:00
6cf113b1df[fix] handle case of T = 0 in Viterbi decoding
Al
2017-03-12 22:55:48 -04:00
35ccb3ee62[fix] move
Al
2017-03-12 20:22:35 -04:00
d40a355d8b[fix] heap issues when cleaning up CRF
Al
2017-03-12 20:20:51 -04:00
1277f82f52[logging] some small logging changes to track vocab pre/post pruning
Al
2017-03-12 00:24:52 -05:00
7afba832e5[test] adding the new tests to the Makefile
Al
2017-03-11 14:34:27 -05:00
7562cf866b[crf] in averaged perceptron training for the CRF, need to update transition features when either guess != truth or prev_guess != prev_truth
Al
2017-03-11 05:58:11 -05:00
a6eaf5ebc5[fix] had taken out a previous optimization while debugging. Don't need to repeatedly update the backpointer array in viterbi to store an argmax when a stack variable will work. Because that's in the quadratic (only in L, the number o labels, which is small) section of the algorithm, even this small change can make a pretty sizeable difference. CRF training speed is now roughly on par with the greedy model
Al
2017-03-11 02:31:52 -05:00
647ddf171d[fix] formatting for print features in CRF model
Al
2017-03-11 01:18:35 -05:00
735fd7a6b7[openaddresses] adding Douglas County, OR
Al
2017-03-10 23:36:11 -05:00
d876beb386[fix] add CRF files to the main lib
Al
2017-03-10 19:40:15 -05:00
0ec590916b[build] adding necessary sources to address_parser client, address_parser_train and address_parser_test
Al
2017-03-10 19:33:31 -05:00
25649f2122[utils] new_fixed and resize_fixed in vector.h
Al
2017-03-10 19:31:34 -05:00
4e02a54a79[utils] adding file_exists to header
Al
2017-03-10 19:30:46 -05:00
5775e3d806[parser/cli] removing geodb loading from parser client
Al
2017-03-10 19:30:18 -05:00
8deb1716cb[parser] adding polymorphic (as much as C does polymorphism) model type for the parser to allow it to handle either the greedy averaged perceptron or a CRF. During training, saving, and loading, we use a different filename for a parser trained with a CRF, which is still backward-compatible with models previously trained in parser-data. Making necessary modifications to address_parser.c, address_parser_train.c, and address_parser_test.c. Also adding an option in address_parser_test to print individual errors in addition to the confusion matrix.
Al
2017-03-10 19:19:40 -05:00
1bd4689c5f[openaddresses] add Gillespie County, TX
Al
2017-03-10 15:53:49 -05:00
171aa77ea3[openaddresses] adding Fisher County, TX
Al
2017-03-10 15:46:46 -05:00
8e3bcbfc95[openaddresses] adding Coffey County, KS
Al
2017-03-10 15:44:59 -05:00
b85ed70674[utils] adding a function for checking if files exists (yay C), or at least the closest agreed-upon method for it (may return false if the user doesn't have permissions, but that's ok for our purposes here)
Al
2017-03-10 13:39:43 -05:00
3b33325c1a[cli] no longer need geodb setup in address parser client
Al
2017-03-10 13:11:32 -05:00
ef8768281b[parser/crf] adding runtime CRF tagger, which can be loaded/used once trained. Currently only does Viterbi inference, can add top-N and/or sequence probabilities later
Al
2017-03-10 02:06:45 -05:00
9afff5c9ed[parser/crf] adding an initial training algorithm for CRFs, the averaged perceptron (FTW!)
Al
2017-03-10 01:28:31 -05:00
5cac4a7585[parser/crf] adding crf_trainer, which can be thought of as a "base class" as much as that's possible in C, for creating trainers for the CRF. It doesn't deal with the weights or their representation, just provides an interface for keeping track of string features and label names, and holds the crf_context
Al
2017-03-10 01:25:20 -05:00
dd0bead63a[test/utils] also a good thing to sanity check (in C especially): string handling code
Al
2017-03-10 01:15:23 -05:00
adab8ab51a[test/crf] test for crf_context, adapted from crf1dc_debug_context in CRFsuite. Always a good idea to sanity check numerical code
Al
2017-03-10 01:13:40 -05:00
f9a9dc2224[parser/crf] adding the beginnings of a linear-chain Conditional Random Field implementation for the address parser.
Al
2017-03-09 23:13:16 -05:00
f9e60b13f5[parser] size the postcode context set appropriately when reading the parser, makes loading a large model much faster
Al
2017-03-09 14:31:12 -05:00
2400122162[fix] fixing up hash str to id template
Al
2017-03-09 00:54:31 -05:00
4c03e563e0[parser] for the min updates method to work, the feature that have not yet reached the min_updates threshold also need to be ignored when scoring, that way the model has to perform without those features, and should make more updates if they're relevant
Al
2017-03-08 15:40:12 -05:00