Al
|
d757baaf56
|
[fix] HAVE_CBLAS in matrix.h, memcpy needs to use sizeof(type)
|
2017-03-29 19:01:13 -04:00 |
|
Al
|
a7c9b919e9
|
[build] trying a CBLAS-specific macro that doesn't rope in Fortran
|
2017-03-29 18:57:32 -04:00 |
|
Al
|
9636ef6393
|
[fix] typo
|
2017-03-29 18:55:03 -04:00 |
|
Al
|
3e051a30da
|
[places] allowing training examples in the US and Canada with no city 5% of the time so the road=>{county,state} transition is more likely
|
2017-03-29 16:43:08 -04:00 |
|
Al
|
b90a703746
|
[openaddresses] adding units to Denver
|
2017-03-29 13:50:32 -04:00 |
|
Al
|
f0d37cc56d
|
[openaddresses] adding Piemonte
|
2017-03-29 13:06:01 -04:00 |
|
Al
|
0bd1bdb6f2
|
[test] adding Brazil and Romania parses for demo
|
2017-03-29 13:03:05 -04:00 |
|
Al
|
03ceb18a41
|
[test] adding US tests for parser demo queries
|
2017-03-28 15:04:00 -04:00 |
|
Al
|
22d97a0a35
|
[openaddresses] adding Belmont County, OH
|
2017-03-28 14:46:46 -04:00 |
|
Al
|
5ac891c484
|
[openaddresses] add McKinney, TX
|
2017-03-28 13:03:42 -04:00 |
|
Al
|
40f594e3be
|
[dictionaries] adding Dep. as an abbreviation for departamento in Spanish
|
2017-03-27 10:03:22 -04:00 |
|
Al
|
c0bded412c
|
[openaddresses] Sibley County, MN
|
2017-03-27 09:59:24 -04:00 |
|
Al
|
217de3a8a2
|
[addresses] adding the ability to hyphenate the generated unit/floor numbers, either for ranges or simple hyphenated numbers, including hyphenated variants of the letter + number or number + letter forms. Implementing for English but something similar can be done in the other configs.
|
2017-03-27 01:48:32 -04:00 |
|
Al
|
56f00250c2
|
[addresses] allowing number/ordinal spellout in the Trappa/Trappor Upp syntax in Swedish, didn't make it into the release
|
2017-03-26 20:56:43 -04:00 |
|
Al
|
61d008f349
|
[test] making some of the test cases simpler/easier so they don't fail. In general this should just be for examples that are/are going to be in the docs. Improving overall aggregate statistics like held-out accuracy over time is preferable to worrying about one individual test failure.
|
2017-03-26 20:27:32 -04:00 |
|
Al
|
81c59e116a
|
[countries] use ISO 3166 country name 5% of the time for general addresses, 10% of the time for OpenAddresses. Gives the parser examples of names like "Korea, Republic of" in #168
|
2017-03-25 19:41:59 -04:00 |
|
Al
|
ecfa6855e7
|
[openaddresses] adding Korea countrywide dataset
|
2017-03-25 16:49:16 -04:00 |
|
Al
|
8e4b909013
|
[formatting] adding postcode before city insertion for former USSR countries
|
2017-03-25 01:12:07 -04:00 |
|
Al
|
9fccfa0997
|
[places] increase state_district probability in India
|
2017-03-25 01:01:15 -04:00 |
|
Al
|
3aaa628b25
|
[test] add LaSalle, Montréal tests
|
2017-03-21 14:24:13 -04:00 |
|
Al
|
1f1dbe25e1
|
[test] adding a number of user-contributed test cases from Moz in #21. Almost all are working under the CRF parser trained on 10% of the data. There are a few problematic ones in the UK still that have been omitted here. We currently don't correctly format the training data for locailty + postal town pattern, which are both considered "city" by libpostal and thus one will usually get lumped in with the road or something like that. There may also be some utility in modelling comma usage (training data has commas, but they're ignored by the parser both at train and run time - might be useful to train on them but drop out randomly so the parser doesn't become too dependent on having them)
|
2017-03-21 03:08:09 -04:00 |
|
Al
|
7fe84e6247
|
[matrix/utils] adding resize_fill_zeros
|
2017-03-21 01:37:08 -04:00 |
|
Al
|
2bda741fa9
|
[openaddresses] adding Sicily statewide
|
2017-03-20 21:22:49 -04:00 |
|
Al
|
67805047f4
|
[openaddresses] adding Novosibirsk Oblast, Russia
|
2017-03-20 19:05:45 -04:00 |
|
Al
|
b8a12e0517
|
[test] adding parser test cases in 22 countries. These may change, and I'm generlaly against putting every obscure test case in the world in here. It's better to measure accuracy in aggregate statistics instead of individual test cases (i.e. if a particular change to the parser improves overall performance but fails one test case, should we accept the improvement?) The thought here is: these represent parses that are used in documentation/examples, as well as most of those that have been brought up in Github issues from the initial release, and we want these specific tests to work from build to build. If a model fails one of these test cases, it shouldn't get pushed to our users.
|
2017-03-20 00:58:52 -04:00 |
|
Al
|
7218ca1316
|
[openaddresses] adding Chesterfield, SC
|
2017-03-19 16:10:29 -04:00 |
|
Al
|
3b9b43f1b5
|
[fix] handle multiple separators (like parens used in https://www.openstreetmap.org/node/244081449). Creates bad trie entries otherwise, which affect more than just that toponym
|
2017-03-18 06:09:52 -04:00 |
|
Al
|
c67678087f
|
[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires |V(postal codes)| + |E| 32 bit ints instead of |E| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead
|
2017-03-18 06:05:28 -04:00 |
|
Al
|
cb112f0ea7
|
[transliteration] regenerate transliteration data
|
2017-03-17 18:28:41 -04:00 |
|
Al
|
579425049b
|
[fix] with the new CLDR transform format, reverse the lines rather than the nodes in reverse transliterators
|
2017-03-17 18:28:15 -04:00 |
|
Al
|
8e3c9d0269
|
[test] adding test of new latin-ascii-simple transliterator which only handles things like HTML entities
|
2017-03-17 18:27:18 -04:00 |
|
Al
|
be07bfe35d
|
[test] adding printfs on expansion test failure so it's more clear what's going on
|
2017-03-17 17:46:22 -04:00 |
|
Al
|
dfabd25e5d
|
[phrases] set node data only when we're sure we have a correct match, otherwise the longer phrase may actually be matched
|
2017-03-17 03:40:29 -04:00 |
|
Al
|
f4a9e9d673
|
[fix] don't compare a double to 0
|
2017-03-15 14:59:33 -04:00 |
|
Al
|
266065f22f
|
[fix] need to store stats for component phrases that have more than one component, otherwise only the first gets stored and everything is an "unambiguous" phrase, which is not true
|
2017-03-15 14:11:59 -04:00 |
|
Al
|
0b27eb3f74
|
[parser] thought numeric boundary names had already been removed in the source data, but someehow they've made it into one of the data sets. Doing a final check in context_fill for valid boundary names (currently valid if there's at least one non-digit token)
|
2017-03-15 13:07:21 -04:00 |
|
Al
|
1b2696b3b5
|
[utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category)
|
2017-03-15 13:04:39 -04:00 |
|
Al
|
1a1f0a44d2
|
[parser] parser only inserts spaces in the output if there were spaces (or other ignorable tokens) in the normalized input
|
2017-03-15 03:35:03 -04:00 |
|
Al
|
d43989cf1c
|
[fix] log_sum_exp in SSE mode shouldn't modify the original array
|
2017-03-15 00:22:17 -04:00 |
|
Al
|
c201939f3a
|
[openaddresses] adding some of the new counties in GA. Adding the simple unit regex to DeKalb county's ignore list as there are a few in there
|
2017-03-13 16:08:37 -04:00 |
|
Al
|
e0a9171c09
|
[openaddresses] adding language-delineated files for South Tyrol
|
2017-03-13 01:27:23 -04:00 |
|
Al
|
6cf113b1df
|
[fix] handle case of T = 0 in Viterbi decoding
|
2017-03-12 22:55:52 -04:00 |
|
Al
|
35ccb3ee62
|
[fix] move
|
2017-03-12 20:22:35 -04:00 |
|
Al
|
d40a355d8b
|
[fix] heap issues when cleaning up CRF
|
2017-03-12 20:20:51 -04:00 |
|
Al
|
1277f82f52
|
[logging] some small logging changes to track vocab pre/post pruning
|
2017-03-12 00:24:52 -05:00 |
|
Al
|
7afba832e5
|
[test] adding the new tests to the Makefile
|
2017-03-11 14:34:27 -05:00 |
|
Al
|
7562cf866b
|
[crf] in averaged perceptron training for the CRF, need to update transition features when either guess != truth or prev_guess != prev_truth
|
2017-03-11 05:58:23 -05:00 |
|
Al
|
a6eaf5ebc5
|
[fix] had taken out a previous optimization while debugging. Don't need to repeatedly update the backpointer array in viterbi to store an argmax when a stack variable will work. Because that's in the quadratic (only in L, the number o labels, which is small) section of the algorithm, even this small change can make a pretty sizeable difference. CRF training speed is now roughly on par with the greedy model
|
2017-03-11 02:31:52 -05:00 |
|
Al
|
647ddf171d
|
[fix] formatting for print features in CRF model
|
2017-03-11 01:18:35 -05:00 |
|
Al
|
735fd7a6b7
|
[openaddresses] adding Douglas County, OR
|
2017-03-10 23:36:11 -05:00 |
|