Commit Graph

4806 Commits

Author SHA1 Message Date
Al
8e3bcbfc95 [openaddresses] adding Coffey County, KS 2017-03-10 15:44:59 -05:00
Al
b85ed70674 [utils] adding a function for checking if files exists (yay C), or at least the closest agreed-upon method for it (may return false if the user doesn't have permissions, but that's ok for our purposes here) 2017-03-10 13:39:52 -05:00
Al
3b33325c1a [cli] no longer need geodb setup in address parser client 2017-03-10 13:11:32 -05:00
Al
ef8768281b [parser/crf] adding runtime CRF tagger, which can be loaded/used once trained. Currently only does Viterbi inference, can add top-N and/or sequence probabilities later 2017-03-10 02:06:45 -05:00
Al
9afff5c9ed [parser/crf] adding an initial training algorithm for CRFs, the averaged
perceptron (FTW!)

Though it does not generate scores suitable for use as probabilties, and
might achieve slightly lower accuracy on some tasks than its
gradient-based counterparts like SGD (a possibility for libpostal)
or LBFGS (prohibitive on this much data), the averaged perceptron is
appealing for two reasons: speed and low memory usage i.e. we can still use
all the same tricks as in the greedy model like sparse construction of
the weight matrix. In this case we can go even sparser than in the
original because the state-transition features are separate from the
state features, and we need to be able to iterate over all of them
instead of simply creating new string keys in the feature space. The
solution to this is quite simple: we simply treat the weights for each
state-transition feature as if they have L * L output labels instead of
simply L. So instead of:

{
    "prev|road|word|DD": {1: 1.0, 2: -1.0}
    ...
}

We'd have:

{
    "word|DD": {(0, 1): 1.0, (0, 2): -1.0}
    ...
}

As usual we compress the features to a trie, and the weights to
compressed-sparse row (CSR) format sparse matrix after the weights have
been averaged. These representations are smaller, faster to load from
disk, and faster to use at runtime (contiguous arrays vs hashtables).

This also includes the min_updates variation from the greedy perceptron,
so features that participate in fewer than N updates are discarded at
the end (and also not used in scoring until they meet the threshold so
the model doesn't become dependent on features it doesn't really have).
This tends to discard irrelevant features, keeping the model small
without hurting accuracy much (within a tenth of a percent or so in my
tests on the greedy perceptron).
2017-03-10 01:28:31 -05:00
Al
5cac4a7585 [parser/crf] adding crf_trainer, which can be thought of as a "base class" as much as that's possible in C, for creating trainers for the CRF. It doesn't deal with the weights or their representation, just provides an interface for keeping track of string features and label names, and holds the crf_context 2017-03-10 01:25:20 -05:00
Al
dd0bead63a [test/utils] also a good thing to sanity check (in C especially): string handling code 2017-03-10 01:15:23 -05:00
Al
adab8ab51a [test/crf] test for crf_context, adapted from crf1dc_debug_context in CRFsuite. Always a good idea to sanity check numerical code 2017-03-10 01:13:40 -05:00
Al
f9a9dc2224 [parser/crf] adding the beginnings of a linear-chain Conditional Random Field
implementation for the address parser.

One of the main issues with the greedy averaged perceptron tagger used currently
in libpostal is that it predicts left-to-right and commits to its
answers i.e. doesn't revise its previous predictions. The model can use
its own previous predictions to classify the current word, but
effectively it makes the best local decision it can and never looks back
(the YOLO approach to parsing).

This can be problematic in a multilingual setting like libpostal,
since the order of address components is language/country dependent.
It would be preferable to have a model that scores whole
_sequences_ instead of individual tagging decisions.

That's exactly what a Conditional Random Field (CRF) does. Instead of modeling
P(y_i|x_i, y_i-1), we're modeling P(y|x) where y is the whole sequence of labels
and x is the whole sequence of features. They achieve state-of-the-art results
in many tasks (or are a component in the state-of-the-art model - LSTM-CRFs
have been an interesting direction along these lines).

The crf_context module is heavily borrowed from the version in CRFSuite
(https://github.com/chokkan/crfsuite) though using libpostal's data structures and
allowing for "state-transition features." CRFSuite has state features
like "word=the", and transition features i.e. "prev tag=house", but
no notion of a feature which incorporates both local and transition
information e.g. "word=the and prev tag=house". These types of features are useful
in our setting where there are many languages and it might not make as
much sense to simply have a weight for "house_number => road" because that
highly depends on the country. This implementation introduces a T x L^2 matrix for
those state-transition scores.

For linear-chain CRFs, the Viterbi algorithm is used for computing the
most probable sequence. There are versions of Viterbi for computing the
N most probable sequences as well, which may come in handy later. This
can also compute marginal probabilities of a sequence (though it would
need to wait until a gradient-based learning method that produces
well-calibrated probabilities is implemented).

The cool thing architecturally about crf_context as a separate module is that the
weights can be learned through any method we want. As long as the state
scores, state-transition scores, and transition scores are populated on
the context struct, we have everything we need to run Viterbi inference,
etc. without really caring about which training algorithm was used to optimize
the weights, what the features are, how they're stored, etc.

So far the results have been very encouraging. While it is slower to
train a linear-chain CRF, and it will likely add several days to the
training process, it's still reasonably fast at runtime and not all that
slow at training time. In unscientific tests on a busy MacBook Pro, so far
training has been chunking through ~3k addresses / sec, which is only
about half the speed of the greedy tagger (haven't benchmarked the runtime
difference but anecdotally it's hardly noticeable). Libpostal training
runs considerably faster on Linux with gcc, so 3k might be a little low.
I'd also guess that re-computing features every iteration means there's
a limit on the performance of the greedy tagger. The differences might
be more pronounced if features were pre-computed (a possible optimization).
2017-03-10 01:10:22 -05:00
Al
f9e60b13f5 [parser] size the postcode context set appropriately when reading the parser, makes loading a large model much faster 2017-03-09 14:31:12 -05:00
Al
2400122162 [fix] fixing up hash str to id template 2017-03-09 00:54:31 -05:00
Al
4c03e563e0 [parser] for the min updates method to work, the feature that have not yet reached the min_updates threshold also need to be ignored when scoring, that way the model has to perform without those features, and should make more updates if they're relevant 2017-03-08 15:40:12 -05:00
Al
a63c182e96 [parser] right context affixes need to use pre-normalized words as well 2017-03-08 13:51:36 -05:00
Al
ce9153d94d [parser] fixing some issues in address_parser_features. Prefix/suffix phrases use the word before token-level normalization (but after string-level normalization like lowercasing), needed to use the same string in the feature function as in address_parser_context_fill. Affects some German suffixes like "str." where the final "." would be deleted in token normalization, but the suffix length would include it. Also, three of the new arrays used in address_parser_context (suffix_phrases, prefix_phrases, and sub_tokens) weren't being cleared per call, which means computing the wrong features at best and a segfault at worst 2017-03-07 17:30:53 -05:00
Al
b6bf8da383 [utils] adding aligned malloc/free/realloc in vector.h and matrix.h, fixing bug in matrix_copy 2017-03-07 16:25:34 -05:00
Al
242b1364ae [parser] using new API in address_parser_test 2017-03-07 16:24:34 -05:00
Al
39f59e7ecf [openaddresses] adding Mayenne, FR 2017-03-07 15:41:33 -05:00
Al
c2b516c761 [openaddresses] adding Hernando County, FL 2017-03-07 15:11:53 -05:00
Al
749bb4907e [openaddresses] adding city of Carlsbad, NM 2017-03-07 10:55:09 -05:00
Al
154fd42299 [openaddresses] adding city of Amarillo, TX 2017-03-07 10:53:52 -05:00
Al
95015990ab [parser] learning a sparser averaged perceptron model for the parser using the following method:
- store a vector of update counts for each feature in the model
- when the model updates after making a mistake, increment the update
  counters for the observed features in that example
- after the model is finished training, keep only the features that
  participated in a minimum number of updates

This method is described in greater detail in this paper from Yoav
Goldberg: https://www.cs.bgu.ac.il/~yoavg/publications/acl2011sparse.pdf

The authors there report a 4x size reduction at only a trivial cost in
terms of accuracy. So far the trials on libpostal indicate roughly the
same, though at lower training set sizes the accuracy cost is greater.

This method is more effective than simple feature pruning as feature
pruning methods are usually based on the frequency of the feature
in the training set, and infrequent features can still be important.
However, the perceptron's early iterations make many updates on
irrelevant featuers simply because the weights for the more relevant
features aren't tuned yet. The number of updates a feature participates
in can be seen as a measure of its relevance to classifying examples.

This commit introduces --min-features option to address_parser_train
(default=5), so it can effectively be turned off by using
"--min-features 0" or "--min-features 1".
2017-03-06 22:28:33 -05:00
Al
5c1c1ae0f2 [parser] moving tagger function pointer definition to a separate header so it can be used for other models 2017-03-06 21:42:06 -05:00
Al
cc58ec9db2 [parser] fix another valgrind error in parser training (cstring_array memory can get moved around when using string pointers obtained before adding to it, which can potentially cause a realloc), no longer using the dummy START tags as the feature function can choose to add features for those cases 2017-03-06 21:39:14 -05:00
Al
754f22c79a [parser] moving feature printing to averaged perceptron tagger, taking advantage of trie prefix-sharing in feature incorporating previous tags 2017-03-06 20:32:50 -05:00
Al
839a13577d [parser] fixing affix-related valgrind errors in address parser features 2017-03-06 20:28:42 -05:00
Al
c3581557a1 [parser] counting classes instead of keeping a set 2017-03-06 20:05:01 -05:00
Al
a5283cb313 [fix] trie_new_from_hash 2017-03-06 15:57:42 -05:00
Al
23ed916f09 [openaddresses] adding Hattiesburg, MS 2017-03-06 15:45:23 -05:00
Al
90cb4d904d [openaddresses] adding Longueuil, QC, Canada 2017-03-06 15:43:51 -05:00
Al
5113a1bc32 [utils] tracking keys added in trie construction from hash 2017-03-06 15:28:26 -05:00
Al
dd4f3eb84c [parser] simpler feature names for the state-transition features 2017-03-06 15:25:10 -05:00
Al
39fa8ff1a5 [parser] counting num classes in address parser init for models where it is needed a priori 2017-03-06 15:17:52 -05:00
Al
5f19e63cbe [parser] more logging in init 2017-03-06 15:11:39 -05:00
Al
4d2f77b3f3 [openaddresses] add city of Alexandria, LA 2017-03-06 14:30:25 -05:00
Al
bb922e4ce4 [parser] adding log message 2017-03-06 12:25:22 -05:00
Al
b97de96ab4 [parser] fixing chunked shuffle, making awk splitting work on Mac 2017-03-05 15:06:02 -05:00
Al
0e49fc580a [parser] uint64_t chunk size, no warning if gshuf is available 2017-03-05 14:50:47 -05:00
Al
d99f83b84a [openaddresses] add unit phrases in Cape Girardeau, MO 2017-03-05 04:00:41 -05:00
Al
d1bcced706 [openaddresses] adding some of the new Mississippi sources and city of Cape Girardeau, MO 2017-03-05 03:59:07 -05:00
Al
5d73aa1295 [fix] don't write formatted addresses in the ways-only data set unless the formatter returns non-None value 2017-03-05 03:50:00 -05:00
Al
b76b7b8527 [parser] adding chunked shuffle as a C function (writes each line to one of n random files, runs shuf on each file and concatenates the result). Adding a version which allows specifying a specific chunk size, and using a 2GB limit for address parser training. Allowing gshuf again for Mac as it seems the only problem there was not having enough memory when testing on a Mac laptop. The new limited-memory version should be fast enough. 2017-03-05 02:15:11 -05:00
Al
ba4052c9ba [openaddresses] add Muskogee, OK 2017-03-03 14:57:36 -05:00
Al
2704708f47 [openaddresses] add Orange County, NY 2017-03-03 14:27:05 -05:00
Al
da62fb62ba [openaddresses] adding Polk County, NC 2017-03-03 13:45:58 -05:00
Al
ce21635b00 [openaddresses] adding city of Salina, KS 2017-03-03 13:45:25 -05:00
Al
b4437848c4 [fix] override_country_dir 2017-03-02 14:31:53 -05:00
Al
69351cad98 [openaddresses] add Tippecanoe County, IN 2017-03-02 13:36:22 -05:00
Al
6b8b6982aa [addresses] more classmethods 2017-03-02 04:23:09 -05:00
Al
f7c8a63093 [addresses] making most of the methods on AddressComponents classmethods if possible so they can be accessed easily for sources not using OSM polygon lookup, etc. 2017-03-01 15:51:56 -05:00
Al
702901608b [openaddresses_uk] adding OpenAddresses UK as a data set. No lat/lons but it does have addresses, cities and postcodes 2017-03-01 15:44:25 -05:00