libpostal

tommy/libpostal

Fork 0

Commit Graph

Author	SHA1	Message	Date
Al	9afff5c9ed	[parser/crf] adding an initial training algorithm for CRFs, the averaged perceptron (FTW!) Though it does not generate scores suitable for use as probabilties, and might achieve slightly lower accuracy on some tasks than its gradient-based counterparts like SGD (a possibility for libpostal) or LBFGS (prohibitive on this much data), the averaged perceptron is appealing for two reasons: speed and low memory usage i.e. we can still use all the same tricks as in the greedy model like sparse construction of the weight matrix. In this case we can go even sparser than in the original because the state-transition features are separate from the state features, and we need to be able to iterate over all of them instead of simply creating new string keys in the feature space. The solution to this is quite simple: we simply treat the weights for each state-transition feature as if they have L * L output labels instead of simply L. So instead of: { "prev\|road\|word\|DD": {1: 1.0, 2: -1.0} ... } We'd have: { "word\|DD": {(0, 1): 1.0, (0, 2): -1.0} ... } As usual we compress the features to a trie, and the weights to compressed-sparse row (CSR) format sparse matrix after the weights have been averaged. These representations are smaller, faster to load from disk, and faster to use at runtime (contiguous arrays vs hashtables). This also includes the min_updates variation from the greedy perceptron, so features that participate in fewer than N updates are discarded at the end (and also not used in scoring until they meet the threshold so the model doesn't become dependent on features it doesn't really have). This tends to discard irrelevant features, keeping the model small without hurting accuracy much (within a tenth of a percent or so in my tests on the greedy perceptron).	2017-03-10 01:28:31 -05:00

Author

SHA1

Message

Date

9afff5c9ed

[parser/crf] adding an initial training algorithm for CRFs, the averaged

perceptron (FTW!)

Though it does not generate scores suitable for use as probabilties, and
might achieve slightly lower accuracy on some tasks than its
gradient-based counterparts like SGD (a possibility for libpostal)
or LBFGS (prohibitive on this much data), the averaged perceptron is
appealing for two reasons: speed and low memory usage i.e. we can still use
all the same tricks as in the greedy model like sparse construction of
the weight matrix. In this case we can go even sparser than in the
original because the state-transition features are separate from the
state features, and we need to be able to iterate over all of them
instead of simply creating new string keys in the feature space. The
solution to this is quite simple: we simply treat the weights for each
state-transition feature as if they have L * L output labels instead of
simply L. So instead of:

{
    "prev|road|word|DD": {1: 1.0, 2: -1.0}
    ...
}

We'd have:

{
    "word|DD": {(0, 1): 1.0, (0, 2): -1.0}
    ...
}

As usual we compress the features to a trie, and the weights to
compressed-sparse row (CSR) format sparse matrix after the weights have
been averaged. These representations are smaller, faster to load from
disk, and faster to use at runtime (contiguous arrays vs hashtables).

This also includes the min_updates variation from the greedy perceptron,
so features that participate in fewer than N updates are discarded at
the end (and also not used in scoring until they meet the threshold so
the model doesn't become dependent on features it doesn't really have).
This tends to discard irrelevant features, keeping the model small
without hurting accuracy much (within a tenth of a percent or so in my
tests on the greedy perceptron).

2017-03-10 01:28:31 -05:00

1 Commits