[docs] Fleshing out parser description, correcting city name in Russian address

This commit is contained in:
Al
2015-12-28 15:46:49 -05:00
parent 45b5e2dd6f
commit d6362ba0fc

View File

@@ -164,7 +164,7 @@ Result:
"country": "france" "country": "france"
} }
> Государственный Эрмитаж Дворцовая наб., 34 191186, Saint Petersburg, Russia > Государственный Эрмитаж Дворцовая наб., 34 191186, St. Petersburg, Russia
Result: Result:
@@ -173,8 +173,7 @@ Result:
"road": "дворцовая наб.", "road": "дворцовая наб.",
"house_number": "34", "house_number": "34",
"postcode": "191186", "postcode": "191186",
"state_district": "saint", "city": "st. petersburg",
"state": "petersburg",
"country": "russia" "country": "russia"
} }
``` ```
@@ -550,17 +549,20 @@ like an F1 score or variants, mostly because there's a class bias problem (most
tokens are non-entities, and a system that simply predicted non-entity for tokens are non-entities, and a system that simply predicted non-entity for
every token would actually do fairly well in terms of accuracy). That is not every token would actually do fairly well in terms of accuracy). That is not
the case for address parsing. Every token has a label and there are millions the case for address parsing. Every token has a label and there are millions
of examples of each class in the training data, so accuracy of examples of each class in the training data, so accuracy is preferable as it's
a clean, simple and intuitive measure of performance.
We prefer to evaluate on full parses (at the sentence level in NER nomenclature), Here we use full parse accuracy, meaning we only give the parser a "point" in
so that means that 98.9% of the time, the address parser gets every single token the numerator if it gets every single token in the address correct. That should
in the address correct, which is quite good performance. be a better measure than simply looking at whether each token was correct.
Improving the address parser Improving the address parser
---------------------------- ----------------------------
There are four primary ways the address parser can be improved even further Though the current parser is quite good for most standard addresses, there
(in order of difficulty): is still room for improvement, particularly in making sure the training data
we use is as close as possible to addresses in the wild. There are four primary
ways the address parser can be improved even further (in order of difficulty):
1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be 1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be
incorporated automatically into the parser next time it's trained. incorporated automatically into the parser next time it's trained.
@@ -571,16 +573,18 @@ There are four primary ways the address parser can be improved even further
and in many other cases there are relatively simple tweaks we can make and in many other cases there are relatively simple tweaks we can make
when creating the training data that will ensure the model is trained to when creating the training data that will ensure the model is trained to
handle your use case without you having to do any manual data entry. handle your use case without you having to do any manual data entry.
If you see a pattern of obviously bad address parses, post an issue to If you see a pattern of obviously bad address parses, the best thing to
Github and we'll tr do is post an issue to Github.
3. We currently don't have training data for things like flat numbers. 3. We currently don't have training data for things like apartment/flat numbers.
The tags are fairly uncommon in OSM and the address-formatting templates The tags are fairly uncommon in OSM and the address-formatting templates
don't use floor, level, apartment/flat number, etc. This would be a slightly don't use floor, level, apartment/flat number, etc. This would be a slightly
more involved effort, but would be like to begin a discussion around it. more involved effort, but would be worth starting a discussion.
4. We use a greedy averaged perceptron for the parser model. Viterbi inference 4. We use a greedy averaged perceptron for the parser model primarily for its
using a linear-chain CRF may improve parser performance on certain classes speed and relatively good performance compared to slower, fancier models.
of input since the score is the argmax over the entire label sequence not Viterbi inference using a linear-chain CRF may improve parser performance
just the token. This may slow down training significantly. on certain classes of input since the score is the argmax over the entire
label sequence not just the token. This may slow down training significantly
although runtime performance would be relatively unaffected.
Todos Todos
----- -----