[docs] Fleshing out parser description, correcting city name in Russian address
This commit is contained in:
38
README.md
38
README.md
@@ -164,7 +164,7 @@ Result:
|
|||||||
"country": "france"
|
"country": "france"
|
||||||
}
|
}
|
||||||
|
|
||||||
> Государственный Эрмитаж Дворцовая наб., 34 191186, Saint Petersburg, Russia
|
> Государственный Эрмитаж Дворцовая наб., 34 191186, St. Petersburg, Russia
|
||||||
|
|
||||||
Result:
|
Result:
|
||||||
|
|
||||||
@@ -173,8 +173,7 @@ Result:
|
|||||||
"road": "дворцовая наб.",
|
"road": "дворцовая наб.",
|
||||||
"house_number": "34",
|
"house_number": "34",
|
||||||
"postcode": "191186",
|
"postcode": "191186",
|
||||||
"state_district": "saint",
|
"city": "st. petersburg",
|
||||||
"state": "petersburg",
|
|
||||||
"country": "russia"
|
"country": "russia"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
@@ -550,17 +549,20 @@ like an F1 score or variants, mostly because there's a class bias problem (most
|
|||||||
tokens are non-entities, and a system that simply predicted non-entity for
|
tokens are non-entities, and a system that simply predicted non-entity for
|
||||||
every token would actually do fairly well in terms of accuracy). That is not
|
every token would actually do fairly well in terms of accuracy). That is not
|
||||||
the case for address parsing. Every token has a label and there are millions
|
the case for address parsing. Every token has a label and there are millions
|
||||||
of examples of each class in the training data, so accuracy
|
of examples of each class in the training data, so accuracy is preferable as it's
|
||||||
|
a clean, simple and intuitive measure of performance.
|
||||||
|
|
||||||
We prefer to evaluate on full parses (at the sentence level in NER nomenclature),
|
Here we use full parse accuracy, meaning we only give the parser a "point" in
|
||||||
so that means that 98.9% of the time, the address parser gets every single token
|
the numerator if it gets every single token in the address correct. That should
|
||||||
in the address correct, which is quite good performance.
|
be a better measure than simply looking at whether each token was correct.
|
||||||
|
|
||||||
Improving the address parser
|
Improving the address parser
|
||||||
----------------------------
|
----------------------------
|
||||||
|
|
||||||
There are four primary ways the address parser can be improved even further
|
Though the current parser is quite good for most standard addresses, there
|
||||||
(in order of difficulty):
|
is still room for improvement, particularly in making sure the training data
|
||||||
|
we use is as close as possible to addresses in the wild. There are four primary
|
||||||
|
ways the address parser can be improved even further (in order of difficulty):
|
||||||
|
|
||||||
1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be
|
1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be
|
||||||
incorporated automatically into the parser next time it's trained.
|
incorporated automatically into the parser next time it's trained.
|
||||||
@@ -571,16 +573,18 @@ There are four primary ways the address parser can be improved even further
|
|||||||
and in many other cases there are relatively simple tweaks we can make
|
and in many other cases there are relatively simple tweaks we can make
|
||||||
when creating the training data that will ensure the model is trained to
|
when creating the training data that will ensure the model is trained to
|
||||||
handle your use case without you having to do any manual data entry.
|
handle your use case without you having to do any manual data entry.
|
||||||
If you see a pattern of obviously bad address parses, post an issue to
|
If you see a pattern of obviously bad address parses, the best thing to
|
||||||
Github and we'll tr
|
do is post an issue to Github.
|
||||||
3. We currently don't have training data for things like flat numbers.
|
3. We currently don't have training data for things like apartment/flat numbers.
|
||||||
The tags are fairly uncommon in OSM and the address-formatting templates
|
The tags are fairly uncommon in OSM and the address-formatting templates
|
||||||
don't use floor, level, apartment/flat number, etc. This would be a slightly
|
don't use floor, level, apartment/flat number, etc. This would be a slightly
|
||||||
more involved effort, but would be like to begin a discussion around it.
|
more involved effort, but would be worth starting a discussion.
|
||||||
4. We use a greedy averaged perceptron for the parser model. Viterbi inference
|
4. We use a greedy averaged perceptron for the parser model primarily for its
|
||||||
using a linear-chain CRF may improve parser performance on certain classes
|
speed and relatively good performance compared to slower, fancier models.
|
||||||
of input since the score is the argmax over the entire label sequence not
|
Viterbi inference using a linear-chain CRF may improve parser performance
|
||||||
just the token. This may slow down training significantly.
|
on certain classes of input since the score is the argmax over the entire
|
||||||
|
label sequence not just the token. This may slow down training significantly
|
||||||
|
although runtime performance would be relatively unaffected.
|
||||||
|
|
||||||
Todos
|
Todos
|
||||||
-----
|
-----
|
||||||
|
|||||||
Reference in New Issue
Block a user