[docs] README updates, better explanations of normalization and parsing
This commit is contained in:
101
README.md
101
README.md
@@ -38,35 +38,57 @@ It is written in C for maximum portability and performance.
|
|||||||
Examples of normalization
|
Examples of normalization
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
Like many problems in information extraction and NLP, address normalization
|
Address normalization may sound trivial initially, especially when thinking
|
||||||
may sound trivial initially, but in fact can be quite complicated in real
|
only about the US (if that's where you happen to reside), but it only takes
|
||||||
natural language texts. Here are some examples of the kinds of address-specific
|
a few examples to realize how complicated natural language addresses are
|
||||||
challenges libpostal can handle:
|
internationally. Here's a short list of some less straightforward normalizations
|
||||||
|
in various languages. The left/right columns in this table are equivalent
|
||||||
|
strings under libpostal, the left column being user input and the right column
|
||||||
|
being the indexed (normalized) string.
|
||||||
|
|
||||||
| Input | Output (may be multiple in libpostal) |
|
| Input | Output (may be multiple in libpostal) |
|
||||||
| ----------------------------------- |---------------------------------------|
|
| ----------------------------------- |---------------------------------------|
|
||||||
| One-hundred twenty E 96th St | 120 east 96th street |
|
| One-hundred twenty E 96th St | 120 east 96th street |
|
||||||
| C/ Ocho, P.I. 4 | calle 8, polígono industrial 4 |
|
| C/ Ocho, P.I. 4 | calle 8 polígono industrial 4 |
|
||||||
| V XX Settembre, 20 | via 20 settembre, 20 |
|
| V XX Settembre, 20 | via 20 settembre 20 |
|
||||||
| Quatre vignt douze R. de l'Église | 92 rue de l' église |
|
| Quatre vignt douze R. de l'Église | 92 rue de l' église |
|
||||||
| ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд, дом 4, строение 7 |
|
| ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд дом 4 строение 7 |
|
||||||
| ул Каретный Ряд, д 4, строение 7 | ulica karetnyj rad, dom 4, stroenie 7 |
|
| ул Каретный Ряд, д 4, строение 7 | ulica karetnyj rad dom 4 stroenie 7 |
|
||||||
| Marktstrasse 14 | markt straße 14 |
|
| Marktstrasse 14 | markt straße 14 |
|
||||||
|
|
||||||
For further reading and some less intuitive examples of addresses, see
|
libpostal currently supports these types of normalization in *over 60 languages*,
|
||||||
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
|
and you can add more (without having to write any C!)
|
||||||
|
|
||||||
|
Now, instead of trying to bake address-specific conventions into traditional
|
||||||
|
document search engines like Elasticsearch using giant synonyms files, scripting,
|
||||||
|
custom analyzers, tokenizers, and the like, geocoding can be as simple as:
|
||||||
|
|
||||||
|
1. Run the addresses in your index through libpostal's expand_address
|
||||||
|
2. Store the normalized string(s) in your favorite search engine, DB,
|
||||||
|
hashtable, etc.
|
||||||
|
3. Run your user queries or fresh imports through libpostal and search
|
||||||
|
the existing database using those strings
|
||||||
|
|
||||||
|
In this way, libpostal can perform fuzzy address matching in constant time.
|
||||||
|
|
||||||
|
For further reading and some bizarre address edge-cases, see:
|
||||||
|
[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/).
|
||||||
|
|
||||||
Examples of parsing
|
Examples of parsing
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
libpostal's address parser is trained on ~50M addresses (everything in OSM),
|
libpostal implements the first truly international statistical address parser,
|
||||||
using the address format templates in https://github.com/OpenCageData/address-formatting
|
trained on ~50 million addresses in over 100 countries speaking over 60
|
||||||
and perturbing the inputs in a number of ways to make the parser as robust
|
languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
|
||||||
as possible to messy real-world input.
|
address format templates at: https://github.com/OpenCageData/address-formatting
|
||||||
|
to construct the training data, supplementing with containing polygons and
|
||||||
|
perturbing the inputs in a number of ways to make the parser as robust as possible
|
||||||
|
to messy real-world input.
|
||||||
|
|
||||||
These examples are taken from the interactive address_parser program that builds
|
These example parses are taken from the interactive address_parser program
|
||||||
with libpostal on make.
|
that builds with libpostal on make. Note that the parser doesn't care about commas
|
||||||
|
vs. no commas, casing, or different permutations of components (if components are
|
||||||
|
left out e.g. just city or just city/postcode).
|
||||||
|
|
||||||
```
|
```
|
||||||
> 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
|
> 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
|
||||||
@@ -153,6 +175,10 @@ Result:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The parser achieves very high accuracy on held-out data, currently 98.9%
|
||||||
|
correct full parses (meaning a 1 in the numerator for getting *every* token
|
||||||
|
in the address correct).
|
||||||
|
|
||||||
Installation
|
Installation
|
||||||
------------
|
------------
|
||||||
|
|
||||||
@@ -228,10 +254,9 @@ After building libpostal:
|
|||||||
cd src/
|
cd src/
|
||||||
|
|
||||||
./address_parser
|
./address_parser
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
address_parser is an interactive shell, just type addresses and libpostal will
|
address_parser is an interactive shell. Just type addresses and libpostal will
|
||||||
parse them and print the result.
|
parse them and print the result.
|
||||||
|
|
||||||
Data files
|
Data files
|
||||||
@@ -242,7 +267,6 @@ representations of the data structures necessary to perform expansion. For addre
|
|||||||
parsing, since model training takes about a day, we publish the fully trained model
|
parsing, since model training takes about a day, we publish the fully trained model
|
||||||
to S3 and will update it automatically as new addresses get added to OSM.
|
to S3 and will update it automatically as new addresses get added to OSM.
|
||||||
|
|
||||||
|
|
||||||
Data files are automatically downloaded when you run make. To check for and download
|
Data files are automatically downloaded when you run make. To check for and download
|
||||||
any new data files, run:
|
any new data files, run:
|
||||||
|
|
||||||
@@ -375,15 +399,10 @@ So it's not a geocoder?
|
|||||||
|
|
||||||
If the above sounds a lot like geocoding, that's because it is in a way,
|
If the above sounds a lot like geocoding, that's because it is in a way,
|
||||||
only in the OpenVenues case, we do it without a UI or a user to select the
|
only in the OpenVenues case, we do it without a UI or a user to select the
|
||||||
correct address in an autocomplete. libpostal does server-side batch geocoding
|
correct address in an autocomplete. Given a database of source addresses
|
||||||
(and you can too!)
|
such as OpenAddresses or OpenStreetMap (or all of the above), libpostal
|
||||||
|
can be used to implement things like address deduping and server-side
|
||||||
Now, instead of fiddling with giant Elasticsearch synonyms files, scripting,
|
batch geocoding in settings like MapReduce.
|
||||||
analyzers, tokenizers, and the like, geocoding can look like this:
|
|
||||||
|
|
||||||
1. Run the addresses in your index through libpostal
|
|
||||||
2. Store the canonical strings
|
|
||||||
3. Run your user queries through libpostal and search with those strings
|
|
||||||
|
|
||||||
Why C?
|
Why C?
|
||||||
------
|
------
|
||||||
@@ -542,20 +561,26 @@ There are four primary ways the address parser can be improved even further
|
|||||||
1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be
|
1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be
|
||||||
incorporated automatically into the parser next time it's trained.
|
incorporated automatically into the parser next time it's trained.
|
||||||
2. If the address parser isn't working well for a particular country, language
|
2. If the address parser isn't working well for a particular country, language
|
||||||
or style of address, chances are that the template can be added at:
|
or style of address, chances are that some name variations or places being
|
||||||
https://github.com/OpenCageData/address-formatting. This repo helps us
|
missed/mislabeled during training data creation. Sometimes the fix is to
|
||||||
format OSM addresses and create the training data used by the address parser.
|
add more countries at: https://github.com/OpenCageData/address-formatting,
|
||||||
|
and in many other cases there are relatively simple tweaks we can make
|
||||||
|
when creating the training data that will ensure the model is trained to
|
||||||
|
handle your use case without you having to do any manual data entry.
|
||||||
|
If you see a pattern of obviously bad address parses, post an issue to
|
||||||
|
Github and we'll tr
|
||||||
3. We currently don't have training data for things like flat numbers.
|
3. We currently don't have training data for things like flat numbers.
|
||||||
The tags are fairly uncommon in OSM and the address-formatting templates
|
The tags are fairly uncommon in OSM and the address-formatting templates
|
||||||
don't use floor, level, apartment/flat number, etc. This would be a slightly
|
don't use floor, level, apartment/flat number, etc. This would be a slightly
|
||||||
more involved effort, but would be happy to discuss.
|
more involved effort, but would be like to begin a discussion around it.
|
||||||
4. Moving to a CRF may improve parser performance on certain kinds of input
|
4. We use a greedy averaged perceptron for the parser model. Viterbi inference
|
||||||
since the score is the argmax over the entire sequence not just the token.
|
using a linear-chain CRF may improve parser performance on certain classes
|
||||||
This may slow down training significantly.
|
of input since the score is the argmax over the entire label sequence not
|
||||||
|
just the token. This may slow down training significantly.
|
||||||
|
|
||||||
Todos
|
Todos
|
||||||
-----
|
-----
|
||||||
|
|
||||||
1. Port language classification from Python, train and publish model
|
[ ] Port language classification from Python, train and publish model
|
||||||
2. Publish tests (currently not on Github) and set up continuous integration
|
[ ] Publish tests (currently not on Github) and set up continuous integration
|
||||||
3. Hosted documentation
|
[ ] Hosted documentation
|
||||||
|
|||||||
Reference in New Issue
Block a user