[docs] Removing the coming soon label from language classification, cleaning up the README a bit
This commit is contained in:
79
README.md
79
README.md
@@ -19,8 +19,8 @@ Stay tuned...
|
|||||||
|
|
||||||
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
|
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
|
||||||
|
|
||||||
libpostal is a fast, multilingual, all-i18n-everything NLP library for
|
libpostal is a fast NLP library for parsing and normalizing street addresses
|
||||||
normalizing and parsing physical addresses.
|
anywhere in the world.
|
||||||
|
|
||||||
Addresses and the geographic coordinates they represent are essential for any
|
Addresses and the geographic coordinates they represent are essential for any
|
||||||
location-based application (map search, transportation, on-demand/delivery
|
location-based application (map search, transportation, on-demand/delivery
|
||||||
@@ -42,36 +42,40 @@ Examples of normalization
|
|||||||
|
|
||||||
Address normalization may sound trivial initially, especially when thinking
|
Address normalization may sound trivial initially, especially when thinking
|
||||||
only about the US (if that's where you happen to reside), but it only takes
|
only about the US (if that's where you happen to reside), but it only takes
|
||||||
a few examples to realize how complicated natural language addresses are
|
a few examples to realize how complex natural language addresses can get
|
||||||
internationally. Here's a short list of some less straightforward normalizations
|
internationally. Here's a short list of some less straightforward normalizations
|
||||||
in various languages. The left/right columns in this table are equivalent
|
in various languages. The left/right columns in this table are equivalent
|
||||||
strings under libpostal, the left column being user input and the right column
|
strings under libpostal, the left column being user input and the right column
|
||||||
being the indexed (normalized) string.
|
being the indexed (normalized) string. Note that libpostal automatically
|
||||||
|
detects the language(s) used in an address and applies the appropriate expansions.
|
||||||
|
The only input needed is the raw address string:
|
||||||
|
|
||||||
| Input | Output (may be multiple in libpostal) |
|
| Input | Output (may be multiple in libpostal) |
|
||||||
| ----------------------------------- |---------------------------------------|
|
| ----------------------------------- |-----------------------------------------|
|
||||||
| One-hundred twenty E 96th St | 120 east 96th street |
|
| One-hundred twenty E 96th St | 120 east 96th street |
|
||||||
| C/ Ocho, P.I. 4 | calle 8 polígono industrial 4 |
|
| C/ Ocho, P.I. 4 | calle 8 polígono industrial 4 |
|
||||||
| V XX Settembre, 20 | via 20 settembre 20 |
|
| V XX Settembre, 20 | via 20 settembre 20 |
|
||||||
| Quatre vignt douze R. de l'Église | 92 rue de l' église |
|
| Quatre vignt douze R. de l'Église | 92 rue de l' église |
|
||||||
| ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд дом 4 строение 7 |
|
| ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд дом 4 строение 7 |
|
||||||
| ул Каретный Ряд, д 4, строение 7 | ulica karetnyj rad dom 4 stroenie 7 |
|
| ул Каретный Ряд, д 4, строение 7 | ulitsa karetnyy ryad dom 4 stroyeniye 7 |
|
||||||
| Marktstrasse 14 | markt straße 14 |
|
| Marktstrasse 14 | markt straße 14 |
|
||||||
|
|
||||||
libpostal currently supports these types of normalization in *over 60 languages*,
|
libpostal currently supports these types of normalization in *60+ languages*,
|
||||||
and you can add more (without having to write any C!)
|
and you can add more (without having to write any C).
|
||||||
|
|
||||||
|
|
||||||
Now, instead of trying to bake address-specific conventions into traditional
|
Now, instead of trying to bake address-specific conventions into traditional
|
||||||
document search engines like Elasticsearch using giant synonyms files, scripting,
|
document search engines like Elasticsearch using giant synonyms files, scripting,
|
||||||
custom analyzers, tokenizers, and the like, geocoding can be as simple as:
|
custom analyzers, tokenizers, and the like, geocoding can look like this:
|
||||||
|
|
||||||
1. Run the addresses in your index through libpostal's expand_address
|
1. Run the addresses in your database through libpostal's expand_address
|
||||||
2. Store the normalized string(s) in your favorite search engine, DB,
|
2. Store the normalized string(s) in your favorite search engine, DB,
|
||||||
hashtable, etc.
|
hashtable, etc.
|
||||||
3. Run your user queries or fresh imports through libpostal and search
|
3. Run your user queries or fresh imports through libpostal and search
|
||||||
the existing database using those strings
|
the existing database using those strings
|
||||||
|
|
||||||
In this way, libpostal can perform fuzzy address matching in constant time.
|
In this way, libpostal can perform fuzzy address matching in constant time
|
||||||
|
relative to the size of the data set.
|
||||||
|
|
||||||
For further reading and some bizarre address edge-cases, see:
|
For further reading and some bizarre address edge-cases, see:
|
||||||
[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/).
|
[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/).
|
||||||
@@ -255,7 +259,8 @@ Data files
|
|||||||
libpostal needs to download some data files from S3. The basic files are on-disk
|
libpostal needs to download some data files from S3. The basic files are on-disk
|
||||||
representations of the data structures necessary to perform expansion. For address
|
representations of the data structures necessary to perform expansion. For address
|
||||||
parsing, since model training takes about a day, we publish the fully trained model
|
parsing, since model training takes about a day, we publish the fully trained model
|
||||||
to S3 and will update it automatically as new addresses get added to OSM.
|
to S3 and will update it automatically as new addresses get added to OSM. Same goes for
|
||||||
|
the language classifier model.
|
||||||
|
|
||||||
Data files are automatically downloaded when you run make. To check for and download
|
Data files are automatically downloaded when you run make. To check for and download
|
||||||
any new data files, run:
|
any new data files, run:
|
||||||
@@ -278,20 +283,23 @@ optionally be separated so Rosenstraße and Rosen Straße are equivalent.
|
|||||||
|
|
||||||
- **International address parsing**: sequence model which parses
|
- **International address parsing**: sequence model which parses
|
||||||
"123 Main Street New York New York" into {"house_number": 123, "road":
|
"123 Main Street New York New York" into {"house_number": 123, "road":
|
||||||
"Main Street", "city": "New York", "state": "New York"}. Unlike the majority
|
"Main Street", "city": "New York", "state": "New York"}. The parser works
|
||||||
of parsers out there, it works for a wide variety of countries and languages,
|
for a wide variety of countries and languages, not just US/English.
|
||||||
not just US/English. The model is trained on > 50M OSM addresses, using the
|
The model is trained on > 50M OSM addresses, using the
|
||||||
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
|
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
|
||||||
tagged traning examples for most countries around the world.
|
tagged traning examples for most countries around the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/osm/osm_address_training_data.py)
|
||||||
|
are performed to make the training data resemble real messy geocoder input as closely as possible.
|
||||||
|
|
||||||
- **Language classification (coming soon)**: multinomial logistic regression
|
- **Language classification**: multinomial logistic regression
|
||||||
trained on all of OpenStreetMap ways, addr:* tags, toponyms and formatted
|
trained on all of OpenStreetMap ways, addr:* tags, toponyms and formatted
|
||||||
addresses. Labels are derived using point-in-polygon tests in Quattroshapes
|
addresses. Labels are derived using point-in-polygon tests in Quattroshapes
|
||||||
and official/regional languages for countries and admin 1 boundaries
|
and official/regional languages for countries and admin 1 boundaries
|
||||||
respectively. So, for example, Spanish is the default language in Spain but
|
respectively. So, for example, Spanish is the default language in Spain but
|
||||||
in different regions e.g. Catalunya, Galicia, the Basque region, regional
|
in different regions e.g. Catalunya, Galicia, the Basque region, the respective
|
||||||
languages are the default. Dictionary-based disambiguation is employed in
|
regional languages are the default. Dictionary-based disambiguation is employed in
|
||||||
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
|
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
|
||||||
|
The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/"
|
||||||
|
(performed on both the language classifier and the address parser training sets)
|
||||||
|
|
||||||
- **Numeric expression parsing** ("twenty first" => 21st,
|
- **Numeric expression parsing** ("twenty first" => 21st,
|
||||||
"quatre-vignt-douze" => 92, again using data provided in CLDR), supports > 30
|
"quatre-vignt-douze" => 92, again using data provided in CLDR), supports > 30
|
||||||
@@ -299,8 +307,8 @@ languages. Handles languages with concatenated expressions e.g.
|
|||||||
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
||||||
language (IX => 9) which occur in the names of many monarchs, popes, etc.
|
language (IX => 9) which occur in the names of many monarchs, popes, etc.
|
||||||
|
|
||||||
- **Geographic name aliasing**: New York, NYC and Nueva York alias to New York
|
- **Geographic name aliasing (coming soon)**: New York, NYC and Nueva York alias
|
||||||
City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
|
to New York City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
|
||||||
names added by contributors can automatically improve libpostal.
|
names added by contributors can automatically improve libpostal.
|
||||||
|
|
||||||
- **Geographic disambiguation (coming soon)**: There are several equally
|
- **Geographic disambiguation (coming soon)**: There are several equally
|
||||||
@@ -327,7 +335,7 @@ languages chracter by character instead of on whitespace.
|
|||||||
strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
|
strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
|
||||||
|
|
||||||
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
||||||
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same as used by ICU,
|
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same source data as used by [ICU](http://site.icu-project.org/),
|
||||||
though libpostal doesn't require pulling in all of ICU (might conflict
|
though libpostal doesn't require pulling in all of ICU (might conflict
|
||||||
with your system's version). Note: some languages, particularly Hebrew, Arabic
|
with your system's version). Note: some languages, particularly Hebrew, Arabic
|
||||||
and Thai may not include vowels and thus will not often match a transliteration
|
and Thai may not include vowels and thus will not often match a transliteration
|
||||||
@@ -349,14 +357,13 @@ Non-goals
|
|||||||
Raison d'être
|
Raison d'être
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve
|
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of
|
||||||
the problem of venue deduping. In OpenVenues, we have a data set of millions of
|
|
||||||
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
||||||
The Common Crawl is published monthly, and so even merging the results of
|
The Common Crawl is published monthly, and so even merging the results of
|
||||||
two crawls produces significant duplicates.
|
two crawls produces significant duplicates.
|
||||||
|
|
||||||
Deduping is a relatively well-studied field, and for text documents like web
|
Deduping is a relatively well-studied field, and for text documents
|
||||||
pages, academic papers, etc. there exist pretty decent approximate
|
like web pages, academic papers, etc. there exist pretty decent approximate
|
||||||
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
||||||
|
|
||||||
However, for physical addresses, the frequent use of conventional abbreviations
|
However, for physical addresses, the frequent use of conventional abbreviations
|
||||||
@@ -388,11 +395,11 @@ So it's not a geocoder?
|
|||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
If the above sounds a lot like geocoding, that's because it is in a way,
|
If the above sounds a lot like geocoding, that's because it is in a way,
|
||||||
only in the OpenVenues case, we do it without a UI or a user to select the
|
only in the OpenVenues case, we have to geocode without a UI or a user
|
||||||
correct address in an autocomplete. Given a database of source addresses
|
to select the correct address in an autocomplete dropdown. Given a database
|
||||||
such as OpenAddresses or OpenStreetMap (or all of the above), libpostal
|
of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
|
||||||
can be used to implement things like address deduping and server-side
|
libpostal can be used to implement things like address deduping and server-side
|
||||||
batch geocoding in settings like MapReduce.
|
batch geocoding in settings like MapReduce or stream processing.
|
||||||
|
|
||||||
Why C?
|
Why C?
|
||||||
------
|
------
|
||||||
@@ -445,12 +452,7 @@ libpostal is written in modern, legible, C99 and uses the following conventions:
|
|||||||
Python codebase
|
Python codebase
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
There are actually two Python packages in libpostal.
|
The [geodata](https://github.com/openvenues/libpostal/tree/master/scripts/geodata) package in the libpostal repo is a confederation of scripts for preprocessing the various geo
|
||||||
|
|
||||||
1. **geodata**: generates C files and data sets used in the C build
|
|
||||||
2. **pypostal**: Python bindings for libpostal
|
|
||||||
|
|
||||||
geodata is simply a confederation of scripts for preprocessing the various geo
|
|
||||||
data sets and building input files for the C lib to use during model training.
|
data sets and building input files for the C lib to use during model training.
|
||||||
Said scripts shouldn't be needed for most users unless you're rebuilding data
|
Said scripts shouldn't be needed for most users unless you're rebuilding data
|
||||||
files for the C lib.
|
files for the C lib.
|
||||||
@@ -516,7 +518,7 @@ Most of the dictionaries have been derived with the following process:
|
|||||||
|
|
||||||
1. Tokenize every street name in OSM for language x
|
1. Tokenize every street name in OSM for language x
|
||||||
2. Count the most common N tokens
|
2. Count the most common N tokens
|
||||||
3. Optionally use frequent item set techniques to exctract phrases
|
3. Optionally use frequent item set techniques to extract phrases
|
||||||
4. Run the most frequent words/phrases through Google Translate
|
4. Run the most frequent words/phrases through Google Translate
|
||||||
5. Add the ones that mean "street" to dictionaries
|
5. Add the ones that mean "street" to dictionaries
|
||||||
6. Augment by researching addresses in countries speaking language x
|
6. Augment by researching addresses in countries speaking language x
|
||||||
@@ -576,6 +578,5 @@ ways the address parser can be improved even further (in order of difficulty):
|
|||||||
Todos
|
Todos
|
||||||
-----
|
-----
|
||||||
|
|
||||||
- [ ] Port language classification from Python, train and publish model
|
|
||||||
- [ ] Publish tests (currently not on Github) and set up continuous integration
|
- [ ] Publish tests (currently not on Github) and set up continuous integration
|
||||||
- [ ] Hosted documentation
|
- [ ] Hosted documentation
|
||||||
|
|||||||
Reference in New Issue
Block a user