[docs] Removing the coming soon label from language classification, cleaning up the README a bit

This commit is contained in:
Al
2016-01-27 14:44:48 -05:00
parent 95a7978131
commit 0bad3adf07

View File

@@ -19,8 +19,8 @@ Stay tuned...
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de: :jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
libpostal is a fast, multilingual, all-i18n-everything NLP library for libpostal is a fast NLP library for parsing and normalizing street addresses
normalizing and parsing physical addresses. anywhere in the world.
Addresses and the geographic coordinates they represent are essential for any Addresses and the geographic coordinates they represent are essential for any
location-based application (map search, transportation, on-demand/delivery location-based application (map search, transportation, on-demand/delivery
@@ -42,36 +42,40 @@ Examples of normalization
Address normalization may sound trivial initially, especially when thinking Address normalization may sound trivial initially, especially when thinking
only about the US (if that's where you happen to reside), but it only takes only about the US (if that's where you happen to reside), but it only takes
a few examples to realize how complicated natural language addresses are a few examples to realize how complex natural language addresses can get
internationally. Here's a short list of some less straightforward normalizations internationally. Here's a short list of some less straightforward normalizations
in various languages. The left/right columns in this table are equivalent in various languages. The left/right columns in this table are equivalent
strings under libpostal, the left column being user input and the right column strings under libpostal, the left column being user input and the right column
being the indexed (normalized) string. being the indexed (normalized) string. Note that libpostal automatically
detects the language(s) used in an address and applies the appropriate expansions.
The only input needed is the raw address string:
| Input | Output (may be multiple in libpostal) | | Input | Output (may be multiple in libpostal) |
| ----------------------------------- |---------------------------------------| | ----------------------------------- |-----------------------------------------|
| One-hundred twenty E 96th St | 120 east 96th street | | One-hundred twenty E 96th St | 120 east 96th street |
| C/ Ocho, P.I. 4 | calle 8 polígono industrial 4 | | C/ Ocho, P.I. 4 | calle 8 polígono industrial 4 |
| V XX Settembre, 20 | via 20 settembre 20 | | V XX Settembre, 20 | via 20 settembre 20 |
| Quatre vignt douze R. de l'Église | 92 rue de l' église | | Quatre vignt douze R. de l'Église | 92 rue de l' église |
| ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд дом 4 строение 7 | | ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд дом 4 строение 7 |
| ул Каретный Ряд, д 4, строение 7 | ulica karetnyj rad dom 4 stroenie 7 | | ул Каретный Ряд, д 4, строение 7 | ulitsa karetnyy ryad dom 4 stroyeniye 7 |
| Marktstrasse 14 | markt straße 14 | | Marktstrasse 14 | markt straße 14 |
libpostal currently supports these types of normalization in *over 60 languages*, libpostal currently supports these types of normalization in *60+ languages*,
and you can add more (without having to write any C!) and you can add more (without having to write any C).
Now, instead of trying to bake address-specific conventions into traditional Now, instead of trying to bake address-specific conventions into traditional
document search engines like Elasticsearch using giant synonyms files, scripting, document search engines like Elasticsearch using giant synonyms files, scripting,
custom analyzers, tokenizers, and the like, geocoding can be as simple as: custom analyzers, tokenizers, and the like, geocoding can look like this:
1. Run the addresses in your index through libpostal's expand_address 1. Run the addresses in your database through libpostal's expand_address
2. Store the normalized string(s) in your favorite search engine, DB, 2. Store the normalized string(s) in your favorite search engine, DB,
hashtable, etc. hashtable, etc.
3. Run your user queries or fresh imports through libpostal and search 3. Run your user queries or fresh imports through libpostal and search
the existing database using those strings the existing database using those strings
In this way, libpostal can perform fuzzy address matching in constant time. In this way, libpostal can perform fuzzy address matching in constant time
relative to the size of the data set.
For further reading and some bizarre address edge-cases, see: For further reading and some bizarre address edge-cases, see:
[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/). [Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/).
@@ -255,7 +259,8 @@ Data files
libpostal needs to download some data files from S3. The basic files are on-disk libpostal needs to download some data files from S3. The basic files are on-disk
representations of the data structures necessary to perform expansion. For address representations of the data structures necessary to perform expansion. For address
parsing, since model training takes about a day, we publish the fully trained model parsing, since model training takes about a day, we publish the fully trained model
to S3 and will update it automatically as new addresses get added to OSM. to S3 and will update it automatically as new addresses get added to OSM. Same goes for
the language classifier model.
Data files are automatically downloaded when you run make. To check for and download Data files are automatically downloaded when you run make. To check for and download
any new data files, run: any new data files, run:
@@ -278,20 +283,23 @@ optionally be separated so Rosenstraße and Rosen Straße are equivalent.
- **International address parsing**: sequence model which parses - **International address parsing**: sequence model which parses
"123 Main Street New York New York" into {"house_number": 123, "road": "123 Main Street New York New York" into {"house_number": 123, "road":
"Main Street", "city": "New York", "state": "New York"}. Unlike the majority "Main Street", "city": "New York", "state": "New York"}. The parser works
of parsers out there, it works for a wide variety of countries and languages, for a wide variety of countries and languages, not just US/English.
not just US/English. The model is trained on > 50M OSM addresses, using the The model is trained on > 50M OSM addresses, using the
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted, templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
tagged traning examples for most countries around the world. tagged traning examples for most countries around the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/osm/osm_address_training_data.py)
are performed to make the training data resemble real messy geocoder input as closely as possible.
- **Language classification (coming soon)**: multinomial logistic regression - **Language classification**: multinomial logistic regression
trained on all of OpenStreetMap ways, addr:* tags, toponyms and formatted trained on all of OpenStreetMap ways, addr:* tags, toponyms and formatted
addresses. Labels are derived using point-in-polygon tests in Quattroshapes addresses. Labels are derived using point-in-polygon tests in Quattroshapes
and official/regional languages for countries and admin 1 boundaries and official/regional languages for countries and admin 1 boundaries
respectively. So, for example, Spanish is the default language in Spain but respectively. So, for example, Spanish is the default language in Spain but
in different regions e.g. Catalunya, Galicia, the Basque region, regional in different regions e.g. Catalunya, Galicia, the Basque region, the respective
languages are the default. Dictionary-based disambiguation is employed in regional languages are the default. Dictionary-based disambiguation is employed in
cases where the regional language is non-default e.g. Welsh, Breton, Occitan. cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/"
(performed on both the language classifier and the address parser training sets)
- **Numeric expression parsing** ("twenty first" => 21st, - **Numeric expression parsing** ("twenty first" => 21st,
"quatre-vignt-douze" => 92, again using data provided in CLDR), supports > 30 "quatre-vignt-douze" => 92, again using data provided in CLDR), supports > 30
@@ -299,8 +307,8 @@ languages. Handles languages with concatenated expressions e.g.
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
language (IX => 9) which occur in the names of many monarchs, popes, etc. language (IX => 9) which occur in the names of many monarchs, popes, etc.
- **Geographic name aliasing**: New York, NYC and Nueva York alias to New York - **Geographic name aliasing (coming soon)**: New York, NYC and Nueva York alias
City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate to New York City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
names added by contributors can automatically improve libpostal. names added by contributors can automatically improve libpostal.
- **Geographic disambiguation (coming soon)**: There are several equally - **Geographic disambiguation (coming soon)**: There are several equally
@@ -327,7 +335,7 @@ languages chracter by character instead of on whitespace.
strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration. strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all - **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
[CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same as used by ICU, [CLDR transforms](http://www.unicode.org/repos/cldr/trunk/common/transforms/), the exact same source data as used by [ICU](http://site.icu-project.org/),
though libpostal doesn't require pulling in all of ICU (might conflict though libpostal doesn't require pulling in all of ICU (might conflict
with your system's version). Note: some languages, particularly Hebrew, Arabic with your system's version). Note: some languages, particularly Hebrew, Arabic
and Thai may not include vowels and thus will not often match a transliteration and Thai may not include vowels and thus will not often match a transliteration
@@ -349,14 +357,13 @@ Non-goals
Raison d'être Raison d'être
------------- -------------
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of
the problem of venue deduping. In OpenVenues, we have a data set of millions of
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/). places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
The Common Crawl is published monthly, and so even merging the results of The Common Crawl is published monthly, and so even merging the results of
two crawls produces significant duplicates. two crawls produces significant duplicates.
Deduping is a relatively well-studied field, and for text documents like web Deduping is a relatively well-studied field, and for text documents
pages, academic papers, etc. there exist pretty decent approximate like web pages, academic papers, etc. there exist pretty decent approximate
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash). similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
However, for physical addresses, the frequent use of conventional abbreviations However, for physical addresses, the frequent use of conventional abbreviations
@@ -388,11 +395,11 @@ So it's not a geocoder?
----------------------- -----------------------
If the above sounds a lot like geocoding, that's because it is in a way, If the above sounds a lot like geocoding, that's because it is in a way,
only in the OpenVenues case, we do it without a UI or a user to select the only in the OpenVenues case, we have to geocode without a UI or a user
correct address in an autocomplete. Given a database of source addresses to select the correct address in an autocomplete dropdown. Given a database
such as OpenAddresses or OpenStreetMap (or all of the above), libpostal of source addresses such as OpenAddresses or OpenStreetMap (or all of the above),
can be used to implement things like address deduping and server-side libpostal can be used to implement things like address deduping and server-side
batch geocoding in settings like MapReduce. batch geocoding in settings like MapReduce or stream processing.
Why C? Why C?
------ ------
@@ -445,12 +452,7 @@ libpostal is written in modern, legible, C99 and uses the following conventions:
Python codebase Python codebase
--------------- ---------------
There are actually two Python packages in libpostal. The [geodata](https://github.com/openvenues/libpostal/tree/master/scripts/geodata) package in the libpostal repo is a confederation of scripts for preprocessing the various geo
1. **geodata**: generates C files and data sets used in the C build
2. **pypostal**: Python bindings for libpostal
geodata is simply a confederation of scripts for preprocessing the various geo
data sets and building input files for the C lib to use during model training. data sets and building input files for the C lib to use during model training.
Said scripts shouldn't be needed for most users unless you're rebuilding data Said scripts shouldn't be needed for most users unless you're rebuilding data
files for the C lib. files for the C lib.
@@ -516,7 +518,7 @@ Most of the dictionaries have been derived with the following process:
1. Tokenize every street name in OSM for language x 1. Tokenize every street name in OSM for language x
2. Count the most common N tokens 2. Count the most common N tokens
3. Optionally use frequent item set techniques to exctract phrases 3. Optionally use frequent item set techniques to extract phrases
4. Run the most frequent words/phrases through Google Translate 4. Run the most frequent words/phrases through Google Translate
5. Add the ones that mean "street" to dictionaries 5. Add the ones that mean "street" to dictionaries
6. Augment by researching addresses in countries speaking language x 6. Augment by researching addresses in countries speaking language x
@@ -576,6 +578,5 @@ ways the address parser can be improved even further (in order of difficulty):
Todos Todos
----- -----
- [ ] Port language classification from Python, train and publish model
- [ ] Publish tests (currently not on Github) and set up continuous integration - [ ] Publish tests (currently not on Github) and set up continuous integration
- [ ] Hosted documentation - [ ] Hosted documentation