From 4ab0a654e87c533ba7c6d97b8a78eb2d9a6d5fad Mon Sep 17 00:00:00 2001 From: Al Date: Wed, 9 Mar 2016 10:56:47 -0500 Subject: [PATCH] [docs] Moving langauge dictionaries README to its own directory, adding note about address_languages repo for getting started --- README.md | 77 +++----------------------------- resources/dictionaries/README.md | 77 ++++++++++++++++++++++++++++++++ 2 files changed, 83 insertions(+), 71 deletions(-) create mode 100644 resources/dictionaries/README.md diff --git a/README.md b/README.md index aaf74264..65748014 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # libpostal: international street address NLP -[![Build Status](https://travis-ci.org/openvenues/libpostal.svg?branch=master)](https://travis-ci.org/openvenues/libpostal) [![License](https://img.shields.io/github/license/openvenues/libpostal.svg)](https://github.com/openvenues/libpostal/LICENSE) +[![Build Status](https://travis-ci.org/openvenues/libpostal.svg?branch=master)](https://travis-ci.org/openvenues/libpostal) [![License](https://img.shields.io/github/license/openvenues/libpostal.svg)](https://github.com/openvenues/libpostal/blob/master/LICENSE) :jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de: @@ -264,6 +264,11 @@ any new data files, run: libpostal_data download all $YOUR_DATA_DIR/libpostal ``` +Language dictionaries +--------------------- + +See [resources/dictionaries](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries) + And replace $YOUR_DATA_DIR with whatever you passed to configure during install. Features @@ -468,76 +473,6 @@ data sets and building input files for the C lib to use during model training. Said scripts shouldn't be needed for most users unless you're rebuilding data files for the C lib. -Language dictionaries ---------------------- - -It's easy to add new languages/synonyms to libpostal by modifying a few text -files. The format of each dictionary file roughly resembles a -Lucene/Elasticsearch synonyms file: - -``` -drive|dr -street|st|str -road|rd -``` - -The leftmost string is treated as the canonical/normalized version. Synonyms -if any, are appended to the right, delimited by the pipe character. - -The supported languages can be found in the [resources/dictionaries](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries). - -Each language can define one or more dictionaries (sometimes called "gazetteers" in NLP) to help with address parsing, and normalizing abbreviations. The dictionary types are: - -- **academic_degrees.txt**: for post-nominal strings like "M.D.", "Ph.D.", etc. -- **ambiguous_expansions.txt**: e.g. "E" could be expanded to "East" or could -be "E Street", so if the string it encountered, it can either be left alone or expanded -- **building_types.txt**: strings indicating a building/house -- **company_types.txt**: company suffixes like "Inc" or "GmbH" -- **concatenated_prefixes_separable.txt**: things like "Hinter..." which can -be written either concatenated or as separate tokens -- **concatenated_suffixes_inseparable.txt**: Things like "...bg." => "...burg" -where the suffix cannot be separated from the main token, but either has an -abbreviated equivalent or simply can help identify the token in parsing as, -say, part of a street name -- **concatenated_suffixes_separable.txt**: Things like "...straße" where the -suffix can be either concatenated to the main token or separated -- **directionals.txt**: strings indicating directions (cardinal and -lower/central/upper, etc.) -- **level_types.txt**: strings indicating a particular floor -- **no_number.txt**: strings like "no fixed address" -- **nulls.txt**: strings meaning "not applicable" -- **personal_suffixes.txt**: post-nominal suffixes, usually generational -like Jr/Sr -- **personal_titles.txt**: civilian, royal and military titles -- **place_names.txt**: strings found in names of places e.g. "theatre", -"aquarium", "restaurant". See [Nominatim Special Phrases](http://wiki.openstreetmap.org/wiki/Nominatim/Special_Phrases) -- **post_office.txt**: strings like "p.o. box" -- **qualifiers.txt**: strings like "township" -- **stopwords.txt**: prepositions and articles mostly, very common words -which may be ignored in some contexts -- **street_types.txt**: words like "street", "road", "drive" which indicate -a thoroughfare and their respective abbreviations. -- **synonyms.txt**: any miscellaneous synonyms/abbreviations e.g. "bros" -expands to "brothers", etc. These have no special meaning and will essentially -just be treated as string replacement. -- **toponyms.txt**: abbreviations for certain abbreviations relating to -toponyms like regions, places, etc. Note: GeoNames covers most of these. -In most cases better to leave these alone -- **unit_types.txt**: strings indicating an apartment or unit number - -Most of the dictionaries have been derived with the following process: - -1. Tokenize every street name in OSM for language x -2. Count the most common N tokens -3. Optionally use frequent item set techniques to extract phrases -4. Run the most frequent words/phrases through Google Translate -5. Add the ones that mean "street" to dictionaries -6. Augment by researching addresses in countries speaking language x - -In the future it might be beneficial to move the dictionaries to a wiki -so they can be crowdsourced by native speakers regardless of whether or not -they use git. - Address parser accuracy ----------------------- diff --git a/resources/dictionaries/README.md b/resources/dictionaries/README.md new file mode 100644 index 00000000..5f0ec982 --- /dev/null +++ b/resources/dictionaries/README.md @@ -0,0 +1,77 @@ + +Language dictionaries +--------------------- + +It's easy to add new languages/synonyms to libpostal by modifying a few text +files. The format of each dictionary file roughly resembles a +Lucene/Elasticsearch synonyms file: + +``` +drive|dr +street|st|str +road|rd +``` + +The leftmost string is treated as the canonical/normalized version. Synonyms +if any, are appended to the right, delimited by the pipe character. + +The supported languages can be found in the [resources/dictionaries](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries). + +Each language can define one or more dictionaries (sometimes called "gazetteers" in NLP) to help with address parsing, and normalizing abbreviations. The dictionary types are: + +- **academic_degrees.txt**: for post-nominal strings like "M.D.", "Ph.D.", etc. +- **ambiguous_expansions.txt**: e.g. "E" could be expanded to "East" but could +be "E Street", so if the string is encountered, it can either be left alone or expanded. In general, single-letter abbreviations in most languages should also be added to ambiguous_expansions.txt since single letters are also often initials +- **building_types.txt**: strings indicating a building/house +- **company_types.txt**: company suffixes like "Inc" or "GmbH" +- **concatenated_prefixes_separable.txt**: things like "Hinter..." which can +be written either concatenated or as separate tokens +- **concatenated_suffixes_inseparable.txt**: Things like "...bg." => "...burg" +where the suffix cannot be separated from the main token, but either has an +abbreviated equivalent or simply can help identify the token in parsing as, +say, part of a street name +- **concatenated_suffixes_separable.txt**: Things like "...straße" where the +suffix can be either concatenated to the main token or separated +- **directionals.txt**: strings indicating directions (cardinal and +lower/central/upper, etc.) +- **level_types.txt**: strings indicating a particular floor +- **no_number.txt**: strings like "no fixed address" +- **nulls.txt**: strings meaning "not applicable" +- **personal_suffixes.txt**: post-nominal suffixes, usually generational +like Jr/Sr +- **personal_titles.txt**: civilian, royal and military titles +- **place_names.txt**: strings found in names of places e.g. "theatre", +"aquarium", "restaurant". [Nominatim Special Phrases](http://wiki.openstreetmap.org/wiki/Nominatim/Special_Phrases) is a great resource for this. +- **post_office.txt**: strings like "p.o. box" +- **qualifiers.txt**: strings like "township" +- **stopwords.txt**: prepositions and articles mostly, very common words +which may be ignored in some contexts +- **street_types.txt**: words like "street", "road", "drive" which indicate +a thoroughfare and their respective abbreviations. +- **synonyms.txt**: any miscellaneous synonyms/abbreviations e.g. "bros" +expands to "brothers", etc. These have no special meaning and will essentially +just be treated as string replacement. +- **toponyms.txt**: abbreviations for certain abbreviations relating to +toponyms like regions, places, etc. Note: GeoNames covers most of these. +In most cases better to leave these alone +- **unit_types.txt**: strings indicating an apartment or unit number + +Most of the dictionaries have been derived using the following process: + +1. Tokenize every street/venue name in OSM for language x using libpostal +2. Count the most common tokens +3. Use the [Apriori algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm) to extract multi-word phrases +4. Run the most frequent words/phrases through Google Translate +5. Add the ones that mean "street" (or other relevant words) to dictionaries +6. Augment by researching addresses in countries speaking language x + +Contributing +============ + +If you're a native speaker of one or more languages in libpostal, we'd love your contribution! It's as simple as editing the text files under this directory and submitting a pull request. Dictionaries are organized by [language code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), so feel free to find any language you speak and start editing! If you don't see your language, just add a directory - there's no additional configuration needed. + +To get started adding new language dictionaries or improving support for existing languages, check out the [address_languages](https://github.com/openvenues/address_languages) repo, where we've published lists of 1-5 word phrases found in street/venue names in every language in OSM. + +In the future it might be beneficial to move these dictionaries to a wiki +so they can be crowdsourced by native speakers regardless of whether or not +they use git. \ No newline at end of file