[readme] Readme fixes and additions
This commit is contained in:
47
README.md
47
README.md
@@ -185,8 +185,8 @@ challenges libpostal can handle:
|
|||||||
For further reading and some less intuitive examples of addresses, see
|
For further reading and some less intuitive examples of addresses, see
|
||||||
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
|
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
|
||||||
|
|
||||||
Why C (i.e. are you crazy)?
|
Why C?
|
||||||
---------------------------
|
------
|
||||||
|
|
||||||
libpostal is written in C for three reasons (in order of importance):
|
libpostal is written in C for three reasons (in order of importance):
|
||||||
|
|
||||||
@@ -218,22 +218,35 @@ isn't as important because everything's being done in parallel, but there are
|
|||||||
some streaming ingestion applications at Mapzen where this needs to
|
some streaming ingestion applications at Mapzen where this needs to
|
||||||
run in-process.
|
run in-process.
|
||||||
|
|
||||||
Design philosophy
|
C codebase
|
||||||
-----------------
|
----------
|
||||||
|
|
||||||
libpostal is written in modern, legible, C99.
|
libpostal is written in modern, legible, C99 and uses the following conventions:
|
||||||
|
|
||||||
- Keep it roughly object-oriented, as allowed by C
|
- Roughly object-oriented, as much as allowed by C
|
||||||
- Confine almost all mallocs to *name*_new and all frees to *name*_destroy
|
- Almost no pointer-based data structures, arrays all the way down
|
||||||
- Don't write custom hashtables, sorting algorithms, other undergrad CS stuff
|
- Uses dynamic character arrays (inspired by [sds](https://github.com/antirez/sds)) for safer string handling
|
||||||
- Use generic containers from klib where possible
|
- Confines almost all mallocs to *name*_new and all frees to *name*_destroy
|
||||||
- Take advantage of sparsity in all data structures
|
- Efficient existing implementations for simple things like hashtables
|
||||||
- Use char_array (inspired by [sds](https://github.com/antirez/sds)) when possible instead of C strings.
|
- Generic containers (via [klib](https://github.com/attractivechaos/klib)) whenever possible
|
||||||
- Throughly test for memory leaks before pushing
|
- Data structrues take advantage of sparsity as much as possible
|
||||||
- Keep it reasonably cross-platform compatible, particularly for *nix
|
- Efficient double-array trie implementation for most string dictionaries
|
||||||
|
- Tries to stay cross-platform as much as possible, particularly for *nix
|
||||||
|
|
||||||
|
Python codebase
|
||||||
|
---------------
|
||||||
|
|
||||||
|
There are actually two Python packages in libpostal.
|
||||||
|
|
||||||
|
1. **geodata**: generates C files and data sets used in the C build
|
||||||
|
2. **pypostal**: Python bindings for libpostal
|
||||||
|
|
||||||
|
geodata is simply a confederation of scripts which share some common code.
|
||||||
|
Said scripts shouldn't be needed for most users unless you're rebuilding data
|
||||||
|
files for the C lib.
|
||||||
|
|
||||||
Language dictionaries
|
Language dictionaries
|
||||||
----------------------
|
---------------------
|
||||||
|
|
||||||
It's easy to add new languages/synonyms to libpostal by modifying a few text
|
It's easy to add new languages/synonyms to libpostal by modifying a few text
|
||||||
files. The format of each dictionary file roughly resembles a
|
files. The format of each dictionary file roughly resembles a
|
||||||
@@ -291,9 +304,9 @@ In most cases better to leave these alone
|
|||||||
|
|
||||||
Most of the dictionaries have been derived with the following process:
|
Most of the dictionaries have been derived with the following process:
|
||||||
|
|
||||||
1. Tokenize all the streets in OSM for language x
|
1. Tokenize every street name in OSM for language x
|
||||||
2. Count the most common N tokens
|
2. Count the most common N tokens
|
||||||
3. Optionally use frequent item set mining to get frequent phrases
|
3. Optionally use frequent item set techniques to exctract phrases
|
||||||
4. Run the most frequent words/phrases through Google Translate
|
4. Run the most frequent words/phrases through Google Translate
|
||||||
5. Add the ones that mean "street" to dictionaries
|
5. Add the ones that mean "street" to dictionaries
|
||||||
6. Augment by researching addresses in countries speaking language x
|
6. Augment by researching addresses in countries speaking language x
|
||||||
@@ -305,7 +318,7 @@ they use git.
|
|||||||
Installation
|
Installation
|
||||||
------------
|
------------
|
||||||
|
|
||||||
For C users or those writing bindings (if you've written a languag
|
For C users or those writing bindings (if you've written a language
|
||||||
binding, please let us know!):
|
binding, please let us know!):
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user