Provides fast and convenient geolocation bindings for Pandas Dataframes. Uses numpy ndarray's internally to speed it up compared to naively applying function per column. Based on the maxminddb-rust.
- Supports both MMAP and in-memory implementations
- Supports parallelism (useful for very big datasets)
- Comes with pre-built wheels, no need to install and maintain external C-library to get (better than) C-performance
- Minimal supported Python is 3.8
pip install pandas_maxminddb
- The preferred way is to use precompiled binary wheel, as this requires no toolchain and is fastest.
- If you want to build from source any platform Rust has target for is supported.
The wheels are built against following numpy
and pandas
distributions:
- If you're on Windows / macOS / Linux there is no need to do anything extra.
- If you use ARMv7 (RaspberryPi and such)
use PiWheels
--extra-index-url=https://www.piwheels.org/simple
, installlibatlas-base-dev
for numpy. - If you use musl-based distro like Alpine
use Alpine-wheels
--extra-index-url https://alpine-wheels.github.io/index
, installlibstdc++
for pandas.
Refer to the build workflow for details.
Py | win x86 | win x64 | macOS x86_64 | macOS AArch64 | linux x86_64 | linux i686 | linux AArch64 | linux ARMv7 | musl linux x86_64 |
---|---|---|---|---|---|---|---|---|---|
3.8 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🚫 | ✅ |
3.9 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🚫 |
3.10 | 🚫 | ✅ | ✅ | ✅ | ✅ | ✅ | 🚫 | 🚫 | ✅ |
By importing pandas_maxminddb
you add Pandas geo
extension which allows you to add columns
in-place. This example uses context manager for reader lifetime:
import pandas as pd
from pandas_maxminddb import open_database
ips = pd.DataFrame(data={
'ip': ["75.63.106.74", "132.206.246.203", "94.226.237.31", "128.119.189.49", "2.30.253.245"]})
with open_database('./GeoLite.mmdb/GeoLite2-City.mmdb') as reader:
ips.geo.geolocate('ip', reader, ['country', 'city', 'state', 'postcode'])
ips
ip | city | postcode | state | country | |
---|---|---|---|---|---|
0 | 75.63.106.74 | Houston | 77070 | TX | US |
1 | 132.206.246.203 | Montreal | H3A | QC | CA |
2 | 94.226.237.31 | Kapellen | 2950 | VLG | BE |
3 | 128.119.189.49 | Northampton | 01060 | MA | US |
4 | 2.30.253.245 | London | SW15 | ENG | GB |
You can also instantiate reader yourself, eg:
import pandas as pd
from pandas_maxminddb import ReaderMem, ReaderMmap
reader = ReaderMem('./GeoLite.mmdb/GeoLite2-City.mmdb')
ips = pd.DataFrame(data={
'ip': ["75.63.106.74", "132.206.246.203", "94.226.237.31", "128.119.189.49", "2.30.253.245"]})
ips.geo.geolocate('ip', reader, ['country', 'city', 'state', 'postcode'])
ips
If dataset is big enough, and you have extra cores you might benefit from using them. Currently only ReaderMem
is supported:
import pandas as pd
from pandas_maxminddb import ReaderMem
reader = ReaderMem('./GeoLite.mmdb/GeoLite2-City.mmdb')
ips = pd.DataFrame(data={
'ip': ["75.63.106.74", "132.206.246.203", "94.226.237.31", "128.119.189.49", "2.30.253.245"]})
ips.geo.geolocate('ip', reader, ['country', 'city', 'state', 'postcode'], parallel=True)
ips
- Tested on M1 Max with 1024 chunk size on 100k dataset, refer to benchmark
Name (time in ms) | Min | Max | Mean | StdDev | Median | IQR | Outliers | OPS | Rounds | Iterations |
---|---|---|---|---|---|---|---|---|---|---|
test_benchmark_pandas_parallel_mem_maxminddb | 52.7588 (1.0) | 57.4206 (1.0) | 54.0573 (1.0) | 1.1782 (1.15) | 53.8497 (1.0) | 1.4194 (1.09) | 4;1 | 18.4989 (1.0) | 20 | 1 |
test_benchmark_pandas_mmap_maxminddb | 240.0050 (4.55) | 244.3257 (4.26) | 242.2177 (4.48) | 1.9017 (1.85) | 243.1021 (4.51) | 3.2122 (2.46) | 2;0 | 4.1285 (0.22) | 5 | 1 |
test_benchmark_pandas_mem_maxminddb | 241.4630 (4.58) | 244.2553 (4.25) | 242.8391 (4.49) | 1.0288 (1.0) | 242.7672 (4.51) | 1.3064 (1.0) | 2;0 | 4.1180 (0.22) | 5 | 1 |
test_benchmark_c_maxminddb | 1,010.6569 (19.16) | 1,055.1080 (18.38) | 1,021.3691 (18.89) | 18.9273 (18.40) | 1,013.3819 (18.82) | 12.9544 (9.92) | 1;1 | 0.9791 (0.05) | 5 | 1 |
test_benchmark_python_maxminddb | 9,021.2686 (170.99) | 9,188.7629 (160.03) | 9,071.0055 (167.80) | 70.0512 (68.09) | 9,039.7811 (167.87) | 84.7766 (64.89) | 1;0 | 0.1102 (0.01) | 5 | 1 |
Due to Dataframe columns being flat arrays and geolocation data coming in a hierarchical format you might need to provide more mappings to serve your particular use-case. In order to do that follow Development section to setup your environment and then:
- Add column name to the geo_column.rs
- Add column mapping to the geolocate.rs
git clone --recurse-submodules [email protected]:andrusha/pandas-maxminddb.git
PYTHON_CONFIGURE_OPTS="--enable-shared" asdf install
PYTHON_CONFIGURE_OPTS="--enable-shared" python -m venv .venv
source .venv/bin/activate
pip install nox
nox -s test
PYTHONPATH=.venv/lib/python3.8/site-packages cargo test --no-default-features
In order to run nox -s bench
properly you would
need libmaxminddb installed as
per maxminddb instructions prior to
installing Python package, so that C-extension could be benchmarked properly.
On macOS this would require following:
brew instal libmaxminddb
PATH="/opt/homebrew/Cellar/libmaxminddb/1.7.1/bin:$PATH" LDFLAGS="-L/opt/homebrew/Cellar/libmaxminddb/1.7.1/lib" CPPFLAGS="-I/opt/homebrew/Cellar/libmaxminddb/1.7.1/include" pip install maxminddb --force-reinstall --verbose --no-cache-dir