20. DuckDB#

Open In Colab

20.1. Introduction#

This notebook shows how to import data into a DuckDB database. It uses the duckdb Python package to connect to a DuckDB database and import data from various formats, including CSV, JSON, DataFrame, parquet, GeoJSON, Shapefile, GeoParquet, and more.

20.2. Datasets#

The following datasets are used in this notebook. You don’t need to download them, they can be accessed directly from the notebook.

20.3. Installation#

Uncomment the following cell to install the required packages if needed.

# %pip install duckdb leafmap

20.4. Library Import#

import duckdb
import leafmap
import pandas as pd

20.5. Installing Extensions#

DuckDB’s Python API provides functions for installing and loading extensions, which perform the equivalent operations to running the INSTALL and LOAD SQL commands, respectively. An example that installs and loads the httpfs extension looks like follows:

con = duckdb.connect()
con.install_extension("httpfs")
con.load_extension("httpfs")
con.install_extension("spatial")
con.load_extension("spatial")

20.6. Downloading Sample Data#

url = "https://open.gishub.org/data/duckdb/cities.zip"
leafmap.download_file(url, unzip=True)

20.7. CSV Files#

CSV files can be read using the read_csv function, called either from within Python or directly from within SQL. By default, the read_csv function attempts to auto-detect the CSV settings by sampling from the provided file.

# read from a file using fully auto-detected settings
con.read_csv("cities.csv")
# specify options on how the CSV is formatted internally
con.read_csv("cities.csv", header=True, sep=",")
# use the (experimental) parallel CSV reader
con.read_csv("cities.csv", parallel=True)
# directly read a CSV file from within SQL
con.sql("SELECT * FROM 'cities.csv'")
# call read_csv from within SQL
con.sql("SELECT * FROM read_csv_auto('cities.csv')")

20.8. JSON Files#

JSON files can be read using the read_json function, called either from within Python or directly from within SQL. By default, the read_json function will automatically detect if a file contains newline-delimited JSON or regular JSON, and will detect the schema of the objects stored within the JSON file.

# read from a single JSON file
con.read_json("cities.json")
# directly read a JSON file from within SQL
con.sql("SELECT * FROM 'cities.json'")
# call read_json from within SQL
con.sql("SELECT * FROM read_json_auto('cities.json')")

20.9. DataFrames#

DuckDB is automatically able to query a Pandas DataFrame.

df = pd.read_csv("cities.csv")
df
con.sql("SELECT * FROM df").fetchall()

20.10. Parquet Files#

Parquet files can be read using the read_parquet function, called either from within Python or directly from within SQL.

# read from a single Parquet file
con.read_parquet("cities.parquet")
# directly read a Parquet file from within SQL
con.sql("SELECT * FROM 'cities.parquet'")
# call read_parquet from within SQL
con.sql("SELECT * FROM read_parquet('cities.parquet')")

20.11. GeoJSON Files#

con.sql("SELECT * FROM ST_Drivers()")
con.sql("SELECT * FROM ST_Read('cities.geojson')")
con.sql("FROM ST_Read('cities.geojson')")
con.sql("CREATE TABLE cities AS SELECT * FROM ST_Read('cities.geojson')")
con.table("cities")
con.sql("SELECT * FROM cities")

20.12. Shapefiles#

con.sql("SELECT * FROM ST_Read('cities.shp')")
con.sql("FROM ST_Read('cities.shp')")
con.sql(
    """
        CREATE TABLE IF NOT EXISTS cities2 AS 
        SELECT * FROM ST_Read('cities.shp')
        """
)
con.table("cities2")
con.sql("SELECT * FROM cities2")

20.13. GeoParquet Files#

con.sql("SELECT * FROM 'cities.parquet'")
con.sql(
    """
CREATE TABLE IF NOT EXISTS cities3 AS
SELECT * EXCLUDE geometry, ST_GeomFromWKB(geometry) 
AS geometry FROM 'cities.parquet'
"""
)
con.table("cities3")
con.sql(
    """
CREATE TABLE IF NOT EXISTS country AS
SELECT * EXCLUDE geometry, ST_GeomFromWKB(geometry) FROM
        's3://us-west-2.opendata.source.coop/google-research-open-buildings/v2/geoparquet-admin1/country=SSD/*.parquet'
"""
)
con.table("country")
con.sql("SELECT COUNT(*) FROM country")

20.14. References#