ZhangZhihui's Blog

公告

昵称： ZhangZhihuiAAA
园龄： 5年9个月
粉丝： 1
关注： 0
+加关注

日历

2025年3月

日

一

二

三

四

五

六

随笔档案 (761)

阅读排行榜

DuckDB - Study Notes 8

Requirements:

duckdb
emoji
ibis-framework
ibis-framework[duckdb]
jupysql
jupyter-lsp
jupyterlab
pandas
plotly
polars-lts-cpu
pyarrow
sqlparse

pi_relation = duckdb.sql("SELECT pi() AS pi")
type(pi_relation)

# duckdb.duckdb.DuckDBPyRelation

pi_relation.show()

┌───────────────────┐
│        pi         │
│      double       │
├───────────────────┤
│ 3.141592653589793 │
└───────────────────┘

pi_relation

┌───────────────────┐
│        pi         │
│      double       │
├───────────────────┤
│ 3.141592653589793 │
└───────────────────┘

duckdb.sql(
    """
    SELECT *
    FROM read_csv('data/C08/Seattle_Pet_Licenses.csv')
    """
)

As our call to duckdb.sql() is the only expression in this cell, it is evaluated, and the resulting relation object is displayed:

┌────────────────────┬────────────────┬───────────────┬─────────┬──────────────────────┬─────────────────┬──────────┐
│ License Issue Date │ License Number │ Animal's Name │ Species │    Primary Breed     │ Secondary Breed │ ZIP Code │
│      varchar       │    varchar     │    varchar    │ varchar │       varchar        │     varchar     │ varchar  │
├────────────────────┼────────────────┼───────────────┼─────────┼──────────────────────┼─────────────────┼──────────┤
│ December 18 2015   │ S107948        │ Zen           │ Cat     │ Domestic Longhair    │ Mix             │ 98117    │
│ June 14 2016       │ S116503        │ Misty         │ Cat     │ Siberian             │ NULL            │ 98117    │
│ August 04 2016     │ S119301        │ Lyra          │ Cat     │ Mix                  │ NULL            │ 98121    │
│ February 13 2019   │ 962273         │ Veronica      │ Cat     │ Domestic Longhair    │ NULL            │ 98107    │
│ August 10 2019     │ S133113        │ Spider        │ Cat     │ LaPerm               │ NULL            │ 98115    │
│ November 21 2019   │ 8002549        │ Maxx          │ Cat     │ American Shorthair   │ NULL            │ 98125    │
│ May 24 2020        │ S142869        │ Mickey        │ Cat     │ Domestic Longhair    │ NULL            │ 98126    │
│ July 03 2020       │ S112835        │ Diamond       │ Cat     │ Domestic Shorthair   │ Mix             │ 98103    │
│ July 21 2020       │ S131986        │ Nacho         │ Cat     │ Domestic Shorthair   │ Mix             │ 98126    │
│ August 18 2020     │ 8019541        │ Gracie        │ Cat     │ Domestic Medium Hair │ Mix             │ 98133    │
│       ·            │    ·           │   ·           │  ·      │         ·            │  ·              │   ·      │
│       ·            │    ·           │   ·           │  ·      │         ·            │  ·              │   ·      │
│       ·            │    ·           │   ·           │  ·      │         ·            │  ·              │   ·      │
│ November 22 2023   │ S148106        │ Gibbs         │ Cat     │ Domestic Shorthair   │ Mix             │ 98116    │
│ November 22 2023   │ 8042587        │ Barry         │ Cat     │ American Shorthair   │ Mix             │ 98109    │
│ November 22 2023   │ 8042588        │ Penny         │ Cat     │ American Shorthair   │ Mix             │ 98109    │
│ November 22 2023   │ 8050533        │ Sabine        │ Cat     │ Domestic Medium Hair │ York Chocolate  │ 98144    │
│ November 22 2023   │ 8050534        │ Milo          │ Cat     │ Domestic Medium Hair │ Mix             │ 98144    │
│ November 22 2023   │ S137987        │ Spike         │ Cat     │ Domestic Medium Hair │ NULL            │ 98117    │
│ November 22 2023   │ S137986        │ Scout         │ Cat     │ Domestic Shorthair   │ NULL            │ 98117    │
│ November 22 2023   │ 8013084        │ Honeybee      │ Cat     │ American Shorthair   │ Mix             │ 98106    │
│ November 22 2023   │ S157559        │ Bug           │ Cat     │ Ragdoll              │ Siamese         │ 98144    │
│ November 22 2023   │ 8042668        │ Beerus        │ Cat     │ Domestic Shorthair   │ Mix             │ 98125    │
├────────────────────┴────────────────┴───────────────┴─────────┴──────────────────────┴─────────────────┴──────────┤
│ ? rows (>9999 rows, 20 shown)                                                                           7 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

pets_csv_relation = duckdb.read_csv(
    "data/C08/Seattle_Pet_Licenses.csv"
)

pets_csv_relation.types

# [VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR]

pets_csv_relation = duckdb.read_csv(
    "data/C08/Seattle_Pet_Licenses.csv",
    dtype={"License Issue Date": "DATE"},
    date_format="%B %d %Y",
)
pets_csv_relation.types


# [DATE, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR]

pets_csv_relation.limit(5)

┌────────────────────┬────────────────┬───────────────┬─────────┬───────────────────┬─────────────────┬──────────┐
│ License Issue Date │ License Number │ Animal's Name │ Species │   Primary Breed   │ Secondary Breed │ ZIP Code │
│        date        │    varchar     │    varchar    │ varchar │      varchar      │     varchar     │ varchar  │
├────────────────────┼────────────────┼───────────────┼─────────┼───────────────────┼─────────────────┼──────────┤
│ 2015-12-18         │ S107948        │ Zen           │ Cat     │ Domestic Longhair │ Mix             │ 98117    │
│ 2016-06-14         │ S116503        │ Misty         │ Cat     │ Siberian          │ NULL            │ 98117    │
│ 2016-08-04         │ S119301        │ Lyra          │ Cat     │ Mix               │ NULL            │ 98121    │
│ 2019-02-13         │ 962273         │ Veronica      │ Cat     │ Domestic Longhair │ NULL            │ 98107    │
│ 2019-08-10         │ S133113        │ Spider        │ Cat     │ LaPerm            │ NULL            │ 98115    │
└────────────────────┴────────────────┴───────────────┴─────────┴───────────────────┴─────────────────┴──────────┘

print(pets_csv_relation.sql_query())

SELECT *
FROM read_csv_auto(
        ['data/C08/Seattle_Pet_Licenses.csv'],
        (auto_detect = false),
        (delim = ','),
        ("quote" = '"'),
        (null_padding = false),
        ("escape" = '"'),
        (dateformat = '%B %d %Y'),
        (all_varchar = false),
        (
            "columns" = { 'License Issue Date': 'DATE',
                'License Number': 'VARCHAR',
                'Animal' s Name ': ' VARCHAR ', 
                ' Species ': ' VARCHAR ', 
                ' Primary Breed ': ' VARCHAR ', 
                ' Secondary Breed ': ' VARCHAR ', 
                ' ZIP Code ': ' VARCHAR '
            }
        ), 
        (max_line_size = 2097152), 
        (normalize_names = false), 
        ("header" = true), 
        ("skip" = 0), 
        ("parallel" = true), 
        (maximum_line_size = 2097152)
)

pets_csv_relation.to_table("seattle_pets_dataset")

duckdb.sql("SHOW TABLES")

┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ seattle_pets_dataset │
└──────────────────────┘

Now, we can make a relation from this table we’ve just created using the DuckDBPyConnection.table() method:

pets_table_relation = duckdb.table("seattle_pets_dataset")

We can now use this relation to work with this table via the Relational API. For example, the DuckDBPyRelation.describe() method is equivalent to issuing a SQL DESCRIBE statement:

pets_table_relation.describe()

┌─────────┬────────────────────┬────────────────┬────────────────────┬─────────┬───────────────────┬───────────────────┬──────────┐
│  aggr   │ License Issue Date │ License Number │   Animal's Name    │ Species │   Primary Breed   │  Secondary Breed  │ ZIP Code │
│ varchar │      varchar       │    varchar     │      varchar       │ varchar │      varchar      │      varchar      │ varchar  │
├─────────┼────────────────────┼────────────────┼────────────────────┼─────────┼───────────────────┼───────────────────┼──────────┤
│ count   │ 42567              │ 42567          │ 42526              │ 42567   │ 42567             │ 28373             │ 42440    │
│ mean    │ NULL               │ NULL           │ NULL               │ NULL    │ NULL              │ NULL              │ NULL     │
│ stddev  │ NULL               │ NULL           │ NULL               │ NULL    │ NULL              │ NULL              │ NULL     │
│ min     │ 2015-10-21         │ 015352         │ !zzy               │ Cat     │ Abruzzese Mastiff │ Abruzzese Mastiff │ 22308    │
│ max     │ 2024-04-05         │ s124879        │ “Dr.” Noah Pickles │ Pig     │ Xoloitzcuintli    │ York Chocolate    │ V5J 1P8  │
│ median  │ NULL               │ NULL           │ NULL               │ NULL    │ NULL              │ NULL              │ NULL     │
└─────────┴────────────────────┴────────────────┴────────────────────┴─────────┴───────────────────┴───────────────────┴──────────┘

duckdb.sql("DESCRIBE TABLE seattle_pets_dataset")

┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name     │ column_type │  null   │   key   │ default │  extra  │
│      varchar       │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ License Issue Date │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ License Number     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Animal's Name      │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Species            │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Primary Breed      │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Secondary Breed    │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ ZIP Code           │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
└────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

pets_table_relation.count("*")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        42567 │
└──────────────┘

duckdb.sql("SELECT count(*) FROM seattle_pets_dataset")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        42567 │
└──────────────┘

duckdb.sql("DROP TABLE seattle_pets_dataset")

pets_cleaned_relation = duckdb.sql(
    """
    SELECT
    "License Issue Date" AS issue_date,
    "Animal's Name" AS pet_name,
    "Species" AS species,
    "Primary Breed" AS breed
    FROM pets_csv_relation
    """
)
pets_cleaned_relation.limit(5)

┌────────────┬──────────┬─────────┬───────────────────┐
│ issue_date │ pet_name │ species │       breed       │
│    date    │ varchar  │ varchar │      varchar      │
├────────────┼──────────┼─────────┼───────────────────┤
│ 2015-12-18 │ Zen      │ Cat     │ Domestic Longhair │
│ 2016-06-14 │ Misty    │ Cat     │ Siberian          │
│ 2016-08-04 │ Lyra     │ Cat     │ Mix               │
│ 2019-02-13 │ Veronica │ Cat     │ Domestic Longhair │
│ 2019-08-10 │ Spider   │ Cat     │ LaPerm            │
└────────────┴──────────┴─────────┴───────────────────┘

min_max_relation = duckdb.sql(
    """
    SELECT min(issue_date), max(issue_date)
    FROM pets_cleaned_relation
    """
)
min_max_relation

┌─────────────────┬─────────────────┐
│ min(issue_date) │ max(issue_date) │
│      date       │      date       │
├─────────────────┼─────────────────┤
│ 2015-10-21      │ 2024-04-05      │
└─────────────────┴─────────────────┘

print(min_max_relation.explain())

duckdb.sql(
    """
    SELECT *
    FROM pets_cleaned_relation
    WHERE species = 'Pig'
    """
)

pets_cleaned_relation.filter("species = 'Pig'")

Same result:

┌────────────┬──────────┬─────────┬─────────────┐
│ issue_date │ pet_name │ species │    breed    │
│    date    │ varchar  │ varchar │   varchar   │
├────────────┼──────────┼─────────┼─────────────┤
│ 2022-11-03 │ Millie   │ Pig     │ Pot-Bellied │
│ 2023-06-01 │ Calvin   │ Pig     │ Pot-Bellied │
└────────────┴──────────┴─────────┴─────────────┘

from duckdb import ColumnExpression, ConstantExpression

species_col = ColumnExpression("species")
pig_constant = ConstantExpression("Pig")
pets_cleaned_relation.filter(species_col == pig_constant)

Got the same result as above.

pets_cleaned_relation.filter(
    "species = 'Cat' AND pet_name = 'Leeloo'"
)

name_col = ColumnExpression("pet_name")
is_cat = species_col == ConstantExpression("Cat")
is_leeloo = name_col == ConstantExpression("Leeloo")
pets_cleaned_relation.filter(is_cat & is_leeloo)

Same result:

┌────────────┬──────────┬─────────┬────────────────────┐
│ issue_date │ pet_name │ species │       breed        │
│    date    │ varchar  │ varchar │      varchar       │
├────────────┼──────────┼─────────┼────────────────────┤
│ 2022-04-13 │ Leeloo   │ Cat     │ Domestic Shorthair │
│ 2022-08-15 │ Leeloo   │ Cat     │ Domestic Shorthair │
│ 2022-08-21 │ Leeloo   │ Cat     │ Domestic Shorthair │
│ 2022-10-20 │ Leeloo   │ Cat     │ American Shorthair │
└────────────┴──────────┴─────────┴────────────────────┘

pets_cleaned_relation.filter(is_cat).limit(5)

num_cat_names_rel = (
    pets_cleaned_relation
    .filter(is_cat)
    .select("pet_name")
    .distinct()
    .count("*")
)
num_cat_names_rel

alt_num_cat_names_rel = (
    pets_cleaned_relation
    .filter(is_cat)
    .unique("pet_name")
    .count("*")
)
alt_num_cat_names_rel

Same result:

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         6321 │
└──────────────┘

num_unique_cats_sql = alt_num_cat_names_rel.sql_query()
print(num_unique_cats_sql)

SELECT count_star() FROM (SELECT DISTINCT pet_name FROM (SELECT * FROM (WITH pets_csv_relation AS (SELECT * FROM (SELECT * FROM read_csv_auto(['data/C08/Seattle_Pet_Licenses.csv'], (auto_detect = false), (delim = ','), ("quote" = '"'), (null_padding = false), ("escape" = '"'), (dateformat = '%B %d %Y'), (all_varchar = false), ("columns" = {'License Issue Date': 'DATE', 'License Number': 'VARCHAR', 'Animal's Name': 'VARCHAR', 'Species': 'VARCHAR', 'Primary Breed': 'VARCHAR', 'Secondary Breed': 'VARCHAR', 'ZIP Code': 'VARCHAR'}), (max_line_size = 2097152), (normalize_names = false), ("header" = true), ("skip" = 0), ("parallel" = true), (maximum_line_size = 2097152))))SELECT "License Issue Date" AS issue_date, "Animal's Name" AS pet_name, Species AS species, "Primary Breed" AS breed FROM pets_csv_relation) AS unnamed_relation_a6705feef22a1dde WHERE (species = 'Cat')) AS unnamed_relation_a6705feef22a1dde) AS unnamed_relation_a6705feef22a1dde GROUP BY ALL

import sqlparse

formatted_sql = sqlparse.format(
    num_unique_cats_sql,
    reindent=True
)
print(formatted_sql)

SELECT count_star()
FROM (
        SELECT DISTINCT pet_name
        FROM (
                SELECT *
                FROM (
                        WITH pets_csv_relation AS (
                            SELECT *
                            FROM (
                                    SELECT *
                                    FROM read_csv_auto(
                                            ['data/C08/Seattle_Pet_Licenses.csv'],
                                            (auto_detect = false),
                                            (delim = ','),
                                            ("quote" = '"'),
                                            (null_padding = false),
                                            ("escape" = '"'),
                                            (dateformat = '%B %d %Y'),
                                            (all_varchar = false),
                                            (
                                                "columns" = { 'License Issue Date': 'DATE',
                                                'License Number': 'VARCHAR',
                                                'Animal' s Name ': ' VARCHAR ', ' Species ': ' VARCHAR ', ' Primary Breed ': ' VARCHAR ', ' Secondary Breed ': ' VARCHAR ', ' ZIP Code ': ' VARCHAR '}), (max_line_size = 2097152), (normalize_names = false), ("header" = true), ("skip" = 0), ("parallel" = true), (maximum_line_size = 2097152))))
                        SELECT "License Issue Date" AS issue_date, "Animal' s Name " AS pet_name, Species AS species, " Primary Breed " AS breed
                        FROM pets_csv_relation) AS unnamed_relation_a6705feef22a1dde
                      WHERE (species = 'Cat')) AS unnamed_relation_a6705feef22a1dde) AS unnamed_relation_a6705feef22a1dde
GROUP BY ALL

duckdb.sql(
    """
    SELECT *, length(pet_name) AS name_length
    FROM pets_cleaned_relation
    ORDER BY name_length DESC
    LIMIT 10
    """
)

from duckdb import ColumnExpression, FunctionExpression, StarExpression

star = StarExpression()
name_col = ColumnExpression("pet_name")
name_length_col = FunctionExpression("length", name_col).alias("name_length")
name_length_sort = ColumnExpression("name_length").desc()

longest_names_relation = (
    pets_cleaned_relation
    .select(star, name_length_col)
    .sort(name_length_sort)
    .limit(10)
)

longest_names_relation

Same result:

┌────────────┬────────────────────────────────────────────────────┬─────────┬────────────────────────────┬─────────────┐
│ issue_date │                      pet_name                      │ species │           breed            │ name_length │
│    date    │                      varchar                       │ varchar │          varchar           │    int64    │
├────────────┼────────────────────────────────────────────────────┼─────────┼────────────────────────────┼─────────────┤
│ 2023-08-04 │ Nuit Ahathoor Hecate Sappho Jezebel Lilith Crowley │ Dog     │ Chihuahua, Short Coat      │          50 │
│ 2024-02-10 │ Lady Kassandra Yu Countess of Wallingford DBE      │ Dog     │ Border Collie              │          45 │
│ 2024-02-01 │ KingKing SirBeastmodeEsquire Stella Jr Sr II       │ Dog     │ Terrier, American Pit Bull │          44 │
│ 2024-01-09 │ Alyeska Juniper Cocoa Luna Taber O'Kelley          │ Cat     │ Domestic Shorthair         │          41 │
│ 2024-01-22 │ WINTERDAWG PRINCESS BUTTERCUP FROSTY JANE          │ Dog     │ Alaskan Malamute           │          41 │
│ 2022-12-01 │ Cascade Mountain's Out of a Dream (BRIA)           │ Dog     │ Retriever, Golden          │          40 │
│ 2023-11-01 │ Houlene "Hula" Elizabeth Honeybee Jacobs           │ Dog     │ Coonhound, Bluetick        │          40 │
│ 2023-02-10 │ Agnes "Aggie" Snowball Pineapple Clark             │ Dog     │ Retriever, Labrador        │          38 │
│ 2024-02-01 │ Snapdragon Caylee Reid Gramila O'Keefe             │ Dog     │ Terrier, Jack Russell      │          38 │
│ 2023-02-07 │ Moose Donut Franswa Cuddlewuddle Post              │ Dog     │ Bulldog, French            │          37 │
├────────────┴────────────────────────────────────────────────────┴─────────┴────────────────────────────┴─────────────┤
│ 10 rows                                                                                                    5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

pets_cleaned_relation.write_csv("seattle_pets.csv")
pets_cleaned_relation.write_parquet("seattle_pets.parquet")

duckdb.sql("COPY pets_cleaned_relation TO 'seattle_pets.csv'")
duckdb.sql("COPY pets_cleaned_relation TO 'seattle_pets.parquet'")

As we write this, there is no DuckDBPyRelation.write_json() Relational API method; however, you can export to JSON files using SQL:

duckdb.sql("COPY pets_cleaned_relation TO 'seattle_pets.json'")

conn = duckdb.connect("seattle_pets.db")
conn.read_parquet("seattle_pets.parquet").to_table("pets")
conn.sql("SHOW TABLES")

┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ pets    │
└─────────┘

conn.table("pets").count("*")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        42567 │
└──────────────┘

new_dog1 = (
    "2024-05-19",
    "Monty",
    "Dog",
    "Border Collie",
)
conn.table("pets").insert(new_dog1)

new_dog2 = (
    "2024-05-19",
    "Pixie",
    "Dog",
    "Australian Kelpie",
)
new_dog_rel = conn.values(new_dog2)
new_dog_rel.insert_into("pets")

conn.table("pets").filter("issue_date = '2024-05-19'")

┌────────────┬──────────┬─────────┬───────────────────┐
│ issue_date │ pet_name │ species │       breed       │
│    date    │ varchar  │ varchar │      varchar      │
├────────────┼──────────┼─────────┼───────────────────┤
│ 2024-05-19 │ Monty    │ Dog     │ Border Collie     │
│ 2024-05-19 │ Pixie    │ Dog     │ Australian Kelpie │
└────────────┴──────────┴─────────┴───────────────────┘

conn.close()

conn = duckdb.connect()

conn.execute(
    """
    CREATE OR REPLACE TABLE seattle_pets AS
    SELECT * FROM 'seattle_pets.parquet'
    """
)

conn.fetchone()
# (42567,)

conn.fetchmany()
# []

conn.fetchall()
# []

conn.execute("SELECT * FROM seattle_pets")

conn.fetchone()
# (datetime.date(2015, 12, 18), 'Zen', 'Cat', 'Domestic Longhair')

conn.fetchmany()
# [(datetime.date(2016, 6, 14), 'Misty', 'Cat', 'Siberian')]

conn.fetchall()
# [(datetime.date(2016, 8, 4), 'Lyra', 'Cat', 'Mix'),
#  (datetime.date(2019, 2, 13), 'Veronica', 'Cat', 'Domestic Longhair'),
#  (datetime.date(2019, 8, 10), 'Spider', 'Cat', 'LaPerm'),
#  (datetime.date(2019, 11, 21), 'Maxx', 'Cat', 'American Shorthair'),
#  (datetime.date(2020, 5, 24), 'Mickey', 'Cat', 'Domestic Longhair'),
# ......

conn.description

[('issue_date', 'Date', None, None, None, None, None),
 ('pet_name', 'STRING', None, None, None, None, None),
 ('species', 'STRING', None, None, None, None, None),
 ('breed', 'STRING', None, None, None, None, None)]

[conn.fetchone() for i in range(3)]

# [None, None, None]

conn.execute("SELECT * FROM seattle_pets")

[conn.fetchone() for i in range(3)]

[(datetime.date(2015, 12, 18), 'Zen', 'Cat', 'Domestic Longhair'),
 (datetime.date(2016, 6, 14), 'Misty', 'Cat', 'Siberian'),
 (datetime.date(2016, 8, 4), 'Lyra', 'Cat', 'Mix')]

conn.fetchmany(3)

[(datetime.date(2019, 2, 13), 'Veronica', 'Cat', 'Domestic Longhair'),
 (datetime.date(2019, 8, 10), 'Spider', 'Cat', 'LaPerm'),
 (datetime.date(2019, 11, 21), 'Maxx', 'Cat', 'American Shorthair')]

rest_rows = conn.fetchall()
len(rest_rows)
# 42561

import datetime

new_pet1 = (
    datetime.date.today(),
    "Ned",
    "Dog",
    "Border Collie",
)
conn.execute("INSERT INTO seattle_pets VALUES (?, ?, ?, ?)", parameters=new_pet1)

new_pet2 = {
    "name": "Simon",
    "species": "Cat",
    "breed": "Bombay",
    "issue_date": datetime.date.today(),
}
conn.execute(
    """
    INSERT INTO seattle_pets VALUES
    ($issue_date, $name, $species, $breed)
    """,
    new_pet2,
)

conn.execute(
    """
    SELECT *
    FROM seattle_pets
    WHERE issue_date = ?;
    """,
    [datetime.date.today()],
).fetchall()

[(datetime.date(2025, 1, 15), 'Ned', 'Dog', 'Border Collie'),
 (datetime.date(2025, 1, 15), 'Simon', 'Cat', 'Bombay')]

conn.execute("COPY seattle_pets TO 'seattle_pets_updates.csv'")
conn.execute("COPY seattle_pets TO 'seattle_pets_updates.parquet'")

conn.close()

conn = duckdb.connect()
new_conn = conn.cursor()

import pandas as pd

pets_df = pd.read_parquet("seattle_pets.parquet")
duckdb.sql("SELECT * FROM pets_df USING SAMPLE 1")

┌────────────┬──────────┬─────────┬─────────────────────┐
│ issue_date │ pet_name │ species │        breed        │
│    date    │ varchar  │ varchar │       varchar       │
├────────────┼──────────┼─────────┼─────────────────────┤
│ 2023-03-20 │ Winnie   │ Dog     │ Retriever, Labrador │
└────────────┴──────────┴─────────┴─────────────────────┘

pets_dict = {
    "seattle": pd.read_parquet("seattle_pets.parquet")
}
duckdb.register("pets_view", pets_dict["seattle"])
duckdb.sql("SELECT * FROM pets_view USING SAMPLE 1")

┌────────────┬──────────┬─────────┬─────────┐
│ issue_date │ pet_name │ species │  breed  │
│    date    │ varchar  │ varchar │ varchar │
├────────────┼──────────┼─────────┼─────────┤
│ 2022-09-03 │ Scooter  │ Dog     │ Pug     │
└────────────┴──────────┴─────────┴─────────┘

pets_df = pd.read_parquet("seattle_pets.parquet")
duckdb.sql(
    """
    CREATE OR REPLACE TABLE pets_table_from_df AS
    SELECT * FROM pets_df
    """
)
duckdb.sql("SELECT * FROM pets_table_from_df USING SAMPLE 1")

┌────────────┬──────────┬─────────┬─────────────────────┐
│ issue_date │ pet_name │ species │        breed        │
│    date    │ varchar  │ varchar │       varchar       │
├────────────┼──────────┼─────────┼─────────────────────┤
│ 2022-07-08 │ Cleo     │ Dog     │ Australian Shepherd │
└────────────┴──────────┴─────────┴─────────────────────┘

Note that this is not an exhaustive list; see the DuckDB documentation for reference (https://duckdb.org/docs/api/python/result_conversion).

conn = duckdb.connect()
seattle_pets = conn.from_parquet("seattle_pets.parquet")

pandas_df = seattle_pets.df()
pandas_df[
    pandas_df["species"] == "Dog"
].value_counts("breed")[:5]

breed
Retriever, Labrador      3034
Retriever, Golden        1488
Chihuahua, Short Coat    1443
German Shepherd           954
Poodle, Miniature         846
Name: count, dtype: int64

import polars as pl

polars_df = seattle_pets.pl()
polars_df.filter(
    pl.col("species") == "Cat"
)["breed"].value_counts(sort=True)[:5]

conn = duckdb.connect()
conn.execute("SELECT * FROM 'seattle_pets.parquet'")
pets_table = conn.arrow()
pets_table.schema

issue_date: date32[day]
pet_name: string
species: string
breed: string

varchar_type = duckdb.typing.VARCHAR
bigint_type = duckdb.typing.BIGINT

varchar_type = duckdb.typing.DuckDBPyType(str)
bigint_type = duckdb.typing.DuckDBPyType(int)

import datetime

duckdb.values(
    [
        10,
        1_000_000,
        0.95,
        "hello string",
        b"hello bytes",
        True,
        datetime.date.today(),
        None,
    ]
)

┌───────┬─────────┬────────┬──────────────┬─────────────┬─────────┬────────────┬───────┐
│ col0  │  col1   │  col2  │     col3     │    col4     │  col5   │    col6    │ col7  │
│ int32 │  int32  │ double │   varchar    │    blob     │ boolean │    date    │ int32 │
├───────┼─────────┼────────┼──────────────┼─────────────┼─────────┼────────────┼───────┤
│    10 │ 1000000 │   0.95 │ hello string │ hello bytes │ true    │ 2025-01-16 │  NULL │
└───────┴─────────┴────────┴──────────────┴─────────────┴─────────┴────────────┴───────┘

duckdb.values(
    [
        (1, 2), 
        ["hello", "world"], 
        {"key1": 10, "key2": "quack!"}
    ]
)

┌─────────┬────────────────┬────────────────────────────────────┐
│  col0   │      col1      │                col2                │
│ int32[] │   varchar[]    │ struct(key1 integer, key2 varchar) │
├─────────┼────────────────┼────────────────────────────────────┤
│ [1, 2]  │ [hello, world] │ {'key1': 10, 'key2': quack!}       │
└─────────┴────────────────┴────────────────────────────────────┘

import emoji

def emojify(species):
    """Converts a string into a single emoji."""
    emoji_str = emoji.emojize(f":{species.lower()}:")
    if emoji.is_emoji(emoji_str):
        return emoji_str
    return None

emojify("goat")

'🐐'

duckdb.create_function(
    "emojify",
    emojify,
    [duckdb.typing.VARCHAR],
    duckdb.typing.VARCHAR
)

/tmp/ipykernel_4851/4183955593.py:1: DeprecationWarning: numpy.core is deprecated and has been renamed to numpy._core. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.multiarray.
  duckdb.create_function(

With the emoji function registered, we can now use it in a SQL query as if it were a DuckDB SQL function.

duckdb.sql(
    """
    SELECT *, emojify(species) AS emoji
    FROM 'seattle_pets_updates.parquet'
    USING SAMPLE 10
    """
)

┌────────────┬──────────┬─────────┬──────────────────────────────────────────────┬─────────┐
│ issue_date │ pet_name │ species │                    breed                     │  emoji  │
│    date    │ varchar  │ varchar │                   varchar                    │ varchar │
├────────────┼──────────┼─────────┼──────────────────────────────────────────────┼─────────┤
│ 2022-08-21 │ Bruce    │ Dog     │ Lhasa Apso                                   │ 🐕      │
│ 2022-07-01 │ Nico     │ Dog     │ Terrier, Fox, Wire                           │ 🐕      │
│ 2022-11-08 │ Heera    │ Cat     │ Mix                                          │ 🐈      │
│ 2023-07-24 │ Daisy    │ Dog     │ Mixed Breed, Large (over 44 lbs fully grown) │ 🐕      │
│ 2023-08-07 │ Ziva     │ Dog     │ Poodle, Standard                             │ 🐕      │
│ 2023-04-13 │ Winston  │ Dog     │ Welsh Corgi, Pembroke                        │ 🐕      │
│ 2023-01-12 │ Jamie    │ Cat     │ American Shorthair                           │ 🐈      │
│ 2022-04-27 │ Aurora   │ Cat     │ Domestic Shorthair                           │ 🐈      │
│ 2022-04-19 │ Josie    │ Cat     │ Domestic Longhair                            │ 🐈      │
│ 2023-02-12 │ Pepper   │ Dog     │ Terrier                                      │ 🐕      │
├────────────┴──────────┴─────────┴──────────────────────────────────────────────┴─────────┤
│ 10 rows                                                                        5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────┘

In order to remove the UDF from the database, we just call the remove_function() method:

duckdb.remove_function("emojify")

duckdb.create_function("emojify", emojify, [str], str)

def emojify(species: str) -> str:
    """Converts a string into a single emoji."""
    emoji_str = emoji.emojize(f":{species.lower()}:")
    if emoji.is_emoji(emoji_str):
        return emoji_str
    return None

duckdb.remove_function("emojify")
duckdb.create_function("emojify", emojify)

from duckdb import ConversionException

try:
    duckdb.execute("SELECT '5,000'::INTEGER").fetchall()
except ConversionException as error:
    print(error)
    # handle exception...

Conversion Error: Could not convert string '5,000' to INT32
LINE 1: SELECT '5,000'::INTEGER
                      ^

from duckdb import CatalogException

try:
    duckdb.sql("SELECT * from imaginary_table")
except CatalogException as error:
    print(error)
    # handle exception...

Catalog Error: Table with name imaginary_table does not exist!
Did you mean "pragma_database_list"?

posted on 2025-01-15 20:52 ZhangZhihuiAAA 阅读(4) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· DuckDB - Study Notes 7

· DuckDB - Study Notes 6

· MySQL的基本操作

· MySQL8-中文参考-三-

· Pandas-2-2-中文文档-十九-

阅读排行：
· 震惊！C++程序真的从main开始吗？99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码？零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾（3.3-3.9）
· winform 绘制太阳，地球，月球运作规律

历史上的今天：
2024-01-15 Redis - Serialize and Deserialize

导航

搜索

常用链接

我的标签

随笔档案 (761)

阅读排行榜


Copyright © 2025 ZhangZhihuiAAA Powered by .NET 9.0 on Kubernetes 博客园