ZhangZhihui's Blog  

 

 

 

Requirements:

复制代码
duckdb
emoji
ibis-framework
ibis-framework[duckdb]
jupysql
jupyter-lsp
jupyterlab
pandas
plotly
polars-lts-cpu
pyarrow
sqlparse
复制代码

 

 

 

pi_relation = duckdb.sql("SELECT pi() AS pi")
type(pi_relation)

# duckdb.duckdb.DuckDBPyRelation

 

pi_relation.show()

 

┌───────────────────┐
│        pi         │
│      double       │
├───────────────────┤
│ 3.141592653589793 │
└───────────────────┘

 

pi_relation

 

┌───────────────────┐
│        pi         │
│      double       │
├───────────────────┤
│ 3.141592653589793 │
└───────────────────┘

 

duckdb.sql(
    """
    SELECT *
    FROM read_csv('data/C08/Seattle_Pet_Licenses.csv')
    """
)

As our call to duckdb.sql() is the only expression in this cell, it is evaluated, and the resulting relation object is displayed:

复制代码
┌────────────────────┬────────────────┬───────────────┬─────────┬──────────────────────┬─────────────────┬──────────┐
│ License Issue Date │ License Number │ Animal's Name │ Species │    Primary Breed     │ Secondary Breed │ ZIP Code │
│      varchar       │    varchar     │    varchar    │ varchar │       varchar        │     varchar     │ varchar  │
├────────────────────┼────────────────┼───────────────┼─────────┼──────────────────────┼─────────────────┼──────────┤
│ December 18 2015   │ S107948        │ Zen           │ Cat     │ Domestic Longhair    │ Mix             │ 98117    │
│ June 14 2016       │ S116503        │ Misty         │ Cat     │ Siberian             │ NULL            │ 98117    │
│ August 04 2016     │ S119301        │ Lyra          │ Cat     │ Mix                  │ NULL            │ 98121    │
│ February 13 2019   │ 962273         │ Veronica      │ Cat     │ Domestic Longhair    │ NULL            │ 98107    │
│ August 10 2019     │ S133113        │ Spider        │ Cat     │ LaPerm               │ NULL            │ 98115    │
│ November 21 2019   │ 8002549        │ Maxx          │ Cat     │ American Shorthair   │ NULL            │ 98125    │
│ May 24 2020        │ S142869        │ Mickey        │ Cat     │ Domestic Longhair    │ NULL            │ 98126    │
│ July 03 2020       │ S112835        │ Diamond       │ Cat     │ Domestic Shorthair   │ Mix             │ 98103    │
│ July 21 2020       │ S131986        │ Nacho         │ Cat     │ Domestic Shorthair   │ Mix             │ 98126    │
│ August 18 2020     │ 8019541        │ Gracie        │ Cat     │ Domestic Medium Hair │ Mix             │ 98133    │
│       ·            │    ·           │   ·           │  ·      │         ·            │  ·              │   ·      │
│       ·            │    ·           │   ·           │  ·      │         ·            │  ·              │   ·      │
│       ·            │    ·           │   ·           │  ·      │         ·            │  ·              │   ·      │
│ November 22 2023   │ S148106        │ Gibbs         │ Cat     │ Domestic Shorthair   │ Mix             │ 98116    │
│ November 22 2023   │ 8042587        │ Barry         │ Cat     │ American Shorthair   │ Mix             │ 98109    │
│ November 22 2023   │ 8042588        │ Penny         │ Cat     │ American Shorthair   │ Mix             │ 98109    │
│ November 22 2023   │ 8050533        │ Sabine        │ Cat     │ Domestic Medium Hair │ York Chocolate  │ 98144    │
│ November 22 2023   │ 8050534        │ Milo          │ Cat     │ Domestic Medium Hair │ Mix             │ 98144    │
│ November 22 2023   │ S137987        │ Spike         │ Cat     │ Domestic Medium Hair │ NULL            │ 98117    │
│ November 22 2023   │ S137986        │ Scout         │ Cat     │ Domestic Shorthair   │ NULL            │ 98117    │
│ November 22 2023   │ 8013084        │ Honeybee      │ Cat     │ American Shorthair   │ Mix             │ 98106    │
│ November 22 2023   │ S157559        │ Bug           │ Cat     │ Ragdoll              │ Siamese         │ 98144    │
│ November 22 2023   │ 8042668        │ Beerus        │ Cat     │ Domestic Shorthair   │ Mix             │ 98125    │
├────────────────────┴────────────────┴───────────────┴─────────┴──────────────────────┴─────────────────┴──────────┤
│ ? rows (>9999 rows, 20 shown)                                                                           7 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
复制代码

 

 

pets_csv_relation = duckdb.read_csv(
    "data/C08/Seattle_Pet_Licenses.csv"
)

 

pets_csv_relation.types

# [VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR]

 

 

复制代码
pets_csv_relation = duckdb.read_csv(
    "data/C08/Seattle_Pet_Licenses.csv",
    dtype={"License Issue Date": "DATE"},
    date_format="%B %d %Y",
)
pets_csv_relation.types


# [DATE, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR]
复制代码

 

pets_csv_relation.limit(5)

 

复制代码
┌────────────────────┬────────────────┬───────────────┬─────────┬───────────────────┬─────────────────┬──────────┐
│ License Issue Date │ License Number │ Animal's Name │ Species │   Primary Breed   │ Secondary Breed │ ZIP Code │
│        date        │    varchar     │    varchar    │ varchar │      varchar      │     varchar     │ varchar  │
├────────────────────┼────────────────┼───────────────┼─────────┼───────────────────┼─────────────────┼──────────┤
│ 2015-12-18         │ S107948        │ Zen           │ Cat     │ Domestic Longhair │ Mix             │ 98117    │
│ 2016-06-14         │ S116503        │ Misty         │ Cat     │ Siberian          │ NULL            │ 98117    │
│ 2016-08-04         │ S119301        │ Lyra          │ Cat     │ Mix               │ NULL            │ 98121    │
│ 2019-02-13         │ 962273         │ Veronica      │ Cat     │ Domestic Longhair │ NULL            │ 98107    │
│ 2019-08-10         │ S133113        │ Spider        │ Cat     │ LaPerm            │ NULL            │ 98115    │
└────────────────────┴────────────────┴───────────────┴─────────┴───────────────────┴─────────────────┴──────────┘
复制代码

 

print(pets_csv_relation.sql_query())

 

复制代码
SELECT *
FROM read_csv_auto(
        ['data/C08/Seattle_Pet_Licenses.csv'],
        (auto_detect = false),
        (delim = ','),
        ("quote" = '"'),
        (null_padding = false),
        ("escape" = '"'),
        (dateformat = '%B %d %Y'),
        (all_varchar = false),
        (
            "columns" = { 'License Issue Date': 'DATE',
                'License Number': 'VARCHAR',
                'Animal' s Name ': ' VARCHAR ', 
                ' Species ': ' VARCHAR ', 
                ' Primary Breed ': ' VARCHAR ', 
                ' Secondary Breed ': ' VARCHAR ', 
                ' ZIP Code ': ' VARCHAR '
            }
        ), 
        (max_line_size = 2097152), 
        (normalize_names = false), 
        ("header" = true), 
        ("skip" = 0), 
        ("parallel" = true), 
        (maximum_line_size = 2097152)
)
复制代码

 

pets_csv_relation.to_table("seattle_pets_dataset")

 

duckdb.sql("SHOW TABLES")

 

┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ seattle_pets_dataset │
└──────────────────────┘

Now, we can make a relation from this table we’ve just created using the DuckDBPyConnection.table() method:

pets_table_relation = duckdb.table("seattle_pets_dataset")

We can now use this relation to work with this table via the Relational API. For example, the DuckDBPyRelation.describe() method is equivalent to issuing a SQL DESCRIBE statement:

pets_table_relation.describe()

 

复制代码
┌─────────┬────────────────────┬────────────────┬────────────────────┬─────────┬───────────────────┬───────────────────┬──────────┐
│  aggr   │ License Issue Date │ License Number │   Animal's Name    │ Species │   Primary Breed   │  Secondary Breed  │ ZIP Code │
│ varchar │      varchar       │    varchar     │      varchar       │ varchar │      varchar      │      varchar      │ varchar  │
├─────────┼────────────────────┼────────────────┼────────────────────┼─────────┼───────────────────┼───────────────────┼──────────┤
│ count   │ 42567              │ 42567          │ 42526              │ 42567   │ 42567             │ 28373             │ 42440    │
│ mean    │ NULL               │ NULL           │ NULL               │ NULL    │ NULL              │ NULL              │ NULL     │
│ stddev  │ NULL               │ NULL           │ NULL               │ NULL    │ NULL              │ NULL              │ NULL     │
│ min     │ 2015-10-21         │ 015352         │ !zzy               │ Cat     │ Abruzzese Mastiff │ Abruzzese Mastiff │ 22308    │
│ max     │ 2024-04-05         │ s124879        │ “Dr.” Noah Pickles │ Pig     │ Xoloitzcuintli    │ York Chocolate    │ V5J 1P8  │
│ median  │ NULL               │ NULL           │ NULL               │ NULL    │ NULL              │ NULL              │ NULL     │
└─────────┴────────────────────┴────────────────┴────────────────────┴─────────┴───────────────────┴───────────────────┴──────────┘
复制代码

 

duckdb.sql("DESCRIBE TABLE seattle_pets_dataset")

 

复制代码
┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name     │ column_type │  null   │   key   │ default │  extra  │
│      varchar       │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ License Issue Date │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ License Number     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Animal's Name      │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Species            │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Primary Breed      │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Secondary Breed    │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ ZIP Code           │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
└────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
复制代码

 

pets_table_relation.count("*")

 

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        42567 │
└──────────────┘

 

duckdb.sql("SELECT count(*) FROM seattle_pets_dataset")

 

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        42567 │
└──────────────┘

 

duckdb.sql("DROP TABLE seattle_pets_dataset")

 

复制代码
pets_cleaned_relation = duckdb.sql(
    """
    SELECT
    "License Issue Date" AS issue_date,
    "Animal's Name" AS pet_name,
    "Species" AS species,
    "Primary Breed" AS breed
    FROM pets_csv_relation
    """
)
pets_cleaned_relation.limit(5)
复制代码

 

复制代码
┌────────────┬──────────┬─────────┬───────────────────┐
│ issue_date │ pet_name │ species │       breed       │
│    date    │ varchar  │ varchar │      varchar      │
├────────────┼──────────┼─────────┼───────────────────┤
│ 2015-12-18 │ Zen      │ Cat     │ Domestic Longhair │
│ 2016-06-14 │ Misty    │ Cat     │ Siberian          │
│ 2016-08-04 │ Lyra     │ Cat     │ Mix               │
│ 2019-02-13 │ Veronica │ Cat     │ Domestic Longhair │
│ 2019-08-10 │ Spider   │ Cat     │ LaPerm            │
└────────────┴──────────┴─────────┴───────────────────┘
复制代码

 

min_max_relation = duckdb.sql(
    """
    SELECT min(issue_date), max(issue_date)
    FROM pets_cleaned_relation
    """
)
min_max_relation

 

┌─────────────────┬─────────────────┐
│ min(issue_date) │ max(issue_date) │
│      date       │      date       │
├─────────────────┼─────────────────┤
│ 2015-10-21      │ 2024-04-05      │
└─────────────────┴─────────────────┘

 

print(min_max_relation.explain())

 

 

 

duckdb.sql(
    """
    SELECT *
    FROM pets_cleaned_relation
    WHERE species = 'Pig'
    """
)

 

pets_cleaned_relation.filter("species = 'Pig'")

 

Same result:

┌────────────┬──────────┬─────────┬─────────────┐
│ issue_date │ pet_name │ species │    breed    │
│    date    │ varchar  │ varchar │   varchar   │
├────────────┼──────────┼─────────┼─────────────┤
│ 2022-11-03 │ Millie   │ Pig     │ Pot-Bellied │
│ 2023-06-01 │ Calvin   │ Pig     │ Pot-Bellied │
└────────────┴──────────┴─────────┴─────────────┘

 

from duckdb import ColumnExpression, ConstantExpression

species_col = ColumnExpression("species")
pig_constant = ConstantExpression("Pig")
pets_cleaned_relation.filter(species_col == pig_constant)

Got the same result as above.

 

pets_cleaned_relation.filter(
    "species = 'Cat' AND pet_name = 'Leeloo'"
)

 

name_col = ColumnExpression("pet_name")
is_cat = species_col == ConstantExpression("Cat")
is_leeloo = name_col == ConstantExpression("Leeloo")
pets_cleaned_relation.filter(is_cat & is_leeloo)

 

Same result:

复制代码
┌────────────┬──────────┬─────────┬────────────────────┐
│ issue_date │ pet_name │ species │       breed        │
│    date    │ varchar  │ varchar │      varchar       │
├────────────┼──────────┼─────────┼────────────────────┤
│ 2022-04-13 │ Leeloo   │ Cat     │ Domestic Shorthair │
│ 2022-08-15 │ Leeloo   │ Cat     │ Domestic Shorthair │
│ 2022-08-21 │ Leeloo   │ Cat     │ Domestic Shorthair │
│ 2022-10-20 │ Leeloo   │ Cat     │ American Shorthair │
└────────────┴──────────┴─────────┴────────────────────┘
复制代码

 

pets_cleaned_relation.filter(is_cat).limit(5)

 

num_cat_names_rel = (
    pets_cleaned_relation
    .filter(is_cat)
    .select("pet_name")
    .distinct()
    .count("*")
)
num_cat_names_rel

 

alt_num_cat_names_rel = (
    pets_cleaned_relation
    .filter(is_cat)
    .unique("pet_name")
    .count("*")
)
alt_num_cat_names_rel

 

Same result:

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         6321 │
└──────────────┘

 

num_unique_cats_sql = alt_num_cat_names_rel.sql_query()
print(num_unique_cats_sql)

 

SELECT count_star() FROM (SELECT DISTINCT pet_name FROM (SELECT * FROM (WITH pets_csv_relation AS (SELECT * FROM (SELECT * FROM read_csv_auto(['data/C08/Seattle_Pet_Licenses.csv'], (auto_detect = false), (delim = ','), ("quote" = '"'), (null_padding = false), ("escape" = '"'), (dateformat = '%B %d %Y'), (all_varchar = false), ("columns" = {'License Issue Date': 'DATE', 'License Number': 'VARCHAR', 'Animal's Name': 'VARCHAR', 'Species': 'VARCHAR', 'Primary Breed': 'VARCHAR', 'Secondary Breed': 'VARCHAR', 'ZIP Code': 'VARCHAR'}), (max_line_size = 2097152), (normalize_names = false), ("header" = true), ("skip" = 0), ("parallel" = true), (maximum_line_size = 2097152))))SELECT "License Issue Date" AS issue_date, "Animal's Name" AS pet_name, Species AS species, "Primary Breed" AS breed FROM pets_csv_relation) AS unnamed_relation_a6705feef22a1dde WHERE (species = 'Cat')) AS unnamed_relation_a6705feef22a1dde) AS unnamed_relation_a6705feef22a1dde GROUP BY ALL

 

import sqlparse

formatted_sql = sqlparse.format(
    num_unique_cats_sql,
    reindent=True
)
print(formatted_sql)

 

复制代码
SELECT count_star()
FROM (
        SELECT DISTINCT pet_name
        FROM (
                SELECT *
                FROM (
                        WITH pets_csv_relation AS (
                            SELECT *
                            FROM (
                                    SELECT *
                                    FROM read_csv_auto(
                                            ['data/C08/Seattle_Pet_Licenses.csv'],
                                            (auto_detect = false),
                                            (delim = ','),
                                            ("quote" = '"'),
                                            (null_padding = false),
                                            ("escape" = '"'),
                                            (dateformat = '%B %d %Y'),
                                            (all_varchar = false),
                                            (
                                                "columns" = { 'License Issue Date': 'DATE',
                                                'License Number': 'VARCHAR',
                                                'Animal' s Name ': ' VARCHAR ', ' Species ': ' VARCHAR ', ' Primary Breed ': ' VARCHAR ', ' Secondary Breed ': ' VARCHAR ', ' ZIP Code ': ' VARCHAR '}), (max_line_size = 2097152), (normalize_names = false), ("header" = true), ("skip" = 0), ("parallel" = true), (maximum_line_size = 2097152))))
                        SELECT "License Issue Date" AS issue_date, "Animal' s Name " AS pet_name, Species AS species, " Primary Breed " AS breed
                        FROM pets_csv_relation) AS unnamed_relation_a6705feef22a1dde
                      WHERE (species = 'Cat')) AS unnamed_relation_a6705feef22a1dde) AS unnamed_relation_a6705feef22a1dde
GROUP BY ALL
复制代码

 

duckdb.sql(
    """
    SELECT *, length(pet_name) AS name_length
    FROM pets_cleaned_relation
    ORDER BY name_length DESC
    LIMIT 10
    """
)

 

复制代码
from duckdb import ColumnExpression, FunctionExpression, StarExpression

star = StarExpression()
name_col = ColumnExpression("pet_name")
name_length_col = FunctionExpression("length", name_col).alias("name_length")
name_length_sort = ColumnExpression("name_length").desc()

longest_names_relation = (
    pets_cleaned_relation
    .select(star, name_length_col)
    .sort(name_length_sort)
    .limit(10)
)

longest_names_relation
复制代码

 

Same result:

复制代码
┌────────────┬────────────────────────────────────────────────────┬─────────┬────────────────────────────┬─────────────┐
│ issue_date │                      pet_name                      │ species │           breed            │ name_length │
│    date    │                      varchar                       │ varchar │          varchar           │    int64    │
├────────────┼────────────────────────────────────────────────────┼─────────┼────────────────────────────┼─────────────┤
│ 2023-08-04 │ Nuit Ahathoor Hecate Sappho Jezebel Lilith Crowley │ Dog     │ Chihuahua, Short Coat      │          50 │
│ 2024-02-10 │ Lady Kassandra Yu Countess of Wallingford DBE      │ Dog     │ Border Collie              │          45 │
│ 2024-02-01 │ KingKing SirBeastmodeEsquire Stella Jr Sr II       │ Dog     │ Terrier, American Pit Bull │          44 │
│ 2024-01-09 │ Alyeska Juniper Cocoa Luna Taber O'Kelley          │ Cat     │ Domestic Shorthair         │          41 │
│ 2024-01-22 │ WINTERDAWG PRINCESS BUTTERCUP FROSTY JANE          │ Dog     │ Alaskan Malamute           │          41 │
│ 2022-12-01 │ Cascade Mountain's Out of a Dream (BRIA)           │ Dog     │ Retriever, Golden          │          40 │
│ 2023-11-01 │ Houlene "Hula" Elizabeth Honeybee Jacobs           │ Dog     │ Coonhound, Bluetick        │          40 │
│ 2023-02-10 │ Agnes "Aggie" Snowball Pineapple Clark             │ Dog     │ Retriever, Labrador        │          38 │
│ 2024-02-01 │ Snapdragon Caylee Reid Gramila O'Keefe             │ Dog     │ Terrier, Jack Russell      │          38 │
│ 2023-02-07 │ Moose Donut Franswa Cuddlewuddle Post              │ Dog     │ Bulldog, French            │          37 │
├────────────┴────────────────────────────────────────────────────┴─────────┴────────────────────────────┴─────────────┤
│ 10 rows                                                                                                    5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
复制代码

 

 

pets_cleaned_relation.write_csv("seattle_pets.csv")
pets_cleaned_relation.write_parquet("seattle_pets.parquet")

 

duckdb.sql("COPY pets_cleaned_relation TO 'seattle_pets.csv'")
duckdb.sql("COPY pets_cleaned_relation TO 'seattle_pets.parquet'")

As we write this, there is no DuckDBPyRelation.write_json() Relational API method; however, you can export to JSON files using SQL:

duckdb.sql("COPY pets_cleaned_relation TO 'seattle_pets.json'")

 

conn = duckdb.connect("seattle_pets.db")
conn.read_parquet("seattle_pets.parquet").to_table("pets")
conn.sql("SHOW TABLES")

 

┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ pets    │
└─────────┘

 

conn.table("pets").count("*")

 

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        42567 │
└──────────────┘

 

new_dog1 = (
    "2024-05-19",
    "Monty",
    "Dog",
    "Border Collie",
)
conn.table("pets").insert(new_dog1)

 

new_dog2 = (
    "2024-05-19",
    "Pixie",
    "Dog",
    "Australian Kelpie",
)
new_dog_rel = conn.values(new_dog2)
new_dog_rel.insert_into("pets")

 

conn.table("pets").filter("issue_date = '2024-05-19'")

 

┌────────────┬──────────┬─────────┬───────────────────┐
│ issue_date │ pet_name │ species │       breed       │
│    date    │ varchar  │ varchar │      varchar      │
├────────────┼──────────┼─────────┼───────────────────┤
│ 2024-05-19 │ Monty    │ Dog     │ Border Collie     │
│ 2024-05-19 │ Pixie    │ Dog     │ Australian Kelpie │
└────────────┴──────────┴─────────┴───────────────────┘

 

conn.close()

 

conn = duckdb.connect()

 

 

conn.execute(
    """
    CREATE OR REPLACE TABLE seattle_pets AS
    SELECT * FROM 'seattle_pets.parquet'
    """
)

 

conn.fetchone()
# (42567,)

conn.fetchmany()
# []

conn.fetchall()
# []

 

conn.execute("SELECT * FROM seattle_pets")

 

复制代码
conn.fetchone()
# (datetime.date(2015, 12, 18), 'Zen', 'Cat', 'Domestic Longhair')

conn.fetchmany()
# [(datetime.date(2016, 6, 14), 'Misty', 'Cat', 'Siberian')]

conn.fetchall()
# [(datetime.date(2016, 8, 4), 'Lyra', 'Cat', 'Mix'),
#  (datetime.date(2019, 2, 13), 'Veronica', 'Cat', 'Domestic Longhair'),
#  (datetime.date(2019, 8, 10), 'Spider', 'Cat', 'LaPerm'),
#  (datetime.date(2019, 11, 21), 'Maxx', 'Cat', 'American Shorthair'),
#  (datetime.date(2020, 5, 24), 'Mickey', 'Cat', 'Domestic Longhair'),
# ......
复制代码

 

conn.description

 

[('issue_date', 'Date', None, None, None, None, None),
 ('pet_name', 'STRING', None, None, None, None, None),
 ('species', 'STRING', None, None, None, None, None),
 ('breed', 'STRING', None, None, None, None, None)]

 

[conn.fetchone() for i in range(3)]

# [None, None, None]

 

conn.execute("SELECT * FROM seattle_pets")

[conn.fetchone() for i in range(3)]

 

[(datetime.date(2015, 12, 18), 'Zen', 'Cat', 'Domestic Longhair'),
 (datetime.date(2016, 6, 14), 'Misty', 'Cat', 'Siberian'),
 (datetime.date(2016, 8, 4), 'Lyra', 'Cat', 'Mix')]

 

conn.fetchmany(3)

 

[(datetime.date(2019, 2, 13), 'Veronica', 'Cat', 'Domestic Longhair'),
 (datetime.date(2019, 8, 10), 'Spider', 'Cat', 'LaPerm'),
 (datetime.date(2019, 11, 21), 'Maxx', 'Cat', 'American Shorthair')]

 

rest_rows = conn.fetchall()
len(rest_rows)
# 42561

 

复制代码
import datetime

new_pet1 = (
    datetime.date.today(),
    "Ned",
    "Dog",
    "Border Collie",
)
conn.execute("INSERT INTO seattle_pets VALUES (?, ?, ?, ?)", parameters=new_pet1)
复制代码

 

复制代码
new_pet2 = {
    "name": "Simon",
    "species": "Cat",
    "breed": "Bombay",
    "issue_date": datetime.date.today(),
}
conn.execute(
    """
    INSERT INTO seattle_pets VALUES
    ($issue_date, $name, $species, $breed)
    """,
    new_pet2,
)
复制代码

 

conn.execute(
    """
    SELECT *
    FROM seattle_pets
    WHERE issue_date = ?;
    """,
    [datetime.date.today()],
).fetchall()

 

[(datetime.date(2025, 1, 15), 'Ned', 'Dog', 'Border Collie'),
 (datetime.date(2025, 1, 15), 'Simon', 'Cat', 'Bombay')]

 

conn.execute("COPY seattle_pets TO 'seattle_pets_updates.csv'")
conn.execute("COPY seattle_pets TO 'seattle_pets_updates.parquet'")

 

conn.close()

 

conn = duckdb.connect()
new_conn = conn.cursor()

 

 

import pandas as pd

pets_df = pd.read_parquet("seattle_pets.parquet")
duckdb.sql("SELECT * FROM pets_df USING SAMPLE 1")

 

┌────────────┬──────────┬─────────┬─────────────────────┐
│ issue_date │ pet_name │ species │        breed        │
│    date    │ varchar  │ varchar │       varchar       │
├────────────┼──────────┼─────────┼─────────────────────┤
│ 2023-03-20 │ Winnie   │ Dog     │ Retriever, Labrador │
└────────────┴──────────┴─────────┴─────────────────────┘

 

pets_dict = {
    "seattle": pd.read_parquet("seattle_pets.parquet")
}
duckdb.register("pets_view", pets_dict["seattle"])
duckdb.sql("SELECT * FROM pets_view USING SAMPLE 1")

 

┌────────────┬──────────┬─────────┬─────────┐
│ issue_date │ pet_name │ species │  breed  │
│    date    │ varchar  │ varchar │ varchar │
├────────────┼──────────┼─────────┼─────────┤
│ 2022-09-03 │ Scooter  │ Dog     │ Pug     │
└────────────┴──────────┴─────────┴─────────┘

 

pets_df = pd.read_parquet("seattle_pets.parquet")
duckdb.sql(
    """
    CREATE OR REPLACE TABLE pets_table_from_df AS
    SELECT * FROM pets_df
    """
)
duckdb.sql("SELECT * FROM pets_table_from_df USING SAMPLE 1")

 

┌────────────┬──────────┬─────────┬─────────────────────┐
│ issue_date │ pet_name │ species │        breed        │
│    date    │ varchar  │ varchar │       varchar       │
├────────────┼──────────┼─────────┼─────────────────────┤
│ 2022-07-08 │ Cleo     │ Dog     │ Australian Shepherd │
└────────────┴──────────┴─────────┴─────────────────────┘

 

Note that this is not an exhaustive list; see the DuckDB documentation for reference (https://duckdb.org/docs/api/python/result_conversion).

 

conn = duckdb.connect()
seattle_pets = conn.from_parquet("seattle_pets.parquet")

pandas_df = seattle_pets.df()
pandas_df[
    pandas_df["species"] == "Dog"
].value_counts("breed")[:5]

 

breed
Retriever, Labrador      3034
Retriever, Golden        1488
Chihuahua, Short Coat    1443
German Shepherd           954
Poodle, Miniature         846
Name: count, dtype: int64

 

import polars as pl

polars_df = seattle_pets.pl()
polars_df.filter(
    pl.col("species") == "Cat"
)["breed"].value_counts(sort=True)[:5]

 

 

conn = duckdb.connect()
conn.execute("SELECT * FROM 'seattle_pets.parquet'")
pets_table = conn.arrow()
pets_table.schema

 

issue_date: date32[day]
pet_name: string
species: string
breed: string

 

varchar_type = duckdb.typing.VARCHAR
bigint_type = duckdb.typing.BIGINT

 

varchar_type = duckdb.typing.DuckDBPyType(str)
bigint_type = duckdb.typing.DuckDBPyType(int)

 

复制代码
import datetime

duckdb.values(
    [
        10,
        1_000_000,
        0.95,
        "hello string",
        b"hello bytes",
        True,
        datetime.date.today(),
        None,
    ]
)
复制代码

 

┌───────┬─────────┬────────┬──────────────┬─────────────┬─────────┬────────────┬───────┐
│ col0  │  col1   │  col2  │     col3     │    col4     │  col5   │    col6    │ col7  │
│ int32 │  int32  │ double │   varchar    │    blob     │ boolean │    date    │ int32 │
├───────┼─────────┼────────┼──────────────┼─────────────┼─────────┼────────────┼───────┤
│    10 │ 1000000 │   0.95 │ hello string │ hello bytes │ true    │ 2025-01-16 │  NULL │
└───────┴─────────┴────────┴──────────────┴─────────────┴─────────┴────────────┴───────┘

 

duckdb.values(
    [
        (1, 2), 
        ["hello", "world"], 
        {"key1": 10, "key2": "quack!"}
    ]
)

 

┌─────────┬────────────────┬────────────────────────────────────┐
│  col0   │      col1      │                col2                │
│ int32[] │   varchar[]    │ struct(key1 integer, key2 varchar) │
├─────────┼────────────────┼────────────────────────────────────┤
│ [1, 2]  │ [hello, world] │ {'key1': 10, 'key2': quack!}       │
└─────────┴────────────────┴────────────────────────────────────┘

 

import emoji

def emojify(species):
    """Converts a string into a single emoji."""
    emoji_str = emoji.emojize(f":{species.lower()}:")
    if emoji.is_emoji(emoji_str):
        return emoji_str
    return None

 

emojify("goat")
'🐐'

duckdb.create_function(
    "emojify",
    emojify,
    [duckdb.typing.VARCHAR],
    duckdb.typing.VARCHAR
)

 

/tmp/ipykernel_4851/4183955593.py:1: DeprecationWarning: numpy.core is deprecated and has been renamed to numpy._core. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.multiarray.
  duckdb.create_function(

 

With the emoji function registered, we can now use it in a SQL query as if it were a DuckDB SQL function.

duckdb.sql(
    """
    SELECT *, emojify(species) AS emoji
    FROM 'seattle_pets_updates.parquet'
    USING SAMPLE 10
    """
)

 

复制代码
┌────────────┬──────────┬─────────┬──────────────────────────────────────────────┬─────────┐
│ issue_date │ pet_name │ species │                    breed                     │  emoji  │
│    date    │ varchar  │ varchar │                   varchar                    │ varchar │
├────────────┼──────────┼─────────┼──────────────────────────────────────────────┼─────────┤
│ 2022-08-21 │ Bruce    │ Dog     │ Lhasa Apso                                   │ 🐕      │
│ 2022-07-01 │ Nico     │ Dog     │ Terrier, Fox, Wire                           │ 🐕      │
│ 2022-11-08 │ Heera    │ Cat     │ Mix                                          │ 🐈      │
│ 2023-07-24 │ Daisy    │ Dog     │ Mixed Breed, Large (over 44 lbs fully grown) │ 🐕      │
│ 2023-08-07 │ Ziva     │ Dog     │ Poodle, Standard                             │ 🐕      │
│ 2023-04-13 │ Winston  │ Dog     │ Welsh Corgi, Pembroke                        │ 🐕      │
│ 2023-01-12 │ Jamie    │ Cat     │ American Shorthair                           │ 🐈      │
│ 2022-04-27 │ Aurora   │ Cat     │ Domestic Shorthair                           │ 🐈      │
│ 2022-04-19 │ Josie    │ Cat     │ Domestic Longhair                            │ 🐈      │
│ 2023-02-12 │ Pepper   │ Dog     │ Terrier                                      │ 🐕      │
├────────────┴──────────┴─────────┴──────────────────────────────────────────────┴─────────┤
│ 10 rows                                                                        5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────┘
复制代码

In order to remove the UDF from the database, we just call the remove_function() method:

duckdb.remove_function("emojify")

 

duckdb.create_function("emojify", emojify, [str], str)

 

复制代码
def emojify(species: str) -> str:
    """Converts a string into a single emoji."""
    emoji_str = emoji.emojize(f":{species.lower()}:")
    if emoji.is_emoji(emoji_str):
        return emoji_str
    return None

duckdb.remove_function("emojify")
duckdb.create_function("emojify", emojify)
复制代码

 

 

from duckdb import ConversionException

try:
    duckdb.execute("SELECT '5,000'::INTEGER").fetchall()
except ConversionException as error:
    print(error)
    # handle exception...

 

Conversion Error: Could not convert string '5,000' to INT32
LINE 1: SELECT '5,000'::INTEGER
                      ^

 

from duckdb import CatalogException

try:
    duckdb.sql("SELECT * from imaginary_table")
except CatalogException as error:
    print(error)
    # handle exception...

 

Catalog Error: Table with name imaginary_table does not exist!
Did you mean "pragma_database_list"?

 

posted on   ZhangZhihuiAAA  阅读(4)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾(3.3-3.9)
· winform 绘制太阳,地球,月球 运作规律
历史上的今天:
2024-01-15 Redis - Serialize and Deserialize
 
点击右上角即可分享
微信分享提示