Wrapping DuckDB Python API
When you import duckdb, there are two main concepts that you have to handle:
A DuckDBPyConnection represents a connection to a DuckDB database in a file or memory
DuckDB API has no concept of a Database, only of a connection that may envolve different attached databases (or catalogs) and schemas.
A Catalog is just a holder for schemas, and schemas hold catalog entries, like tables, views, functions, types, etc..
We will change this behaviour in a bit…
A DuckDBPyRelation represents a query. It is a table-like object that will be lazy executed and can be queried further.
Once it’s executed then yes, it contains the result set.
But when further projections are made on it, that result set is not used, the projections are just done on top of the original query as a subquery
duckdb.table('tbl').sql("select a")
Really becomes, in essence: select a from (select * from tbl)
db = duckdb.connect ()
db.sql(f"FROM duckdb_tables()" )
0 rows x 16 cols DuckDBPyRelation
type (db.sql(f"SELECT table_name FROM duckdb_tables()" ))
duckdb.duckdb.DuckDBPyRelation
Improving Connection
We want the developer to understand the connection as a database.
import duckdb
conn = duckdb.connect ()
becomes
from fastduck import database
db = database()
By acessing the same Python API using fastduck, the developer shall get some niceties.
PyCapsule.connect
*connect(database: str = ‘:memory:’, read_only: bool = False, config: dict = None) -> duckdb.DuckDBPyConnection
Create a DuckDB database instance. Can take a database file name to read/write persistent data and a read_only flag if no changes are desired*
Let’s start by simplifying the access to some information on the current catalog and schema in the connection.
db = database('../data/chinook.duckdb' , ['httpfs' , 'spatial' , 'sqlite' ])
db.sql(f"FROM duckdb_extensions()" ).filter ('loaded = false' )
Installing and loading extensions: 100%|██████████| 3/3 [00:00<00:00, 2610.56it/s, Loaded: sqlite (0.00s)]
arrow
False
False
A zero-copy data integration between Apache Arrow and DuckDB
[]
None
autocomplete
False
False
Adds support for autocomplete in the shell
[]
None
aws
False
False
Provides features that depend on the AWS SDK
[]
None
...
...
...
...
...
...
...
...
...
substrait
False
False
Adds support for the Substrait integration
[]
None
vss
False
False
Adds indexing support to accelerate Vector Similarity Search
[]
None
15 rows x 9 cols DuckDBPyRelation
DuckDBPyConnection.q
DuckDBPyConnection.q (*args, **kwargs)
Run a query and return the result as a list of records
DuckDBPyRelation.to_list
DuckDBPyRelation.to_list ()
The relation as a list
DuckDBPyRelation.to_recs
DuckDBPyRelation.to_recs ()
The relation as a list of records
DuckDBPyRelation.__getitem__
DuckDBPyRelation.__getitem__ (idxs)
idxs
Returns
DuckDBPyRelation
selecting by passing a list of column names
DuckDBPyRelation.name
DuckDBPyRelation.name ()
DuckDBPyConnection.schema
DuckDBPyConnection.schema ()
DuckDBPyConnection.catalog
DuckDBPyConnection.catalog ()
album = db.sql("select * from Album" )
album['AlbumId' , 'Title' ].limit(5 )
1
For Those About To Rock We Salute You
2
Balls to the Wall
3
Restless and Wild
4
Let There Be Rock
5
Big Ones
5 rows x 2 cols DuckDBPyRelation
db.table('Album' )['AlbumId' , 'Title' ].limit(2 ).to_recs()
[{'AlbumId': 1, 'Title': 'For Those About To Rock We Salute You'},
{'AlbumId': 2, 'Title': 'Balls to the Wall'}]
We also need a way to know the tables in the database.
DuckDBPyConnection.schemas
DuckDBPyConnection.schemas ()
Returns the schemas in the database
DuckDBPyConnection.base_tables
DuckDBPyConnection.base_tables ()
Returns the base tables in current schema
DuckDBPyConnection.views
DuckDBPyConnection.views ()
Returns the views in current schema
DuckDBPyConnection.tables
DuckDBPyConnection.tables ()
Returns the tables in the database
1 rows x 1 cols DuckDBPyRelation
chinook
main
a_view
VIEW
Just a view
chinook
main
temp_view
VIEW
None
2 rows x 5 cols DuckDBPyRelation
The functions bellow add some utilities that are useful for working with tables and views in a database.
DuckDBPyConnection.datamodel
DuckDBPyConnection.datamodel (table_name:str)
Returns the data model of a table or view. The columns names, types, nullable status, default value and primary key status.
[{'name': 'ArtistId',
'type': 'INTEGER',
'nullable': False,
'default': None,
'pk': True},
{'name': 'Name',
'type': 'VARCHAR',
'nullable': True,
'default': None,
'pk': False}]
DuckDBPyConnection.dataclass
DuckDBPyConnection.dataclass (table_name:str, pref='', suf='',
cls_name:str=None)
Creates a dataclass type from a table or view in the database.
table_name
str
table or view name
pref
str
prefix to add to the field names
suf
str
suffix to add to the field names
cls_name
str
None
defaults to table_name
Returns
type
convertTypes
convertTypes (s:str)
Convert DuckDB types to Python and Numpy types
artist_dc = db.dataclass('Artist' )
src = dataclass_src(artist_dc)
hl_md(src, 'python' ) # fix error in nbdev_prepare
@dataclass
class Artist:
ArtistId: int32 = None
Name: str | None = None
acdc = db.sql(f"select * from artist where artist.Name like 'AC/%'" ).df().to_dict(orient= 'records' )
acdc
[{'ArtistId': 1, 'Name': 'AC/DC'}]
acdc_object = artist_dc(** acdc[0 ])
acdc_object
Artist(ArtistId=1, Name='AC/DC')
Relation utilities
Once we know that a certain DuckDBPyRelation is a table (or view), we can also make it keep some valuable props.
DuckDBPyConnection.table
DuckDBPyConnection.table (name:str, schema:str=None, catalog:str=None)
DuckDBPyRelation.rel
DuckDBPyRelation.rel ()
DuckDBPyRelation.model
DuckDBPyRelation.model ()
DuckDBPyRelation.cls
DuckDBPyRelation.cls ()
create_prop
create_prop (c, name, f)
DuckDBPyRelation.__dir__
DuckDBPyRelation.__dir__ ()
Default dir() implementation.
create_patch_property
create_patch_property (name)
custom_dir
custom_dir (c, add)
DuckDBPyRelation.c
DuckDBPyRelation.c ()
Column autocomplete
DuckDBPyConnection (chinook_main)
Autocomplete
We want an easy access to table information.
like db.tables should work similar to the .table cli command.
DuckDBPyConnection.__repr__
DuckDBPyConnection.__repr__ ()
Return repr(self).
DuckDBPyConnection.shh
DuckDBPyConnection.shh ()
DuckDBPyConnection.fns
DuckDBPyConnection.fns ()
DuckDBPyConnection.v
DuckDBPyConnection.v ()
Autocomplete functonality for views
DuckDBPyConnection.t
DuckDBPyConnection.t ()
Autocomplete functonality for tables
DuckDBPyConnection.s
DuckDBPyConnection.s ()
Autocomplete functonality for schemas
DuckDBPyConnection.get_schema
DuckDBPyConnection.get_schema (catalog_schema:str)
DuckDBPyConnection.use
DuckDBPyConnection.use (catalog_schema:str=None, catalog:str=None,
schema=None)
noop
noop (*args, **kwargs)
:::
:::
a = db.t.Album.c['AlbumId' , 'Title' ].limit(4 )
a
1
For Those About To Rock We Salute You
2
Balls to the Wall
3
Restless and Wild
4
Let There Be Rock
4 rows x 2 cols DuckDBPyRelation chinook.main.Album
Let’s also improve the representation of Relations.
db.s.chinook_main.t.Album
1
For Those About To Rock We Salute You
1
2
Balls to the Wall
2
3
Restless and Wild
2
...
...
...
346
Mozart: Chamber Music
274
347
Koyaanisqatsi (Soundtrack from the Motion Picture)
275
347 rows x 3 cols DuckDBPyRelation BASE_TABLE chinook.main.Album
Album table
DuckDBPyRelation._repr_markdown_??
Signature: DuckDBPyRelation._repr_markdown_(self: 'DuckDBPyRelation')
Docstring: <no docstring>
Source:
@patch
def _repr_markdown_(self: DuckDBPyRelation):
markdown = f"#### {self.__repr__()} "
if self.meta and self.meta['comment']: markdown += f"> {self.meta['comment']}\n\n"
df = self.df()
if self.shape[0] > 5:
head = df.head(3)
tail = df.tail(2)
ellipsis = pd.DataFrame([["..."] * df.shape[1]], columns=df.columns)
df = pd.concat([head, ellipsis, tail])
markdown += df.to_markdown(index=False, tablefmt="pipe")
markdown += f"\n\n {self.shape[0]} rows x {self.shape[1]} cols "
return markdown
File: /var/folders/gm/fbvqv1z96zzb73768sgn_jp80000gn/T/ipykernel_70680/1718415001.py
Type: function
print (a.df().to_markdown(index= False ))
| AlbumId | Title |
|----------:|:--------------------------------------|
| 1 | For Those About To Rock We Salute You |
| 2 | Balls to the Wall |
| 3 | Restless and Wild |
| 4 | Let There Be Rock |
DuckDBPyRelation.__repr__
DuckDBPyRelation.__repr__ ()
Return repr(self).
DuckDBPyRelation.__str__
DuckDBPyRelation.__str__ ()
Return str(self).
┌─────────┬─────────┬───────────────┬────────────┬─────────────┐
│ catalog │ schema │ name │ type │ comment │
│ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼───────────────┼────────────┼─────────────┤
│ chinook │ main │ Album │ BASE TABLE │ Album table │
│ chinook │ main │ Artist │ BASE TABLE │ NULL │
│ chinook │ main │ Customer │ BASE TABLE │ NULL │
│ chinook │ main │ Employee │ BASE TABLE │ NULL │
│ chinook │ main │ Genre │ BASE TABLE │ NULL │
│ chinook │ main │ Invoice │ BASE TABLE │ NULL │
│ chinook │ main │ InvoiceLine │ BASE TABLE │ NULL │
│ chinook │ main │ MediaType │ BASE TABLE │ NULL │
│ chinook │ main │ Playlist │ BASE TABLE │ NULL │
│ chinook │ main │ PlaylistTrack │ BASE TABLE │ NULL │
│ chinook │ main │ Track │ BASE TABLE │ NULL │
│ chinook │ main │ a_view │ VIEW │ Just a view │
│ chinook │ main │ temp_view │ VIEW │ NULL │
├─────────┴─────────┴───────────────┴────────────┴─────────────┤
│ 13 rows 5 columns │
└──────────────────────────────────────────────────────────────┘
1
For Those About To Rock (We Salute You)
1
1
1
Angus Young, Malcolm Young, Brian Johnson
343719
11170334
0.98
2
Balls to the Wall
2
2
1
None
342562
5510424
0.98
3
Fast As a Shark
3
2
1
F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman
230619
3990994
0.98
...
...
...
...
...
...
...
...
...
3502
Quintet for Horn, Violin, 2 Violas, and Cello in E Flat Major, K. 407/386c: III. Allegro
346
2
24
Wolfgang Amadeus Mozart
221331
3665114
0.98
3503
Koyaanisqatsi
347
2
10
Philip Glass
206005
3305164
0.98
3503 rows x 9 cols DuckDBPyRelation VIEW chinook.main.a_view
Just a view
db.sql('select * from Album' )
1
For Those About To Rock We Salute You
1
2
Balls to the Wall
2
3
Restless and Wild
2
...
...
...
346
Mozart: Chamber Music
274
347
Koyaanisqatsi (Soundtrack from the Motion Picture)
275
347 rows x 3 cols DuckDBPyRelation
1
Rock
2
Jazz
3
Metal
...
...
24
Classical
25
Opera
25 rows x 2 cols DuckDBPyRelation BASE_TABLE chinook.main.Genre
1
For Those About To Rock We Salute You
2
Balls to the Wall
3
Restless and Wild
4
Let There Be Rock
4 rows x 2 cols DuckDBPyRelation chinook.main.Album
Replacement Scans
You may be asking yourself why I am patching DuckDBPyRelation and DuckDBPyConnection instead of subclassing them. The problem is that these classes do not allow subclassing. They do not implement __init__.
We could have create our own classes like Database and Table and just wrap DuckDBPy objects. But then we would loose a very nice feature of the PyRelation objects…..
replacement scans .
db.sql("select * from a" )
1
For Those About To Rock We Salute You
1
2
Balls to the Wall
2
3
Restless and Wild
2
...
...
...
346
Mozart: Chamber Music
274
347
Koyaanisqatsi (Soundtrack from the Motion Picture)
275
347 rows x 3 cols DuckDBPyRelation
I did not had to use a f-string and pass the variable. DuckDBPy objects (as well as Pandas and Polars Dataframes, Arrow tables, and Datasets) are replaced in the query automagically.
db.sql(f"select * from { a} " )
1
For Those About To Rock We Salute You
1
2
Balls to the Wall
2
3
Restless and Wild
2
...
...
...
346
Mozart: Chamber Music
274
347
Koyaanisqatsi (Soundtrack from the Motion Picture)
275
347 rows x 3 cols DuckDBPyRelation
chinook
main
Album
BASE TABLE
Album table
chinook
main
Artist
BASE TABLE
None
chinook
main
Customer
BASE TABLE
None
...
...
...
...
...
chinook
main
a_view
VIEW
Just a view
chinook
main
temp_view
VIEW
None
13 rows x 5 cols DuckDBPyRelation
Database management
DuckDBPyConnection.attach
InvalidPathError
Common base class for all non-exit exceptions.
RemoteSqliteError
Common base class for all non-exit exceptions.
@patch
def detach(self : DuckDBPyConnection, catalog_name) -> None :
if catalog_name in self .schemas.to_list(): self .sql(f"DETACH { catalog_name. split('_' )[0 ]} " )
db.attach('../data/chinook.sqlite' , catalog_name= 'lite' )
2 rows x 1 cols DuckDBPyRelation
DuckDBPyConnection.drop
DuckDBPyConnection.drop (pattern:str)
Drop a table or view
DuckDBPyConnection.__contains__
DuckDBPyConnection.__contains__ (name:str)
find_matches
find_matches (pattern:str, items:List[str])
┌─────────┬─────────┬───────────────┬────────────┬─────────────┐
│ catalog │ schema │ name │ type │ comment │
│ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼───────────────┼────────────┼─────────────┤
│ chinook │ main │ Album │ BASE TABLE │ Album table │
│ chinook │ main │ Artist │ BASE TABLE │ NULL │
│ chinook │ main │ Customer │ BASE TABLE │ NULL │
│ chinook │ main │ Employee │ BASE TABLE │ NULL │
│ chinook │ main │ Genre │ BASE TABLE │ NULL │
│ chinook │ main │ Invoice │ BASE TABLE │ NULL │
│ chinook │ main │ InvoiceLine │ BASE TABLE │ NULL │
│ chinook │ main │ MediaType │ BASE TABLE │ NULL │
│ chinook │ main │ Playlist │ BASE TABLE │ NULL │
│ chinook │ main │ PlaylistTrack │ BASE TABLE │ NULL │
│ chinook │ main │ Track │ BASE TABLE │ NULL │
│ chinook │ main │ a_view │ VIEW │ Just a view │
│ chinook │ main │ temp_view │ VIEW │ NULL │
├─────────┴─────────┴───────────────┴────────────┴─────────────┤
│ 13 rows 5 columns │
└──────────────────────────────────────────────────────────────┘
DuckDBPyConnection.create_view
DuckDBPyConnection.create_view (fileglob:str,
view_name:Optional[str]=None, filetype:Op
tional[Literal['csv','xlsx','json','parqu
et','sqlite']]=None, replace:bool=False,
as_name:Optional[str]=None, *args,
**kwargs)
Create a view from a file
fileglob
str
file path or glob
view_name
Optional[str]
None
view name
filetype
Optional[Literal[‘csv’, ‘xlsx’, ‘json’, ‘parquet’, ‘sqlite’]]
None
file type
replace
bool
False
replace existing view
as_name
Optional[str]
None
args
kwargs
DuckDBPyConnection.create_table
DuckDBPyConnection.create_table (fileglob:str,
table_name:Optional[str]=None, filetype:
Optional[Literal['csv','xlsx','json','pa
rquet','sqlite']]=None,
as_name:Optional[str]=None,
replace:bool=False, *args, **kwargs)
Create a table from a file
fileglob
str
file path or glob
table_name
Optional[str]
None
table name
filetype
Optional[Literal[‘csv’, ‘xlsx’, ‘json’, ‘parquet’, ‘sqlite’]]
None
file type
as_name
Optional[str]
None
replace
bool
False
replace existing table
args
kwargs
db.create_table('../data/username.latin1.csv' , 'latin' , replace= True )
db.tables
chinook
main
Album
BASE TABLE
Album table
chinook
main
Artist
BASE TABLE
None
chinook
main
Customer
BASE TABLE
None
...
...
...
...
...
chinook
main
a_view
VIEW
Just a view
chinook
main
temp_view
VIEW
None
14 rows x 5 cols DuckDBPyRelation
db.create_view('../data/username.latin1.csv' , 'latin1view' , replace= True )
db.tables.filter (f"name like '%latin%'" )
# db.sql("select distinct database_name, schema_name, table_name, column_name from duckdb_columns()")
# db.sql("attach database '../data/chinook.sqlite' as sqlite")
chinook
main
latin
BASE TABLE
None
chinook
main
latin1view
BASE TABLE
None
2 rows x 5 cols DuckDBPyRelation
db.drop('latin1view' )
db.drop('latin' )
# from fastcore.test import test_fail
# test_fail(db.drop, 'banana') # fix error in nbdev_prepare
# db.create_table('https://jsonplaceholder.typicode.com/todos/', 'todos', filetype='json', replace=True)
# db.create_table('https://huggingface.co/datasets/ibm/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/test/0000.parquet', 'hf_movies')
# db.t.hf_movies.limit(2)
db.create_table('../data/example.xlsx' )
sheet = db.t.example
sheet
1
1.3
a
2
1.3
bn
3
1.3
v
4
1.3
sgf
5
1.3
asd
5 rows x 3 cols DuckDBPyRelation BASE_TABLE chinook.main.example
db.create_table('../data/chinook.sqlite' , filetype= 'sqlite' , table_name= 'Customer' , as_name= 'sqlite_Customer' )
1
Luís
Gonçalves
Embraer - Empresa Brasileira de Aeronáutica S.A.
Av. Brigadeiro Faria Lima, 2170
São José dos Campos
SP
Brazil
12227-000
+55 (12) 3923-5555
+55 (12) 3923-5566
luisg@embraer.com.br
3
2
Leonie
Köhler
None
Theodor-Heuss-Straße 34
Stuttgart
None
Germany
70174
+49 0711 2842222
None
leonekohler@surfeu.de
5
3
François
Tremblay
None
1498 rue Bélanger
Montréal
QC
Canada
H2G 1A7
+1 (514) 721-4711
None
ftremblay@gmail.com
3
...
...
...
...
...
...
...
...
...
...
...
...
...
58
Manoj
Pareek
None
12,Community Centre
Delhi
None
India
110017
+91 0124 39883988
None
manoj.pareek@rediff.com
3
59
Puja
Srivastava
None
3,Raj Bhavan Road
Bangalore
None
India
560001
+91 080 22289999
None
puja_srivastava@yahoo.in
3
59 rows x 13 cols DuckDBPyRelation BASE_TABLE chinook.main.sqlite_Customer
db.drop('sqlite_Customer' )
DuckDBPyConnection.import_from
DuckDBPyConnection.import_from (filepath=None, pre='', suf='',
schema=None, replace=None)
┌─────────┬─────────┬───────────────┬────────────┬─────────────┐
│ catalog │ schema │ name │ type │ comment │
│ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼───────────────┼────────────┼─────────────┤
│ chinook │ main │ Album │ BASE TABLE │ Album table │
│ chinook │ main │ Artist │ BASE TABLE │ NULL │
│ chinook │ main │ Customer │ BASE TABLE │ NULL │
│ chinook │ main │ Employee │ BASE TABLE │ NULL │
│ chinook │ main │ Genre │ BASE TABLE │ NULL │
│ chinook │ main │ Invoice │ BASE TABLE │ NULL │
│ chinook │ main │ InvoiceLine │ BASE TABLE │ NULL │
│ chinook │ main │ MediaType │ BASE TABLE │ NULL │
│ chinook │ main │ Playlist │ BASE TABLE │ NULL │
│ chinook │ main │ PlaylistTrack │ BASE TABLE │ NULL │
│ chinook │ main │ Track │ BASE TABLE │ NULL │
│ chinook │ main │ a_view │ VIEW │ Just a view │
│ chinook │ main │ temp_view │ VIEW │ NULL │
├─────────┴─────────┴───────────────┴────────────┴─────────────┤
│ 13 rows 5 columns │
└──────────────────────────────────────────────────────────────┘
db.import_from('../data/chinook.sqlite' , schema = 'cebola' , replace= True )
┌─────────┬─────────┬───────────────┬────────────┬─────────────┐
│ catalog │ schema │ name │ type │ comment │
│ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼───────────────┼────────────┼─────────────┤
│ chinook │ cebola │ Album │ BASE TABLE │ NULL │
│ chinook │ main │ Album │ BASE TABLE │ Album table │
│ chinook │ cebola │ Artist │ BASE TABLE │ NULL │
│ chinook │ main │ Artist │ BASE TABLE │ NULL │
│ chinook │ main │ Customer │ BASE TABLE │ NULL │
│ chinook │ cebola │ Customer │ BASE TABLE │ NULL │
│ chinook │ main │ Employee │ BASE TABLE │ NULL │
│ chinook │ cebola │ Employee │ BASE TABLE │ NULL │
│ chinook │ cebola │ Genre │ BASE TABLE │ NULL │
│ chinook │ main │ Genre │ BASE TABLE │ NULL │
│ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │
│ chinook │ main │ MediaType │ BASE TABLE │ NULL │
│ chinook │ cebola │ MediaType │ BASE TABLE │ NULL │
│ chinook │ cebola │ Playlist │ BASE TABLE │ NULL │
│ chinook │ main │ Playlist │ BASE TABLE │ NULL │
│ chinook │ cebola │ PlaylistTrack │ BASE TABLE │ NULL │
│ chinook │ main │ PlaylistTrack │ BASE TABLE │ NULL │
│ chinook │ main │ Track │ BASE TABLE │ NULL │
│ chinook │ cebola │ Track │ BASE TABLE │ NULL │
│ chinook │ main │ a_view │ VIEW │ Just a view │
│ chinook │ main │ temp_view │ VIEW │ NULL │
├─────────┴─────────┴───────────────┴────────────┴─────────────┤
│ 24 rows (20 shown) 5 columns │
└──────────────────────────────────────────────────────────────┘
db.drop('cebola.*' )
db.detach('lite_main' )
chinook
main
Album
BASE TABLE
Album table
chinook
main
Artist
BASE TABLE
None
chinook
main
Customer
BASE TABLE
None
...
...
...
...
...
chinook
main
a_view
VIEW
Just a view
chinook
main
temp_view
VIEW
None
13 rows x 5 cols DuckDBPyRelation