Handling Larger-Than-Memory Flat Files with DuckDB and MotherDuck
DuckDB local and MotherDuck hybrid execution with cloud integration. MotherDuck Docs. 2024.
Let’s chat briefly about a few different ways to load flat files into DuckDB and how easy it is to move data cloud to cloud in motherduck, a managed cloud instance of DuckDB. We’ll be using backblaze which is a simple and affordable alternative to AWS S3 as well as motherduck of course, and pyarrow.
Now, the painful way to load data in involves first downloading the flat files from Blob storage, then loading them into python using a library like pandas and finally loading that df into DuckDB. However, that means the entire file needs to be able to fit:
in memory and
on disk
Which in many cases is not possible due a number of factors like size of data or cost of available resources.
One advantage, as shown in the code below (specifically on line 38), is that DuckDB allows you to directly query a Pandas DataFrame within a SQL statement. By comparison, in many other applications that combine SQL and Python in a single notebook environment, you typically need to create a temporary table for the DataFrame to make it accessible to the SQL engine. But we’ll see further down that you can also do with this other objects as well.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | # Connect to backblaze b2 (S3 API) from b2sdk.v2 import * info = InMemoryAccountInfo() b2_api = B2Api(info, cache=AuthInfoCache(info)) application_key_id = 'yourid' application_key = 'yourkey' b2_api.authorize_account("production", application_key_id, application_key) # Download files in a folder bucket_name='bucket-name' download_path='/home/repl/workspace/b100' bucket = b2_api.get_bucket_by_name(bucket_name) for file_version, folder_name in bucket.ls(folder_to_list='', latest_only=True): print(file_version.file_name, file_version.upload_timestamp, folder_name) local_file_path = f"{download_path}/{file_version.file_name}" print(f"Downloading {file_version.file_name} to {local_file_path}...") downloaded_file = bucket.download_file_by_name(file_version.file_name) downloaded_file.save_to(local_file_path) # Load into pd df df_touch = pd.read_csv(local_file_path) # Load into motherduck md_token = 'yourtoken' # Create connection string connection_string = f"md:?token={md_token}" # Connect to MotherDuck conn = duckdb.connect(connection_string) conn.execute("create table mydb.main.t1 as SELECT * FROM df_touch") |
So how can we tackle these two constraints, namely 1. memory and 2. disk.
For 1. memory, we can simply use a library like pyarrow to create a lazy table. This approach avoids immediately loading the entire dataset into memory. Instead, PyArrow constructs a metadata object representing the dataset, including information about the schema, file paths, and file formats. This allows the system to defer reading the data until it's needed (e.g., during filtering, transformation), conserving memory.
On line 12, I choose to then load the data into DuckDB locally (this is optional) because it allows for greater flexibility with joining operations. Specifically, joining in pyarrow materializes the results immediately so the results of the two tables you are joining need to fit in memory. DuckDB, on the other hand, has it’s own implementation of joining and so does not have this same limitation that pyarrow does. Therefore, if in the future we wanted to join this data to another large dataset it would be trivial with DuckDB. If you do not care about joining then you can just export the data to parquet in pyarrow and load it into MotherDuck.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as ds import duckdb local_file_path='/home/repl/workspace/b100/touch.txt' # Read the file as a PyArrow Dataset for lazy evaluation aa_dataset = ds.dataset(local_file_path, format="csv") # connect to database local_ddb_con = duckdb.connect() # query the Apache Arrow Table and return as an Arrow Table results = local_ddb_con.execute("SELECT count(*) FROM aa_dataset").arrow() #From here you can export to parquet file in chunks and load to MotherDuck. COPY (SELECT * FROM aa_dataset) TO 'output.parquet' (FORMAT PARQUET); |
BUT you might be thinking. Why not just use MotherDuck to load the data if you are using a local DuckDB anyways?🧐 Exactly! 😅
It is actually better than that, MotherDuck can just load the data directly from Blob without having to go through pyarrow or python or load the data locally at all. Below you will see a nice single line of code (line 9). Behind the scenes MotherDuck uses a number of techniques to improve the performance like chunk based processing. This approach would solve both our disk and memory issues.
1 2 3 4 5 6 7 8 9 | CREATE OR REPLACE SECRET <md-name> IN MOTHERDUCK ( TYPE S3, KEY_ID 'yourkey', SECRET 'yoursecret', REGION 'us-east-00X', ENDPOINT 's3.us-east-00X.backblazeb2.com' ); SELECT * FROM read_csv_auto('s3://bucket-name/touch.txt'); |