BigQuery vectorized Python UDFs with Arrow

OraCore Editors

Back to home

[TOOLS] June 21, 20266 min readOraCore Editors

BigQuery vectorized Python UDFs with Arrow

Enable vectorized Python UDFs in BigQuery with Apache Arrow RecordBatch for faster batch processing.

Share LinkedIn

BigQuery vectorized Python UDFs with Arrow

Enable vectorized Python UDFs in BigQuery with Apache Arrow RecordBatch for faster batch processing.

This guide is for BigQuery developers who want to use the new vectorized Python UDF path announced in the BigQuery release notes and the python-bigquery GitHub repo to process data in batches instead of row by row.

After you follow the steps, you will have a working Python UDF that accepts Apache Arrow RecordBatch input, a query that calls it from SQL, and a simple way to verify that the function is running in batch mode.

Before you start

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

A Google Cloud project with BigQuery enabled
Billing enabled on the project
BigQuery Studio access or permission to run SQL jobs
A Cloud resource connection for BigQuery Python UDFs
Python 3.10+ if you plan to test logic locally
Apache Arrow 14+ in your local test environment
Google Cloud CLI 470+ if you want to authenticate from a terminal

Step 1: Confirm the vectorized UDF feature

Your goal is to make sure the release note you are targeting is the right one, and that your project is ready to use the GA Python UDF path with Arrow RecordBatch input.

Open the BigQuery release notes page and look for the entry that says Python UDFs are generally available, then note that the new vectorized path uses the Apache Arrow RecordBatch interface for improved performance.

Verify that you can access BigQuery and that your project can run SQL jobs. You should see the Python UDF GA note in the release notes and be able to open the BigQuery query editor.

Step 2: Create a Cloud resource connection

Your goal is to give BigQuery a secure way to reach external Python dependencies or services when your UDF needs them.

In BigQuery, create or reuse a Cloud resource connection in the same region as your dataset. Then grant the connection the permissions required for the UDF to run.

-- Example only: create a connection in your target region using the BigQuery CLI or Console workflow.

Verify the connection exists and is usable. You should see the connection listed in BigQuery and its service account should have the expected IAM roles.

Step 3: Write an Arrow RecordBatch Python UDF

Your goal is to define a UDF that processes input in batches rather than one row at a time, which is the core performance benefit called out in the release note.

Write your Python logic so it accepts an Apache Arrow RecordBatch, transforms the batch, and returns the result in the format BigQuery expects. Keep the function small at first so you can validate the data flow before adding third-party libraries.

CREATE OR REPLACE FUNCTION `my_dataset.normalize_text_batch`(input STRING)
RETURNS STRING
LANGUAGE PYTHON
OPTIONS (
  runtime_version = 'python-3.11',
  entry_point = 'normalize_text_batch',
  packages = ['pyarrow']
)
AS r'''
import pyarrow as pa

def normalize_text_batch(batch):
    # Batch-oriented logic goes here
    return batch
''';

Verify the function saves successfully. You should see the UDF appear in your dataset and the editor should accept the Python runtime and package options.

Step 4: Call the UDF from SQL

Your goal is to run the UDF against real table data so you can confirm that BigQuery is invoking the batch-oriented Python code inside a query.

Use a SELECT statement over a small sample table first, then expand to a larger dataset once the function returns the expected values.

SELECT
  my_dataset.normalize_text_batch(col) AS normalized_value
FROM my_dataset.sample_table
LIMIT 100;

Verify the query completes and returns transformed rows. You should see the normalized output in the result grid and no Python runtime errors in the job details.

Step 5: Validate batch behavior and performance

Your goal is to confirm that the vectorized path is actually helping, not just running successfully.

Compare the query job details for the UDF version against a row-by-row baseline. Watch for lower per-row overhead, fewer Python invocations, or reduced elapsed time on the same input size. If your logic is CPU-heavy, test with a larger sample to make the batch advantage easier to see.

Verify the query still returns the same results as the baseline version. You should see the same output values, but with better execution characteristics when the batch path is used.

Metric	Before/Baseline	After/Result
Python execution style	Row-by-row UDF	Vectorized Arrow RecordBatch UDF
Data transfer unit	Single rows	Batches
Expected overhead	Higher per-row overhead	Lower per-row overhead

Common mistakes

Using the wrong region for the connection or dataset. Fix: create the UDF, dataset, and connection in the same location.
Installing packages that are not compatible with the chosen Python runtime. Fix: pin versions that work with the runtime you selected and test the import locally first.
Expecting speedups on tiny queries. Fix: benchmark on a larger table or a repeated workload so the batch-processing benefit is visible.

What's next

Once the basic UDF works, extend it with approved PyPI packages, add error handling, and compare it with a SQL-native transformation so you can choose the best path for each workload.

// Related Articles

BigQuery vectorized Python UDFs with Arrow

Before you start

Get the latest AI news in your inbox

Step 1: Confirm the vectorized UDF feature

Step 2: Create a Cloud resource connection

Step 3: Write an Arrow RecordBatch Python UDF

Step 4: Call the UDF from SQL

Step 5: Validate batch behavior and performance

Common mistakes

What's next

Apple’s Gemini-Powered Siri Raises SEO Stakes

Databricks endpoints that stop guessing

Go turns team chaos into boring builds

FDE岗位把售前和工程拧成一股绳

Deploy MiniMax M3 with vLLM OpenAI API

Namastack turns outbox pain into reliable events