BigQuery vectorized Python UDFs with Arrow
Enable vectorized Python UDFs in BigQuery with Apache Arrow RecordBatch for faster batch processing.

Enable vectorized Python UDFs in BigQuery with Apache Arrow RecordBatch for faster batch processing.
This guide is for BigQuery developers who want to use the new vectorized Python UDF path announced in the BigQuery release notes and the python-bigquery GitHub repo to process data in batches instead of row by row.
After you follow the steps, you will have a working Python UDF that accepts Apache Arrow RecordBatch input, a query that calls it from SQL, and a simple way to verify that the function is running in batch mode.
Before you start
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
- A Google Cloud project with BigQuery enabled
- Billing enabled on the project
- BigQuery Studio access or permission to run SQL jobs
- A Cloud resource connection for BigQuery Python UDFs
- Python 3.10+ if you plan to test logic locally
- Apache Arrow 14+ in your local test environment
- Google Cloud CLI 470+ if you want to authenticate from a terminal
Step 1: Confirm the vectorized UDF feature
Your goal is to make sure the release note you are targeting is the right one, and that your project is ready to use the GA Python UDF path with Arrow RecordBatch input.

Open the BigQuery release notes page and look for the entry that says Python UDFs are generally available, then note that the new vectorized path uses the Apache Arrow RecordBatch interface for improved performance.
Verify that you can access BigQuery and that your project can run SQL jobs. You should see the Python UDF GA note in the release notes and be able to open the BigQuery query editor.
Step 2: Create a Cloud resource connection
Your goal is to give BigQuery a secure way to reach external Python dependencies or services when your UDF needs them.

In BigQuery, create or reuse a Cloud resource connection in the same region as your dataset. Then grant the connection the permissions required for the UDF to run.
-- Example only: create a connection in your target region using the BigQuery CLI or Console workflow.Verify the connection exists and is usable. You should see the connection listed in BigQuery and its service account should have the expected IAM roles.
Step 3: Write an Arrow RecordBatch Python UDF
Your goal is to define a UDF that processes input in batches rather than one row at a time, which is the core performance benefit called out in the release note.
Write your Python logic so it accepts an Apache Arrow RecordBatch, transforms the batch, and returns the result in the format BigQuery expects. Keep the function small at first so you can validate the data flow before adding third-party libraries.
CREATE OR REPLACE FUNCTION `my_dataset.normalize_text_batch`(input STRING)
RETURNS STRING
LANGUAGE PYTHON
OPTIONS (
runtime_version = 'python-3.11',
entry_point = 'normalize_text_batch',
packages = ['pyarrow']
)
AS r'''
import pyarrow as pa
def normalize_text_batch(batch):
# Batch-oriented logic goes here
return batch
''';Verify the function saves successfully. You should see the UDF appear in your dataset and the editor should accept the Python runtime and package options.
Step 4: Call the UDF from SQL
Your goal is to run the UDF against real table data so you can confirm that BigQuery is invoking the batch-oriented Python code inside a query.
Use a SELECT statement over a small sample table first, then expand to a larger dataset once the function returns the expected values.
SELECT
my_dataset.normalize_text_batch(col) AS normalized_value
FROM my_dataset.sample_table
LIMIT 100;Verify the query completes and returns transformed rows. You should see the normalized output in the result grid and no Python runtime errors in the job details.
Step 5: Validate batch behavior and performance
Your goal is to confirm that the vectorized path is actually helping, not just running successfully.
Compare the query job details for the UDF version against a row-by-row baseline. Watch for lower per-row overhead, fewer Python invocations, or reduced elapsed time on the same input size. If your logic is CPU-heavy, test with a larger sample to make the batch advantage easier to see.
Verify the query still returns the same results as the baseline version. You should see the same output values, but with better execution characteristics when the batch path is used.
| Metric | Before/Baseline | After/Result |
|---|---|---|
| Python execution style | Row-by-row UDF | Vectorized Arrow RecordBatch UDF |
| Data transfer unit | Single rows | Batches |
| Expected overhead | Higher per-row overhead | Lower per-row overhead |
Common mistakes
- Using the wrong region for the connection or dataset. Fix: create the UDF, dataset, and connection in the same location.
- Installing packages that are not compatible with the chosen Python runtime. Fix: pin versions that work with the runtime you selected and test the import locally first.
- Expecting speedups on tiny queries. Fix: benchmark on a larger table or a repeated workload so the batch-processing benefit is visible.
What's next
Once the basic UDF works, extend it with approved PyPI packages, add error handling, and compare it with a SQL-native transformation so you can choose the best path for each workload.