Documentation Index
Fetch the complete documentation index at: https://docs.datafold.com/llms.txt
Use this file to discover all available pages before exploring further.
Beta ProductSQL Proxy is currently in beta. Features and APIs may change.
Overview
Jobs compute routes SQL queries to Databricks Jobs instead of SQL warehouses. This is ideal for long-running transformations, DDL operations, and workloads that benefit from dedicated compute resources.
Admin API
Configure global jobs compute settings via the admin API.
Get Configuration
GET /admin/jobs-compute
curl -X GET \
-H "Authorization: Bearer <admin-token>" \
https://sqlproxy.your-company.datafold.com/admin/jobs-compute
Update Configuration (Full Replace)
POST /admin/jobs-compute
curl -X POST \
-H "Authorization: Bearer <admin-token>" \
-H "Content-Type: application/json" \
-d '{
"enabled": true,
"max_wait_ms": 3600000,
"executor_script_path": "/dbfs/datafold/executor.py",
"default_spark_version": "14.3.x-scala2.12",
"default_node_type": "i3.xlarge",
"default_num_workers": 4
}' \
https://sqlproxy.your-company.datafold.com/admin/jobs-compute
Request Body
| Field | Type | Required | Description |
|---|
enabled | boolean | Yes | Enable/disable jobs compute routing |
max_wait_ms | integer | Yes | Maximum wait time for job completion (milliseconds) |
executor_script_path | string | Yes | DBFS path to the executor script |
default_spark_version | string | No | Default Databricks runtime version |
default_node_type | string | No | Default EC2 instance type for workers |
default_num_workers | integer | No | Default number of workers |
Update Configuration (Partial)
PUT /admin/jobs-compute
All fields are optional. Only provided fields are updated.
curl -X PUT \
-H "Authorization: Bearer <admin-token>" \
-H "Content-Type: application/json" \
-d '{"default_num_workers": 8}' \
https://sqlproxy.your-company.datafold.com/admin/jobs-compute
Routing Directives
Use @datafold:jobs_compute in SQL comments to route queries to jobs compute.
-- @datafold:jobs_compute [options]
CREATE TABLE result AS SELECT ...
Compute Types
Serverless (Default)
-- @datafold:jobs_compute
CREATE TABLE result AS SELECT * FROM source
Uses Databricks serverless compute. Simple but charges by DBU-hours.
Classic
-- @datafold:jobs_compute type=classic node_type_id=i3.xlarge num_workers=4
CREATE TABLE result AS SELECT * FROM source
Uses a dedicated cluster. More control over instance types and pricing (spot instances).
Directive Parameters
All parameter names match the official Databricks Jobs API new_cluster schema.
Core Configuration
| Parameter | Type | Description | Example |
|---|
type | string | Compute type: serverless (default) or classic | type=classic |
Node Configuration
| Parameter | Type | Description | Example |
|---|
node_type_id | string | EC2 instance type for workers | node_type_id=i3.xlarge |
driver_node_type_id | string | EC2 instance type for driver (defaults to node_type_id) | driver_node_type_id=i3.2xlarge |
Worker Configuration
| Parameter | Type | Description | Example |
|---|
num_workers | integer | Fixed number of workers | num_workers=4 |
autoscale.min_workers | integer | Minimum workers for autoscaling | autoscale.min_workers=2 |
autoscale.max_workers | integer | Maximum workers for autoscaling | autoscale.max_workers=10 |
Use either num_workers (fixed) or autoscale.min_workers/autoscale.max_workers (autoscaling), not both.
Spark Configuration
| Parameter | Type | Description | Example |
|---|
spark_version | string | Databricks Runtime version | spark_version=14.3.x-scala2.12 |
spark_conf.<key> | string | Spark configuration property | spark_conf.spark.sql.shuffle.partitions=200 |
spark_env_vars.<key> | string | Environment variable | spark_env_vars.MY_VAR=value |
runtime_engine | string | STANDARD or PHOTON | runtime_engine=PHOTON |
Instance Pools
| Parameter | Type | Description | Example |
|---|
instance_pool_id | string | Instance pool for workers | instance_pool_id=0123-456789-pool |
driver_instance_pool_id | string | Instance pool for driver | driver_instance_pool_id=0123-456789-pool |
When using instance pools, node_type_id is not required as the pool defines instance types.
Cluster Policies
| Parameter | Type | Description | Example |
|---|
policy_id | string | Databricks cluster policy ID | policy_id=0123456789ABCDEF |
apply_policy_default_values | boolean | Use policy defaults for omitted fields | apply_policy_default_values=true |
Storage
| Parameter | Type | Description | Example |
|---|
enable_elastic_disk | boolean | Enable autoscaling local storage | enable_elastic_disk=true |
enable_local_disk_encryption | boolean | Encrypt local disks | enable_local_disk_encryption=true |
Security
| Parameter | Type | Description | Example |
|---|
data_security_mode | string | NONE, SINGLE_USER, or USER_ISOLATION | data_security_mode=SINGLE_USER |
single_user_name | string | Username for single-user mode | single_user_name=user@company.com |
| Parameter | Type | Description | Example |
|---|
custom_tags.<key> | string | Resource tag key-value | custom_tags.cost_center=12345 |
Init Scripts
| Parameter | Type | Description | Example |
|---|
init_scripts | string | Comma-separated paths (s3://, dbfs:/, or workspace) | init_scripts=s3://bucket/init.sh |
Docker
| Parameter | Type | Description | Example |
|---|
docker_image.url | string | Custom container image URL | docker_image.url=my-registry/image:tag |
docker_image.basic_auth.username | string | Registry auth username | docker_image.basic_auth.username=user |
docker_image.basic_auth.password | string | Registry auth password | docker_image.basic_auth.password=secret |
AWS Attributes
All AWS-specific parameters use the aws_attributes. prefix.
| Parameter | Type | Description | Example |
|---|
aws_attributes.availability | string | SPOT, ON_DEMAND, or SPOT_WITH_FALLBACK | aws_attributes.availability=SPOT |
aws_attributes.zone_id | string | AWS availability zone | aws_attributes.zone_id=us-west-2a |
aws_attributes.first_on_demand | integer | Number of on-demand nodes before spot | aws_attributes.first_on_demand=1 |
aws_attributes.spot_bid_price_percent | integer | Spot bid as % of on-demand price | aws_attributes.spot_bid_price_percent=100 |
aws_attributes.ebs_volume_type | string | EBS type: GENERAL_PURPOSE_SSD, THROUGHPUT_OPTIMIZED_HDD | aws_attributes.ebs_volume_type=GENERAL_PURPOSE_SSD |
aws_attributes.ebs_volume_count | integer | Number of EBS volumes per node | aws_attributes.ebs_volume_count=1 |
aws_attributes.ebs_volume_size | integer | EBS volume size in GB | aws_attributes.ebs_volume_size=100 |
aws_attributes.ebs_volume_iops | integer | IOPS for gp3/io1/io2 volumes | aws_attributes.ebs_volume_iops=3000 |
aws_attributes.ebs_volume_throughput | integer | Throughput in MB/s for gp3 volumes | aws_attributes.ebs_volume_throughput=125 |
aws_attributes.instance_profile_arn | string | IAM instance profile ARN | aws_attributes.instance_profile_arn=arn:aws:iam::123:instance-profile/role |
Cluster Logging
Configure Spark log delivery using the cluster_log_conf. prefix.
| Parameter | Type | Description | Example |
|---|
cluster_log_conf.destination | string | S3 or DBFS path for logs | cluster_log_conf.destination=s3://bucket/logs |
cluster_log_conf.region | string | AWS region for S3 destination | cluster_log_conf.region=us-west-2 |
cluster_log_conf.enable_encryption | boolean | Enable S3 server-side encryption | cluster_log_conf.enable_encryption=true |
The destination type (S3 vs DBFS) is auto-detected from the path prefix (s3:// or dbfs:/).
Examples
Cost-Optimized Spot Cluster
-- @datafold:jobs_compute type=classic node_type_id=i3.xlarge num_workers=4 aws_attributes.availability=SPOT_WITH_FALLBACK aws_attributes.spot_bid_price_percent=100
CREATE TABLE aggregated_metrics AS
SELECT date, SUM(value) as total
FROM raw_metrics
GROUP BY date
Autoscaling Cluster with Policy
-- @datafold:jobs_compute type=classic policy_id=0123456789ABCDEF apply_policy_default_values=true autoscale.min_workers=2 autoscale.max_workers=10
CREATE TABLE large_join AS
SELECT a.*, b.details
FROM table_a a
JOIN table_b b ON a.id = b.id
Photon-Enabled with Custom Spark Config
-- @datafold:jobs_compute type=classic node_type_id=i3.xlarge num_workers=8 runtime_engine=PHOTON spark_conf.spark.sql.shuffle.partitions=400
CREATE TABLE optimized_result AS
SELECT * FROM complex_transformation
Tagged for Cost Tracking
-- @datafold:jobs_compute type=classic node_type_id=i3.xlarge num_workers=4 custom_tags.team=analytics custom_tags.cost_center=12345 custom_tags.environment=production
CREATE TABLE team_report AS SELECT ...
Single-User Security Mode
-- @datafold:jobs_compute type=classic node_type_id=i3.xlarge num_workers=2 data_security_mode=SINGLE_USER single_user_name=analyst@company.com
CREATE TABLE secure_output AS SELECT ...
With Cluster Logging
-- @datafold:jobs_compute type=classic node_type_id=i3.xlarge num_workers=4 cluster_log_conf.destination=s3://my-bucket/spark-logs cluster_log_conf.region=us-west-2
CREATE TABLE logged_job AS SELECT ...
See Also