Which data warehouses are supported?

BigQuery, Snowflake, Redshift, Databricks, S3, Google Cloud Storage, and Azure Blob Storage are first-class destinations with native streaming support. Any warehouse or object store reachable over HTTPS can be added via custom webhook delivery if a native connector does not yet exist.

How does delivery latency compare to batch ETL?

Events typically arrive in the warehouse within 1-10 seconds depending on destination. BigQuery streaming inserts are sub-second. Snowflake via Snowpipe Streaming is 1-2 seconds. Compared to nightly batch ETL or hourly Fivetran syncs, downstream models, dashboards, and ML pipelines run on data that is genuinely current rather than 3-24 hours stale.

How is the data structured in the warehouse?

Events land in a normalised events table with structured columns (event_id, user_id, timestamp, event_name, properties JSON, context JSON, consent flags, classification metadata). The schema is consistent across all sources (web, mobile, server) so downstream queries, dbt models, and ML feature pipelines do not need source-specific logic.

Are PII transformations applied before warehouse delivery?

Yes. The Org Data Layer's PII rules apply uniformly across every destination, including warehouses. Email and phone are SHA-256 hashed by default, IP addresses are masked, payment cards / DOB / addresses are stripped. You can override per-destination if a specific warehouse needs the raw values for compliance reasons (rare), but the secure-by-default posture means warehouse data is procurement-friendly without per-team configuration.

Does Datafly Signal replace dbt or my data modelling layer?

No, it complements them. Datafly Signal handles the collect → enrich → deliver layer, with events landing in a clean, well-typed events table. Your dbt project then transforms that table into your business models, marts, and metrics. Most teams point dbt staging models directly at the Datafly-delivered events table without intermediate ELT.

Data Warehouses

Enriched event data in your warehouse in real time

Stream complete, governed, schema-validated event data to BigQuery, Snowflake, Redshift, Databricks, and object stores. The same enriched events that reach your marketing vendors also feed your data warehouse — no separate ETL pipeline required.

Request a Demo View Platform

<50ms

Streaming delivery

Events arrive in near real time, not batch windows

100%

Event completeness

Every event — no sampling, no loss

ETL pipelines

No separate ingestion, transformation, or loading

30+

Enrichment fields

Geo, device, session, identity — pre-enriched

The problem with current data warehouse pipelines

Most analytics data reaches your warehouse through a chain of lossy, delayed steps: client-side tags capture a subset of events, a CDP or ETL tool batches them, and hours later a partial picture arrives in your tables.

Incomplete data in, incomplete insights out

If 20-40% of visitors are invisible to client-side tags because of ad blockers, your warehouse tables are missing the same 20-40%. Every dashboard, ML model, and report built on that data inherits the same blind spot.

Batch delays make data stale

Traditional ETL runs on hourly or daily schedules. By the time events land in your warehouse, the moment has passed. Real-time personalisation, fraud detection, and operational dashboards need data now — not data from six hours ago.

Duplicated pipelines, duplicated cost

One pipeline feeds your marketing vendors. A second feeds your warehouse. A third feeds your ML feature store. Each has its own ingestion, transformation, and delivery logic — with different schemas, different enrichments, and different data quality.

Supported destinations

Deliver enriched events to any combination of data warehouses, lakes, and object stores simultaneously.

Destination	Delivery Method	Latency
Google BigQuery	Streaming Insert API	Real-time
Snowflake	Snowpipe	Near real-time
Amazon Redshift	Data API / COPY	Near real-time
Databricks	Delta Lake / REST API	Near real-time
ClickHouse	HTTP Interface	Real-time
Amazon S3	PUT Object (Parquet/JSON)	Micro-batch
Google Cloud Storage	JSON / Parquet files	Micro-batch
Azure Blob Storage	Block Blob upload	Micro-batch
Amazon Kinesis	PutRecords	Real-time
Apache Kafka	Produce	Real-time

Pipeline configuration

The same pipeline that delivers events to GA4 and Meta also streams them to your warehouse. Configure what fields to include, how to transform them, and which events to send.

bigquery-all-events.yamlYAML

# bigquery-all-events.yaml
name: bigquery_all_events
integration: google-bigquery
trigger:
  event: "*"  # All event types

parameters:
  project_id: "your-gcp-project"
  dataset: "datafly_events"
  table: "events"

global:
  anonymous_id:
    source: anonymous_id
    mode: direct
  user_id:
    source: user_id
    mode: direct
  session_id:
    source: context.session.id
    mode: direct

events:
  "*":
    mappings:
      event_name:
        source: event
        mode: direct
      event_type:
        source: type
        mode: direct
      event_timestamp:
        source: timestamp
        mode: direct
      page_url:
        source: context.page.url
        mode: direct
      page_title:
        source: context.page.title
        mode: direct
      referrer:
        source: context.page.referrer
        mode: direct
      properties:
        source: properties
        mode: direct
      geo_country:
        source: context.geo.country
        mode: direct
      geo_city:
        source: context.geo.city
        mode: direct
      device_type:
        source: context.device.type
        mode: direct
      browser:
        source: context.device.browser
        mode: direct

Data warehouse delivery that works

Single source of truth

The same enriched events reach your warehouse and your marketing vendors. One pipeline, one schema, one version of the data.

Real-time streaming

Events stream to your warehouse as they happen. No batch windows, no hourly ETL jobs, no stale data.

Pre-governed data

PII handling, consent enforcement, and bot filtering are applied before events reach your warehouse. Clean data by default.

Schema-validated

Every event is validated against your defined schema before delivery. No malformed rows, no type mismatches, no missing fields.

Guaranteed delivery

At-least-once delivery with exponential backoff retry. Failed events enter the dead letter queue for inspection and replay.

Pre-enriched

Events arrive with geolocation, device parsing, session data, and 30+ vendor IDs already attached. No post-load enrichment needed.

What teams build with warehouse data

BI and reporting

Build dashboards in Looker, Tableau, or Power BI on complete, real-time event data. No gaps from ad blockers, no sampling, no batch delays.

ML and AI models

Train recommendation engines, demand forecasters, and personalisation models on unsampled behavioural data with full user journeys and complete identity.

Attribution modelling

Run custom multi-touch attribution models on raw event data with 400-day identity resolution. See the full path to conversion, not just the last 7 days.

Customer 360

Combine web and app behavioural data with CRM and transaction data in your warehouse. Complete event history with consistent identity across all touchpoints.

Real-time activation

Stream events to Kinesis or Kafka for real-time feature stores, fraud detection, personalisation engines, and operational monitoring.

Frequently asked questions

Which data warehouses are supported?: BigQuery, Snowflake, Redshift, Databricks, S3, Google Cloud Storage, and Azure Blob Storage are first-class destinations with native streaming support. Any warehouse or object store reachable over HTTPS can be added via custom webhook delivery if a native connector does not yet exist.
How does delivery latency compare to batch ETL?: Events typically arrive in the warehouse within 1-10 seconds depending on destination. BigQuery streaming inserts are sub-second. Snowflake via Snowpipe Streaming is 1-2 seconds. Compared to nightly batch ETL or hourly Fivetran syncs, downstream models, dashboards, and ML pipelines run on data that is genuinely current rather than 3-24 hours stale.
How is the data structured in the warehouse?: Events land in a normalised events table with structured columns (event_id, user_id, timestamp, event_name, properties JSON, context JSON, consent flags, classification metadata). The schema is consistent across all sources (web, mobile, server) so downstream queries, dbt models, and ML feature pipelines do not need source-specific logic.
Are PII transformations applied before warehouse delivery?: Yes. The Org Data Layer's PII rules apply uniformly across every destination, including warehouses. Email and phone are SHA-256 hashed by default, IP addresses are masked, payment cards / DOB / addresses are stripped. You can override per-destination if a specific warehouse needs the raw values for compliance reasons (rare), but the secure-by-default posture means warehouse data is procurement-friendly without per-team configuration.
Does Datafly Signal replace dbt or my data modelling layer?: No, it complements them. Datafly Signal handles the collect → enrich → deliver layer, with events landing in a clean, well-typed events table. Your dbt project then transforms that table into your business models, marts, and metrics. Most teams point dbt staging models directly at the Datafly-delivered events table without intermediate ELT.

AI-Ready Data

Real-time event streams as ML / AI training data, with consistent identity and clean PII handling.

PII Handling

How email, phone, and address fields are transformed before reaching the warehouse.

Bot Filtering

Strip invalid traffic at the Signal Core so warehouse data reflects real users only.

CDP Integration

Same upstream pipeline feeds CDPs alongside warehouses for consistent identity.

Complete data in your warehouse, in real time

See how Datafly Signal streams enriched, governed event data to your data warehouse alongside every marketing destination.