Data Warehouses
Enriched event data in your warehouse in real time
Stream complete, governed, schema-validated event data to BigQuery, Snowflake, Redshift, Databricks, and object stores. The same enriched events that reach your marketing vendors also feed your data warehouse — no separate ETL pipeline required.
<50ms
Streaming delivery
Events arrive in near real time, not batch windows
100%
Event completeness
Every event — no sampling, no loss
0
ETL pipelines
No separate ingestion, transformation, or loading
30+
Enrichment fields
Geo, device, session, identity — pre-enriched
The problem with current data warehouse pipelines
Most analytics data reaches your warehouse through a chain of lossy, delayed steps: client-side tags capture a subset of events, a CDP or ETL tool batches them, and hours later a partial picture arrives in your tables.
Incomplete data in, incomplete insights out
If 20-40% of visitors are invisible to client-side tags because of ad blockers, your warehouse tables are missing the same 20-40%. Every dashboard, ML model, and report built on that data inherits the same blind spot.
Batch delays make data stale
Traditional ETL runs on hourly or daily schedules. By the time events land in your warehouse, the moment has passed. Real-time personalisation, fraud detection, and operational dashboards need data now — not data from six hours ago.
Duplicated pipelines, duplicated cost
One pipeline feeds your marketing vendors. A second feeds your warehouse. A third feeds your ML feature store. Each has its own ingestion, transformation, and delivery logic — with different schemas, different enrichments, and different data quality.
Supported destinations
Deliver enriched events to any combination of data warehouses, lakes, and object stores simultaneously.
| Destination | Delivery Method | Latency |
|---|---|---|
| Google BigQuery | Streaming Insert API | Real-time |
| Snowflake | Snowpipe | Near real-time |
| Amazon Redshift | Data API / COPY | Near real-time |
| Databricks | Delta Lake / REST API | Near real-time |
| ClickHouse | HTTP Interface | Real-time |
| Amazon S3 | PUT Object (Parquet/JSON) | Micro-batch |
| Google Cloud Storage | JSON / Parquet files | Micro-batch |
| Azure Blob Storage | Block Blob upload | Micro-batch |
| Amazon Kinesis | PutRecords | Real-time |
| Apache Kafka | Produce | Real-time |
Pipeline configuration
The same pipeline that delivers events to GA4 and Meta also streams them to your warehouse. Configure what fields to include, how to transform them, and which events to send.
# bigquery-all-events.yaml
name: bigquery_all_events
integration: google-bigquery
trigger:
event: "*" # All event types
parameters:
project_id: "your-gcp-project"
dataset: "datafly_events"
table: "events"
global:
anonymous_id:
source: anonymous_id
mode: direct
user_id:
source: user_id
mode: direct
session_id:
source: context.session.id
mode: direct
events:
"*":
mappings:
event_name:
source: event
mode: direct
event_type:
source: type
mode: direct
event_timestamp:
source: timestamp
mode: direct
page_url:
source: context.page.url
mode: direct
page_title:
source: context.page.title
mode: direct
referrer:
source: context.page.referrer
mode: direct
properties:
source: properties
mode: direct
geo_country:
source: context.geo.country
mode: direct
geo_city:
source: context.geo.city
mode: direct
device_type:
source: context.device.type
mode: direct
browser:
source: context.device.browser
mode: directData warehouse delivery that works
Single source of truth
The same enriched events reach your warehouse and your marketing vendors. One pipeline, one schema, one version of the data.
Real-time streaming
Events stream to your warehouse as they happen. No batch windows, no hourly ETL jobs, no stale data.
Pre-governed data
PII handling, consent enforcement, and bot filtering are applied before events reach your warehouse. Clean data by default.
Schema-validated
Every event is validated against your defined schema before delivery. No malformed rows, no type mismatches, no missing fields.
Guaranteed delivery
At-least-once delivery with exponential backoff retry. Failed events enter the dead letter queue for inspection and replay.
Pre-enriched
Events arrive with geolocation, device parsing, session data, and 30+ vendor IDs already attached. No post-load enrichment needed.
What teams build with warehouse data
BI and reporting
Build dashboards in Looker, Tableau, or Power BI on complete, real-time event data. No gaps from ad blockers, no sampling, no batch delays.
ML and AI models
Train recommendation engines, demand forecasters, and personalisation models on unsampled behavioural data with full user journeys and complete identity.
Attribution modelling
Run custom multi-touch attribution models on raw event data with 400-day identity resolution. See the full path to conversion, not just the last 7 days.
Customer 360
Combine web and app behavioural data with CRM and transaction data in your warehouse. Complete event history with consistent identity across all touchpoints.
Real-time activation
Stream events to Kinesis or Kafka for real-time feature stores, fraud detection, personalisation engines, and operational monitoring.
Frequently asked questions
- Which data warehouses are supported?
- BigQuery, Snowflake, Redshift, Databricks, S3, Google Cloud Storage, and Azure Blob Storage are first-class destinations with native streaming support. Any warehouse or object store reachable over HTTPS can be added via custom webhook delivery if a native connector does not yet exist.
- How does delivery latency compare to batch ETL?
- Events typically arrive in the warehouse within 1-10 seconds depending on destination. BigQuery streaming inserts are sub-second. Snowflake via Snowpipe Streaming is 1-2 seconds. Compared to nightly batch ETL or hourly Fivetran syncs, downstream models, dashboards, and ML pipelines run on data that is genuinely current rather than 3-24 hours stale.
- How is the data structured in the warehouse?
- Events land in a normalised events table with structured columns (event_id, user_id, timestamp, event_name, properties JSON, context JSON, consent flags, classification metadata). The schema is consistent across all sources (web, mobile, server) so downstream queries, dbt models, and ML feature pipelines do not need source-specific logic.
- Are PII transformations applied before warehouse delivery?
- Yes. The Org Data Layer's PII rules apply uniformly across every destination, including warehouses. Email and phone are SHA-256 hashed by default, IP addresses are masked, payment cards / DOB / addresses are stripped. You can override per-destination if a specific warehouse needs the raw values for compliance reasons (rare), but the secure-by-default posture means warehouse data is procurement-friendly without per-team configuration.
- Does Datafly Signal replace dbt or my data modelling layer?
- No, it complements them. Datafly Signal handles the collect → enrich → deliver layer, with events landing in a clean, well-typed events table. Your dbt project then transforms that table into your business models, marts, and metrics. Most teams point dbt staging models directly at the Datafly-delivered events table without intermediate ELT.
Related
AI-Ready Data
Real-time event streams as ML / AI training data, with consistent identity and clean PII handling.
PII Handling
How email, phone, and address fields are transformed before reaching the warehouse.
Bot Filtering
Strip invalid traffic at the Signal Core so warehouse data reflects real users only.
CDP Integration
Same upstream pipeline feeds CDPs alongside warehouses for consistent identity.
Complete data in your warehouse, in real time
See how Datafly Signal streams enriched, governed event data to your data warehouse alongside every marketing destination.