Data Collaboration Platform (DCP) - Prepare source data for ingestion

For:

Marketers

Last update: February 05, 2026 14:42

At a glance: Prepare your source data to maximize match rates between collaborator datasets and optimize collaboration outcomes.

About data preparation

The AppsFlyer Data Collaboration Platform (DCP) is designed to ingest and process large-scale transactional data, user-level data, or a combination of both from multiple collaborators. Once ingested, the DCP can match users across datasets using shared identifiers (for example, hashed email or hashed phone) and enable collaboration workflows based on that data.

This article outlines:

The kinds of data the DCP can process (transactional, user-level, or a combination)
How to structure data to maximize match rates between collaborators, and for optimal matching by activation partners
Which data structures are recommended and which aren't
How to handle multi-value identifiers when using CSV files

Understand data types

Most sources used in collaborations involve one or both of the following data types:

User-level (identity) data
- Goal: Represent each user and the identifiers used for matching
- Recommended structure: One row per user
Transactional (event) data
- Goal: Represent user actions (purchases, visits, transactions)
- Typical structure: Multiple rows per user (because a user can have many transactions)

Data structuring guidelines

This section outlines DCP data structuring guidelines and best practices.

Recommended data structures

To structure your source data for optimal ingestion and maximum outcomes in DCP:

Ensure it meets source data formatting requirements.
For best results, structure identity data as one row per user and include all available identifiers for that user. See Recommended identifier structures.
- If users have multiple email addresses or phone numbers, use array-based formats (such as Parquet, BigQuery, or Snowflake).
Check that hashing guidelines are followed.
For transactional datasets (and other data types where applicable), use the DCP Data View tool to create a business-oriented, privacy-safe subset of your data optimized for collaboration, and share it instead of raw tables. Why Data Views are recommended.

While other structures can be used, they may not be ideally suited for getting the most out of DCP. See Data structures that aren't recommended.

Recommended identifier structures

To maximize match rates, it's recommended to send user-level identity data as a single row per user whenever possible and to include all available identifiers in a structure that accurately reflects how users are represented.

Option 1: One row per user, one value per identifier type

Simple and effective.

Use this when each user has at most one hashed email address, one hashed phone number, and so on.

Example

User ID	Hashed Email	Hashed Phone
123	5d41402	152d234b70
456	7d793037
789		0e6e5414

Why this is recommended

Simplest to prepare and ingest
Easy to validate and troubleshoot
Strong matching performance when identifiers are complete and consistent (for matching with collaborator datasets and activation media partner matching)

Best fit

Great for CSV-based uploads
Also works well in Parquet / BigQuery / Snowflake

Option 2: Arrays (lists) of identifiers for multi-value fields

Best for accuracy when users have multiple emails and phones.

Use this when a user can have multiple values of the same identifier type, such as:

Multiple hashed emails
Multiple hashed phones

This structure should be represented as native arrays.

Example

User ID	Hashed Emails	Hashed Phones	MAID
u_0001	[a1b2 · c3d4 · e5f6]	[p1q2 · r3s4]	m001
u_0002	[z9y8 · x7w6]		m002
u_0003		[t1u2 · v3w4 · x5y6]

Why this is preferred

Reflects how identity data is often stored internally
No need to guess how many “email columns” might be needed
Avoids ongoing schema changes
Maximizes matching potential (for collaborator datasets and for activation media partner matching) by allowing multiple identifiers per user

Best fit

Parquet (S3/GCS)
BigQuery
Snowflake

Practical checklist for data structuring

Identity (user-level) dataset checklist

One row represents one user.
Include all available identifiers (hashed email, hashed phone, MAID, and so on).
If multiple values exist (for example, multiple emails), use native arrays where possible.
Avoid lists in cells, avoid “email_1/email_2/email_3,” and avoid multi-row identity structures.

Transactional (events) dataset checklist

Keep event/transaction rows as-is. See Transactional data examples
Use Data Views to create collaboration-ready user-level summaries when appropriate.

Data Views checklist

Build business-oriented Data Views (audience-ready, summarized, standardized).
Share Data Views instead of raw sensitive tables whenever possible.
Minimize shared fields to only what’s needed (for privacy and business protection).
Consider aggregations to reduce access to sensitive data and improve efficiency.

Supported ingestion cloud services and formats

The DCP supports ingestion through multiple cloud services, such as:

AWS S3
GCS (Google Cloud Storage)
BigQuery
Snowflake

Format guidance

Use CSV (comma-separated) when each user has one value per identifier type (one email, one phone, and so on).
When users can have multiple values of the same identifier type (for example, multiple hashed emails), prefer array-capable formats:
- Parquet (common in big data)
- BigQuery (arrays are built-in)
- Snowflake (arrays are built-in)

Notes:

“Arrays” just means “a list of values.”
Parquet and arrays in BigQuery/Snowflake are standard, and most data teams can support this naturally.

See Connector-specific example assets (AWS S3, GCS, BigQuery, Snowflake).

Data structures that aren't recommended

Certain types of data structuring are ideally suited for ingestion into DCP, and others are not. The following is a list of data structures that are not recommended:

Avoid: Multiple columns for the same identifier type

Examples like “primary/secondary/third email” are discouraged.

Example to avoid

User ID	Email Hash 1	Email Hash 2	Email Hash 3
123	a1b2	c3d4
456	z9y8

Why to avoid

The number of values per user is unpredictable
Leads to column proliferation and schema maintenance issues

Avoid: “Lists inside a single cell” (free text)

Even if it looks convenient, it’s error-prone.

Example to avoid

User ID	Hashed Emails (text)
123	a1b2 · c3d4 · e5f6

Why to avoid

Not a true array type
Formatting/escaping issues are common
Small inconsistencies can break ingestion

Avoid: Multiple rows per user for identifiers (when your goal is a user identity table)

Don’t split identity data across rows when you can keep one row per user.

Example to avoid

User ID	Identifier Type	Identifier Value
123	hashed_email	a1b2
123	hashed_email	c3d4

Why to avoid

Requires extra processing to reconstruct users
Higher risk of duplicates and inconsistencies
Less efficient at scale

Note: Multiple rows per user are normal for transactional tables, but not ideal for the identity (user-level) table used for matching.

Additional information

This section provides additional, in-depth information on relevant topics.

Transactional data examples

Transactional/event data may have multiple rows per user:

Transaction ID	User ID	Transaction Date	Amount	Category
t_001	123	2025-12-01	120.00	Grocery
t_002	123	2025-12-15	45.00	Pharmacy
t_003	456	2025-12-20	300.00	Electronics

Recommended

Ingest transactional data in its natural structure
Use the Data View tool to produce a collaboration-ready user-level view (for example, spend in last 30 days, category affinity, recency/frequency summaries)

Connector-specific example assets (AWS S3, GCS, BigQuery, Snowflake)

What does “array/list of identifiers” mean?

Sometimes a single user can have more than one email or phone (for example, a personal and a work email). Instead of forcing these into extra columns or messy text fields, we represent them as a list of values.

User ID	Hashed Emails (List)
u_0001	[a1b2 · c3d4 · e5f6]
u_0002	[z9y8 · x7w6]

1. AWS S3 and GCS connector (Parquet)

What the identity sample should contain (schema)

Field	Type	Notes
user_id	string	One row per user
hashed_emails	array of strings	Multi-value identifier (recommended)
hashed_phones	array of strings	Multi-value identifier (recommended)
maid	string	Optional additional identifier

Parquet supports lists natively, which is why it’s recommended when users have multiple emails/phones.

2. BigQuery connector (table preview and schema screenshot)

Example identity schema (BigQuery)

Field	Type	Mode
user_id	STRING	NULLABLE
hashed_emails	STRING	REPEATED (ARRAY)
hashed_phones	STRING	REPEATED (ARRAY)
maid	STRING	NULLABLE

Arrays are native in BigQuery (REPEATED fields). This is the recommended approach when multiple emails/phones exist.

3. Snowflake connector (table preview and schema screenshot)

Example identity schema (Snowflake)

Field	Type	Notes
USER_ID	VARCHAR	One row per user
HASHED_EMAILS	ARRAY	Multi-value identifier
HASHED_PHONES	ARRAY	Multi-value identifier
MAID	VARCHAR	Optional

Snowflake supports list-like fields (arrays). This is preferred over creating email_1/email_2/email_3 columns.

Why Data Views are recommended

Data views are customized subsets of your existing data sources that you can create without changing the original dataset. They let you perform functions that modify, refine, and summarize the data so you can share only what’s relevant in collaborations, keeping the data you share clear and focused, and all other data private and secure. Use Data Views in collaborations because they:

Are business-oriented and easier to collaborate with

Data Views let you present the data in a clean, business-friendly structure (for example, “Customer Identity View”, “Eligible Audience View”, “Purchase Summary View”) rather than raw logs or complex transactional schemas.
They help standardize definitions (for example, “active customer”, “total spend last 30 days”) so both sides are aligned and using the same logic.

Are more privacy-safe by design

Data Views support data minimization: you can share only the fields and rows needed for the collaboration, instead of exposing full raw datasets.
They reduce privacy risk by allowing you to:
- Exclude unnecessary attributes
- Limit granularity (for example, share aggregated metrics instead of line-level transactions)
- Share only the identifiers needed for matching (and not additional sensitive columns)

Protect sensitive business data

Raw transactional tables can contain highly sensitive business signals (for example, detailed purchase behavior, pricing patterns, merchant relationships, internal IDs).
Data Views let you share only what’s required for the collaboration:
- Example: share category-level aggregates instead of item-level details
- Example: share spend buckets or summary metrics instead of exact line items
This reduces the risk of unintentionally revealing competitive or proprietary information.

Can make your data more cost-effective and performant

Data Views can be structured to be lighter and more efficient, which often reduces:

Data scanned
Processing time
Compute costs

Pre-structuring a collaboration-ready view (especially aggregated views) may make collaboration faster and more operationally predictable.

Multiple IDs/keys upload and matching behavior by platform

DCP identifiers upload

Partner/Platform	Supported Identifier Types	Upload Mode	Upload Behavior	Identifier Priority (for single-type partners)
Meta	Email, Phone, IDFA, GAID, Android ID, First Name, Last Name, DOB, City, State, Zip, Country	All supported types	Uploads all supported identifier types	N/A - uploads all
Google	Email, Phone, IDFA, GAID, Android ID, Customer ID	Single type	Uploads one identifier type per audience	1. Email 2. Phone 3. Mobile IDs 4. Customer ID
TikTok	Email, Phone, IDFA, GAID, Android ID	All supported types	Uploads all supported identifier types	N/A - uploads all
Pinterest	Email, Phone, IDFA, GAID	Single type	Uploads one identifier type per audience	1. Email 2. Phone 3. Mobile IDs
Infillion	Email, Phone, IDFA, GAID, MM-UUID	All supported types	Uploads all supported types grouped into up to 3 files	N/A - uploads all
TheTradeDesk	Email, Phone	Single type	Uploads one identifier type per audience	1. Email 2. Phone
AppsFlyer Data Locker	All available identifiers	All types combined	Delivers all identifiers in a single combined file	N/A - delivers all

Platform	Can you send multiple IDs per user?	Can they match on any of them?	Can they combine multiple IDs?	Matching model (practical)	Notes
Meta	Yes	Yes	Yes	Deterministic, any-ID / combo	Email, phone, MAID, fbp/fbc all evaluated; strongest “any-of-them” behavior
Google	Yes	Limited	No	Deterministic, prioritized	One ID resolves the user; IDs are not combined to form a match
TikTok	Yes	Yes	Yes	Deterministic, any-ID / combo	Very similar to Meta; ttclid strong for events
Pinterest	Yes	Partial	Limited	Deterministic, ID-centric	Primarily email + MAID; limited cross-ID reinforcement
The Trade Desk (TTD)	Yes	Via ID graph	Yes (via graph)	Deterministic via identity graph	UID2 + MAID + cookies resolved through EUID/UID2 graph
Infillion	Yes	Via ID graph	Yes (via graph)	Deterministic via partner graph	Relies on OmniID / partner identity resolution

Takeaways for activation planning

Meta and TikTok
- Send everything you have
- True “any-ID” behavior
- Best platforms for identity-rich clean room activation
Google
- Send multiple IDs for coverage
- Do not expect cross-ID reinforcement
- Email/phone dominance is real
TTD and Infillion
- Matching quality depends more on identity graph participation
- UID2 / OmniID coverage matters more than raw identifier count
Pinterest
- Behaves closer to Google than Meta
- Limited upside beyond email + MAID

TL;DR

Only Meta and TikTok do “any-of-them” matching natively
Google is deterministic but single-ID resolved
Open web platforms match via identity graphs, not raw IDs

Hashed data formatting guidelines by media partner

Universal normalization rules:
Before hashing any data, almost all platforms require the following normalization steps to ensure matching accuracy:

Text: Convert to lowercase and trim all leading/trailing whitespace.
Phone: Convert to E.164 format (e.g., +15551234567)
Hashing: SHA-256 is the industry standard, though some platforms accept MD5 or SHA-1.

Media Partner	Email Guidelines	Phone Number Guidelines	Mobile ID (MAID) Guidelines	Specific Platform Nuances & Matching Logic	Relevant Reference
Google Ads / DV360	Format: Lowercase, trim whitespace. Include domain. Gmail Rule: For @gmail.com / @googlemail.com, remove periods (.) before the "@" and suffixes after (+). Hashing: SHA-256 (Hex encoded).	Format: E.164 (e.g., +16505551212). Must include country code. Hashing: SHA-256.	Format: IDFA (iOS) or GAID (Android). Hashing: Do not hash.	Matching Priority: Deterministic and prioritized (1. Email, 2. Phone, 3. Mobile IDs). It does not combine multiple IDs to form a match (single-ID resolution). DV360 PAIR: DV360 also supports the PAIR protocol for secure publisher-advertiser reconciliation.	Get started with Customer Match (Google Ads API)
Meta (Facebook/ Instagram)	Format: Lowercase, trim whitespace. Hashing: SHA-256 required.	Format: Remove symbols, letters, and leading zeros. Include country code (e.g., 16505551212). Hashing: SHA-256.	Format: IDFA or GAID. Hashing: Do not hash. Lowercase, keep hyphens.	Matching Logic: Uses "any-of-them" logic; combines multiple IDs (Email, Phone, MAID, etc.) to resolve a user, providing strong cross-ID reinforcement. CAPI: Requires unhashed client_ip_address and client_user_agent for browser matching.	Customer Information Parameters - Conversions API
TikTok	Format: Lowercase, remove all spaces. Hashing: SHA-256.	Format: E.164 (+[country code][number]). Also accepts without +. Hashing: SHA-256.	Format: GAID (Android) or IDFA (iOS). Case: All upper or all lower. Hashing: SHA-256 or MD5.	Matching Logic: Similar to Meta, uses "any-of-them" deterministic matching logic where any single valid ID is sufficient. Audience Size: Requires a minimum of 1,000 matched users to activate a Custom Audience.	How to create a Custom Audience with a customer file
The Trade Desk (UID2)	Format: Same "Gmail Rule" as Google (remove dots/plus). Encoding: Base64-encode the raw bytes (result is 44 chars).	Format: E.164. Hashing: SHA-256. Encoding: Base64-encode the raw bytes.	Format: AAID (lowercase) / IDFA (uppercase). Hashing: SHA-256, MD5, or SHA-1.	Matching Logic: Deterministic via identity graph. Matching depends on graph participation rather than raw ID coverage. Critical Nuance: You must base64-encode the raw bytes of the hash. Encoding the hex string will fail.
Pinterest	Format: Must include "@". Hashing: SHA-256, MD5, or SHA-1.	Format: E.164. Remove special characters (spaces, hyphens). Hashing: SHA-256, MD5, or SHA-1.	Format: IDFA or GAID. Hashing: SHA-256, MD5, or SHA-1.	Matching Logic: "ID-Centric." Matching is primarily driven by Email and MAID. There is "limited upside" to providing other IDs as cross-ID reinforcement is weak compared to Meta. Upload Behavior: Uploads one identifier type per audience.	DCP Data Ingestion and Structuring Guidelines