At a glance: Prepare your source data to maximize match rates between collaborator datasets and optimize collaboration outcomes.
About data preparation
The AppsFlyer Data Collaboration Platform (DCP) is designed to ingest and process large-scale transactional data, user-level data, or a combination of both from multiple collaborators. Once ingested, the DCP can match users across datasets using shared identifiers (for example, hashed email or hashed phone) and enable collaboration workflows based on that data.
This article outlines:
- The kinds of data the DCP can process (transactional, user-level, or a combination)
- How to structure data to maximize match rates between collaborators, and for optimal matching by activation partners
- Which data structures are recommended and which aren't
- How to handle multi-value identifiers when using CSV files
Understand data types
Most sources used in collaborations involve one or both of the following data types:
-
User-level (identity) data
- Goal: Represent each user and the identifiers used for matching
- Recommended structure: One row per user
-
Transactional (event) data
- Goal: Represent user actions (purchases, visits, transactions)
- Typical structure: Multiple rows per user (because a user can have many transactions)
Data structuring guidelines
This section outlines DCP data structuring guidelines and best practices.
Recommended data structures
To structure your source data for optimal ingestion and maximum outcomes in DCP:
- Ensure it meets source data formatting requirements.
- For best results, structure identity data as one row per user and include all available identifiers for that user. See Recommended identifier structures.
- If users have multiple email addresses or phone numbers, use array-based formats (such as Parquet, BigQuery, or Snowflake).
- Check that hashing guidelines are followed.
- For transactional datasets (and other data types where applicable), use the DCP Data View tool to create a business-oriented, privacy-safe subset of your data optimized for collaboration, and share it instead of raw tables. Why Data Views are recommended.
While other structures can be used, they may not be ideally suited for getting the most out of DCP. See Data structures that aren't recommended.
Recommended identifier structures
To maximize match rates, it's recommended to send user-level identity data as a single row per user whenever possible and to include all available identifiers in a structure that accurately reflects how users are represented.
Option 1: One row per user, one value per identifier type
Simple and effective.
Use this when each user has at most one hashed email address, one hashed phone number, and so on.
Example
| User ID | Hashed Email | Hashed Phone |
|---|---|---|
| 123 | 5d41402 | 152d234b70 |
| 456 | 7d793037 | |
| 789 | 0e6e5414 |
Why this is recommended
- Simplest to prepare and ingest
- Easy to validate and troubleshoot
- Strong matching performance when identifiers are complete and consistent (for matching with collaborator datasets and activation media partner matching)
Best fit
- Great for CSV-based uploads
- Also works well in Parquet / BigQuery / Snowflake
Option 2: Arrays (lists) of identifiers for multi-value fields
Best for accuracy when users have multiple emails and phones.
Use this when a user can have multiple values of the same identifier type, such as:
- Multiple hashed emails
- Multiple hashed phones
This structure should be represented as native arrays.
Example
| User ID | Hashed Emails | Hashed Phones | MAID |
|---|---|---|---|
| u_0001 | [a1b2 · c3d4 · e5f6] | [p1q2 · r3s4] | m001 |
| u_0002 | [z9y8 · x7w6] | m002 | |
| u_0003 | [t1u2 · v3w4 · x5y6] |
Why this is preferred
- Reflects how identity data is often stored internally
- No need to guess how many “email columns” might be needed
- Avoids ongoing schema changes
- Maximizes matching potential (for collaborator datasets and for activation media partner matching) by allowing multiple identifiers per user
Best fit
- Parquet (S3/GCS)
- BigQuery
- Snowflake
Practical checklist for data structuring
Identity (user-level) dataset checklist
- One row represents one user.
- Include all available identifiers (hashed email, hashed phone, MAID, and so on).
- If multiple values exist (for example, multiple emails), use native arrays where possible.
- Avoid lists in cells, avoid “email_1/email_2/email_3,” and avoid multi-row identity structures.
Transactional (events) dataset checklist
- Keep event/transaction rows as-is. See Transactional data examples
- Use Data Views to create collaboration-ready user-level summaries when appropriate.
Data Views checklist
- Build business-oriented Data Views (audience-ready, summarized, standardized).
- Share Data Views instead of raw sensitive tables whenever possible.
- Minimize shared fields to only what’s needed (for privacy and business protection).
- Consider aggregations to reduce access to sensitive data and improve efficiency.
Supported ingestion cloud services and formats
The DCP supports ingestion through multiple cloud services, such as:
- AWS S3
- GCS (Google Cloud Storage)
- BigQuery
- Snowflake
Format guidance
- Use CSV (comma-separated) when each user has one value per identifier type (one email, one phone, and so on).
- When users can have multiple values of the same identifier type (for example, multiple hashed emails), prefer array-capable formats:
- Parquet (common in big data)
- BigQuery (arrays are built-in)
- Snowflake (arrays are built-in)
Notes:
- “Arrays” just means “a list of values.”
- Parquet and arrays in BigQuery/Snowflake are standard, and most data teams can support this naturally.
See Connector-specific example assets (AWS S3, GCS, BigQuery, Snowflake).
Data structures that aren't recommended
Certain types of data structuring are ideally suited for ingestion into DCP, and others are not. The following is a list of data structures that are not recommended:
Avoid: Multiple columns for the same identifier type
Examples like “primary/secondary/third email” are discouraged.
Example to avoid
| User ID | Email Hash 1 | Email Hash 2 | Email Hash 3 |
|---|---|---|---|
| 123 | a1b2 | c3d4 | |
| 456 | z9y8 |
Why to avoid
- The number of values per user is unpredictable
- Leads to column proliferation and schema maintenance issues
Avoid: “Lists inside a single cell” (free text)
Even if it looks convenient, it’s error-prone.
Example to avoid
| User ID | Hashed Emails (text) |
|---|---|
| 123 | a1b2 · c3d4 · e5f6 |
Why to avoid
- Not a true array type
- Formatting/escaping issues are common
- Small inconsistencies can break ingestion
Avoid: Multiple rows per user for identifiers (when your goal is a user identity table)
Don’t split identity data across rows when you can keep one row per user.
Example to avoid
| User ID | Identifier Type | Identifier Value |
|---|---|---|
| 123 | hashed_email | a1b2 |
| 123 | hashed_email | c3d4 |
Why to avoid
- Requires extra processing to reconstruct users
- Higher risk of duplicates and inconsistencies
- Less efficient at scale
Note: Multiple rows per user are normal for transactional tables, but not ideal for the identity (user-level) table used for matching.
Additional information
This section provides additional, in-depth information on relevant topics.
Transactional data examples
Transactional/event data may have multiple rows per user:
| Transaction ID | User ID | Transaction Date | Amount | Category |
|---|---|---|---|---|
| t_001 | 123 | 2025-12-01 | 120.00 | Grocery |
| t_002 | 123 | 2025-12-15 | 45.00 | Pharmacy |
| t_003 | 456 | 2025-12-20 | 300.00 | Electronics |
Recommended
- Ingest transactional data in its natural structure
- Use the Data View tool to produce a collaboration-ready user-level view (for example, spend in last 30 days, category affinity, recency/frequency summaries)
Connector-specific example assets (AWS S3, GCS, BigQuery, Snowflake)
What does “array/list of identifiers” mean?
Sometimes a single user can have more than one email or phone (for example, a personal and a work email). Instead of forcing these into extra columns or messy text fields, we represent them as a list of values.
| User ID | Hashed Emails (List) |
|---|---|
| u_0001 | [a1b2 · c3d4 · e5f6] |
| u_0002 | [z9y8 · x7w6] |
1. AWS S3 and GCS connector (Parquet)
What the identity sample should contain (schema)
| Field | Type | Notes |
|---|---|---|
| user_id | string | One row per user |
| hashed_emails | array of strings | Multi-value identifier (recommended) |
| hashed_phones | array of strings | Multi-value identifier (recommended) |
| maid | string | Optional additional identifier |
Parquet supports lists natively, which is why it’s recommended when users have multiple emails/phones.
2. BigQuery connector (table preview and schema screenshot)
Example identity schema (BigQuery)
| Field | Type | Mode |
|---|---|---|
| user_id | STRING | NULLABLE |
| hashed_emails | STRING | REPEATED (ARRAY) |
| hashed_phones | STRING | REPEATED (ARRAY) |
| maid | STRING | NULLABLE |
Arrays are native in BigQuery (REPEATED fields). This is the recommended approach when multiple emails/phones exist.
3. Snowflake connector (table preview and schema screenshot)
Example identity schema (Snowflake)
| Field | Type | Notes |
|---|---|---|
| USER_ID | VARCHAR | One row per user |
| HASHED_EMAILS | ARRAY | Multi-value identifier |
| HASHED_PHONES | ARRAY | Multi-value identifier |
| MAID | VARCHAR | Optional |
Snowflake supports list-like fields (arrays). This is preferred over creating email_1/email_2/email_3 columns.
Why Data Views are recommended
Data views are customized subsets of your existing data sources that you can create without changing the original dataset. They let you perform functions that modify, refine, and summarize the data so you can share only what’s relevant in collaborations, keeping the data you share clear and focused, and all other data private and secure. Use Data Views in collaborations because they:
Are business-oriented and easier to collaborate with
- Data Views let you present the data in a clean, business-friendly structure (for example, “Customer Identity View”, “Eligible Audience View”, “Purchase Summary View”) rather than raw logs or complex transactional schemas.
- They help standardize definitions (for example, “active customer”, “total spend last 30 days”) so both sides are aligned and using the same logic.
Are more privacy-safe by design
- Data Views support data minimization: you can share only the fields and rows needed for the collaboration, instead of exposing full raw datasets.
- They reduce privacy risk by allowing you to:
- Exclude unnecessary attributes
- Limit granularity (for example, share aggregated metrics instead of line-level transactions)
- Share only the identifiers needed for matching (and not additional sensitive columns)
Protect sensitive business data
- Raw transactional tables can contain highly sensitive business signals (for example, detailed purchase behavior, pricing patterns, merchant relationships, internal IDs).
- Data Views let you share only what’s required for the collaboration:
- Example: share category-level aggregates instead of item-level details
- Example: share spend buckets or summary metrics instead of exact line items
- This reduces the risk of unintentionally revealing competitive or proprietary information.
Can make your data more cost-effective and performant
Data Views can be structured to be lighter and more efficient, which often reduces:
- Data scanned
- Processing time
- Compute costs
Pre-structuring a collaboration-ready view (especially aggregated views) may make collaboration faster and more operationally predictable.
Multiple IDs/keys upload and matching behavior by platform
DCP identifiers upload
| Partner/Platform | Supported Identifier Types | Upload Mode | Upload Behavior | Identifier Priority (for single-type partners) |
|---|---|---|---|---|
| Meta | Email, Phone, IDFA, GAID, Android ID, First Name, Last Name, DOB, City, State, Zip, Country | All supported types | Uploads all supported identifier types | N/A - uploads all |
| Email, Phone, IDFA, GAID, Android ID, Customer ID | Single type | Uploads one identifier type per audience | 1. Email 2. Phone 3. Mobile IDs 4. Customer ID | |
| TikTok | Email, Phone, IDFA, GAID, Android ID | All supported types | Uploads all supported identifier types | N/A - uploads all |
| Email, Phone, IDFA, GAID | Single type | Uploads one identifier type per audience | 1. Email 2. Phone 3. Mobile IDs | |
| Infillion | Email, Phone, IDFA, GAID, MM-UUID | All supported types | Uploads all supported types grouped into up to 3 files | N/A - uploads all |
| TheTradeDesk | Email, Phone | Single type | Uploads one identifier type per audience | 1. Email 2. Phone |
| AppsFlyer Data Locker | All available identifiers | All types combined | Delivers all identifiers in a single combined file | N/A - delivers all |
| Platform | Can you send multiple IDs per user? | Can they match on any of them? | Can they combine multiple IDs? | Matching model (practical) | Notes |
|---|---|---|---|---|---|
| Meta | Yes | Yes | Yes | Deterministic, any-ID / combo | Email, phone, MAID, fbp/fbc all evaluated; strongest “any-of-them” behavior |
| Yes | Limited | No | Deterministic, prioritized | One ID resolves the user; IDs are not combined to form a match | |
| TikTok | Yes | Yes | Yes | Deterministic, any-ID / combo | Very similar to Meta; ttclid strong for events |
| Yes | Partial | Limited | Deterministic, ID-centric | Primarily email + MAID; limited cross-ID reinforcement | |
| The Trade Desk (TTD) | Yes | Via ID graph | Yes (via graph) | Deterministic via identity graph | UID2 + MAID + cookies resolved through EUID/UID2 graph |
| Infillion | Yes | Via ID graph | Yes (via graph) | Deterministic via partner graph | Relies on OmniID / partner identity resolution |
Takeaways for activation planning
-
Meta and TikTok
- Send everything you have
- True “any-ID” behavior
- Best platforms for identity-rich clean room activation
-
Google
- Send multiple IDs for coverage
- Do not expect cross-ID reinforcement
- Email/phone dominance is real
-
TTD and Infillion
- Matching quality depends more on identity graph participation
- UID2 / OmniID coverage matters more than raw identifier count
-
Pinterest
- Behaves closer to Google than Meta
- Limited upside beyond email + MAID
TL;DR
- Only Meta and TikTok do “any-of-them” matching natively
- Google is deterministic but single-ID resolved
- Open web platforms match via identity graphs, not raw IDs
Hashed data formatting guidelines by media partner
Universal normalization rules:
Before hashing any data, almost all platforms require the following normalization steps to ensure matching accuracy:
- Text: Convert to lowercase and trim all leading/trailing whitespace.
- Phone: Convert to E.164 format (e.g., +15551234567)
- Hashing: SHA-256 is the industry standard, though some platforms accept MD5 or SHA-1.
| Media Partner | Email Guidelines | Phone Number Guidelines | Mobile ID (MAID) Guidelines | Specific Platform Nuances & Matching Logic | Relevant Reference |
|---|---|---|---|---|---|
| Google Ads / DV360 | Format: Lowercase, trim whitespace. Include domain. Gmail Rule: For @gmail.com / @googlemail.com, remove periods (.) before the "@" and suffixes after (+). Hashing: SHA-256 (Hex encoded). | Format: E.164 (e.g., +16505551212). Must include country code. Hashing: SHA-256. | Format: IDFA (iOS) or GAID (Android). Hashing: Do not hash. | Matching Priority: Deterministic and prioritized (1. Email, 2. Phone, 3. Mobile IDs). It does not combine multiple IDs to form a match (single-ID resolution). DV360 PAIR: DV360 also supports the PAIR protocol for secure publisher-advertiser reconciliation. | Get started with Customer Match (Google Ads API) |
| Meta (Facebook/ Instagram) | Format: Lowercase, trim whitespace. Hashing: SHA-256 required. | Format: Remove symbols, letters, and leading zeros. Include country code (e.g., 16505551212). Hashing: SHA-256. | Format: IDFA or GAID. Hashing: Do not hash. Lowercase, keep hyphens. |
Matching Logic: Uses "any-of-them" logic; combines multiple IDs (Email, Phone, MAID, etc.) to resolve a user, providing strong cross-ID reinforcement. CAPI: Requires unhashed client_ip_address and client_user_agent for browser matching. |
Customer Information Parameters - Conversions API |
| TikTok | Format: Lowercase, remove all spaces. Hashing: SHA-256. | Format: E.164 (+[country code][number]). Also accepts without +. Hashing: SHA-256. | Format: GAID (Android) or IDFA (iOS). Case: All upper or all lower. Hashing: SHA-256 or MD5. |
Matching Logic: Similar to Meta, uses "any-of-them" deterministic matching logic where any single valid ID is sufficient. Audience Size: Requires a minimum of 1,000 matched users to activate a Custom Audience. |
How to create a Custom Audience with a customer file |
| The Trade Desk (UID2) | Format: Same "Gmail Rule" as Google (remove dots/plus). Encoding: Base64-encode the raw bytes (result is 44 chars). | Format: E.164. Hashing: SHA-256. Encoding: Base64-encode the raw bytes. | Format: AAID (lowercase) / IDFA (uppercase). Hashing: SHA-256, MD5, or SHA-1. |
Matching Logic: Deterministic via identity graph. Matching depends on graph participation rather than raw ID coverage. Critical Nuance: You must base64-encode the raw bytes of the hash. Encoding the hex string will fail. |
|
| Format: Must include "@". Hashing: SHA-256, MD5, or SHA-1. | Format: E.164. Remove special characters (spaces, hyphens). Hashing: SHA-256, MD5, or SHA-1. | Format: IDFA or GAID. Hashing: SHA-256, MD5, or SHA-1. | Matching Logic: "ID-Centric." Matching is primarily driven by Email and MAID. There is "limited upside" to providing other IDs as cross-ID reinforcement is weak compared to Meta. Upload Behavior: Uploads one identifier type per audience. | DCP Data Ingestion and Structuring Guidelines |