Data Collaboration Platform (DCP) - Prepare source data for ingestion

At a glance: Prepare your source data to maximize match rates between collaborator datasets and optimize collaboration outcomes.

About data preparation

The AppsFlyer Data Collaboration Platform (DCP) is designed to ingest and process large-scale transactional data, user-level data, or a combination of both from multiple collaborators. Once ingested, the DCP can match users across datasets using shared identifiers (for example, hashed email or hashed phone) and enable collaboration workflows based on that data.

This article outlines:

  • The kinds of data the DCP can process (transactional, user-level, or a combination)
  • How to structure data to maximize match rates between collaborators, and for optimal matching by activation partners
  • Which data structures are recommended and which aren't
  • How to handle multi-value identifiers when using CSV files

Understand data types

Most sources used in collaborations involve one or both of the following data types:

  • User-level (identity) data
    • Goal: Represent each user and the identifiers used for matching
    • Recommended structure: One row per user
  • Transactional (event) data
    • Goal: Represent user actions (purchases, visits, transactions)
    • Typical structure: Multiple rows per user (because a user can have many transactions)

Data structuring guidelines

This section outlines DCP data structuring guidelines and best practices.

Recommended data structures

To structure your source data for optimal ingestion and maximum outcomes in DCP:

  1. Ensure it meets source data formatting requirements
  2. For best results, structure identity data as one row per user and include all available identifiers for that user. See Recommended identifier structures.
    • If users have multiple email addresses or phone numbers, use array-based formats (such as Parquet, BigQuery, or Snowflake).
  3. Check that hashing guidelines are followed.
  4. For transactional datasets (and other data types where applicable), use the DCP Data View tool to create a business-oriented, privacy-safe subset of your data optimized for collaboration, and share it instead of raw tables. Why Data Views are recommended.

While other structures can be used, they may not be ideally suited for getting the most out of DCP. See Data structures that aren't recommended.

Recommended identifier structures

To maximize match rates, it's recommended to send user-level identity data as a single row per user whenever possible and to include all available identifiers in a structure that accurately reflects how users are represented.

Option 1: One row per user, one value per identifier type

Simple and effective.

Use this when each user has at most one hashed email address, one hashed phone number, and so on.

Example

User ID Hashed Email Hashed Phone
123 5d41402 152d234b70
456 7d793037
789 0e6e5414

Why this is recommended

  • Simplest to prepare and ingest
  • Easy to validate and troubleshoot
  • Strong matching performance when identifiers are complete and consistent (for matching with collaborator datasets and activation media partner matching) 

Best fit

  • Great for CSV-based uploads
  • Also works well in Parquet / BigQuery / Snowflake

Option 2: Arrays (lists) of identifiers for multi-value fields

Best for accuracy when users have multiple emails and phones.

Use this when a user can have multiple values of the same identifier type, such as:

  • Multiple hashed emails
  • Multiple hashed phones

This structure should be represented as native arrays.

Example

User ID Hashed Emails Hashed Phones MAID
u_0001 [a1b2 · c3d4 · e5f6] [p1q2 · r3s4] m001
u_0002 [z9y8 · x7w6] m002
u_0003 [t1u2 · v3w4 · x5y6]

Why this is preferred

  • Reflects how identity data is often stored internally
  • No need to guess how many “email columns” might be needed
  • Avoids ongoing schema changes
  • Maximizes matching potential (for collaborator datasets and for activation media partner matching) by allowing multiple identifiers per user

Best fit

  • Parquet (S3/GCS)
  • BigQuery
  • Snowflake

Practical checklist for data structuring

Identity (user-level) dataset checklist 

  • One row represents one user. 
  • Include all available identifiers (hashed email, hashed phone, MAID, and so on). 
  • If multiple values exist (for example, multiple emails), use native arrays where possible. 
  • Avoid lists in cells, avoid “email_1/email_2/email_3,” and avoid multi-row identity structures. 

Transactional (events) dataset checklist 

Data Views checklist 

  • Build business-oriented Data Views (audience-ready, summarized, standardized). 
  • Share Data Views instead of raw sensitive tables whenever possible. 
  • Minimize shared fields to only what’s needed (for privacy and business protection). 
  • Consider aggregations to reduce access to sensitive data and improve efficiency.

Supported ingestion cloud services and formats

The DCP supports ingestion through multiple cloud services, such as:

  • AWS S3
  • GCS (Google Cloud Storage)
  • BigQuery
  • Snowflake

Format guidance

  • Use CSV (comma-separated) when each user has one value per identifier type (one email, one phone, and so on).
  • When users can have multiple values of the same identifier type (for example, multiple hashed emails), prefer array-capable formats:
    • Parquet (common in big data)
    • BigQuery (arrays are built-in)
    • Snowflake (arrays are built-in)

Notes: 

  • “Arrays” just means “a list of values.”
  • Parquet and arrays in BigQuery/Snowflake are standard, and most data teams can support this naturally.

See Connector-specific example assets (AWS S3, GCS, BigQuery, Snowflake).

Data structures that aren't recommended

Certain types of data structuring are ideally suited for ingestion into DCP, and others are not. The following is a list of data structures that are not recommended:  

Avoid: Multiple columns for the same identifier type

Examples like “primary/secondary/third email” are discouraged.

Example to avoid

User ID Email Hash 1 Email Hash 2 Email Hash 3
123 a1b2 c3d4
456 z9y8

Why to avoid

  • The number of values per user is unpredictable
  • Leads to column proliferation and schema maintenance issues

Avoid: “Lists inside a single cell” (free text)

Even if it looks convenient, it’s error-prone.

Example to avoid

User ID Hashed Emails (text)
123 a1b2 · c3d4 · e5f6

Why to avoid

  • Not a true array type
  • Formatting/escaping issues are common
  • Small inconsistencies can break ingestion

Avoid: Multiple rows per user for identifiers (when your goal is a user identity table)

Don’t split identity data across rows when you can keep one row per user.

Example to avoid

User ID Identifier Type Identifier Value
123 hashed_email a1b2
123 hashed_email c3d4

Why to avoid

  • Requires extra processing to reconstruct users
  • Higher risk of duplicates and inconsistencies
  • Less efficient at scale

Note: Multiple rows per user are normal for transactional tables, but not ideal for the identity (user-level) table used for matching.

Additional information

This section provides additional, in-depth information on relevant topics.

Transactional data examples

Transactional/event data may have multiple rows per user:

Transaction ID User ID Transaction Date Amount Category
t_001 123 2025-12-01 120.00 Grocery
t_002 123 2025-12-15 45.00 Pharmacy
t_003 456 2025-12-20 300.00 Electronics

Recommended

  • Ingest transactional data in its natural structure
  • Use the Data View tool to produce a collaboration-ready user-level view (for example, spend in last 30 days, category affinity, recency/frequency summaries)

Connector-specific example assets (AWS S3, GCS, BigQuery, Snowflake)

What does “array/list of identifiers” mean?

Sometimes a single user can have more than one email or phone (for example, a personal and a work email). Instead of forcing these into extra columns or messy text fields, we represent them as a list of values.

User ID Hashed Emails (List)
u_0001 [a1b2 · c3d4 · e5f6]
u_0002 [z9y8 · x7w6]

1. AWS S3 and GCS connector (Parquet)

What the identity sample should contain (schema)

Field Type Notes
user_id string One row per user
hashed_emails array of strings Multi-value identifier (recommended)
hashed_phones array of strings Multi-value identifier (recommended)
maid string Optional additional identifier

Parquet supports lists natively, which is why it’s recommended when users have multiple emails/phones.

2. BigQuery connector (table preview and schema screenshot)

Example identity schema (BigQuery)

Field Type Mode
user_id STRING NULLABLE
hashed_emails STRING REPEATED (ARRAY)
hashed_phones STRING REPEATED (ARRAY)
maid STRING NULLABLE

Arrays are native in BigQuery (REPEATED fields). This is the recommended approach when multiple emails/phones exist.

3. Snowflake connector (table preview and schema screenshot)

Example identity schema (Snowflake)

Field Type Notes
USER_ID VARCHAR One row per user
HASHED_EMAILS ARRAY Multi-value identifier
HASHED_PHONES ARRAY Multi-value identifier
MAID VARCHAR Optional

Snowflake supports list-like fields (arrays). This is preferred over creating email_1/email_2/email_3 columns.

Why Data Views are recommended

Data views are customized subsets of your existing data sources that you can create without changing the original dataset. They let you perform functions that modify, refine, and summarize the data so you can share only what’s relevant in collaborations, keeping the data you share clear and focused, and all other data private and secure. Use Data Views in collaborations because they:

Are business-oriented and easier to collaborate with

  • Data Views let you present the data in a clean, business-friendly structure (for example, “Customer Identity View”, “Eligible Audience View”, “Purchase Summary View”) rather than raw logs or complex transactional schemas.
  • They help standardize definitions (for example, “active customer”, “total spend last 30 days”) so both sides are aligned and using the same logic.

Are more privacy-safe by design

  • Data Views support data minimization: you can share only the fields and rows needed for the collaboration, instead of exposing full raw datasets.
  • They reduce privacy risk by allowing you to:
    • Exclude unnecessary attributes
    • Limit granularity (for example, share aggregated metrics instead of line-level transactions)
    • Share only the identifiers needed for matching (and not additional sensitive columns)

Protect sensitive business data

  • Raw transactional tables can contain highly sensitive business signals (for example, detailed purchase behavior, pricing patterns, merchant relationships, internal IDs).
  • Data Views let you share only what’s required for the collaboration:
    • Example: share category-level aggregates instead of item-level details
    • Example: share spend buckets or summary metrics instead of exact line items
  • This reduces the risk of unintentionally revealing competitive or proprietary information.

Can make your data more cost-effective and performant

Data Views can be structured to be lighter and more efficient, which often reduces:

  • Data scanned
  • Processing time
  • Compute costs

Pre-structuring a collaboration-ready view (especially aggregated views) may make collaboration faster and more operationally predictable.

Multiple IDs/keys upload and matching behavior by platform

DCP identifiers upload 

Partner/Platform Supported Identifier Types Upload Mode Upload Behavior Identifier Priority (for single-type partners)
Meta Email, Phone, IDFA, GAID, Android ID, First Name, Last Name, DOB, City, State, Zip, Country All supported types Uploads all supported identifier types N/A - uploads all
Google Email, Phone, IDFA, GAID, Android ID, Customer ID Single type Uploads one identifier type per audience 1. Email 2. Phone 3. Mobile IDs 4. Customer ID
TikTok Email, Phone, IDFA, GAID, Android ID All supported types Uploads all supported identifier types N/A - uploads all
Pinterest Email, Phone, IDFA, GAID Single type Uploads one identifier type per audience 1. Email 2. Phone 3. Mobile IDs
Infillion Email, Phone, IDFA, GAID, MM-UUID All supported types Uploads all supported types grouped into up to 3 files N/A - uploads all
TheTradeDesk Email, Phone Single type Uploads one identifier type per audience 1. Email 2. Phone
AppsFlyer Data Locker All available identifiers All types combined Delivers all identifiers in a single combined file N/A - delivers all
Platform Can you send multiple IDs per user? Can they match on any of them? Can they combine multiple IDs? Matching model (practical) Notes
Meta Yes Yes Yes Deterministic, any-ID / combo Email, phone, MAID, fbp/fbc all evaluated; strongest “any-of-them” behavior
Google Yes Limited No Deterministic, prioritized One ID resolves the user; IDs are not combined to form a match
TikTok Yes Yes Yes Deterministic, any-ID / combo Very similar to Meta; ttclid strong for events
Pinterest Yes Partial Limited Deterministic, ID-centric Primarily email + MAID; limited cross-ID reinforcement
The Trade Desk (TTD) Yes Via ID graph Yes (via graph) Deterministic via identity graph UID2 + MAID + cookies resolved through EUID/UID2 graph
Infillion Yes Via ID graph Yes (via graph) Deterministic via partner graph Relies on OmniID / partner identity resolution

Takeaways for activation planning

  • Meta and TikTok
    • Send everything you have
    • True “any-ID” behavior
    • Best platforms for identity-rich clean room activation
  • Google
    • Send multiple IDs for coverage
    • Do not expect cross-ID reinforcement
    • Email/phone dominance is real
  • TTD and Infillion
    • Matching quality depends more on identity graph participation
    • UID2 / OmniID coverage matters more than raw identifier count
  • Pinterest
    • Behaves closer to Google than Meta
    • Limited upside beyond email + MAID

TL;DR

  • Only Meta and TikTok do “any-of-them” matching natively
  • Google is deterministic but single-ID resolved
  • Open web platforms match via identity graphs, not raw IDs

Hashed data formatting guidelines by media partner

Universal normalization rules:
Before hashing any data, almost all platforms require the following normalization steps to ensure matching accuracy:

  • Text: Convert to lowercase and trim all leading/trailing whitespace.
  • Phone: Convert to E.164 format (e.g., +15551234567)
  • Hashing: SHA-256 is the industry standard, though some platforms accept MD5 or SHA-1.
Media Partner Email Guidelines Phone Number Guidelines Mobile ID (MAID) Guidelines Specific Platform Nuances & Matching Logic Relevant Reference
Google Ads / DV360 Format: Lowercase, trim whitespace. Include domain. Gmail Rule: For @gmail.com / @googlemail.com, remove periods (.) before the "@" and suffixes after (+). Hashing: SHA-256 (Hex encoded). Format: E.164 (e.g., +16505551212). Must include country code. Hashing: SHA-256. Format: IDFA (iOS) or GAID (Android). Hashing: Do not hash. Matching Priority: Deterministic and prioritized (1. Email, 2. Phone, 3. Mobile IDs). It does not combine multiple IDs to form a match (single-ID resolution). DV360 PAIR: DV360 also supports the PAIR protocol for secure publisher-advertiser reconciliation. Get started with Customer Match (Google Ads API)
Meta (Facebook/ Instagram) Format: Lowercase, trim whitespace. Hashing: SHA-256 required. Format: Remove symbols, letters, and leading zeros. Include country code (e.g., 16505551212). Hashing: SHA-256. Format: IDFA or GAID. Hashing: Do not hash. Lowercase, keep hyphens.

Matching Logic: Uses "any-of-them" logic; combines multiple IDs (Email, Phone, MAID, etc.) to resolve a user, providing strong cross-ID reinforcement.

CAPI: Requires unhashed client_ip_address and client_user_agent for browser matching.

Customer Information Parameters - Conversions API
TikTok Format: Lowercase, remove all spaces. Hashing: SHA-256. Format: E.164 (+[country code][number]). Also accepts without +. Hashing: SHA-256. Format: GAID (Android) or IDFA (iOS). Case: All upper or all lower. Hashing: SHA-256 or MD5.

Matching Logic: Similar to Meta, uses "any-of-them" deterministic matching logic where any single valid ID is sufficient.

Audience Size: Requires a minimum of 1,000 matched users to activate a Custom Audience.

How to create a Custom Audience with a customer file
The Trade Desk (UID2) Format: Same "Gmail Rule" as Google (remove dots/plus). Encoding: Base64-encode the raw bytes (result is 44 chars). Format: E.164. Hashing: SHA-256. Encoding: Base64-encode the raw bytes. Format: AAID (lowercase) / IDFA (uppercase). Hashing: SHA-256, MD5, or SHA-1.

Matching Logic: Deterministic via identity graph. Matching depends on graph participation rather than raw ID coverage.

Critical Nuance: You must base64-encode the raw bytes of the hash. Encoding the hex string will fail.

Pinterest Format: Must include "@". Hashing: SHA-256, MD5, or SHA-1. Format: E.164. Remove special characters (spaces, hyphens). Hashing: SHA-256, MD5, or SHA-1. Format: IDFA or GAID. Hashing: SHA-256, MD5, or SHA-1. Matching Logic: "ID-Centric." Matching is primarily driven by Email and MAID. There is "limited upside" to providing other IDs as cross-ID reinforcement is weak compared to Meta. Upload Behavior: Uploads one identifier type per audience. DCP Data Ingestion and Structuring Guidelines