Data Clean Room—Working with sources

Premium

At a glance: Set up the custom data sources you share with the Data Clean Room (DCR) for matching with attribution data and creating DCR reports.

Introduction

Many DCR reports are designed to match your attribution data with data from your custom sources. This article contains everything you need to know about working with your custom sources, including how to:

 Before you begin

Before you create custom sources, you should first:

  • [Required] Set up the cloud services from which the DCR will retrieve the data. Two types of cloud services are supported:
    • Data warehouses: BigQuery and Snowflake
    • Cloud storage buckets: Amazon S3 (AWS) and GCS
  • [Optional] Create inbound connections in the AppsFlyer platform to connect these cloud services to the DCR.
    • If these connections were not set up previously, you will be prompted to set them up during source creation. 

Source data requirements

Sources must meet these requirements in order to prevent errors in source creation and report processing.

Data format (relevant to all sources)

Data within sources must meet these requirements:

  • Date and time:
    • Format: yyyy-MMM-dd hh:mm:ss (for example, 2023-APR-18 15:30:35)
    • Time zone: UTC
  • Numbers: maximum 2 digits following the decimal point
  • String length: maximum of 256 characters
  • Character limitations:
    • For field names (column headers): no spaces or special characters
    • All other data: no limitations (all characters are valid)

Table columns (relevant only to sources in data warehouses)

In addition to data shared for processing, source tables in BigQuery or Snowflake must include 2 additional columns – one for date and one for version:

  • Date:
    • Column header: dt
    • Column type: date
    • Data format: yyyy-mm-dd (for example, 2023-04-18)
    • Additional: BigQuery tables must be partitioned by this column
  • Version:
    • Column header: v
    • Column type: string
    • Data format: number (for example, 1, 2, 3, 10)
    • Important! A new version of a report is triggered each time the DCR detects a new value in this column. To ensure the completeness of your report, be sure to populate the source table with a complete set of data whenever the column value is changed.

File name and format (relevant only to sources in cloud storage buckets)

Source files stored in Amazon S3 or GCS must meet these file name and format requirements:

  • File name must comply with DCR naming requirements
  • CSV or GZIP format
    • The file underlying GZIP compression must be a CSV file.
  • Number of data source files per data folder:
    • CSV: Maximum of 1
    • GZIP: Maximum of 1 single-part file. Multi-part GZIP files are supported when named as follows: filename_part01.gzip, filename_part02.gzip, etc.

Creating a source

The process of creating a source consists of all the steps described below. They are separated into tabs simply for ease of reading.

Follow these steps to create a source:

#1: Name the source

  1. Go to the Sources tab of the Data Clean Room.
  2. Click the + New source button.
    The New source page opens.
  3. Enter the name of the source in the upper-left corner.
    • This can be any unique name that will help you identify the source in the DCR platform. For cloud integrations, the name doesn't need to match the file name.
    • Important! Ensure that the source name is different from all other sources in your account or you won't be able to save the source.
    • Source name requirements:
      • Length: 2-80 characters
      • Valid characters:
        • letters (A-Z, a-z)
        • numbers (0-9), cannot be the first character of a name; 
      • Invalid characters:
        • spaces
        • all other symbols or special characters

#2: Specify the source location

To specify the source location:

  1. Select the connection in which the source will be (or has been) created.
    • If there are no connections defined in your account, the New connection dialog will open, prompting you to create one. Follow these instructions to create it.
    • If you have existing connections but want to use a new one, click the dcr_new_connection_button.png button to open the New connection dialog. Follow these instructions to create it.
  2. Continue with the relevant instructions below based on where the data for your source is located.

Source locations in BigQuery

To complete specifying the source location for BigQuery sources:

  1. Select the dataset in which the source table is located.
  2. Select the table in which the source data is located.

The lists from which you make these selections contain the available datasets and tables, respectively, in the BigQuery project you specified when creating the connection.

Source locations in Snowflake

To complete specifying the source location for Snowflake sources:

  1. Select the share containing the source data.
  2. Select the schema in which the source table is located.
  3. Select the table in which the source data is located.

The lists from which you make these selections contain the shares, schemas, and tables, respectively, in the Snowflake account you specified when creating the connection.

Source locations in cloud storage buckets

Source locations in Amazon S3 or GCS consist of the cloud storage bucket specified by the connection and the underlying folder path from which the DCR reads the source file each time it is updated. 

Once you have specified the connection, AppsFlyer can automatically generate the required underlying folder path as part of the source creation process.

  • Allowing AppsFlyer to generate the folders makes the process easy. However, you can choose to manually create them instead, according to the instructions detailed here.

If AppsFlyer generates the folders, the only additional information required is the name you want to give the source folder. (This is the top-level folder in which you update the source each time your want to use it for running a new report version.) You can also indicate whether you want the source folder to be created underneath a parent folder often named input.

To complete specifying a source location in a cloud storage bucket, enter the source folder name.

  • By default, the displayed source folder name:
    • Is based on the name you gave the source. You can change the folder name to meet your needs, so long as it complies with the DCR naming requirements.
    • Indicates that it will be generated within a parent folder named input. This folder serves as the parent folder for all sources you upload to the DCR.
      • The input folder is not required, and you can remove it or name it something different, so long as it complies with the DCR naming requirements.
      • Although this folder is not required, having an input folder (or an equivalent folder of a different name) is considered best practice. It is even more highly recommended when you are using the same cloud storage bucket both for uploading data files (input) and receiving reports (output).

 Important!

If you manually created the folder path, make sure the connection and path you enter in the Source location section match the path you manually created.

#3: Define the source structure

For all sources that you share with the DCR for processing, AppsFlyer needs to know how each data field should be used in order to create reports. Defining the source structure consists of:

  • Loading the source fields
  • Categorizing each field (column) as one of the following types:
    • Identifier: Field that identifies a unique app user (examples might include CUID, AppsFlyer ID, etc.)
      • The primary purpose of identifiers in the context of the DCR is to join data sources so that corresponding user-level data can be matched.
    • Dimension: An attribute by which you categorize app users (examples might include geo, install date, campaign, etc.)
    • Metric: Numeric data you have collected with respect to an app user (examples might include revenue, number of app opens, LTV, etc.)
      • A data field identified as a metric can contain only numeric values.

Loading source fields

Load source fields using the relevant instructions below:

Data warehouse sources

To load fields from a source located in a data warehouse (BigQuery or Snowflake), click the  dcr_load_fields_from_source.png button.

 Important!

If the selected source table does not include the required date and version columns, you will receive an error.

Cloud storage bucket sources

To load fields from a source located in a cloud storage bucket (Amazon S3 or GCS), you must upload a prototype source file.

For purposes of defining the source structure: 

  • You can upload a prototype version of the source from a local file.
    • If you select this option, AppsFlyer always creates the source folder path automatically.

                                                                - or -

  • You can upload a prototype version of the source file directly from its connection.
    • If you select this option, there's one additional choice to make:
      • Allow AppsFlyer to automatically create the source folder structure; or
      • Create the source folder structure manually

To upload your prototype source file, follow the instructions in the relevant tab below:

Local file Connection (automatic creation) Connection (manual creation)
  1. In the Source structure section, click the DCR_load_fields_from_file.png button.
  2. In the window that opens, select Upload a local file.
  3. Specify the CSV or GZIP file you want to upload, then click OK.

Categorize fields

After you load the fields, AppsFlyer analyzes the file, and a list of all data fields (columns) is displayed in the Available fields list.

To categorize the fields:

  1. Select one or more fields in the Available fields list on the left and use the buttons in the middle of the screen to categorize them as identifiers, dimensions, or metrics.
    • Once you categorize a field, it is displayed in the relevant category list on the right side of the screen.
    • You can use the search bar to search for fields in the lists.
    • To remove a field from a category it's been assigned to, select it in the relevant category list and use the Remove button to return it to the Available fields list.
  2. Repeat this process until you have categorized each field you want to include in DCR reports.
    • There is no requirement to categorize every field in the Available fields list. However, a field must be categorized in order to use it later in a report.
  3. If you edit the source before saving the source and want to use fields from the edited source data, click the Reload fields link at the bottom of the Available fields list.
    • Note that reloading the source will overwrite the field names in the Available fields list. Any fields that you previously categorized will remain in the Identifiers, Dimensions, or Metrics lists.
    • If a previously categorized field is not found in the reloaded source data, it will still display in the relevant category list, but it will be marked with an error icon.

 Note

If you decide to use additional fields from this source after saving it, you can do so by editing the source structure.

#4: Save the source

To save the source:
  1. [Optional] Click DCR_test_source.png to check for errors in the format or validity of the source fields.
  2. Click Save to save the source.

    The source is created and a confirmation message is displayed.

    • If you uploaded the source from a local file, saving the source triggers the automatic creation of the folder structure, and the displayed confirmation message includes a link to the source folder.

    The new source is displayed in the list of all existing sources in the Sources tab of the Data Clean Room.

Updating sources to trigger report processing

Each time you want AppsFlyer to process a data source file and run a report based on it, you upload a new version of the file to the source folder, within a series of nested subfolders indicating the date and version number (plus one extra subfolder to let AppsFlyer know where the data is).

AppsFlyer continually scans for new versions of source files for the current date and 2 days prior. A new version of a report is triggered each time new versions of the source files are found (including _SUCCESS files, as further detailed below).

Nested subfolders for each date and version

The structure of nested subfolders is as follows:

  • Within the source folder --> 1 subfolder for each date ("date folder")
    • Format: dt=yyyy-mm-dd/
    • Example: dt=2022-12-15/
  • Within each date folder --> 1 subfolder for each version on that date ("version folder")
    • Format: v=n/
    • Example: v=1/
    • Note: The version folder is required even if you only upload the file one time per day.
  • Within each version folder --> 1 subfolder to indicate the location of the data ("data folder")
    • Format: data/
    • The data folder is the location to which the source file is uploaded.

In most cases, you use API calls or other available programmatic means to create the date/version/data folders automatically each time the data source file is uploaded. For additional information, see the API reference for your cloud service: AWS, GCS.

_SUCCESS files

Once the upload of a source file to the data folder is complete, an empty file named _SUCCESS should be uploaded to the version folder. This alerts AppsFlyer that a new file is available to be processed. In most cases, you use an API script to automatically generate and upload this file.

Important! The _SUCCESS file must be uploaded to the version folder, outside the data folder.

The filename for the _SUCCESS file:

  • Must be in ALL CAPS
  • Must be preceded by an underscore (_)
  • Should not have a file extension

For multi-part GZIP files:

  • Only one _SUCCESS file should be uploaded for all file parts.
  • The _SUCCESS file should be uploaded only after all file part uploads are complete.

Example (after uploading files)

After uploading source files on 2 days (and programmatically creating date/version/data folders and _SUCCESS files), your bucket/folder structure might look something like this:

dcr_file_structure_after_uploads.png

Working with existing sources

There are several ways in which you might want to work with existing sources. You initiate these processes from the Sources tab of the Data Clean Room:

Editing the source name

To edit the source name:

  1. Go to the Sources tab of the Data Clean Room.
  2. In the list of sources, hover over the row of the source you want to edit.
  3. Click the edit button edit_button.png that displays on the right side of the row.
  4. On the Edit source page, edit the name of the source.
  5. Click the Save button to save the source with the new name or Cancel to undo your changes.

Editing the source structure

To edit the source structure:

  1. Go to the Sources tab of the Data Clean Room.
  2. In the list of sources, hover over the row of the source you want to edit.
  3. Click the edit button edit_button.png that displays on the right side of the row.
  4. On the Edit source page, the fields that were previously categorized as identifiers, dimensions, or metrics will display in the relevant category lists on the right side of the screen.
  5. You can move a previously categorized field to a different category without reloading fields from the source file. To do this:
    1. First, select it in the relevant category list and use the Remove button to return it to the Available fields list.
    2. Next, select it in the Available fields list and use the buttons in the middle of the screen to categorize it as an identifier, dimension, or metric.
  6. To work with fields in the source file that have not yet been categorized, they must be reloaded from the source location or from a local file. Make this selection by clicking the Reload fields link at the bottom of the Available fields list.
  7. AppsFlyer analyzes the file, and a list of all previously uncategorized data fields (columns) is displayed in the Available fields list.
    • Fields that were previously categorized as identifiers, dimensions, or metrics will still display in the relevant category lists on the right side of the screen.
    • If a previously categorized field is not found in the reloaded source file, it will still display in the relevant category list, but it will be marked with an error icon.
  8. Select one or more fields in the Available fields list on the left and use the buttons in the middle of the screen to categorize them as identifiers, dimensions, or metrics.
  9. Once you have made all necessary changes, click the Save button to save the source with the updated structure or Cancel to undo your changes.

 Important!

Don't forget to make corresponding changes reflecting the new source structure in any reports for which this source is used:

  • Fields that were removed, uncategorized, or changed from their previous categories are automatically removed from any reports in which they are used.
  • Newly added or categorized fields are not automatically included in existing reports until you edit report definitions to include them.

Deleting a source

  1. Go to the Sources tab of the Data Clean Room.
  2. In the list of sources, hover over the row of the source you want to delete.
  3. Click the delete button delete_button.png that displays on the right side of the row.
  4. In the dialog, confirm that you want to delete the source.
    • You cannot delete a source that is being used by a report. If this is the case, a message will list the reports in which the source is being used. In order to delete the source, you can either:
      • Delete the reports in which it is being used; or
      • Remove the source fields from the definitions of the reports in which they are used.

Reference

Manually creating a storage bucket folder structure (relevant only if you choose to do so)

In general, it's easiest to allow AppsFlyer to automatically generate the required folder structure as part of the source creation process. However, if you wish to create these folders manually, you can do so as follows.

Create a DCR key folder

To ensure maximum security, the folder directly beneath the bucket (the "DCR key folder") must be named with the 8-character, alphanumeric DCR key assigned to your account (for example, 01bcc5fb). Note that this is different from any other password or key associated with your AppsFlyer account.

The DCR key folder is generally created manually using the interface of your selected cloud service.

To get your account's DCR key, click the DCR key button at the top of the main DCR page.

dcr_key_button.png

After creating the DCR key folder, your bucket/folder structure would look something like this:

dcr_file_structure_dcr_key_folder.png

Top-level input folder

Though it is not required, best practice is to create a top-level input folder directly beneath the DCR key folder. This folder will be dedicated to files you upload to the DCR.

The top-level input folder is generally created manually using the interface of your selected cloud service.

  • This practice is even more highly recommended when you are using the same bucket both for uploading data files (input) and receiving reports (output).
  • You can name this folder anything you want, so long as it complies with the DCR naming requirements. For ease of identification, it is usually named input/.

After creating the top-level input folder, your bucket/folder structure might look something like this:

dcr_file_structure_input_folder.png

Second-level folder for each data source

You can regularly upload different data source files to the DCR for processing. Each of these data sources must be assigned a separate folder ("data source folders").

So, for example, if you plan to upload 2 files to the DCR for processing every day: BI-data.csv and CRM-data.gzip, you would assign each of these data sources a folder. You could choose to call these folders BI-data/ and CRM-data/.

The data source folders are generally created manually using the interface of your selected cloud service.

After creating 2 data source folders, your bucket/folder structure might look something like this:

dcr_file_structure_source_folders.png

Under each data source folder, nested subfolders by date and version must be created each time the source is updated.