Inside Meta's Data Flow Discovery
Discover How Meta Tracks Data Journeys to Safeguard User Privacy at Scale
TL;DR
Situation
Meta handles vast amounts of user data across its platforms, requiring strong privacy controls to protect sensitive information. A critical component of this effort is data lineage, which helps trace how data moves across different systems, ensuring compliance with privacy policies like purpose limitation.
Task
Meta needed a scalable and automated way to track data lineage across millions of assets, including databases, web services, and AI systems. This required moving beyond manual data flow documentation to a more robust, automated discovery process.
Action
Data Flow Collection – Used static code analysis, runtime instrumentation, and input/output matching to track data across stacks (Hack, C++, Python, SQL).
Privacy Probes – Captured real-time runtime signals, identifying how and where sensitive data is logged, stored, or transformed.
Automated Lineage Graphs – Created scalable data flow visualizations to streamline privacy control implementation.
AI & Data Warehouse Integration – Ensured end-to-end traceability across AI models, databases, and batch-processing systems.
Iterative Filtering Tool – Allowed developers to refine lineage graphs, isolating relevant data flows and removing noise.
Result
Meta’s data lineage system reduced engineering time, improved compliance accuracy, and automated privacy enforcement. It enabled developers to quickly identify and secure sensitive data flows while ensuring continuous monitoring at scale. These innovations enhanced user data protection across Meta’s ecosystem
Use Cases
Privacy Enforcement, Compliance Monitoring, Data Lineage
Tech Stack/Framework
Python, SQL, C++, PyTorch, Presto, Spark
Explained Further
Meta's Privacy Aware Infrastructure (PAI) is designed to embed privacy controls within its systems, ensuring user data is handled responsibly. A foundational element of PAI is data lineage, which traces the journey of data across various platforms, providing a comprehensive view of its flow from collection to processing and storage. This capability is crucial for implementing privacy measures like purpose limitation, which restricts data usage to specific, intended purposes
Understanding Data Lineage at Meta
Data lineage involves mapping out how data moves through Meta's vast ecosystem, connecting source assets (e.g., database tables where data originates) to sink assets (e.g., tables or systems where data is stored or processed).
This mapping is essential for:
Scalable Data Flow Discovery: Creating an end-to-end graph that visualizes data movement, aiding in identifying where data resides and how it traverses the system.
Efficient Rollout of Privacy Controls: Pinpointing optimal integration points for privacy measures, streamlining their implementation.
Continuous Compliance Verification: Monitoring data flows to ensure ongoing adherence to privacy requirements.
Traditional methods of tracking data flow, such as manual code inspections and data flow diagrams, are inadequate for Meta's scale, which encompasses billions of lines of code. To address this, Meta has developed a scalable lineage solution that combines static code analysis with runtime signals.
Implementing Data Lineage: A Walkthrough with Religion Data
To illustrate the implementation of data lineage, consider how Meta handles users' religious information within the Facebook Dating app. This data is sensitive and subject to strict purpose limitation requirements, ensuring it's used solely for enhancing the dating experience and not for other purposes.
The implementation involves two key stages:
Collecting Data Flow Signals: Capturing signals from various processing activities across different systems to create a comprehensive lineage graph.
Identifying Relevant Data Flows: Isolating the specific subset of data flows that pertain to religion from the broader lineage graph.
These stages span multiple systems, including web services, data warehouses, and AI platforms.
Collecting Data Flow Signals
Tracking data lineage at scale requires a systematic approach to capturing how data moves through different systems. At Meta, this process involves collecting data flow signals across web systems, data warehouses, and AI platforms.
1. Web Systems: Capturing Data from User Input to Storage
When a user enters their religious views on Facebook Dating, this data is stored and used to identify compatible matches. However, to comply with purpose limitation requirements, the data should not be used outside of the dating feature.
How Web Data Flow is Logged
The data journey starts when a user submits their religious information via their mobile device. This information is then transmitted to a web endpoint, logged into a logging table, and stored in a database. The process can be visualized as follows:
User Input: The user enters their religion preference (e.g., "Christian").
Web Endpoint Processing: The backend receives and processes the request.
Logging: The data is recorded for debugging or tracking purposes.
Database Storage: The information is stored for future use in the matching algorithm.
Tracking Data Flow with Static and Runtime Analysis
To track how religion data moves through these stages, Meta uses both static and runtime analysis tools:
Static Code Analysis: Simulates code execution to determine how data is expected to flow. This method provides useful quality signals (indicating confidence in the detected flow). However, since it doesn't run the actual code, it may generate false positives (detecting flows that don’t actually happen).
Runtime Instrumentation (Privacy Probes): Gathers real-time data flow signals by actively monitoring how data moves during execution. Privacy Probes observe data as it flows through loggers, databases, and services, capturing actual movements rather than just predicted ones.
By combining static analysis (expected flows) and runtime instrumentation (actual flows), Meta ensures a high-confidence tracking system that accurately maps data movements.
How Privacy Probes Work
Privacy Probes are a core component of Meta’s lineage technology and function as automated data flow discovery agents. Their operation can be broken down into three steps:
Capturing Payloads: Privacy Probes sample source and sink payloads in memory, capturing metadata like timestamps, asset identifiers, and stack traces. This metadata serves as evidence of a data flow.
Comparing Payloads: The system compares input and output values to check if they match (directly or via transformation).
Categorizing Matches:
Exact Match (High Confidence): The source and sink have identical values.
Substring Match (High Confidence): The sink contains a recognizable portion of the source data.
Transformed Data (Low Confidence): The source undergoes transformation before logging (e.g., instead of storing "Christian," the system logs "Count: 1").
Example: Privacy Probes in Action
Consider an example where religious views are logged in different ways:
The first two rows are high-confidence matches, indicating clear data lineage.
The last row, where data is transformed, is categorized as low-confidence but still monitored.
2. Data Warehouse: Tracking Offline Processing
Once religious data is logged in the web system, it often propagates to the data warehouse for offline analysis. Unlike web data, which flows through endpoints and databases, data warehouse tracking requires analyzing SQL queries and job execution logs.
Tracking Data Movement in SQL Queries
In the data warehouse, data flow tracking is achieved by:
Logging SQL queries executed by Presto and Spark.
Performing static analysis on logged SQL queries and job configurations.
For example, consider the following SQL query:
Source Table:
dating_log_tbl
Destination Table:
dating_training_tbl
Tracked Columns:
user_id
,religion
Meta’s SQL Analyzer automatically extracts lineage signals between tables. In real applications, tracking is done at a granular level (e.g., column mappings like religion
→ target_religion
).
Handling Partial Data Flow Signals
In some cases, data is not fully processed via SQL queries, leading to missing lineage connections. To address this, Meta:
Collects runtime metadata (e.g., execution environments, job IDs).
Uses trace IDs to connect isolated read/write operations into a complete lineage graph.
3. AI Systems: Tracking Data Flow in Machine Learning Models
AI models often use sensitive data to improve recommendations. For instance, Facebook Dating may train a model using religious data to refine match suggestions.
To track data lineage in AI systems, Meta monitors asset relationships between:
Input Datasets (e.g., training tables).
Features (e.g., derived scores).
Models (e.g., trained ranking models).
Workflows & Inferences (e.g., real-time scoring).
Tracking AI Data Flow via Configurations
Consider a training configuration:
The dataset ID traces data from the warehouse into the AI training pipeline.
The feature ID tracks how religion-based scores are generated.
The model ID links the trained model to downstream inferences.
Instrumenting AI Workflows
To enhance tracking, Meta instruments AI systems at key stages:
Data-loading layers (e.g., PyTorch, TensorFlow).
Workflow engines (e.g., FBLearner Flow).
Inference services (e.g., real-time model APIs).
By matching input/output assets, Meta constructs a comprehensive lineage graph, ensuring that AI models do not unintentionally misuse sensitive data.
Identifying Relevant Data Flows
Once the comprehensive lineage graph is constructed, the next step is to isolate the data flows pertinent to religion. Meta has developed an iterative analysis tool that assists developers in this process. The tool facilitates:
Discovery: Identifying data flows from source assets and highlighting downstream assets with low-confidence flows.
Exclusion and Inclusion: Allowing developers to exclude assets that don't handle religion data and include those that do, refining the focus of the lineage analysis.
Iteration: Repeating the discovery process with newly identified assets until all relevant data flows are mapped.
This systematic approach enables developers to efficiently locate and manage sensitive data flows, ensuring that appropriate privacy controls are applied throughout Meta's complex systems.
Lessons Learned
Throughout the development and implementation of data lineage within PAI, Meta has gleaned several key insights:
Early Investment in Lineage: Prioritizing data lineage early in the development process accelerates the implementation of privacy controls and uncovers new opportunities for technology application.
Tooling for Efficiency: Developing intuitive tools for consuming lineage data significantly reduces engineering effort and complexity.
System Integration: Integrating lineage collection with existing systems and libraries enhances coverage and scalability.
Continuous Measurement: Implementing metrics to monitor lineage coverage ensures adaptability to the evolving data landscape.
The Full Scoop
To learn more about this, check Meta's Engineering Blog post on this topic