Data Engineering8 min read

Data Federation and the Modern Connected Enterprise

An exploration of data federation architecture, its benefits, challenges, and real-world applications in modern data systems.

Jack Hasselbring

November 2025

Data FederationData ArchitectureDistributed Systems

Introduction

Modern Enterprises generate and maintain massive amounts of data distributed across many systems. A well-functioning organization requires these systems to exchange and coalesce information to derive business insights. Studies have reported the global market will be larger than 150B USD in 2025 (source), and another study found that IT teams report spending over 30% of their budget on data storage, backup, and recovery (source). This need will only grow as data-hungry deep neural networks play an increasing role in business. Unnecessarily copying data at this scale can result in massive costs and lead to competing sources of truth, which can paralyze business insights.

Lift and Shift

Figure 1: Lift and Shift

In the past, a common solution was to move all the data to a single location, sometimes referred to as "Lift and Shift". The advantage is the data is now located in a single space for business analysts to immediately work with. The drawback is this process can be slow, duplicate data, and continuously use compute resources to move data from one location to another, all of which drive up costs. The central domain may also experience some lag in data freshness, as information needs to be copied or streamed in periodically.

A secondary pitfall of moving everything to a central data lake is Vendor Lock-In: an enterprise becomes so dependent on a single platform that it presents a major security, financial, or technical challenge to move to another solution.

All these aspects have driven interest in Federated Data Architectures, which allows a system to interact with data where it exists without migrating it to a new location.

What is a Federated Data Architecture?

Data Federation provides a unified view of many data sources in a single location without the step of moving any data. This eliminates data duplication and expensive data transfer pipelines. A federation engine can connect to multiple systems in an enterprise and query multiple data sources in real time, resulting in a seamless experience where users can access all relevant information without switching between systems.

This style of architecture reduces the need for costly data copies between different systems. Data federation is even more important in today's world of multiple competing cloud platforms where data is spread out (source).

Additionally, properly implemented data federation can help to avoid vendor lock-in. Since it allows a company to query, govern, and manage data without relocating or reformatting it, enterprises don't have to spend resources migrating data before they're consumed by new analytical tools. A well-designed federated architecture can sit above any individual vendor. If you switch vendors, the query interface and security remain the same. Figure 2 shows how a company could swap a vendor's application out without having to move any data.

Figure 2: Vendor Lock-In

A potential case where a company wants to switch their CRM vendor for some reason. The federated data architecture above allows them to do so without migrating any data. This makes the switch between vendors simpler, and eliminates all the costs and risks with large scale data migration.

Virtual Tables

Figure 3: Virtual Table

A key innovation enabling this architecture is the Virtual Table. A Virtual Table is a logical representation of data that behaves like a table, but doesn't store any data. The data is stored in another source that the virtual table accesses through a logical view. It handles accessing the true data source, translating your query in a language the source can understand, and displaying the result back to the user. You might think of it as projecting data onto a screen as seen in figure 3. The key takeaway is that a virtual table doesn't store data, but serves as an interface between backing systems and the user.

A single query engine aggregates data from one or many data sources, then projects that information to the user without moving any of the stored data. The user doesn't know any difference.

When you run a query, the federation engine builds a logical plan on how to fetch the data from different sources (such as Snowflake, Postgres, or other APIs). A key point to note is that the engine is highly flexible on the systems it's able to import from. The engine is capable of querying the various data sources, allowing expensive computation to be performed where it's most efficient. The compiled data is returned to the user in the virtual table.

Iceberg Tables

Iceberg is a table format specification created by Netflix and open-sourced to the Apache Foundation in 2018. It gained popularity and is now a popular format used by enterprise software such as Snowflake, AWS, and Foundry.

This table format introduced several technical innovations, but for simplicity, it could be thought of like a standardized wall outlet. Similar to the outlet, by conforming to a single design, a table opens itself up to be easily consumed by many applications (source).

Metadata Catalog

How does the federation engine know what's available and where to look? A Metadata Catalog tells the query engine what data exists, where it lives, and how to interpret it. It does not store any of the actual data. It's like a shopping list the data federation has access to and can pull from. Real world examples include Unity Catalog (Databricks), AWS Glue Data Catalog, and Hive Metastore (Hadoop) (source).

Real World Implementations of Data Federation

Salesforce

In the link below, Salesforce refers to zero-copy data federation, meaning you can seamlessly access all data without moving it. In this article, Salesforce highlights the importance of "data fluidity" by calling out the seamless movement of data between multiple data sources. Their concept of "File Federation" enables data specialists to modify external tables within Salesforce without creating redundant copies. In the link below, they make an external table available to a data specialist, who can modify it from the Salesforce platform without needing to save any extra data outside of the source.

Snowflake

Snowflake has a capability called "External tables". Instead of storing the data directly, Snowflake queries that table and provides it to the user as if it were available there. A user of this feature will incur no Snowflake storage costs. Snowflake also supports iceberg tables, which reach out to external storage locations like Amazon S3, Google Cloud Storage, or Azure Storage. A limitation here is that external tables are read-only (source, Iceberg source).

Palantir

Palantir's concept of this architecture is referred to as the "Multi Modal Data Plane" (source). This creates a data layer that spans any storage and compute environment. The most important part: data remains in existing systems while analytics, models, or any other business intelligence tools run where they're suited best. An analyst may interact with a single virtual data layer, unaware that the data is actually coming from multiple systems.

Additionally, Palantir has its implementation of virtual tables, where users can query data in supported platforms without first having to store that in the Foundry. Repeated in this pattern, the configuration and controls of the source system are removed from the users, and they only need to work with the Foundry platform. Palantir Foundry also supports tables as outputs, which means transforms performed in Foundry can be written back to their external service.

Conclusion

As enterprises scale, the amount of data scattered across systems continues to grow. Copying or moving this data around is expensive, slow, and introduces lag between when data is generated and when it's available for analysis. Federated Data Architectures solve this by letting teams query and work with data where it already lives, without duplicating it.

Technologies like Virtual Tables, Iceberg Tables, and centralized metadata catalogs make this possible. They allow queries to reach into different systems, combine results, and return them as if they came from a single source. This approach reduces infrastructure costs, simplifies data governance, and helps avoid vendor lock-in.

In practice, platforms like Palantir, Snowflake, and Salesforce have already built these capabilities into their ecosystems, showing how zero-copy federation can make enterprise data more fluid and accessible.

Federated architectures shift the focus from moving data to connecting it, which helps organizations get value from their information faster, with less overhead and greater flexibility as they evolve.