Audience with CEO George Fraser and VP of Product for Databases and Destinations Dan Lynn
Dan Lynn on why there is a renaissance in data lakes
Changing definition – “Started off as dumping raw files into Hadoop, evolved into object storage, and people kept dumping raw files and I added a little more structure to it over time, and the need for structure and transactions really led the more modern table formatting”
Key enabling concepts:
- Centralise on inter-operable data platform
Enables customers to reduce the copies of their data across their data cycle
- Use different warehousing technologies, different query processing technologies to accomplish specialised customers and save a lot of cost in the process.
- Future-proof their architecture
George Fraser emphasis they are building support for Icerberg and Open Table format
OpenTables are key for FiveTran being able to support data integration – “delivers data at a very high level of abstraction. When you replicate data with Fivetran, you don’t get files, certainly not Changelogs, you get a replica of the source. Whether SalesForce, or an Oracle database, in a normalised representation with a set of tables for references to each other.
It involves a lot of complexity, but it means for data lakes, we need a way to be able to offer that level of abstraction in our object storage environment, and for that what we needed was open table format and now in Iceberg, we have that.
Dan Lynn
Open Table format – what support is available for customers?
In all technology transformations, you find there’s an inflection point in the market. You find some computing standards, especially in open-source. Eventually one or two technologies tend to dominate.
“We’re working with our partners, and with our customers, to plan the data in the way that’s most accessible to them, so that data tables are widely adopted in the ecosytem.
FiveTran’s role in that is really to get the data accessible to our customers in the destination of their preference.”
George Fraser
Delivers all your information in one place, in a nice relational format, in a data lake environment, it doesn’t make it easy.
Lot of work went into building, first of all, an ingest service. Which is capable of doing what we want to do in databases.
When we get, for example, notification of an update, don’t want to process in a Data Table.
“We want to go find that record, and update it. And we have a way of delivering historical data that shows every row of every version that’s ever existed.
It took a huge effort from some really great engineers at FiveTran to build that ingest service that is capable of delivering data to a data lake the way that we want to so Open Tables formats were a key enabling technology. But there’s still development to be done on that Open Table enabling format.”
Dan Lynn
“A lot of people talked about data lakes as an append only, a bucket that you just dumped your changes in and then you’d sort it out later; construct ‘What’s the latest version of this record’ and that puts a lot of burden downstream, on the downstream courier versions and it makes it feel less like an analysis data warehouse that you’re working with.
One of the formats enable the asset transactions, and something has to do that work, it has to compute, you know, what was the old-date version of this, and what do I have today, and make that resolution in a way that makes it usable for the customer.
George Fraser
Dealing with data changes downsream, is an idea that has a lot of appeal to people, and it menas the data is totally raw, and unworkable.
There’s the sweet spot on how much processing you want to do, and it’s not nothing and it’s not a flat-dimensional scheme, it’s not master data management, it’s something in between. I want to present the table as it was in the source, and that’s the table I want to present to the destination.
FiveTran’s Managed Data Lake Service, where they are managing that change tracking, and updating the most recent version of the records, it’s just a queryable table in your data lake.
Customers can bring a variety of different compute options to that data, in some cases.. Multi-cloud options to single-cloud, allowing you to choose the right compute for the table.
You don’t have to use the same compute for data transformation activity as you would use for deep analytics use cases, which you would be crunching much more deeply… putting that together to give users the flexibility in that opens a lot of benefits to them, especially in terms of costing, cost savings, interoperability…
You do the acquisition, you might bring a whole other data warehouse in that wasn’t a part of your stack before,, and (our offering) lets you get up and running with that new stack much more quickly. “
George Fraser
Catalogues enable multiple queries to be enacted in a database.
- Super bit-rich metadata catalogues, used for full business governance
- A technical catalogue indexing your table’s storage updates
What are the views and permissions for a’l that?
“Too much choice, would be better off if they use the Iceberg API-base catalogue is the best activity recorder right now because as with all things in Iceberg it’s minimalist. And that makes it easier for multiple vendors to adopt. … I’m hopeful that the ecosystem will centralise around that solution.
And we do post an implementation of the RestAPI catalogue but we also connect to other catalogues and write data to them. “
We’re the incidental complexity business”
6 catalogues currently, trying to consolidate
What are the benefits of comparability in the data stacks? Of applying this data-lake architecture?
Implicit vs explicit – difference between “storage” and “compute”
- Avoid having multiple copies of the data
- Customers have multiple analysis compute engines and if they don’t have a data lake, they have to store multiple copies of the same data to make it accessible for multiple engines – significant cost savings
- For smaller customers it’s largely about future-proofing. The benefit really depends on vendors like FiveTran making them as easy to adopt as tightly integrated storage and compute engines that are designed to work together.
- On FiveTran you save ingest costs, which are “huge, the unnoticed cost of data warehousing” represents a 20-30% of the total cost of the data warehouse compute, in the typical customer representation.
“If you look at a sample of customer queries that have been published by Snowflake and redshift, that is the number that you see.”
FiveTran internalises the ingest cost, so we absorb that because it’s designed to work with FiveTran data pipelines, it’s incredibly efficient. Even as a very small company you will experience considerable savings.”
Dan Lynn
Decoupled storage layer giving that client access, you can integrate with different query enines, those “bleeding-edge” new features, it’s future-proofing – if something new comes into it, you don’t have to create a whole new set of copies, rehouse your replication somewhere… it really is a key enabler for businesses.”