Iceberg v3, Two Talks, One Lakehouse

Blog > Iceberg v3, Two Talks, One Lakehouse

Iceberg v3, Two Talks, One Lakehouse

About two tech talks I presented recently

March 30, 2025

1,244 words, ~ 6 min read

perspective

tech talk

legal note: all information in this post is public information (available to anyone online) and I do not write on behalf of or represent any other entity. there is nothing confidential or proprietary here.

2025 has been off to an interesting start. In the past 2 months, I’ve presented two tech talks:

TampaDevs January Meetup, 1/7, Fishing for AI-Powered Insights: Lakehouse Technologies
Apache Iceberg SF Meetup, 2/27, The Invisible Ink of Iceberg Deletions

I want to do two things in this post: describe the general subject matter of each talk and my reflections on how each event went.

The Lakehouse
TampaDevs January Meetup, 1/7
Apache Iceberg SF Meetup, 2/27
Closing Thoughts

The Lakehouse

This is a very high level description of the subject matter of the talks. I will write a more detailed follow-up that goes into more detail, since I anyway need something that explains what I work on without too much jargon.

So, what’s the one-line answer to what you work on?

I work on storage.

Storage? Isn’t that a solved problem? What exactly do you work on?

Specifically, very large scale storage. The device you’re reading this on probably uses GB (gigabyte) as the unit for RAM and the hard drive/SSD. Some devices will even hit the TB (terabyte) scale for the drive, which is 1000x bigger than GB. The scale I work on is typically in the PB (petabyte) scale, which is 1000x bigger than TB. Sometimes, it’s even in the EB (exabyte) scale, which is 1000x bigger than PB.

Sure, this data is big. But why is that a problem? Can’t you just scale up?

Typically, data is stored in databases (very often SQL databases like PostgreSQL). However, at this very large scale, compute (what executes a query to retrieve data) and storage (what is actually stored) are split. Your computer can be though of as “compute” (the CPU executing) and “storage” (the disk drive), but both are intertwined

At the huge scale, we have to store data cheaply using cloud object storage (like Amazon S3, which you might see hosting PDFs, videos, or images). There’s also a lot of machines typically accessing the data, requiring scalability. Behind the scenes, these object storage systems are actually complex distributed systems. They’ve just done such a good job of designing them that we treat them as simple as a file upload. These systems are known as data lakes.

But if you have a table, and you’re adding just a single row, it’s challenging to deal with the many limitations that come with storing straight files in cloud objet storage. To provide the illusion of a table, we use table formats and file formats. Apache Iceberg and Delta Lake are two of these table formats, which use metadata layers on top of Apache Parquet as the file format. The metadata tracks what’s happening and the active state of a table, though the actual representations are slightly different across the formats. These are all open source, so data stored in these formats can be accessed by any service. Using these formats transforms the data lake into the lakehouse (a portmanteau of data lake and data warehouse).

I work at Databricks, which created Delta Lake. In Summer 2024, it acquired Tabular, which was started by the founder of Apache Iceberg. I work on the storage team across both formats. The exact features and products we work are continually announced both on the Databricks blog and the Data and AI summits.

So what are the talks about?

The first talk goes into detail from the ground up on how these systems work and is a good (but long) introduction. It doesn’t require any prior technical knowledge.

The second talk goes into detail about a specific limitation around row-level deletes with this file-based system. This does assume familiarity with Apache Iceberg, since it was at an Iceberg-specific meetup.

TampaDevs January Meetup, 1/7

TampaDevs is a developer community that meets monthly to host developer talks (and other developer related events such as networking).

This talk was challenging because it needed to go into details about the lakehouse architecture to a relatively nontechnical audience. From my research on past talks, many of them dived into the end result without going into the why or the patterns that mattered. My goal was to built up the end data architecture by going through various problems to set the stage.

The actual talk went pretty well. Technically, it was only supposed to go for an hour long. But there were a lot of questions, so it ended up going an hour and a half. I distinctly remember an interjection from one of the organizers about including an example relating to banks for ACID transactions. If you watch the recording, you can see me visually reboot and clear my mind as I go off what I rehearsed into integrating the bank example.

What surprised me the most were how easy some of my friends understood the topics. One of my friends is pursuing a career in public health and was asking me about why I didn't emphasize the double ETL cost on one of the slides.

Apache Iceberg SF Meetup, 2/27

Apache Iceberg is an open-source project (described above), with meetups happening around the world (SF, Seattle, Palo Alto, Amsterdam, and more) to discuss topics related to the project.

This talk was challenging for the opposite reason: it was to a very technical audience that knew quite a bit. There are folks at Databricks who helped contribute to the actual improvements with row-level deletes that I spoke with. I knew how to structure the talk but my main concern was with the questions coming from the audience. For Apache projects, there’s a Project Management Committee (PMC). PMC members are responsible for guiding the project forward, and typically know a lot about a project’s present, past, and future. On the other hand, I had maybe ~6 months of familiarity with the entire data engineering experience.

To my surprise, that was enough. I’ve been working with the mentality to define based on what I can do, not what people expect me to do based on my experience. This really came from a Tweet I saw about new grad advice, where it's not sufficient to use a lack of experience to justify anything. I don't think I'm perfect at this by any means. But going there and seeing folks that have been contributing to Iceberg for many years nodding throughout the talk gives me a great deal of confidence that I do know a thing or two.

Closing Thoughts

I’ve always been interested in giving technical talks. I always thought it’d be a few years down the line when I had more experience building and shipping, though. No idea when the next talk will be, but super grateful to have these opportunities.

In general, I’ve been adding fewer posts than I used to. This blog began because I didn’t write a lot at college, but I felt I had interesting ideas I was thinking about. These days, I write a lot at work. I also can’t write about most of the things I’m thinking about, since it’s internal strategy information that can’t be publicly revealed. I’ll still try to write about other things, but it might be on a lesser frequency for now. Stay tuned.

Found this interesting? Subscribe to get email updates for new posts.

Self-Driving? Not Impressive Anymore

Fully Open-Source VS Code

Return to Blog