

And then you would go find out what files are in those directories when you needed to query the data.

So they ended up adding a database of those directories. The problem, though, is that what they were doing was trying to keep track of these directories. That allows Hive tables to have fast queries on really large amounts of data. With Hive, he explained, the idea was to keep data in directories and be able to prune out the directories you don’t need. Anything in between leaves a lot of clean-up work. It’s based on an all-or-nothing approach: An operation should complete entirely and commit at one point in time or it should fail and make no changes to the table.

The outgrowth of that frustration is Iceberg, an open table format for huge analytic datasets. We need to fix that and go back to a design that would definitely work.’” It’s that we’re tracking data in our tables the wrong way. And we finally just said, ‘You know, we know what the problem is here. … I describe it as putting Band-Aids over these problems, very different problems here, there.
Apache iceberg Patch#
“At Netflix, I spent a couple of years working around those problems and trying to basically patch them or deal with the underlying format. When he moved to Netflix, “the problems were still there, only 10 times worse,” he said. Those problems included the inability to reliably write to Hive tables, correctness issues and not being able to trust the results from its massively parallel processing database. “We kept seeing problems that were not really at the file level that people were trying to solve at the file level, or, you know, basically trying to work around limitations,” he said. Ryan Blue experienced it while working on data formats at Cloudera. Like so many tech projects, Apache Iceberg grew out of frustration.
