View all on-demand sessions from the Smart Security Summit here.
Managed data lake provider Onehouse announced today that it has raised $25 million in Series A funding to help further its go-to-market and technical efforts based on the open source Apache Hudi project.
A year ago, in February 2022, Onehouse emerged as the first commercial vendor to provide support and services for Apache Hudi. Hudi, an acronym for Hadoop Upserts Deletes and Incrementals, can trace its roots back to Uber in 2016, when it was originally developed as a technology to help sort the massive amounts of data stored in data lakes.
Hudi technology provides a data lake table format and services that facilitate clustering, archiving, and data replication. Hudi competes with several other open source data lake table technologies, including Apache Iceberg and Databricks Delta Lake.
Onehouse’s goal is to create a cloud hosting service that helps organizations benefit from a hosted data lake house. Along with the new funding, Onehouse also announced its Onetable initiative, which aims to enable users of Iceberg and Delta Lake to interoperate with Hudi. With Onetable, organizations can use Hudi to ingest data into a data lake while still benefiting from query engine technologies running on Iceberg — including Snowflake — and Databricks’ Delta Lake.
event
Smart Security Summit On Demand
Learn about the critical role of AI and ML in cybersecurity and industry-specific case studies. Watch the on-demand session today.
look here
“We’re really trying to build a new way of thinking about data architecture,” Onehouse founder and CEO Vinoth Chandar told VentureBeat. “We strongly believe that people should start with an interoperable lakehouse.”
Understanding Data Lakehouse Trends
Data Lakehouse is a term originally coined by Databricks.
The goal of Data Lake House is to leverage the best aspects of data lakes, data lakes provide massive data storage, and data warehouses provide structured data services for query and data analysis. A 2022 report from Databricks identified a number of key benefits of a data lake house approach, including improved data quality, increased productivity, and better data collaboration.
A key component of the Data Lakehouse model is the ability to apply structure to the data lake, which is where open source data lake tabular formats including Hudi, Delta Lake, and Iceberg come in. Multiple vendors are now building complete platform formats using these tables as a basis.
Among the many supporters of Apache Iceberg, Cloudera launched its data lake house service in August 2022. Dremio is another strong Iceberg supporter, using it as part of its Data Lake House platform. Even Snowflake, one of the pioneers of the cloud data warehouse concept, now supports Iceberg.
Onetable is not another data lake table format
At their core, today’s major data lake formats, including Hudi, Delta Lake, and Iceberg, are files that organizations expect to be able to use for analytics, business intelligence, or operations.
One challenge that has emerged, however, is that vendor technologies are increasingly vertically integrated—combining data storage and query engines. Kyle Weller, director of product at Onehouse, explained that he’s seeing organizations get confused about which vendor to choose based on the supported data lake tabular approach. The Onetable approach aims to abstract the differences between data lake table formats to create an interoperability layer.
“Onehouse’s goal and mission is to decouple the data processing data query engine from how the core data infrastructure operates,” Weller told VentureBeat.
Weller added that the foundation of many data lakes today are files stored in the Apache Parquet data storage format. Onetable essentially provides a metadata layer on top of Parquet that makes it easy to convert from one table format to another.
Where Onetable fits the data lake house use case
Chandar noted that Hudi offers advantages over other formats, such as transactional replication and fast data ingestion.
One potential use case where he sees Onetable’s capabilities as a good fit is for organizations that use Hudi for high-volume data ingestion, but want to be able to use the data with other query engines or technologies, such as a Snowflake data cloud deployment, for some type of analysis.
Many companies, whose data is stored in data warehouses, are increasingly deciding to build a data lake, either because of cost considerations or because they want to start a new data science team, Chandar said. The first thing these organizations do is data ingestion, bringing all their transactional data into the lake, which is where Chandar said the Hudi and Onehouse services excel.
Now, taking advantage of Onetable technology, the same organization that brings data into Onehouse can also query and analyze the data using other technologies such as Snowflake and Databricks.
Looking ahead to the Hudi and Onehouse platforms, Chandar emphasized that further optimization of an organization’s ability to leverage data quickly will remain a key theme.
“We’ve announced in the Hudi project that we hope to add a caching layer at some point,” he said. “We’re thinking about anything around the data and how we can really optimize it.”
VentureBeat’s mission is the digital town square where technology decision makers gain knowledge about transformative enterprise technologies and transactions. Discover our newsletter.