TileDB secures $34M to reimagine databases, not just collect GitHub stars

System aims to clean mess of high-performance analytics cluttering the modern data stack

Flush from securing a $34 million VC investment for his fledgling database company, TileDB CEO, Stavros Papadopoulos, is not planning on returning to the well any time soon.

A former colleague of database pioneer Michael Stonebraker at MIT, Papadopoulos is optimistic that the revenue for database systems designed around multi-dimensional arrays will outpace costs sufficiently to avoid taking cap in hand to VCs again.

"I'm a very conservative CEO," he told The Register. "The previous [$15 million] round lasted for three years, although it was supposed to last 18 months. The economic environment right now is horrible and investors are more conservative than they were.

"The funding may last indefinitely because we have revenue: we're not raising money on GitHub stars, we're raising money on actual numbers. We have a lot of revenue coming in based on our projections. If we were cautious, we can become profitable very, very quickly. I will first get to profitability, and then make this decision whether we want to deploy more aggressively or we want to organically grow."

The last couple of decades have seen a number of concerted efforts to reinvent the database and move on from omnipresent relational systems. Object-oriented, wide-column, document, graph, and value-key systems have all vied to find markets where the RDBMS doesn't play. Papadopoulos's notion of a system with a multi-dimensional array as its first-class data structure is aimed squarely at analytical problems.

The advantage of the array approach is that it represents a general system from which relational or vectors systems, for example, become special cases, he said. TileDB hopes to provide a mathematical proof showing that the array model is a generalization of the relational model; in effect that the array model subsumes the relational model.

For example, document databases, such as systems from MongoDB and Couchbase, have become popular with developers owing to their schema-less or schema-lite approach, making it easier to get systems up and running. But there is a cost when it comes to analytics, Papadopoulos argues.

"You may be able to store an image in a document database like MongoDB but you store it as a blob; you're not going to store each pixel separately," he said. "So that image is not analysis-ready. In an object store, you can't slice it. You can't create these multi-resolution images, to be able to zoom in, zoom out, and do that interactively with the cloud.

"The images that we're handling are in the terabyte scale. In a document database, you would have to download the whole file locally, but you may not have enough memory and enough storage to do this. TileDB stores it in a structured way, which is tiled and indexed, so you can slice any portion and you can do analytics in a distributed way – you don't need tons of memory to do this."

TileDB was born out of Papadopoulos's time as a research scientist at MIT's Intel Labs, working on supporting scientific research. The main focus remains life sciences, where the multitude of X-rays, CAT scans, genomic data, and transcripts play to TileDB's strengths, but there are also opportunities in engineering diagnostics and financial services, he said.

"The way people are solving these problems today is that they're either putting together 10 different tools that are completely different to each other: a relational database, a key value database, bespoke files and formats.

"And then they're hiring big teams of data engineers, and they're building catalogs on top and access control layers and logging layers. Effectively, they're reinventing the database, but to manage other databases, and that's what they call the modern data stack. There are different flavors of the same thing, but they conceal a problem: instead of going back to the roots, and fixing this problem at its core, they're hacking it."

TileDB comes in an open source and a commercial offering. Unlike so-called cloud-native data warehouse systems that mushroomed in popularity over the last decade – including Snowflake and AWS Redshift – TileDB charges a flat license fee based on seats and data volume.

Papadopoulos argued that the pay-as-you-go consumption model for data analytics could create a conflict between sales teams who want to see consumption go up, and the engineering team trying to make the system become more efficient, and as a result, potentially reduce consumption.

Andy Pavlo, associate professor of databaseology at Carnegie Mellon University, said the conceptual foundation of TileDB has some merit. "Multi-dimensional arrays are the only data model that you do not want to store in a native relational DBMS. A row-store scans data 'horizontally,' a column-store scans data 'vertically.'

"But some array query access patterns do arbitrary traversals across different dimensions. Therefore, you want a specialized engine – like TileDB – to handle them. But no major cloud provider offers a hosted array DBMS service, meaning they do not see a sizable market."

Pavlo pointed out that SQL:2023 – the ninth edition of the ubiquitous ISO query language – added support for multi-dimensional arrays (SQL/MDA). TileDB supports SQL.

However, array databases were not necessary for vector analytics – something that has become en vogue due to escalating interest in large language models in machine learning.

"Vectors are just single-dimension arrays. There is nothing special about them; relational DBMSes have supported them for decades. The vector DBs have added indexes to do fast (approximate) nearest neighbor search," said Pavlo, who is also CEO of database performance management company OtterTune. ®

 

More about

TIP US OFF

Send us news


Other stories you might like