Transcript
Craig Dunham [00:00:00]: 90% of the data that organizations have just go unused. And so in addition to this mission that we're on around making data analytics just really more cost efficient, more affordable, it's also awakening companies and organizations a bit to the art of the possible. And we want to make sure that people know that they can unlock this and there's tons of insights that they can get.
Daniel Darling [00:00:25]: Welcome to the Five Year Frontier Podcast, a preview of the future for the eyes of the innovators shaping our world. Through short, insight packed discussions, I seek to bring you a glimpse of what a key industry could look like five years out. I'm your host Daniel Darling, a venture capitalist at Focal, where I spend my days with founders at the very start of their journey to transform an industry. The best have a distinct vision of what's to come, a guiding North Star they're building towards and that's what I'm here to share with you. Today's episode is about the future of data infrastructure. We cover the explosion in compute demand, the petabytes of untapped enterprise Data, energy efficient GPUs, DeepSeek, the $500 billion Stargate Project, and how AI is transforming data processing. Guiding us will be Craig Dunham, CEO of Voltron Data, a company at the forefront of accelerating data processing for AI analytics and enterprise scale workloads. Voltron provides the infrastructure necessary to handle enormous amounts of data, transforming bottlenecks into breakthroughs. They are championing open source frameworks like Apache Arrow and Voltron is building the connective tissue that allows businesses to process data at orders of magnitude, speed and efficiency, reshaping industries from finance to health care to national security, and partnering with the likes of Snowflake and Meta. Voltron have established themselves as a key part of the AI infrastructure stack and and have raised a total of 110 million from the likes of CO2, Lightspeed, Google Ventures and BlackRock. With a deep background in scaling data infrastructure, Craig is Voltron's CEO. Before Voltron, Craig was the CEO of luma, a leading SaaS technical SEO platform. Prior to that, he held significant roles including General Manager at Guild Education and Seismic, where he led the integration of Seismic's acquisition of the Savoy Group and drove go to market strategies in the financial services sector. Craig began his career in investment banking with Citi and Lehman Brothers before transitioning into technology. He holds an MBA from Northwestern Kellogg School of Management. Craig, so nice to see you. Thanks for coming on to talk with me today.
Craig Dunham [00:02:34]: Oh, thanks for having me. I appreciate it. I love doing things like this.
Daniel Darling [00:02:38]: Talk about data, because processing data is more important than ever. We have an increasing amount of it, and it's obvious just how valuable that data has become. But. But for those outside of the industry, data processing seems to be complex and an invisible part of the tech ecosystem. Can you help us appreciate the big picture of how your industry works and how Voltron fits in?
Craig Dunham [00:03:01]: Data processing is essentially the backbone of almost every business decision. You could think about it in terms of pricing strategies, risk management, fraud, or anomaly detection. Companies are gathering data from countless numbers of sources. Transactions, sensors, web traffic, you name it. This data needs to move through sort of three main stages. Store the data, compute and transform the data. Where it gets filtered, sorted, analyzed, and then ultimately delivered. Where it then can power insights or business decisions. The challenge around this really comes in the fact that as data continues to grow and gets larger, the process of doing that gets really slow and expensive. And then with this sort of AI and ML craze that's taken over the world, it has actually exacerbated the problem by requiring more data to train, more data to make predictions. And this insufficient processing of data can then create bottlenecks and limit how much of this data can actually be used.
Daniel Darling [00:04:01]: Can you explain how this life cycle has really evolved in this current modern age of AI?
Craig Dunham [00:04:07]: So first you have the collection of the data which brings in raw data from those various sources. And if you want to make this really practical, think about your phone, various sensors on an automobile, social media, business transactions logs that come from using specific systems. And that data can come in a number formats. It could be structured data, as in tables and rows, similar to like Excel, or really unstructured. And think about images, videos, just raw text. And so the storage layer is the foundational piece. That happens where companies use platforms in the market today, like maybe like in AWS or Oracle or Google Cloud, to hold and sort of store that data. But the analogy I sometimes like to use is raw data is a bit like uncooked ingredients, right? It's not useful until it's been processed. You can make a meal out of it. And that's where things, again, my complete bias get really interesting is in the processing stage, right? I. E. Like that's what Voltron Data does. Data processing really involves transforming that raw data into a format that can then be used for analysis. And so this includes things like joining together data sources, aggregating, running mathematical computations on that data, filtering out irrelevant records, sorting the data to make it then easy to read and analyze, and the stage after that is okay, we've now processed that data, we then hand it off to some other technology for insight generation. And today that handoff is often to a machine learning or a deep learning system, things like PyTorch or XGBoost or TensorFlow or even something like a Power BI or a tableau. And these systems use that cleansed data to then train models or make predictions. And those insights that are then derived from those tools are what drives those decisions.
Daniel Darling [00:05:56]: It's so broadly applicable, this technology. So I'm guessing you serve a whole array of customers and maybe you can tell us where do you see the most demand from a customer perspective for this type of solution? What are the pain points that you're looking to address?
Craig Dunham [00:06:10]: So at Vulture Data, I mean we solve really that data processing, that sort of central middle piece of the flow, and we accelerate data processing at scale. And so we leverage GPUs as a technology. GPUs are really beneficial as it enables parallel tasks, it can speed up those transformations by meaningful orders of magnitude. And the types of customers that we serve really are. Financial services, it's one of the largest. They rely on data for things like fraud detection, credit risk scoring, portfolio optimization, cybersecurity. So think about anomaly and threat detection in that area where companies need to really sort through billions of records to identify attacks or breaches. Think large retailers who have to predict and manage inventory across thousands of stores, hundreds of thousands of products and SKUs. And if they can't predict demand accurately, it can lead to sellouts or overstocking which then become costly. Automotive. It's really critical for supply chain visibility. There's federal government, if you think about understanding sort of geospatial analytics or some sort of threat that, that we need to protect against just massive volumes of data that we can serve quite well.
Daniel Darling [00:07:20]: Speaking of this massive volumes, can you give me an idea of how much volume of data are within these kind of financial services or organizations that you described and how much that has changed just over the coming years in terms of the growth of it?
Craig Dunham [00:07:35]: A lot of our customers have petabyte scale data sets and not sure how if the audience can conceptualize what that means. But if we use the example of a financial institution, it's basically one big data engine where they're processing transaction data around the clock. So think billions of payments that are being made, trades that are being made, loans that are happening all at the same time. If you think about again, a large retailer where you've got thousands of products across thousands of Stores every single purchase, every movement in the supply chain, every warehouse update. It generates a stream of data way beyond what an Excel spreadsheet could manage. It's literally billions of columns and billions of rows of data that need to be thoughtfully processed and enable someone to make really smart insights with.
Daniel Darling [00:08:22]: If you look at this moment in time that we're in, you have enterprises that are really hungry to use AI and are sitting on huge amounts of ever increasing amounts of data. But I guess your role is really making sure that that data can be unlocked with the most value and unified for them, which is probably an ever increasing role in the industry from there. And a key way you're doing that is you chose to do open source frameworks, which is obviously a big topic of discussion, like your Apache Arrow. How does that enhance your ability to handle efficiently?
Craig Dunham [00:08:54]: They make sure and they make it really easy for data to move and transform efficiently. If you think about data systems today, across the vast number of organizations, they're really diverse and folks are using different tools, different platforms, different programming languages and they all need to seamlessly interact with one another. Open source standards like Apache Arrow really act as this glue that enables smooth data integration across these systems without really costly or overly complex or custom solutions. And then there's the community element of it, which we also love and subscribe to, where we make sure that these frameworks can will remain relevant and future proofing in a way because as they evolve, they'll improve and they're really backed by this incredible global community of developers and users.
Daniel Darling [00:09:41]: One of the big changes that's happened over the last couple of years, especially with the rise of Nvidia, is on the amount of compute going from CPUs to GPUs and the unlock that that has enabled for data processing and for AI. So what have you seen with that kind of evolution and where do you see it going?
Craig Dunham [00:10:02]: So GPUs were originally designed for gaming and they've now been reapplied and have really transformed data processing by enabling these large scale tasks to run much faster, much more efficiently through this idea of parallel processing. And so think about processing multiple things at the exact same time and it makes them really ideal for these complex workloads. AI, data analytics, machine learning. And Nvidia of course has led this shift with innovations like Cuda Rapids, which accelerate just traditional workflows. And so data transformation, as we talked about, ETL as it's sometimes called, all on these GPU hardware. And again, it's about speed to insights, ability to distract Insights and large volumes of data. And so this demand for GPUs goes well beyond, you know, all the buzz and all the hype today around AI. There are a bunch of industries where, you know, some of the ones that we've talked about that have been relying on GPUs for a number of years for large scale analytics, for large scale data processing. And again, it's all about this parallelism that happens where you can process multiple things all at the same time. And it's just a central and a core part of modern data infrastructure.
Daniel Darling [00:11:18]: And do you find that the enterprise is increasingly aware of the value of the data that they're sitting on and we're still in of early days of them processing it and wanting to unlock that data, or do you find that it's quite mature and they're quite far along in that process?
Craig Dunham [00:11:35]: Companies are not as far along as we would hope. You know, we often talk about the fact that a lot of these organizations are sitting on these petabytes and that exabyte size data sets and they're not even using that data. I saw a stat somewhere recently that talks about, you know, 90% of the data that organizations have just go unused. And you know, so in addition to this mission that we're on around making data analytics just really more cost efficient, more affordable, it's also awakening companies and organizations a bit to the art of the possible. And we want to make sure that people know that they can unlock this and there's tons of insights that they can get. And that's part of, that's part of our mission and part of our vision as well.
Daniel Darling [00:12:20]: Absolutely. And it's a really exciting macro story for economic progress and growth. The counter to a lot of this value is the cost to do some of this processing and to run this kind of analysis. You're talking about such a huge volume of data, but it's quite remarkable. Looking at Voltron, you claim that you can allow your customers to go from 200 servers to two and spend $2 instead of $200 for a job run of data processing. So that's an order of magnitude improvement on cost. How are you able to do that?
Craig Dunham [00:12:53]: It's a bold claim and an accurate one. There's always an it depends. And you know, we won't get too much into that today. That really is our secret sauce. Some of the things that we've discussed earlier, it really does start with building what we call an accelerator native architecture, which is designed to fully optimize GPUs, which are again, hardware that's built for this parallel processing and really high speed performance. But it doesn't stop there. And so we've invested years into studying and solving the bottlenecks that can arise in data systems. So things like networking delays, memory constraints, inefficiencies and how data gets moved around. And we've built our query engine, our data processing engine called Theseus, with a really deep understanding of how hardware interacts with data. And we've eliminated a lot of these performance roadblocks and again allowing our, our customers to drastically reduce the amount of servers that they have to use. The other thing that I'll speak to is how our system is just built with flexibility and integration a lot in mind. And it's about how we can sort of seamlessly work alongside other data engines and with other existing architectures that allow for companies to choose, hey, this is the right tool for the right workload. We don't claim to be the right engine for everything. We are a large scale data processing engine and there are some smaller scale engines that are better suited for those jobs. And so for us, we always go in with a, hey, let's give us your biggest, scariest, most time consuming workloads. Let us take those and you'll find tons of efficiencies throughout the entire pipeline.
Daniel Darling [00:14:28]: And it's a really great achievement to drop those costs. Is there more juice to squeeze on that curve or how can people over the years ahead get more performance at low cost out of their GPUs, out of their data centers? And where do you see that going?
Craig Dunham [00:14:43]: Within a server, within a computer, there are a number of components that you can pull on that you can leverage that will help you to get better speed, better performance. Our mission, our vision is really around how do we find every single one of those components that today goes unutilized and help you unlock that as part.
Daniel Darling [00:15:08]: Of that conversation of getting more performance? One of the big topical points around AI and data processing in 2025 so far has been around DeepSeek and the innovation on model. I'd love to get your take on DeepSeek and how it's really impacting the work that you do.
Craig Dunham [00:15:25]: It is really hard to ignore the hype, to ignore the excitement. It has for certain made some headlines and for a lot of good reasons. One of the really impressive things about DeepSeek and their release is how they've shifted away from, call it the traditional supervised learning approach and they've instead opted for this unsupervised learning. And by doing so they have really opened up a path that we think could lead to more efficient, more powerful reasoning models. I expect a bunch of folks in the industry are going to take note and will start to incorporate a lot of these techniques, which should then result in better models. Another interesting move was they optimized how they worked with the hardware, essentially programming the GPUs in a way that allowed them to get much better performance. And by taking advantage of the hardware's capabilities, they have achieved some pretty impressive results. And so it's a nice example of what happens when you optimize for the hardware that you have, which is something that we really resonate with at Voltron Data. It's a core principle of how we build our analytics engine.
Daniel Darling [00:16:31]: Does it influence your own thinking around the need for the US to build out far more larger data centers, Most notable being this $500 billion Stargate program? Does that sort of run counter to that narrative? Or do you just think that that the sheer amount of volume of data that needs to be processed and computed upon is necessary to build out far more infrastructure?
Craig Dunham [00:16:54]: There are going to be constraints. That's energy constraints, that's land constraints, that's hardware constraints. One of the things that we talk a lot about and focus a lot on is how do you do as much, how do you do more, how do you process more, using a lot less? Because eventually we are just going to run out of energy and we're going to run out of space for all these massive, massive data centers. And so I think the thing that a lot of the DeepSeek news has triggered is this idea of we just have to do things more efficiently. I don't think the volume of data is going to go down, actually. We know it's not. How do we then just be more efficient with what we do with that data?
Daniel Darling [00:17:34]: What about this kind of shift in the AI industry from pre training where a lot of these sort of work has been done with training these large language models, to inference and the huge amount of demand for compute. And you hear Jensen from Nvidia talking about it being a magnitude order more demand thanks to this shift to inference or post training of these AI models. What do you see as that future as it unfolds?
Craig Dunham [00:18:02]: I think we are moving beyond foundational models to focus more on inference, particularly with the rise of AI agents. And as these agents become more widespread, the demand for data and analytics will surge alongside of it. You know, they'll request and process information from large scale data sources at a rate far exceeding that of human Analysts and you know, I think that puts a significant amount of pressure on enterprise data pipelines to keep up with those insights. And that's the way I kind of predict it will play itself out. As data centers consume more energy, require larger physical footprints, we're going to need smarter, more efficient infrastructure that optimizes how those resources are used. How do we make sure we optimize for the amount of energy that we have? And there's just going to be a big focus on do things faster, better, cheaper, energy efficient, focusing on energy because.
Daniel Darling [00:18:55]: You are hearing about how much that is a really cornerstone of being able to power this whole industry. What are you seeing in terms of the innovation?
Craig Dunham [00:19:03]: Bit of a newer cliche is like, you know, the math doesn't matter and we know that the current infrastructure just can't keep up without some really significant changes. And so, you know, we're already seeing some of these constraints pop up where there's energy not as available in certain places. We talked a little bit about the rising infrastructure costs. I think this does shift a bit how the industry operates. And you know, we think about some of these data centers that are being constructed today are focusing more on just energy efficiency. How do we power them with more renewable power sources?
Daniel Darling [00:19:32]: And does it feel to you that we have the capacity in the renewables and clean energy sector to meet any of this growing demand? Or do you think the growing demand for energy for compute will just outstrip the US's ability to meet that with renewable and we'll have to start to power it in other ways?
Craig Dunham [00:19:49]: That is a good question. I would say I think the demand is going to force innovation in such a way that we will be able to do the things that we need to do. But it's got to be a combination of figuring out how to do things more efficiently in conjunction with finding better newer renewable sources.
Daniel Darling [00:20:06]: One of the things that people also talk about is has all the good data already been computed on by these foundational models and are we going to run out of data? You sit right at the intersection of that. What's your view?
Craig Dunham [00:20:24]: No. Can I just say no. But we have so much more data than we actually realize. And many of the folks, and again I mentioned this a little bit earlier, but a lot of the customers and folks that we talk to, they're sitting on exabyte, I mean massive, massive amounts of data. And the problem really just becomes because the cost of that data is so high that they then just. The data sits and goes unused. And I gave the stat about 90% of it going unused. It's completely unlocked and untapped. And so it's the data that's not being used. And it's also the compute capacity that we touched on a little bit earlier that's also not being used. And so that's our mission. It's like fully unlock your data. Fully unlock the compute power that you probably already have sitting in your data systems and are just not using. And how can we, again, selfishly, Voltron data, how can we help you to unlock those things?
Daniel Darling [00:21:14]: How about synthetic data? How much are you processing those kind of data sources?
Craig Dunham [00:21:21]: It's there, but the answer is not a lot. There's so much real data. And that real data is, you know, the authenticity of it, you know, trying to create data that, that matches that or mimics that. I think it's really hard. I would rather see us turn to a space where we figure out how to unlock the data that we already have.
Daniel Darling [00:21:41]: And another area of frontier data generation that you hear about is with the rise of AI agents and all of the huge amounts of interactions that they'll be having amongst themselves and hypothesis testing and reasoning, trains of thoughts, et cetera, that it will create a huge amount of data too. Are we set up to be able to capture some of that data and leverage that and use that to put into our pipelines? Or is that just going to swamp us potentially, given the amount of.
Craig Dunham [00:22:10]: No, I don't think, I don't think it swamps us. We're trying to help organizations prepare and future proof against this exact thing that you're describing. It's more and more and other data sources that are just coming in and just creating and creating and it's like exponentially growing in orders of magnitude or multiples of 10 every two to three years. And if you just do the math on that, it's going to require organizations or companies or agencies or whomever it happens to be to just be prepared.
Daniel Darling [00:22:39]: A big part of the value of the data for your customers after they've done the processing is the analytics pace. But we're also moving from just absorbing things retrospectively to now taking advantage of being more predictive about our decision making and leveraging of data. How far will we start to see the predictive capabilities of analytics start to manifest over the next couple of years?
Craig Dunham [00:23:02]: Analytics, whether batch or real time, it is becoming faster, it is becoming scalable, it is moving through data pipelines. You know, with some of this high performance hardware that we've talked about, with some of the open source frameworks, like Apache Arrow that we talked about. Like these innovations really reduce latency. They allow for faster, more immediate insights. AI models are also improving in their ability to not only sort of react to this data, but then predict the outcomes based on patterns of large historical data sets. Like we have been for years and years and years. We've been using historical data to predict what's going to happen in the future. And now we just have more, more of that data to be able to do so. And in certain industries like finance, healthcare, we about supply chain, they're starting to leverage a lot of these predictive capabilities to be able to forecast risks and optimize operations and personalize customer experiences. And so as this infrastructure and AI just evolves together, I think we'll see predictive analytics also just become more accurate and a greater part of how businesses operate across sectors. And it's just an evolution and I imagine as the technology gets better and faster, so will the predictive capabilities. These things.
Daniel Darling [00:24:14]: We're going to come up on time here, but I wanted to get your opinion on being at the intersection at the infrastructure layer and at the open source layer. You're part of this wider ecosystem of partners and contributors to the data infrastructure stack. What areas of the industry do you see poised for major breakthroughs in the coming years? Are there any startups or innovators that are shaping the future of the industry that you want to highlight?
Craig Dunham [00:24:37]: Open source and infrastructure in general, it's about collaboration and collaboration across ecosystems. And when you look at where the innovation is happening, it's really clear, I think that we're on the brink of some really interesting breakthroughs. One area that's gaining traction, and I'll mention a couple companies as well, but one area that's gaining really good traction is this sort of unification of data. Lakehouse architectures. Right. So data warehouse architectures. And so there's a really cool startup called One House that are pioneering this by making it easy for companies to manage both structured and unstructured data under a single system system and improving performance and flexibility. This trend is reshaping how businesses store and access these massive data sets, which I think paved the way for really efficient analytics and AI workloads before it feeds into something like a Voltron Data. On the infrastructure side, there are companies like Weka, Vast Data. I think they're making great strides in high performance storage solutions. And so you think about data centers beginning to grapple with this increasing amount of data. These companies are offering ways to really dramatically reduce the long term energy and physical space footprint as well. And we talked a lot about the sustainability concerns that are rising. I think these innovations are really important. One other one I guess I'll mention that I'm excited about. We work a bit with Lambda Labs and they've become a great partner for us. They provide some of the GPU infrastructure that we need in the cloud. And so again, this AI and analytics demand this massive amounts of compute power. Lambda is really focused on optimized and high performance infrastructure. It really aligns with some of the needs that we have. And so across these areas, architecture, storage, compute startups are really helping to shape the vision of that infrastructure that's more efficient, that's more scalable, that's more sustainable.
Daniel Darling [00:26:30]: Such an exciting place that you've cemented yourself in and what a time to be building a company like Voltron. So thank you for sharing your perspective from that vantage point. Congratulations on all the success to date and all that's to come in the future and I appreciate you coming on to chat with me today.
Craig Dunham [00:26:46]: Daniel, this has been incredibly fun. Thank you so much for having me.
Daniel Darling [00:26:48]: Fantastic to sit down with Craig and run through the incredibly important life cycle of data, the atomic unit, to how intelligence is derived in our modern economy and one that is undergoing massive transformation and enablement. Clearly we're just at the tip of the iceberg in terms of the enterprise taking advantage of the data available to them and the industry's ability to help them do so at a cost and speed that makes it increasingly economically viable. It's clear from China to Craig that the biggest gains in AI still lie ahead of all of us. To follow Craig and Voltron, head over to their account on X at Voltron Data. I hope you enjoyed today's episode, so please subscribe to the podcast to listen to more coming down the path. And while you're there, please drop us a rating and help us out a lot. Until next time, thanks for listening. Have a great rest of your day.
