I really love Oxide to an unhealthy amount (it's become a bit of a meme among my colleagues), but sometimes I do wonder whether they went about their go-to-market the right way. They really tried to do everything at once - custom servers, custom router, custom rack, everything. Their accomplishments are technologically impressive, but, as somebody who is in a position to make purchasing decisions, not economically attractive. They're 3x more expensive than our existing hardware, two generations behind (I'm aware they're on track for a refresh) and don't have any GPUs. E.g. what I would have loved to see is just an after-market BMC/NIC/firmware solution using their stack. Plug it into a cheap Gigabyte system (their BMC is pluggable and NIC is OCP) and just have the control plane manage it as a whole box. I'd have easily paid serveral thousand $ per server just for that. All the rack scale integration, virtualization, migration, network storage, etc stuff is cool, but not everyone needs it. Get your foot in the door at customers, build up some volume for better deals with AMD, and then start building the custom rack stuff ... Of course it's easy to be a critic from the side lines. As I said, I do really love what the Oxide folks are doing, I just really hope it'll become possible for me to buy their gear at some point.
First, thanks for the love -- it's deeply appreciated! Our go-to-market is not an accident: we spent a ton of time (too much time?) looking at how every company had endeavored (and failed) in this space, and then considering a bunch of other options besides. Plugging into a "cheap Gigabyte" system wouldn't actually allow us to build what we've built, and we know this viscerally: before we had our system built, we had to have hardware to build our software on -- which was... a bunch of cheap Gigabyte systems. We had the special pain of relearning all of the reasons why we took the approach we've taken: these systems are a non-starter with respect to foundation.
You may very well not need the system that we have built, but lots of people do -- and the price point versus the alternatives (public cloud or on-prem commodity HW + pretty price SW) has proven to be pretty compelling. I don't know if we'll ever have a product that hits your price point (which sounds like... the cost of Gigabyte plus a few thousand bucks?), but at least the software is all open source!
Please forgive my tergiversation. I fully trust that you know your path and I know how annoying it is to be why-dont-they-just'd. As I said, I'm rooting for you.
There's two dictionary definitions of tergiversate. One is the one you quoted, the other is one of desertion. Both meanings of the word are pejorative in the sense that the word comes with a connotation of betrayal of a cause. What I wanted to express was an acknowledgement that I understood the feeling that you get when someone who's clearly a fan of your work nevertheless does not provide a clear endorsement. It's easy emotionally to dismiss people who "just don't get it". But when someone does get it but chooses to equivocate, that can feel like an emotional betrayal. So I was looking for a word that covered that with the right connotation. I originally used apostasy, but it didn't feel quite right, because I wasn't really renouncing, more failing to fully endorse, so tergiversation it was. Of course having to write an entire paragraph to explain your word choice kind of defeats the purpose of choosing a single well fitting word over just writing a sentence of simple words that explains what you mean. But hey, I write enough technical writing, documentation, reports, grants, etc. all day where clarity is paramount that I feel like I get to have a little vocabulary treat in my personal writing ;).
We are definitely very much building a business! We have the iconoclastic belief that you can build a business by having a terrific team building a great product that customers love. And we're getting there![0]
Oxide are doing great work. Hoping they can probe the market a bit more for us out on the sidelines preparing to drop in and compete with some similar tech.
Id also wish I could get to play around with a cheaper version of their tech, but they probably havw enough customers that really want a large-scale solution that is completely customizable
> When we started Oxide, the DC bus bar stood as one of the most glaring differences between the rack-scale machines at the hyperscalers and the rack-and-stack servers that the rest of the market was stuck with. That a relatively simple piece of copper was unavailable to commercial buyers
It seems that 0xide was founded in 2019 and Open Compute Project had been specifying dc bus bars for 6 years at that point. People could purchase racks if they wanted, but it seems like, by large, people didn't care enough to go whole hog in on it.
Wonder if the economics have changed or if it's still just neat but won't move the needle.
Things like -48VDC bus bars in the 'telco' world significantly predate the OCP, all the way back to like 1952 in the Bell system.
In general, the telco world concept hasn't changed much. You have AC grid power coming from your local utility into some BIG ASS RECTIFIERS which create -48VDC (and are responsible for charging your BIG ASS BATTERY BANK to float voltage), then various DC fuses/breakers going to distribution of -48VDC bus bars powering the equipment in a CO.
Re: Open Compute, the general concept of what they did was go to a bunch of 1U/2U server power supply manufacturers and get them to make a series of 48VDC-to-12VDC power supplies (which can be 92%+ efficient), and cut out the need for legacy 5VDC feed from power supply into ATX-derived-design x86-64 motherboards.
You simply can't buy OCP hardware is part of the issue, not new anyway. What you're going to find is "OCP Inspired" hardware that has some overlap with the full OCP specification but is almost always meant to run on 240VAC on 19in racks because nobody wants to invest the money in something that can't be bought from CDW.
I remember the one time I had OCP hardware in data center, and how it was essentially rumoured it's better to not ask too much how it got there - not the level of "fell of a truck", but some possibility it was ex-(big tech) equipment acquired through favours, or some really insistent negotiating with Quanta till "to be sold to (big tech)" racks ended up with us
It's normally incredibly difficult for employees to disrupt at massive companies that would be the type which runs a data center. Disruption usually enters the corp in a sales deck, much like the one Oxide would have.
Yes. I think as an engineer at this level you need to also have the patience to deal with the bean counters.
But as I’ve grown in my career I’ve actually found that line of thinking refreshing. Can you quantify benefit? If it requires too many assumptions it’s probably not worth it.
But then again there’s always the Vp or the svp who wants to “showcase his towers’ innovative spirit” and then there goes money that could be used for better things. The innovative spirit of the day is random Llm apps.
Once the accountants are convinced the entire company is about them, there’s not much the engineers can do. They just starve you out by refusing to buy anything. It’s a big reason why open source is as successful as it is. It’s free so they can’t stop you with the checkbook.
OCP hardware is only really accessible to hyperscalers. You can't go out and just buy a rack or two, the Taiwanese OEMs don't do direct deals that small. Even if they did, no integration is done for you. You would have to integrate the compute hardware from one company, the network fabric from another company, and then the OS and everything else from yet another. That's a lot of risk, a lot of engineering resources, a lot of procurement overhead, and a lot of different vendors pointing fingers at each other when something doesn't work.
If you're Amazon or Google, you can do this stuff yourself. If you're a normal company, you probably won't have the inhouse expertise.
On the other hand, Oxide sells a turnkey IaaS platform that you can just roll off the pallet, plug in and start using immediately. You only need to pay one company, and you have one company to yell at if something goes wrong.
You can buy a rack of 1-2U machines from Dell, HPE or Cisco with VMware or some other HCI platform, but you don't get that power efficiency or the really nice control plane Oxide have on their platform.
But isn’t it a little surprising (I’m not an expert) that Dell or Supermicro or somefirm like that hadn’t already started offering an approachable access to either OCP gear or a proprietary knockoff of it? Presumably that may still happen if Oxide is seen to have proven the market.
Azure tried this, not with their hyperscaler stuff, but with Azure Operator Nexus.
Basically an "opinionated" combination of Dell, Arista, and Pure storage with a special Azure AKS running on top and a metric ton of management and orchestration smarts. The target customer base was telcos who needed local capabilities in their data centers and who might otherwise have gone to OCP.
As far as I can surmise, it's dead, but not EOLed. Microsoft nuked the operator business unit earlier in the year, and judging by recent job postings from contract shops, AT&T might be the only customer.
These companies are looked into their way of doing things. Also, they would be competing with themselves. It would also require more work on their side then they do now.
I think the whole 'existing company is not doing something, therefore its a bad idea' is a really dangerous take.
Oxide is also not just exactly, OCP, they share some aspects, but Oxide racks are optimized for typical DC of large organizations. Maybe there is a balance there that matters.
HP BladeSystem p-series chassis were all DC bus bar powered back in the mid 2000s. You had a power enclosure which provided DC output to one or more chassis in a rack over the bus bar. We were glad to be rid of those blades but it wasn't because of their power configuration.
They do have a good point here. If you do the total power budget on a typical 1U (discrete chassis, not blade) server which is packed full of a wall of 40mm fans pushing air, the highest speed screaming 40mm 12VDC fans can be 20W electrical load each. It's easy to "spend" at least 120W at maximum heat from the CPUs, in a dual socket system, just on the fans to pull air from the front/cold side of the server through to the rear heat exhaust.
Just going up to 60mm or 80mm standard size DC fans can be a huge efficiency increase in watt-hours spent per cubic meters of air moved per hour.
I am extremely skeptical of the "12x" but using larger fans is more efficient.
from the URL linked:
> Bigger fans = bigger efficiency gains
Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.
FWIW, we had to have the idle speed of our fans lowered because the usual idle of around 5k RPM was WAY too much cooling. We generally run our fans at around 2.5kRPM (barely above idle). This is due to not only the larger fans, but also the fact that we optimized and prioritized as little restriction on airflow as possible. If you’ve taken apart a current gen 1U/2U server and then compare that to how little our airflow is restricted and how little our fans have to work, the 12X reduction becomes a bit clearer.
I kinda feel we need minicomputers back in this age of computing. Instead of making one giant rack that doesn’t fit through doorways, they should make a 4 ft tall unit that stacks. At least once they’re established enough that they can manage doing small installs instead of full data centers. I’ve looked around and there are tiny forklifts they could use to install 2 at once.
Just the power demands for their full rack exceed capacity for most office spaces.
That and someone needs to make a rack that has a port to plug a glycol line directly into. Doesn’t have to be Oxide, but someone should.
That is quiet, indeed! Have you done any decibel measurements by any chance? I wonder how loud it would be when compared to just ambient residential noise level.
It's quiet enough that one customer is putting one just straight-up on their office floor, rather than in a colo somewhere. I've stood next to one in our office (which is a big garage, no soundproofing, so sound otherwhere bounces around a lot) and had conversations easily.
It is, but if you're running on different hardware than us, you'd have to do a bunch of porting. Buying a solution would be a lot simpler, as we'd have already done the porting.
Have you thought of building an affordable small-scale product for home labs and maybe SMBs? Even if that line didn't turn a profit, it could function as a loss leader in getting engineers and consultants familiar with Oxide, and an opportunity to experiment with (and ultimately evangelize) your tech stack without needing to already have an enterprise-scale use case.
In general, we love the love we get from homelab folks, but the issue is that the current thesis of our designs is "take advantage of the scale of building at the full-rack level."
We really can't afford to do loss leaders before we have more of a business. It's already difficult enough to build a company like this, and that's with making money off of sales. I fully agree that in general, this idea completely makes sense, but you can only really employ it once you have a business to be able to absorb those losses. Right now, building and selling the current product takes up 110% of our time.
I respect that, and I hope you get to that point! As a tech leader in a organization that currently falls short of the scale we'd need to justify Oxide products, I'm hoping that day comes soon.
We're getting to the point where people are building large clusters of Raspberry Pis and the like for hobbyist projects, so I hope that within a few years, the concept of "full-rack level" can encompass hardware with hundreds of nodes small and cheap enough to be packed into a "rack" that still fits under a desk and sells for a couple grand.
In the meantime, I'll guess I'll have to settle for exploring your code and listening to your podcast!
What I don't get is why tie to such an ancient platform. AMD Milan is my home lab. The new 9004 Epycs are so much better on power efficiency. I'm sure they've done their market research and the gains must be so significant. We used to have a few petabytes and tens of thousands of cores almost ten years ago and it's crazy how much higher data and compute density you can get with modern 30 TiB disks and Epyc 9654s. 100 such nodes and you have 10k cores and really fast data. I can't see myself running a 7003-series datacenter anymore unless the Oxide gains are that big.
They've built this a while ago. A hardware refresh takes time. The good news is that they may be able to upgrade the existing equipment with newer sleds.
my undestanding is that they had to build not only the entire hardware platform from scratch, but also the software.
in one of his talks Bryan Cantrill talks about how AMD cpus were meant to be booted off a uefi microcode, and AMD themselves told them such... Until they kinda reverse engineered the AGESA thingy and made the cpu boot without bios/uefi.
I guess that's the kind of things that take a lot of time... the first time. In the future they'll likely to be iterating faster.
EDIT: i wrote the comment above to the best of my knowledge, somebody from Oxide might chime in and maybe add some more details :)
I believe the telco’s did dc power for years so I don’t think this anything new. Any old hands out there want to school us on how it was done in the old days?
Every old telco technician had a story about dropping a wrench on a busbar or other bare piece of high powered transmission equipment and having to shut that center down, get out the heavy equipment, and cut it off because the wrench had been welded to the bus bars.
Note that the rack doesn't accept DC input, like lots of (e.g., NEBS certified) telco equipment. There's a bus bar, but it's enclosed within the rack itself. The rack takes single- or three-phase AC inputs to power the rectifiers, which are then attached to the internal bus bar.
huge gauge copper cables going around a central office (google "telcoflex IV")
big DC breaker/fuse panels
specialized dc fuse panels for power distribution at the top of racks, using little tiny fuses
100% overhead steel ladder rack type cable trays, since your typical telco CO was never a raised floor type environment (UNLIKE legacy 1960s/1970s mainframe computer rooms), so all the power was kept accessible by a team of people working on stepladders.
The same general thing continues today in serious telco/ISP operations, with tech features to bring it into the modern era. The rectifiers are modular now, and there's also rectiverters. Monitoring is much better. People are moving rapidly away from wet cell 2V lead acid battery banks and AGM sealed lead acid stuff to LiFePo4 battery systems.
DC fuse panels can come with network-based monitoring, ability to turn on/off devices remotely.
equipment is a whole lot less power hungry now, a telco CO that has decommed a 5ESS will find itself with a ton of empty thermal and power budget.
when I say serious telco stuff is a lot less power hungry, it's by huge margins. randomly chosen example of radio transport equipment. For instance back in the day a powerful, very expensive point to point microwave radio system might be a full 42U rack, 800W in load, with waveguide going out to antennas on a roof. It would carry one, two or three DS3 equivalent of capacity (45 Mbps each).
now, that same telco might have a radio on its CO roof in the same microwave bands that is 1.3 Gbps FDD capacity, pure ethernet with a SFP+ fiber interface built into it, and the whole radio is a 40W electrical load. The radio is mounted directly on the antenna with some UV/IR resistant weatherproof 16 gauge DC power cable running down into the CO and plugged into a fuse panel.
Can you give me a link to this 1.3 gbps radio product? I have some Alcatel radios with waveguides on a licensed band that only do 50 megabit that I would upgrade if there was something that could get more bits out of the same bandwidth and towers.
Ceragon is one brand name. If you need to keep an entirely indoor unit radio in a rack with the existing waveguide it'll cost a little more, since that's a more rare configuration for new 4096QAM modulation radios.
The 1.3 Gbps full duplex capacity assumes dual linear H&V polarization simultaneously, and assumes an 80 MHz wide FDD channel split such as in the 11 GHz high/low band plan. If you're in FCC part 101 regulatory band territory, and what frequency your existing radios use and existing path, you might not have that capacity. You could have an existing 40 MHz wide channel which will be half the capacity.
If you have a 50 Mbps radio product it's also very likely you're in a single polarity so you would need to recoordinate the path (around $1500) entirely to get the same MHz in the opposite polarity.
I don't have a link handy (on my phone), but I was involved in installs of licensed Cambium 18Ghz radios last year that were pushing >1Gbps. PTP-800 was the model number, if memory serves.
The first large scale app I did we got offices in a building that used to have telco equipment in it. There wasn’t enough power or cooling to run about a rack worth of equipment split across several. It basically had a mini-split for AC. We had to bring in new wiring and run a glycol line to a condenser on the roof, and the smallest unit we were willing to pay for was too big so we had to knock out a wall to tack a reasonable sized office onto the end to get the volume large enough. So much wasted space for the amount of equipment in there.
Their tech may be more than adequate today. Bigger businesses may not buy from a small startup company. They expect a lot more. Illumos is a less popular OS. It wouldn't be the first choice for the OS I'd rely on. Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
The answer to "who does X" is Oxide. That's the point. You're not going to Dell who's integrating multiple vendors in the same box in a way that "should" work. You're getting a rack where everything is designed to work together from top to bottom.
The goal is that you can email Oxide and they'll be able fix it regardless of where it is in the stack, even down to the processor ROM.
If you want on prem infra in exactly the shape and form Oxide delivers*
I've read and understood from Joyent and SmartOS that they believe fault tolerant block devices / filesystems is the wrong abstraction, your software should handle losing storage.
We do not put the onus on customers to tolerate data loss. Our storage is redundant and spread through the rack so that if you lose drives or even an entire computer, your data is still safe.
https://oxide.computer/product/storage
And a big enough customer will evaluate Oxide's resources and consider for themselves whether they think Oxide can provide a quick enough turnaround for everything. That's what GP is talking about.
> Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
Oxide.
This is all a pre-canned solution: just use the API like you would an off-prem cloud. Do you worry about AWS patching stuff? And how many people purchasing 'traditional' servers from Dell/HPe/Lenovo worry about patching links like the LOM?
Further, all of Oxide's stuff is on Github, so you're in better shape for old stuff, whereas if the traditional server vendors EO(S)L something firmware-wise you have no recourse.
How much did Shopify buy? Sounds like from what the CEO is saying they bought 1 unit.
>We learned that Oxide has so far shipped “under 20 racks,” which illustrates the selective markets its powerful systems are aimed at.
>B&F understands most of those systems were deployed as single units at customer sites. Therefore, Oxide hopes these and new customers will scale up their operations in response to positive outcomes.
Yikes. If they sold 20 racks in July, how many are they up to now?
We write the security mitigations. We patch the CVEs. Oxide employs many, perhaps most, of the currently active illumos maintainers --- although I don't work on the illumos kernel personally, I talk to those folks every day.
A big part of what we're offering our customers is the promise that there's one vendor who's responsible for everything in the rack. We want to be the responsible party for all the software we ship, whether it's firmware, the host operating system, the hypervisor, and everything else. Arguably, the promise that there's one vendor you can yell at for everything is a more important differentiator for us than any particular technical aspect of our hardware or software.
Because they use such esoteric software that you'll forever be reliant on Oxide.
I'd rather they use more standardized open source software like Linux, Talos, k8s, Ceph, KubeVirt. Instead of rolling it all themselves on an OS that has a very small niche ecosystem.
Oxide is providing an x86 platform to run VMs/containers on. That's a commoditized market.
The value they're offering is that the rack-level consumption and management is improved over the competition, but you should be able to run whatever you want on the actual compute, k8s or whatnot.
This also means you'd not be forever reliant on Oxide.
I’m rooting for solutions like this as an alternative to the public cloud.
I do see that an org would rely on one company that theoretically can do a ‘Broadcom VMware’ on them but I don’t get this vibe from 0x1d3 at all.
But they target large orgs, I wish a solution like this would be accessible for smaller companies.
I wish I could throw their stack on my second hand cots hardware, rent a few U’s in two colos for geo redundancy and cry of happiness each month realizing how much money we save on public cloud cost, yet having cloud capabilities/benefits
> Here’s a sobering thought: today, data centers already consume 1-2% of the world’s power, and that percentage will likely rise to 3-4% by the end of the decade.
I don't get this marketing angle. I've made arguments here before that the cost of compute from a energy perspective is often negligible. If Google Maps, for example, can save you 1 mile due to better routing, then that is several orders of magnitude more efficient [1].
If it uses less resources, it uses less resources. Everybody (businesses and individuals) loves that.
I'm amazed Apple don't have a rack mount version of their M series chips yet.
Even for their own internal use in their data centers they'd have to save an absolute boat load on power and cooling given their performance per watt compared to legacy stuff.
Oxide is not touching DLC systems in their post even with a 100ft barge pole.
Lenovo's DLC systems use 45 degrees C water to directly cool the power supplies and the servers themselves (water goes through them) for > 97% heat transfer to water. In cooler climates, you can just pump this to your drycoolers, and in winter you can freecool them with just air convection.
Yes, the TDP doesn't go down, but cooling costs and efficiency shots up considerably, reducing POE to 1.03 levels. You can put tremendous amount of compute or GPU power in one rack, and cool them efficiently.
Every chassis handles its own power, but IIRC, all the chassis electricity is DC. and the PSUs are extremely efficient.
Companies buying massive cloud scale server hardware want to be able to choose from a dozen different Taiwanese motherboard manufacturers. Apple is in no way motivated to release or sell the M3/M4 CPUs as a product that major east asia motherboard manufacturers can design their own platform for. Apple is highly invested in tightly integrated ecosystems where everything is soldered down together in one package as a consumer product (take a look at a macbook air or pro motherboard for instance).
…Apple has made rack-mounted computers in recent history. They don’t sell chips, they sell complete boxes with rack mount hardware, motherboard and all.
An extremely niche product for things like video editing studios, not something you can deploy at scale in colocation/datacenter environments. Literally never seen rackmounted apple hardware in a serious datacenter since the apple xserve 20 to 22 years ago.
I don't think they'd admit much about it even if they had one internally, both because Apple isn't known for their openness about many things, and because they already exited the dedicated server hardware business years ago, so I think they're likely averse to re-entering it without very strong evidence that it would be beneficial for more than a brief period.
In particular, while I'd enjoy such a device, Apple's whole thing is their whole-system integration and charging a premium because of it, and I'm not sure the markets that want to sell people access to Apple CPUs will pay a premium for a 1U over shoving multiple Mac Minis in the same 1U footprint, especially if they've already been doing that for years at this point...
...I might also speculate that if they did this, they'd have a serious problem, because if they're buying exclusive access to all TSMC's newest fab for extended intervals to meet demand on their existing products, they'd have issues finding sources to meet a potentially substantial demand in people wanting their machines for dense compute. (They could always opt to lag the server platforms behind on a previous fab that's not as competed with, of course, but that feels like self-sabotage if they're already competing with people shoving Mac Minis in a rack, and now the Mac Minis get to be a generation ahead, too?)
I will add that consumer macOS is a piss-poor server OS.
At one point, for many years, it would just sometimes fail to `exec()` a process. This would manifest as a random failure on our build farm about once/twice a month. (This would manifest as "/bin/sh: fail to exec binary file" because the error type from the kernel would have the libc fall back to trying to run the binary as a script, as normal for a Unix, but it isn't a script)
This is likely stemming from their exiting the server business years ago, and focusing on consumer appeal more than robustness (see various terrible releases, security- and stability-wise).
(I'll grant that macOS has many features that would make it a great server OS, but it's just not polished enough in that direction)
As I recall, Apple advertised macOS as a Unix without such certification, got sued, and then scrambled to implement the required features to get certification as a result. Here's the story as told by the lead engineer of the project:
This comes up rather often, and on the last significant post about it I saw on HN someone pointed out that the certification is kind of meaningless[1]. macOS poll(2) is not Unix-compliant, hasn't been since forever, yet every new version of macOS gets certified regardless.
I wouldn't run a Windows server, but at least it can manage a SYN flood, whereas MacOS doesn't have syncookies or similar (their version of pf has the syncookie keyword, but it seems like it only works for traffic that transits the host, not for traffic that is terminated by the host). Windows also has some pretty nice stuff for servers like receive side scaling (afaik, Microsoft brought that to market, or at least was very early).
That's designed for the broadcast market, where they rack mount everything in the studio environment. It's not really a server, it has no out of band management, redundant power etc.
There are third party rack mounts available for the Mac Mini and Mac Studio also.
Maybe it becomes a big enough profit center to matter. Maybe. At the risk of taking focus away, splitting attention from the mission they're on today: building end user systems.
Maybe they build them for themselves. For what upside? Maybe somewhat better compute efficiency maybe, but I think if you have big workloads the huge massive AMD Turin super-chips are going to be incredibly hard to beat.
It's hard to emphasize just how efficient AMD is, with 192 very high performance cores on a 350-500W chip.
> Maybe they build them for themselves. For what upside?
They do build it for themselves. From their security blog:
"The root of trust for Private Cloud Compute is our compute node: custom-built server hardware that brings the power and security of Apple silicon to the data center, with the same hardware security technologies used in iPhone, including the Secure Enclave and Secure Boot. We paired this hardware with a new operating system: a hardened subset of the foundations of iOS and macOS tailored to support Large Language Model (LLM) inference workloads while presenting an extremely narrow attack surface. This allows us to take advantage of iOS security technologies such as Code Signing and sandboxing."
This is such a narrow narrow tiny corner of computing needs. That has such serious need for ownership, no matter the cost. And has extremely fantastically chill as shit overall computing needs, is un-perfomamce-sensitive as it gets.
I could not be less convinced by this information that this is a useful indicator for the other 99.999999999% of computing needs.
> > The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies
This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff. In general, the cost savings advertised by cloud infrastructure should be more holistic.
>This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff.
I'll happily take a single high qualify power supply (which may have internal redundancy FWIW) over 70 much more cheaply made power supplies that stress other parts of my datacenter via sheer inefficiency, and also costs more in aggregate. Nobody drives down the highway with 10 spare tires for their SUV.
A DC busbar can propagate a short circuit across the rack, and DC circuit protection is harder than AC. So of course each server now needs its own current limiter, or a cheap fuse.
But I’m not debating the merits of this engineering tradeoff - which seems fine, and pretty widely adopted - just its advertisement. The healthcare industry understands the importance of assessing clinical endpoints (like mortality) rather than surrogate measures (like lab results). Whenever we replace “legacy” with “cloud”, it’d be nice to estimate the change in TCO.
Let's say your high quality supply's yearly failure rate is 100 times less than the cheap ones
The probability of at least a single failure is 1-(1-r)^70.
This is quite high even w/out considering the higher quality of the one supply.
The probability of all 70 going down is
r^70 which is absurdly low.
Let's say r = 0.05 or one failed supply every 20 in a year.
1-(1-r)^70 = 97%
r^70 < 1E-91
The high quality supply has r = 0.0005, in between no failure and all failing. If you code can handle node failure, very many, cheaper supplies appears to be more robust.
Yeah but the failure rate of an analog piece of copper is pretty low, it'll keep being copper unless you do stupid things. You'll have multiple power supplies provide power on the same piece of copper
The big piece of copper is fed by redundant rectifiers. Each power shelf has six independent rectifiers which are 5+1 redundant if the rack is fully loaded with compute sleds, or 3+3 redundant if the rack is half-populated. Customers who want more redundancy can also have a second power shelf with six more rectifiers.
The bus bar itself is an SPoF, but it's also just dumb copper. That doesn't mean that nothing can go wrong, but it's pretty far into the tail of the failure distribution.
The power shelf that keeps the busbar fed will have multiple rectifiers, often with at least N+1 redundancy so that you can have a rectifier fail and swap it without the rack itself failing. Similar things apply to the battery shelves.
It's also plausible to have multiple power supplies feeding the same bus bar in parallel (if they're designed to support this) e.g. one at each end of a row.
This is how our rack works (Oxide employee). In each power shelf, there are 6 power supplies and only 5 need to be functional to run at full load. If you want even more redundancy, you can use both power shelves with independent power feeds to each so even if you lose a feed, the rack still has 5+1 redundant power supplies.
The whole thing with eliminating 70 discrete 1U server size AC-to-DC power supplies is nothing new. It's the same general concept as the power distribution unit in the center of an open compute platform rack design from 10+ years ago.
Everyone who's doing serious datacenter stuff at scale knows that one of the absolute least efficient, labor intensive and cabling intensive/annoying ways of powering stuff is to have something like a 42U cabinet with 36 servers in it, each of them with dual power supplies, with power leads going to a pair of 208V 30A vertical PDUs in the rear of the cabinet. It gets ugly fast in terms of efficiency.
The single point of failure isn't really a problem as long as the software is architected to be tolerant of the disappearance of an entire node (mapping to a single motherboard that is a single or dual cpu socket config with a ton of DDR4 on it).
This isn't even remotely close. Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.
In the event that all 32 servers had redundant AC power feeds, you could just install a pair of redundant DC power feeds.
It's highly dependent on the individual server model and quite often how you spec it too. Most 1U Dell machines I worked with in the past only had a single slot for a PSU, whereas the beefier 2U (and above) machines generally came with 2 PSUs.
Rack servers have two PSUs because enterprise buyers are gullible and will buy anything. Generally what happens in case of a single PSU failure is the other PSU also fails or it asserts PROCHOT which means instead of a clean hard down server you have a slow server derping along at 400MHz which is worse in every possible way.
That's true if you think the market is SaaS upstarts like Bluesky and maybe less true if you think of the market in terms of who buys hardware. I remember early on at Matasano working for a house account, a major US corp that isn't a household name, and being shocked 2 years in when I finally had to do something in their data center (a FCIP appliance assessment) and seeing how much they'd spent on it. Look at everyone who runs (and wishes they weren't) z/OS today, or Oracle. There's more of them than I think a lot of HN people think.
Not Oxide or Bluesky, but firstly I'd suggest that asking the company about their customers is unlikely to get a response, most companies don't disclose their customers. Secondly, Bluesky have been growing quickly, I can only assume their hardware is too, and that means long lead time products like an Oxide rack aren't going to work, especially when you can have an off the shelf machine from Dell delivered in a few days.
Oxide is very open, we are happy to talk about customers that allow us to talk about them. Some don’t want to, others are very happy to be mentioned, just like any other company.
> we are happy to talk about customers that allow us to talk about them
This is what I meant by "don't disclose", I didn't mean that Oxide was in any way secretive, but that usually this stuff doesn't get agreed, and that it would make more sense to ask the customer rather than the company selling as Oxide won't want to disclose unless there's already an agreement in place (formal or otherwise).
In my head I'm imagining an average landing page. They slap their customers on there like stickers. I doubt bluesky would stay secretive about using oxide if they did
Those customers listed on the front page of companies are there as part of an agreement. Usually something like a discount. Certainly they are not listed without permission. 10x that if it is a case study.
I think they often are listed without permission unfortunately, and often literally based on on the the email addresses of people signing up for a trial. I see my company's logo on the landing page of many products that we don't use or may even have a policy preventing our use of.
> Replacing low-efficiency AC power supplies with a high-efficiency DC Bus Bar
The part after it about better cooling fans, meh, there are more efficient liquid-cooling methods including immersion-cooling which are already there in implementation albeit niche.
We don’t currently have GPUs in the product. The closed-ness of the GPU space is a bit of a cultural difference, but we’ll surely have something eventually. As a small company, we have to focus on our strengths, and there’s plenty of folks who don’t need GPUs right now.
For sure. It’s not just GPUs; given that we have one product with three SKUs, there’s a variety of workloads we won’t be appropriate for just yet. Just takes time to diversify the offering.
"If only they used DC from the wall socket, all those H100s would be green" is, not, I think, the hill you want to die on.
But, yeah, my three 18MW/y racks agree that more power efficiency would be nice, it's just that Rewrite It In (Safe) Rust is unlikely to help with that...
It’s significantly more than that, but it’s also true that we include stuff in other languages where appropriate. CockroachDB is in Go, and illumos is in C, as two examples. But almost all new code we write is in Rust. That is the stuff you’re talking about, but also like, our control plane.
I think it's hard to call it a reason. It is a tool which fits in with the philosophy of the company in terms of how to achieve it's goals, but I think it would still exist if rust didn't. I would describe the goal as making a hyperscaling system that can be sold as a product, the philosophy of how to make this is an aggressive focus on integration, openness, and quality, and that rust is a language that works well with the last two of those goals.
It's also not really a case of "rewriting in Rust" anyway, it's more just "writing it in Rust" since most of the stuff the Oxide team has built is greenfield work.
Pretty much everything Oxide publishes on github is either in rust or it's an sdk to service in rust. Well and web panel isn'tin rust, so negative points for that, true evangelists would have used WASM.
But Oxide reason to exist is to keep memory of cool racks from Sun running Solaris alive forever.
(And for that matter, Oracle's proprietary Solaris seems better maintained than I ever expected, though in this context I think the open source fork is the relevant thing to look at.)
> How can organizations reduce power consumption and corresponding carbon emissions?
Stop running so much useless stuff.
Also maybe ARM over x86_64 and similar power-efficiency-oriented hardware.
Rack-level system design, or at least power & cooling design, is certainly also a reasonable thing to do. But standardization is probably important here, rather than some bespoke solution which only one provider/supplier offers.
> How can organizations keep pace with AI innovation as existing data centers run out of available power?
Current ARM servers actually generally offer "on par" (varies by workload) perf/Watt for generally worse absolute performance (varies by workload) i.e. require more other overhead to achieve the same total perf despite "on par" perf/Watt.
Need either Apple to get into the general market server business or someone to start designing CPUs as well as Apple (based on the comparison between different ARM cores I'm not sure it really matters if they do so using a specific architecture or not).
It's more a case of selection of optimization parameters and corresponding economy. It's not so much that apple towers over others in design (though they are absolutely no slouches and have wins there) but their design team is in position to coordinate with product directly and as such isn't as limited by "but will it sell in high enough numbers for the excel sheet at investor's desk?"
The real show stopper for years is that ARM servers are just not prepared to be a proper platform. uBoot with grudgingly included FDT (after getting kicked out of Linux kernel) does not make a proper platform, and often there's also no BMC, unique approaches to various parts making the server that one annoying weirdo in the data center, etc.
Cloud providers can spend the effort to backfill necessary features with custom parts, but doing so on your own on-prem is hard
Not sure what you mean wrt to Apple's uniqueness. AMD/Mediatek/Intel/Qualcomm/Samsung only make margin on how well they invest on their designs vs their competitors and they'd all love to be outshipping each other and Apple in any market. All, including Apple, also rely on the same manufacturer for their top products and the ones (Intel/Samsung) with alternatives have not been able to use that as an advantage for top performing products. Sure, Apple can work directly with their own product... but at the end of the day the goal and available customer pool to fight over is the same and they still ship fewer units than the others.
I'm not hands-on familiar with other serious ARM server market players but for several years now Ampere ARM server CPUs at least are nothing like you describe. Phoronix says it best in https://www.phoronix.com/review/linux-os-ampereone
> All the Linux distributions I attempted worked out effortlessly on this Supermicro AmpereOne server. Like with Ampere Altra and Ampere eMAG before that, it's a seamless AArch64 Linux experience. Thanks to supporting open standards like UEFI, Arm SBSA/SBBR and ACPI and not having to rely on DeviceTrees or other nuisances, installing an AArch64 Linux distribution on Ampere hardware is as easy as in the x86_64 space.
Ampere is a bright spot in all of this, indeed. Just considerably late. I remember being bombarded by "ARM servers are going to eat the world" in 2013, but ARM couldn't deliver SBSA in shape that would make it possible and to this day I am left with serious doubts if any ARM board will work out right (there are bright spots though).
As for Apple "uniqueness", I met a lot of people who think that Apple "just" has so much better design team, when it's similar to what you say and the unique part is them being able to properly narrow their design space instead of chasing cost-conscious manufacturers.
I really love Oxide to an unhealthy amount (it's become a bit of a meme among my colleagues), but sometimes I do wonder whether they went about their go-to-market the right way. They really tried to do everything at once - custom servers, custom router, custom rack, everything. Their accomplishments are technologically impressive, but, as somebody who is in a position to make purchasing decisions, not economically attractive. They're 3x more expensive than our existing hardware, two generations behind (I'm aware they're on track for a refresh) and don't have any GPUs. E.g. what I would have loved to see is just an after-market BMC/NIC/firmware solution using their stack. Plug it into a cheap Gigabyte system (their BMC is pluggable and NIC is OCP) and just have the control plane manage it as a whole box. I'd have easily paid serveral thousand $ per server just for that. All the rack scale integration, virtualization, migration, network storage, etc stuff is cool, but not everyone needs it. Get your foot in the door at customers, build up some volume for better deals with AMD, and then start building the custom rack stuff ... Of course it's easy to be a critic from the side lines. As I said, I do really love what the Oxide folks are doing, I just really hope it'll become possible for me to buy their gear at some point.
First, thanks for the love -- it's deeply appreciated! Our go-to-market is not an accident: we spent a ton of time (too much time?) looking at how every company had endeavored (and failed) in this space, and then considering a bunch of other options besides. Plugging into a "cheap Gigabyte" system wouldn't actually allow us to build what we've built, and we know this viscerally: before we had our system built, we had to have hardware to build our software on -- which was... a bunch of cheap Gigabyte systems. We had the special pain of relearning all of the reasons why we took the approach we've taken: these systems are a non-starter with respect to foundation.
You may very well not need the system that we have built, but lots of people do -- and the price point versus the alternatives (public cloud or on-prem commodity HW + pretty price SW) has proven to be pretty compelling. I don't know if we'll ever have a product that hits your price point (which sounds like... the cost of Gigabyte plus a few thousand bucks?), but at least the software is all open source!
Please forgive my tergiversation. I fully trust that you know your path and I know how annoying it is to be why-dont-they-just'd. As I said, I'm rooting for you.
> The meaning of TERGIVERSATION is evasion of straightforward action or clear-cut statement : equivocation
There's two dictionary definitions of tergiversate. One is the one you quoted, the other is one of desertion. Both meanings of the word are pejorative in the sense that the word comes with a connotation of betrayal of a cause. What I wanted to express was an acknowledgement that I understood the feeling that you get when someone who's clearly a fan of your work nevertheless does not provide a clear endorsement. It's easy emotionally to dismiss people who "just don't get it". But when someone does get it but chooses to equivocate, that can feel like an emotional betrayal. So I was looking for a word that covered that with the right connotation. I originally used apostasy, but it didn't feel quite right, because I wasn't really renouncing, more failing to fully endorse, so tergiversation it was. Of course having to write an entire paragraph to explain your word choice kind of defeats the purpose of choosing a single well fitting word over just writing a sentence of simple words that explains what you mean. But hey, I write enough technical writing, documentation, reports, grants, etc. all day where clarity is paramount that I feel like I get to have a little vocabulary treat in my personal writing ;).
+1 for use of tergiversation
So my question: any Arm-based system or GPU-based system on the horizon?
You just described why commodity servers won over engineered systems that came before Oxide (like Nutanix, Sun / Oracle Exa*, VCE etc).
So I totally agree with your go-to-market comment, because it’s also a bet against cloud.
I wish them luck though.
And yet, non of the hyperscalers use commodity server. They are buying parts from the OCP but those are hardly 'commodity' servers. So did they win?
I kinda feel that their focus is more on building a great technology (& culture?) than a great business.
Not necessarily a bad choice; after all, for what shall it profit a man, if he shall gain the whole world, and lose his own soul?
We are definitely very much building a business! We have the iconoclastic belief that you can build a business by having a terrific team building a great product that customers love. And we're getting there![0]
[0] https://www.theregister.com/2024/11/18/llnl_oxide_compute/
Oxide are doing great work. Hoping they can probe the market a bit more for us out on the sidelines preparing to drop in and compete with some similar tech.
Id also wish I could get to play around with a cheaper version of their tech, but they probably havw enough customers that really want a large-scale solution that is completely customizable
I'm curious what their burn rate is.
> When we started Oxide, the DC bus bar stood as one of the most glaring differences between the rack-scale machines at the hyperscalers and the rack-and-stack servers that the rest of the market was stuck with. That a relatively simple piece of copper was unavailable to commercial buyers
It seems that 0xide was founded in 2019 and Open Compute Project had been specifying dc bus bars for 6 years at that point. People could purchase racks if they wanted, but it seems like, by large, people didn't care enough to go whole hog in on it.
Wonder if the economics have changed or if it's still just neat but won't move the needle.
Things like -48VDC bus bars in the 'telco' world significantly predate the OCP, all the way back to like 1952 in the Bell system.
In general, the telco world concept hasn't changed much. You have AC grid power coming from your local utility into some BIG ASS RECTIFIERS which create -48VDC (and are responsible for charging your BIG ASS BATTERY BANK to float voltage), then various DC fuses/breakers going to distribution of -48VDC bus bars powering the equipment in a CO.
Re: Open Compute, the general concept of what they did was go to a bunch of 1U/2U server power supply manufacturers and get them to make a series of 48VDC-to-12VDC power supplies (which can be 92%+ efficient), and cut out the need for legacy 5VDC feed from power supply into ATX-derived-design x86-64 motherboards.
I remember seeing an old telephone switching system from the 20's and I think it was 48vdc. Uncertain though.
Yeah, would have been 48 vdc for line operations, 60 and up AC for the ring.
You simply can't buy OCP hardware is part of the issue, not new anyway. What you're going to find is "OCP Inspired" hardware that has some overlap with the full OCP specification but is almost always meant to run on 240VAC on 19in racks because nobody wants to invest the money in something that can't be bought from CDW.
I remember the one time I had OCP hardware in data center, and how it was essentially rumoured it's better to not ask too much how it got there - not the level of "fell of a truck", but some possibility it was ex-(big tech) equipment acquired through favours, or some really insistent negotiating with Quanta till "to be sold to (big tech)" racks ended up with us
It's normally incredibly difficult for employees to disrupt at massive companies that would be the type which runs a data center. Disruption usually enters the corp in a sales deck, much like the one Oxide would have.
It's stupid, but that's why we all have jobs.
I think engineers should be more forceful to lead their own visions instead being led by accountants and lawyers.
After engineers have the power of implementation and de-implementstion. They need to step into dirty politics and bend other people's views.
It's either theirs or ours. Win-win is a fallacy.
Being able to navigate this is what differentiates a very senior IC (principal, distinguished, etc) and random employees.
Yes. I think as an engineer at this level you need to also have the patience to deal with the bean counters.
But as I’ve grown in my career I’ve actually found that line of thinking refreshing. Can you quantify benefit? If it requires too many assumptions it’s probably not worth it.
But then again there’s always the Vp or the svp who wants to “showcase his towers’ innovative spirit” and then there goes money that could be used for better things. The innovative spirit of the day is random Llm apps.
Let me know how that works out for you!
Once the accountants are convinced the entire company is about them, there’s not much the engineers can do. They just starve you out by refusing to buy anything. It’s a big reason why open source is as successful as it is. It’s free so they can’t stop you with the checkbook.
OCP hardware is only really accessible to hyperscalers. You can't go out and just buy a rack or two, the Taiwanese OEMs don't do direct deals that small. Even if they did, no integration is done for you. You would have to integrate the compute hardware from one company, the network fabric from another company, and then the OS and everything else from yet another. That's a lot of risk, a lot of engineering resources, a lot of procurement overhead, and a lot of different vendors pointing fingers at each other when something doesn't work.
If you're Amazon or Google, you can do this stuff yourself. If you're a normal company, you probably won't have the inhouse expertise.
On the other hand, Oxide sells a turnkey IaaS platform that you can just roll off the pallet, plug in and start using immediately. You only need to pay one company, and you have one company to yell at if something goes wrong.
You can buy a rack of 1-2U machines from Dell, HPE or Cisco with VMware or some other HCI platform, but you don't get that power efficiency or the really nice control plane Oxide have on their platform.
But isn’t it a little surprising (I’m not an expert) that Dell or Supermicro or somefirm like that hadn’t already started offering an approachable access to either OCP gear or a proprietary knockoff of it? Presumably that may still happen if Oxide is seen to have proven the market.
Azure tried this, not with their hyperscaler stuff, but with Azure Operator Nexus.
Basically an "opinionated" combination of Dell, Arista, and Pure storage with a special Azure AKS running on top and a metric ton of management and orchestration smarts. The target customer base was telcos who needed local capabilities in their data centers and who might otherwise have gone to OCP.
As far as I can surmise, it's dead, but not EOLed. Microsoft nuked the operator business unit earlier in the year, and judging by recent job postings from contract shops, AT&T might be the only customer.
These companies are looked into their way of doing things. Also, they would be competing with themselves. It would also require more work on their side then they do now.
I think the whole 'existing company is not doing something, therefore its a bad idea' is a really dangerous take.
Oxide is also not just exactly, OCP, they share some aspects, but Oxide racks are optimized for typical DC of large organizations. Maybe there is a balance there that matters.
Supermicro does sell OCP racks.
https://www.supermicro.com/solutions/Solution-Brief-Supermic...
I recall them offering older versions of the specs but can't easily find a reference, so I might be wrong about how accessible they were.
HP BladeSystem p-series chassis were all DC bus bar powered back in the mid 2000s. You had a power enclosure which provided DC output to one or more chassis in a rack over the bus bar. We were glad to be rid of those blades but it wasn't because of their power configuration.
One is the specs and the other is an actual implementation, what am I missing?
They do have a good point here. If you do the total power budget on a typical 1U (discrete chassis, not blade) server which is packed full of a wall of 40mm fans pushing air, the highest speed screaming 40mm 12VDC fans can be 20W electrical load each. It's easy to "spend" at least 120W at maximum heat from the CPUs, in a dual socket system, just on the fans to pull air from the front/cold side of the server through to the rear heat exhaust.
Just going up to 60mm or 80mm standard size DC fans can be a huge efficiency increase in watt-hours spent per cubic meters of air moved per hour.
I am extremely skeptical of the "12x" but using larger fans is more efficient.
from the URL linked:
> Bigger fans = bigger efficiency gains Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.
FWIW, we had to have the idle speed of our fans lowered because the usual idle of around 5k RPM was WAY too much cooling. We generally run our fans at around 2.5kRPM (barely above idle). This is due to not only the larger fans, but also the fact that we optimized and prioritized as little restriction on airflow as possible. If you’ve taken apart a current gen 1U/2U server and then compare that to how little our airflow is restricted and how little our fans have to work, the 12X reduction becomes a bit clearer.
> the usual idle of around 5k RPM was WAY too much cooling.
What does this mean? Can one actually get too much cooling? Do you get like condensation and stuff, that kind of "too much cooling" ?
I'm not being snarky, i actually don't know.
It must mean cooling significantly below the target temperature, and thus wasting power to do it
I see, thank you!
I really wish Oxide had homelab/prosumer grade stuff. I'd be sending them so much money.
I kinda feel we need minicomputers back in this age of computing. Instead of making one giant rack that doesn’t fit through doorways, they should make a 4 ft tall unit that stacks. At least once they’re established enough that they can manage doing small installs instead of full data centers. I’ve looked around and there are tiny forklifts they could use to install 2 at once.
Just the power demands for their full rack exceed capacity for most office spaces.
That and someone needs to make a rack that has a port to plug a glycol line directly into. Doesn’t have to be Oxide, but someone should.
A ~20U rack working off residential 15/20A would have been so cool.
Though given how it's designed for the datacenters, I'd expect the thing to be pretty darn loud.
> Though given how it's designed for the datacenters, I'd expect the thing to be pretty darn loud.
It's actually very much the opposite: the rack is very, very quiet. You can hear for yourself: https://www.youtube.com/watch?v=bYcgPRIWf6I
That is quiet, indeed! Have you done any decibel measurements by any chance? I wonder how loud it would be when compared to just ambient residential noise level.
I don't remember off the top of my head.
It's quiet enough that one customer is putting one just straight-up on their office floor, rather than in a colo somewhere. I've stood next to one in our office (which is a big garage, no soundproofing, so sound otherwhere bounces around a lot) and had conversations easily.
Thanks for the info, it would definitely pass the WAF gate from that perspective :)
Isn't most of their stuff open source?
It is, but if you're running on different hardware than us, you'd have to do a bunch of porting. Buying a solution would be a lot simpler, as we'd have already done the porting.
Have you thought of building an affordable small-scale product for home labs and maybe SMBs? Even if that line didn't turn a profit, it could function as a loss leader in getting engineers and consultants familiar with Oxide, and an opportunity to experiment with (and ultimately evangelize) your tech stack without needing to already have an enterprise-scale use case.
In general, we love the love we get from homelab folks, but the issue is that the current thesis of our designs is "take advantage of the scale of building at the full-rack level."
We really can't afford to do loss leaders before we have more of a business. It's already difficult enough to build a company like this, and that's with making money off of sales. I fully agree that in general, this idea completely makes sense, but you can only really employ it once you have a business to be able to absorb those losses. Right now, building and selling the current product takes up 110% of our time.
I respect that, and I hope you get to that point! As a tech leader in a organization that currently falls short of the scale we'd need to justify Oxide products, I'm hoping that day comes soon.
We're getting to the point where people are building large clusters of Raspberry Pis and the like for hobbyist projects, so I hope that within a few years, the concept of "full-rack level" can encompass hardware with hundreds of nodes small and cheap enough to be packed into a "rack" that still fits under a desk and sells for a couple grand.
In the meantime, I'll guess I'll have to settle for exploring your code and listening to your podcast!
What I don't get is why tie to such an ancient platform. AMD Milan is my home lab. The new 9004 Epycs are so much better on power efficiency. I'm sure they've done their market research and the gains must be so significant. We used to have a few petabytes and tens of thousands of cores almost ten years ago and it's crazy how much higher data and compute density you can get with modern 30 TiB disks and Epyc 9654s. 100 such nodes and you have 10k cores and really fast data. I can't see myself running a 7003-series datacenter anymore unless the Oxide gains are that big.
They've built this a while ago. A hardware refresh takes time. The good news is that they may be able to upgrade the existing equipment with newer sleds.
Yes we're definitely building the next generation of equipment to fit into the existing racks!
my undestanding is that they had to build not only the entire hardware platform from scratch, but also the software.
in one of his talks Bryan Cantrill talks about how AMD cpus were meant to be booted off a uefi microcode, and AMD themselves told them such... Until they kinda reverse engineered the AGESA thingy and made the cpu boot without bios/uefi.
I guess that's the kind of things that take a lot of time... the first time. In the future they'll likely to be iterating faster.
EDIT: i wrote the comment above to the best of my knowledge, somebody from Oxide might chime in and maybe add some more details :)
I believe the telco’s did dc power for years so I don’t think this anything new. Any old hands out there want to school us on how it was done in the old days?
Every old telco technician had a story about dropping a wrench on a busbar or other bare piece of high powered transmission equipment and having to shut that center down, get out the heavy equipment, and cut it off because the wrench had been welded to the bus bars.
Note that the rack doesn't accept DC input, like lots of (e.g., NEBS certified) telco equipment. There's a bus bar, but it's enclosed within the rack itself. The rack takes single- or three-phase AC inputs to power the rectifiers, which are then attached to the internal bus bar.
big ass rectifiers
big ass solid copper busbars
huge gauge copper cables going around a central office (google "telcoflex IV")
big DC breaker/fuse panels
specialized dc fuse panels for power distribution at the top of racks, using little tiny fuses
100% overhead steel ladder rack type cable trays, since your typical telco CO was never a raised floor type environment (UNLIKE legacy 1960s/1970s mainframe computer rooms), so all the power was kept accessible by a team of people working on stepladders.
The same general thing continues today in serious telco/ISP operations, with tech features to bring it into the modern era. The rectifiers are modular now, and there's also rectiverters. Monitoring is much better. People are moving rapidly away from wet cell 2V lead acid battery banks and AGM sealed lead acid stuff to LiFePo4 battery systems.
DC fuse panels can come with network-based monitoring, ability to turn on/off devices remotely.
equipment is a whole lot less power hungry now, a telco CO that has decommed a 5ESS will find itself with a ton of empty thermal and power budget.
when I say serious telco stuff is a lot less power hungry, it's by huge margins. randomly chosen example of radio transport equipment. For instance back in the day a powerful, very expensive point to point microwave radio system might be a full 42U rack, 800W in load, with waveguide going out to antennas on a roof. It would carry one, two or three DS3 equivalent of capacity (45 Mbps each).
now, that same telco might have a radio on its CO roof in the same microwave bands that is 1.3 Gbps FDD capacity, pure ethernet with a SFP+ fiber interface built into it, and the whole radio is a 40W electrical load. The radio is mounted directly on the antenna with some UV/IR resistant weatherproof 16 gauge DC power cable running down into the CO and plugged into a fuse panel.
Can you give me a link to this 1.3 gbps radio product? I have some Alcatel radios with waveguides on a licensed band that only do 50 megabit that I would upgrade if there was something that could get more bits out of the same bandwidth and towers.
Ceragon is one brand name. If you need to keep an entirely indoor unit radio in a rack with the existing waveguide it'll cost a little more, since that's a more rare configuration for new 4096QAM modulation radios.
The 1.3 Gbps full duplex capacity assumes dual linear H&V polarization simultaneously, and assumes an 80 MHz wide FDD channel split such as in the 11 GHz high/low band plan. If you're in FCC part 101 regulatory band territory, and what frequency your existing radios use and existing path, you might not have that capacity. You could have an existing 40 MHz wide channel which will be half the capacity.
If you have a 50 Mbps radio product it's also very likely you're in a single polarity so you would need to recoordinate the path (around $1500) entirely to get the same MHz in the opposite polarity.
I don't have a link handy (on my phone), but I was involved in installs of licensed Cambium 18Ghz radios last year that were pushing >1Gbps. PTP-800 was the model number, if memory serves.
The first large scale app I did we got offices in a building that used to have telco equipment in it. There wasn’t enough power or cooling to run about a rack worth of equipment split across several. It basically had a mini-split for AC. We had to bring in new wiring and run a glycol line to a condenser on the roof, and the smallest unit we were willing to pay for was too big so we had to knock out a wall to tack a reasonable sized office onto the end to get the volume large enough. So much wasted space for the amount of equipment in there.
Their tech may be more than adequate today. Bigger businesses may not buy from a small startup company. They expect a lot more. Illumos is a less popular OS. It wouldn't be the first choice for the OS I'd rely on. Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
The answer to "who does X" is Oxide. That's the point. You're not going to Dell who's integrating multiple vendors in the same box in a way that "should" work. You're getting a rack where everything is designed to work together from top to bottom.
The goal is that you can email Oxide and they'll be able fix it regardless of where it is in the stack, even down to the processor ROM.
This. If you want on prem cloud infra without having to roll it yourself, Oxide is the solution.
(no affiliation, just a fan)
If you want on prem infra in exactly the shape and form Oxide delivers*
I've read and understood from Joyent and SmartOS that they believe fault tolerant block devices / filesystems is the wrong abstraction, your software should handle losing storage.
We do not put the onus on customers to tolerate data loss. Our storage is redundant and spread through the rack so that if you lose drives or even an entire computer, your data is still safe. https://oxide.computer/product/storage
They have partly changed their position on that. You can listen to their podcast on their distributed block storage solution.
And a big enough customer will evaluate Oxide's resources and consider for themselves whether they think Oxide can provide a quick enough turnaround for everything. That's what GP is talking about.
> Bigger businesses may not buy from a small startup company.
What would you classify Shopify as?
> One existing Oxide user is e-commerce giant Shopify, which indicates the growth potential for the systems available.
* https://blocksandfiles.com/2024/07/04/oxide-ships-first-clou...
Their CEO has tweeted about it:
* https://twitter.com/tobi/status/1793798092212367669
> Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
Oxide.
This is all a pre-canned solution: just use the API like you would an off-prem cloud. Do you worry about AWS patching stuff? And how many people purchasing 'traditional' servers from Dell/HPe/Lenovo worry about patching links like the LOM?
Further, all of Oxide's stuff is on Github, so you're in better shape for old stuff, whereas if the traditional server vendors EO(S)L something firmware-wise you have no recourse.
How much did Shopify buy? Sounds like from what the CEO is saying they bought 1 unit.
>We learned that Oxide has so far shipped “under 20 racks,” which illustrates the selective markets its powerful systems are aimed at.
>B&F understands most of those systems were deployed as single units at customer sites. Therefore, Oxide hopes these and new customers will scale up their operations in response to positive outcomes.
Yikes. If they sold 20 racks in July, how many are they up to now?
Illumos is the OS for the hypervisor and core services, they don't expect their customers to run their code directly on that OS, but inside VMs.
> Bigger businesses may not buy from a small startup company.
Our early customers include government, finance, and places like Shopify.
You’re not wrong that some places may prefer older companies but that doesn’t mean they all do.
Illumos is not really directly relevant to the customer, it’s a non user facing implementation detail.
We provide security updates.
The illumos bare-metal OS is not directly visible to customers.
We write the security mitigations. We patch the CVEs. Oxide employs many, perhaps most, of the currently active illumos maintainers --- although I don't work on the illumos kernel personally, I talk to those folks every day.
A big part of what we're offering our customers is the promise that there's one vendor who's responsible for everything in the rack. We want to be the responsible party for all the software we ship, whether it's firmware, the host operating system, the hypervisor, and everything else. Arguably, the promise that there's one vendor you can yell at for everything is a more important differentiator for us than any particular technical aspect of our hardware or software.
See perhaps "Oxide Cloud Computer Tour - Rear":
* https://www.youtube.com/watch?v=lJmw9OICH-4
How long before a VPS pops up running Oxide racks? Or, why wouldn't a VPS build on top of Oxide if they offer better efficiency and server management?
Someone could if they wanted to! We’ll see if anyone does.
Because they use such esoteric software that you'll forever be reliant on Oxide.
I'd rather they use more standardized open source software like Linux, Talos, k8s, Ceph, KubeVirt. Instead of rolling it all themselves on an OS that has a very small niche ecosystem.
Oxide is providing an x86 platform to run VMs/containers on. That's a commoditized market.
The value they're offering is that the rack-level consumption and management is improved over the competition, but you should be able to run whatever you want on the actual compute, k8s or whatnot.
This also means you'd not be forever reliant on Oxide.
I’m rooting for solutions like this as an alternative to the public cloud. I do see that an org would rely on one company that theoretically can do a ‘Broadcom VMware’ on them but I don’t get this vibe from 0x1d3 at all.
But they target large orgs, I wish a solution like this would be accessible for smaller companies.
I wish I could throw their stack on my second hand cots hardware, rent a few U’s in two colos for geo redundancy and cry of happiness each month realizing how much money we save on public cloud cost, yet having cloud capabilities/benefits
> Here’s a sobering thought: today, data centers already consume 1-2% of the world’s power, and that percentage will likely rise to 3-4% by the end of the decade.
I don't get this marketing angle. I've made arguments here before that the cost of compute from a energy perspective is often negligible. If Google Maps, for example, can save you 1 mile due to better routing, then that is several orders of magnitude more efficient [1].
If it uses less resources, it uses less resources. Everybody (businesses and individuals) loves that.
[1]: https://news.ycombinator.com/threads?id=huijzer&next=4206549...
both are true. using computers to reduce emissions is good, and reducing computer emissions is good.
I'm amazed Apple don't have a rack mount version of their M series chips yet.
Even for their own internal use in their data centers they'd have to save an absolute boat load on power and cooling given their performance per watt compared to legacy stuff.
Oxide is not touching DLC systems in their post even with a 100ft barge pole.
Lenovo's DLC systems use 45 degrees C water to directly cool the power supplies and the servers themselves (water goes through them) for > 97% heat transfer to water. In cooler climates, you can just pump this to your drycoolers, and in winter you can freecool them with just air convection.
Yes, the TDP doesn't go down, but cooling costs and efficiency shots up considerably, reducing POE to 1.03 levels. You can put tremendous amount of compute or GPU power in one rack, and cool them efficiently.
Every chassis handles its own power, but IIRC, all the chassis electricity is DC. and the PSUs are extremely efficient.
The in case PSUs I’ve seen them gesturing to in videos don’t even seem to have cooling fins on them.
Companies buying massive cloud scale server hardware want to be able to choose from a dozen different Taiwanese motherboard manufacturers. Apple is in no way motivated to release or sell the M3/M4 CPUs as a product that major east asia motherboard manufacturers can design their own platform for. Apple is highly invested in tightly integrated ecosystems where everything is soldered down together in one package as a consumer product (take a look at a macbook air or pro motherboard for instance).
…Apple has made rack-mounted computers in recent history. They don’t sell chips, they sell complete boxes with rack mount hardware, motherboard and all.
https://www.apple.com/shop/product/G1720LL/A/Refurbished-Mac...
An extremely niche product for things like video editing studios, not something you can deploy at scale in colocation/datacenter environments. Literally never seen rackmounted apple hardware in a serious datacenter since the apple xserve 20 to 22 years ago.
I don't think they'd admit much about it even if they had one internally, both because Apple isn't known for their openness about many things, and because they already exited the dedicated server hardware business years ago, so I think they're likely averse to re-entering it without very strong evidence that it would be beneficial for more than a brief period.
In particular, while I'd enjoy such a device, Apple's whole thing is their whole-system integration and charging a premium because of it, and I'm not sure the markets that want to sell people access to Apple CPUs will pay a premium for a 1U over shoving multiple Mac Minis in the same 1U footprint, especially if they've already been doing that for years at this point...
...I might also speculate that if they did this, they'd have a serious problem, because if they're buying exclusive access to all TSMC's newest fab for extended intervals to meet demand on their existing products, they'd have issues finding sources to meet a potentially substantial demand in people wanting their machines for dense compute. (They could always opt to lag the server platforms behind on a previous fab that's not as competed with, of course, but that feels like self-sabotage if they're already competing with people shoving Mac Minis in a rack, and now the Mac Minis get to be a generation ahead, too?)
I will add that consumer macOS is a piss-poor server OS.
At one point, for many years, it would just sometimes fail to `exec()` a process. This would manifest as a random failure on our build farm about once/twice a month. (This would manifest as "/bin/sh: fail to exec binary file" because the error type from the kernel would have the libc fall back to trying to run the binary as a script, as normal for a Unix, but it isn't a script)
This is likely stemming from their exiting the server business years ago, and focusing on consumer appeal more than robustness (see various terrible releases, security- and stability-wise).
(I'll grant that macOS has many features that would make it a great server OS, but it's just not polished enough in that direction)
> as normal for a Unix
veering offtopic, did you know macOS is a certified Unix?
https://www.opengroup.org/openbrand/register/brand3581.htm
As I recall, Apple advertised macOS as a Unix without such certification, got sued, and then scrambled to implement the required features to get certification as a result. Here's the story as told by the lead engineer of the project:
https://www.quora.com/What-goes-into-making-an-OS-to-be-Unix...
This comes up rather often, and on the last significant post about it I saw on HN someone pointed out that the certification is kind of meaningless[1]. macOS poll(2) is not Unix-compliant, hasn't been since forever, yet every new version of macOS gets certified regardless.
[1]: https://news.ycombinator.com/item?id=41823078
lovely, i favorited that comment!
and Windows used to be certified for posix, but none of that matters theses days if it's not bug-compatible with Linux
Did that ever get fixed? That...seems like a pretty critical problem.
Yes, it quietly stopped happening a few years ago, sometime since 2020.
> I will add that consumer macOS is a piss-poor server OS.
Windows is also abysmal but it hasn't stopped people from using it.
But yes, it is too much of a desktop OS.
I wouldn't run a Windows server, but at least it can manage a SYN flood, whereas MacOS doesn't have syncookies or similar (their version of pf has the syncookie keyword, but it seems like it only works for traffic that transits the host, not for traffic that is terminated by the host). Windows also has some pretty nice stuff for servers like receive side scaling (afaik, Microsoft brought that to market, or at least was very early).
There is a rack mount version of the Mac Pro you can buy
That's designed for the broadcast market, where they rack mount everything in the studio environment. It's not really a server, it has no out of band management, redundant power etc.
There are third party rack mounts available for the Mac Mini and Mac Studio also.
Rack mount models have LOM over MDM.
For who? How would this help their core mission?
Maybe it becomes a big enough profit center to matter. Maybe. At the risk of taking focus away, splitting attention from the mission they're on today: building end user systems.
Maybe they build them for themselves. For what upside? Maybe somewhat better compute efficiency maybe, but I think if you have big workloads the huge massive AMD Turin super-chips are going to be incredibly hard to beat.
It's hard to emphasize just how efficient AMD is, with 192 very high performance cores on a 350-500W chip.
> Maybe they build them for themselves. For what upside?
They do build it for themselves. From their security blog:
"The root of trust for Private Cloud Compute is our compute node: custom-built server hardware that brings the power and security of Apple silicon to the data center, with the same hardware security technologies used in iPhone, including the Secure Enclave and Secure Boot. We paired this hardware with a new operating system: a hardened subset of the foundations of iOS and macOS tailored to support Large Language Model (LLM) inference workloads while presenting an extremely narrow attack surface. This allows us to take advantage of iOS security technologies such as Code Signing and sandboxing."
<https://security.apple.com/blog/private-cloud-compute/>
This is such a narrow narrow tiny corner of computing needs. That has such serious need for ownership, no matter the cost. And has extremely fantastically chill as shit overall computing needs, is un-perfomamce-sensitive as it gets.
I could not be less convinced by this information that this is a useful indicator for the other 99.999999999% of computing needs.
Good, because you can’t have one.
(some of?) their servers do run apple silicon: https://security.apple.com/blog/private-cloud-compute/
> > The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies
This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff. In general, the cost savings advertised by cloud infrastructure should be more holistic.
>This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff.
I'll happily take a single high qualify power supply (which may have internal redundancy FWIW) over 70 much more cheaply made power supplies that stress other parts of my datacenter via sheer inefficiency, and also costs more in aggregate. Nobody drives down the highway with 10 spare tires for their SUV.
A DC busbar can propagate a short circuit across the rack, and DC circuit protection is harder than AC. So of course each server now needs its own current limiter, or a cheap fuse.
But I’m not debating the merits of this engineering tradeoff - which seems fine, and pretty widely adopted - just its advertisement. The healthcare industry understands the importance of assessing clinical endpoints (like mortality) rather than surrogate measures (like lab results). Whenever we replace “legacy” with “cloud”, it’d be nice to estimate the change in TCO.
DC circuit protection is absolutely not harder than AC. DC has the advantage in current flowing in only one direction, not two
Which makes it much harder to break the circuit vs AC
At 48 volts arcing shorts aren't the concern.
No one drives down the highway with one tire either.
Careful, unicyclists are an unforgiving bunch.
Let's say your high quality supply's yearly failure rate is 100 times less than the cheap ones
The probability of at least a single failure is 1-(1-r)^70.
This is quite high even w/out considering the higher quality of the one supply.
The probability of all 70 going down is
r^70 which is absurdly low.
Let's say r = 0.05 or one failed supply every 20 in a year.
1-(1-r)^70 = 97% r^70 < 1E-91
The high quality supply has r = 0.0005, in between no failure and all failing. If you code can handle node failure, very many, cheaper supplies appears to be more robust.
(Assuming uncorrelated events. YMMV)
Yeah but the failure rate of an analog piece of copper is pretty low, it'll keep being copper unless you do stupid things. You'll have multiple power supplies provide power on the same piece of copper
TL/DR, isnt there a single, shared, DC supply that supplies said piece of copper? Presumably connected to mains?
Or are the running on SOFCs?
The big piece of copper is fed by redundant rectifiers. Each power shelf has six independent rectifiers which are 5+1 redundant if the rack is fully loaded with compute sleds, or 3+3 redundant if the rack is half-populated. Customers who want more redundancy can also have a second power shelf with six more rectifiers.
I'm going to assume this is on 3 phase power, but how is the ripple filtered?
Inductors and capacitors usually
Look very carefully at the picture of the rack at https://oxide.computer/ :) there are two power shelves in the middle, not one.
We're absolutely aware of the tradeoffs here and have made quite considered decisions!
The bus bar itself is an SPoF, but it's also just dumb copper. That doesn't mean that nothing can go wrong, but it's pretty far into the tail of the failure distribution.
The power shelf that keeps the busbar fed will have multiple rectifiers, often with at least N+1 redundancy so that you can have a rectifier fail and swap it without the rack itself failing. Similar things apply to the battery shelves.
It's also plausible to have multiple power supplies feeding the same bus bar in parallel (if they're designed to support this) e.g. one at each end of a row.
This is how our rack works (Oxide employee). In each power shelf, there are 6 power supplies and only 5 need to be functional to run at full load. If you want even more redundancy, you can use both power shelves with independent power feeds to each so even if you lose a feed, the rack still has 5+1 redundant power supplies.
The whole thing with eliminating 70 discrete 1U server size AC-to-DC power supplies is nothing new. It's the same general concept as the power distribution unit in the center of an open compute platform rack design from 10+ years ago.
Everyone who's doing serious datacenter stuff at scale knows that one of the absolute least efficient, labor intensive and cabling intensive/annoying ways of powering stuff is to have something like a 42U cabinet with 36 servers in it, each of them with dual power supplies, with power leads going to a pair of 208V 30A vertical PDUs in the rear of the cabinet. It gets ugly fast in terms of efficiency.
The single point of failure isn't really a problem as long as the software is architected to be tolerant of the disappearance of an entire node (mapping to a single motherboard that is a single or dual cpu socket config with a ton of DDR4 on it).
That’s one reason why 2U4N systems are kinda popular. 1/4 the cabling in legacy infrastructure.
PDUs are also very failure-prone and not worth the trouble.
This isn't even remotely close. Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.
In the event that all 32 servers had redundant AC power feeds, you could just install a pair of redundant DC power feeds.
>Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.
Is this not standard? I vaguely remember that rack severs typically have two PSUs for this reason.
It's highly dependent on the individual server model and quite often how you spec it too. Most 1U Dell machines I worked with in the past only had a single slot for a PSU, whereas the beefier 2U (and above) machines generally came with 2 PSUs.
But 2 PSUs plugged into the same AC supply still have a single point of failure.
Which is why you have two separate PDUs in the rack which are fed by different power feeds and you connect the server's 2 PSUs to opposing PDUs.
This works brilliantly, right up to the point where your A side fails, and every single server suddenly doubles their demand on B.
Better have good capacity management so you don't go over 100% on B when that happens! (I've seen it happen and take a DC out).
Rack servers have two PSUs because enterprise buyers are gullible and will buy anything. Generally what happens in case of a single PSU failure is the other PSU also fails or it asserts PROCHOT which means instead of a clean hard down server you have a slow server derping along at 400MHz which is worse in every possible way.
you could have 15 PSUs in a server. It doesn't mean they have redundant power feeds
> This creates a single point of failure,
Who told you there is only one PSU in the power shelf?
From the title, I was expecting to read about how oxidation (aka rust) reduces power throughput capacity
If any Oxide staff are here, I'm just curious, is BlueSky a customer? Seems like it would fit well with their on-prem setup.
Nope, but many of us (Oxide staff) are big fans of what Bluesky is doing!
One of the Bluesky team members posted about their requirements earlier this month, and why Oxide isn't a great fit for them at the moment:
https://bsky.app/profile/jaz.bsky.social/post/3laha2upw3k2z
Appreciate the reply! Been following Oxide for a few years now and really enjoy the technical blogs :)
> Also prices don't make sense for us.
Oof.
Why is that "oof"? They're using commodity servers today. Oxide does not offer commodity servers.
Just that it highlights the challenge that Oxide faces, that they're effectively offering a "luxury" product in a deeply commoditized space.
That's true if you think the market is SaaS upstarts like Bluesky and maybe less true if you think of the market in terms of who buys hardware. I remember early on at Matasano working for a house account, a major US corp that isn't a household name, and being shocked 2 years in when I finally had to do something in their data center (a FCIP appliance assessment) and seeing how much they'd spent on it. Look at everyone who runs (and wishes they weren't) z/OS today, or Oracle. There's more of them than I think a lot of HN people think.
Good on 0x1d5 to bring back the era of expensive, proprietary hardware that everybody loved so much.
Not Oxide or Bluesky, but firstly I'd suggest that asking the company about their customers is unlikely to get a response, most companies don't disclose their customers. Secondly, Bluesky have been growing quickly, I can only assume their hardware is too, and that means long lead time products like an Oxide rack aren't going to work, especially when you can have an off the shelf machine from Dell delivered in a few days.
Oxide is very open, we are happy to talk about customers that allow us to talk about them. Some don’t want to, others are very happy to be mentioned, just like any other company.
> we are happy to talk about customers that allow us to talk about them
This is what I meant by "don't disclose", I didn't mean that Oxide was in any way secretive, but that usually this stuff doesn't get agreed, and that it would make more sense to ask the customer rather than the company selling as Oxide won't want to disclose unless there's already an agreement in place (formal or otherwise).
Gotcha. That totally makes sense, I would t have thought about it that way.
> most companies dont disclose their customers
In my head I'm imagining an average landing page. They slap their customers on there like stickers. I doubt bluesky would stay secretive about using oxide if they did
Those customers listed on the front page of companies are there as part of an agreement. Usually something like a discount. Certainly they are not listed without permission. 10x that if it is a case study.
I think they often are listed without permission unfortunately, and often literally based on on the the email addresses of people signing up for a trial. I see my company's logo on the landing page of many products that we don't use or may even have a policy preventing our use of.
events.bsky appears to be hosted on OVH. Single-product SAAS companies less than a few years old are unlikely to be a major customer cohort for Oxide.
Is this just the main reason?
> Replacing low-efficiency AC power supplies with a high-efficiency DC Bus Bar
The part after it about better cooling fans, meh, there are more efficient liquid-cooling methods including immersion-cooling which are already there in implementation albeit niche.
Where is the GPU?
We don’t currently have GPUs in the product. The closed-ness of the GPU space is a bit of a cultural difference, but we’ll surely have something eventually. As a small company, we have to focus on our strengths, and there’s plenty of folks who don’t need GPUs right now.
That's fine, just awkward because the GS report shows the TAM or problem depending on your perspective being accelerated computing.
For sure. It’s not just GPUs; given that we have one product with three SKUs, there’s a variety of workloads we won’t be appropriate for just yet. Just takes time to diversify the offering.
maybe the real GPU was the friends we made along the way
"If only they used DC from the wall socket, all those H100s would be green" is, not, I think, the hill you want to die on.
But, yeah, my three 18MW/y racks agree that more power efficiency would be nice, it's just that Rewrite It In (Safe) Rust is unlikely to help with that...
> it's just that Rewrite It In (Safe) Rust is unlikely to help with that...
I didn't see any mention of Rust in the article?
[flagged]
They wrote their own BMC and various other bits and pieces in Rust. That's an extremely tiny part of the whole picture.
It’s significantly more than that, but it’s also true that we include stuff in other languages where appropriate. CockroachDB is in Go, and illumos is in C, as two examples. But almost all new code we write is in Rust. That is the stuff you’re talking about, but also like, our control plane.
Oh and we write a lot of Typescript too.
I think it's hard to call it a reason. It is a tool which fits in with the philosophy of the company in terms of how to achieve it's goals, but I think it would still exist if rust didn't. I would describe the goal as making a hyperscaling system that can be sold as a product, the philosophy of how to make this is an aggressive focus on integration, openness, and quality, and that rust is a language that works well with the last two of those goals.
It's also not really a case of "rewriting in Rust" anyway, it's more just "writing it in Rust" since most of the stuff the Oxide team has built is greenfield work.
We also sell computers... :)
OSS Rust in Rack trenchcoat.
That's an interesting take. What's your reasoning? Whats your evidence?
Pretty much everything Oxide publishes on github is either in rust or it's an sdk to service in rust. Well and web panel isn'tin rust, so negative points for that, true evangelists would have used WASM.
But Oxide reason to exist is to keep memory of cool racks from Sun running Solaris alive forever.
The raison d'être of Oxide isn't Rust, it's continuing to pretend that the bloated corpse of Solaris still has some signs of life.
https://github.com/illumos/illumos-gate/commits/master/ looks alive to me.
(And for that matter, Oracle's proprietary Solaris seems better maintained than I ever expected, though in this context I think the open source fork is the relevant thing to look at.)
18MW/year is not a real unit of measurement; did you mean MWh?
> How can organizations reduce power consumption and corresponding carbon emissions?
Stop running so much useless stuff.
Also maybe ARM over x86_64 and similar power-efficiency-oriented hardware.
Rack-level system design, or at least power & cooling design, is certainly also a reasonable thing to do. But standardization is probably important here, rather than some bespoke solution which only one provider/supplier offers.
> How can organizations keep pace with AI innovation as existing data centers run out of available power?
Waste less energy on LLM chatbots?
Current ARM servers actually generally offer "on par" (varies by workload) perf/Watt for generally worse absolute performance (varies by workload) i.e. require more other overhead to achieve the same total perf despite "on par" perf/Watt.
Need either Apple to get into the general market server business or someone to start designing CPUs as well as Apple (based on the comparison between different ARM cores I'm not sure it really matters if they do so using a specific architecture or not).
It's more a case of selection of optimization parameters and corresponding economy. It's not so much that apple towers over others in design (though they are absolutely no slouches and have wins there) but their design team is in position to coordinate with product directly and as such isn't as limited by "but will it sell in high enough numbers for the excel sheet at investor's desk?"
The real show stopper for years is that ARM servers are just not prepared to be a proper platform. uBoot with grudgingly included FDT (after getting kicked out of Linux kernel) does not make a proper platform, and often there's also no BMC, unique approaches to various parts making the server that one annoying weirdo in the data center, etc.
Cloud providers can spend the effort to backfill necessary features with custom parts, but doing so on your own on-prem is hard
Not sure what you mean wrt to Apple's uniqueness. AMD/Mediatek/Intel/Qualcomm/Samsung only make margin on how well they invest on their designs vs their competitors and they'd all love to be outshipping each other and Apple in any market. All, including Apple, also rely on the same manufacturer for their top products and the ones (Intel/Samsung) with alternatives have not been able to use that as an advantage for top performing products. Sure, Apple can work directly with their own product... but at the end of the day the goal and available customer pool to fight over is the same and they still ship fewer units than the others.
I'm not hands-on familiar with other serious ARM server market players but for several years now Ampere ARM server CPUs at least are nothing like you describe. Phoronix says it best in https://www.phoronix.com/review/linux-os-ampereone
> All the Linux distributions I attempted worked out effortlessly on this Supermicro AmpereOne server. Like with Ampere Altra and Ampere eMAG before that, it's a seamless AArch64 Linux experience. Thanks to supporting open standards like UEFI, Arm SBSA/SBBR and ACPI and not having to rely on DeviceTrees or other nuisances, installing an AArch64 Linux distribution on Ampere hardware is as easy as in the x86_64 space.
Ampere is a bright spot in all of this, indeed. Just considerably late. I remember being bombarded by "ARM servers are going to eat the world" in 2013, but ARM couldn't deliver SBSA in shape that would make it possible and to this day I am left with serious doubts if any ARM board will work out right (there are bright spots though).
As for Apple "uniqueness", I met a lot of people who think that Apple "just" has so much better design team, when it's similar to what you say and the unique part is them being able to properly narrow their design space instead of chasing cost-conscious manufacturers.