Filecoin, StorJ and the problem with decentralized storage (part II)

[continued from part I]

Quantifying reliability

Smart-contracts can not magically prevent hardware failures or compel a service provider at gun point to perform the advertised services. At best blockchains can facilitate contractual arrangements with a fairness criteria: the service provider gets paid if and only if they deliver the goods. Proofs-of-storage verified by a decentralized storage chain are an example of that model. It keeps service providers honest by making their revenue contingent on living up to the stated promise of storing customer data. But as the saying goes, past performance is no guarantee of future results. A service provider can produce the requisite proofs 99 times and then report all data is lost when it is time for the next one. This can happen because of an “honest” mistake or more troubling, because it is more profitable for decentralized providers to break existing contracts.

When it comes to honest mistakes—random hardware failures resulting in unrecoverable data loss— the guarantees that can be provided by decentralized storage alone are slightly weaker. This follows from a limitation with existing decentralized designs: their inability to express the reliability of storage systems, except in most rudimentary ways. All storage systems are subject to risks of hardware failure and data loss. That goes for AWS and Google. For all the sophistication of their custom-designed hardware they are still subject to laws of physics. There is still a mean-time-to-failure  associated with every component. It follows must be a design in place to cope with those failures across the board, ranging from making regular backups to having diesel generators ready to kick in when the grid power fails. We take for granted the existence of this massive infrastructure behind the scenes when dealing with the likes of Amazon. There is no such guarantee for a random counterparty on the blockchain.

Filecoin uses a proof-of-replication intended to show that not only does the storage provider have the data but they have multiple copies. (Ironically that involves introducing even more work on the storage provider to format data for storage— otherwise they can fool the test by re-encrypting one copy into multiple replicas when necessary— further pushing the economics away from the allegedly zero marginal cost.) That may seem comparable to the redundancy of AWS but it is not. Five disks sitting in the same basement hooked up to the same PC can claim “5-way replication.” But it is not meaningful redundancy because all five copies are subject to correlated risk, one lightning-strike or ransomware infection away from total data loss. By comparison Google operates data-centers around the world and can afford to put  each of those five copies in a different facility. Each one of those facilities still has a non-zero chance of burning to the ground or losing power during a natural disaster. As long as the locations are far enough from each other, those risks are largely uncorrelated. That key distinction is lost in the primitive notion of “replication” expressed by smart-contracts.

Unreliable by design

Reliability questions aside, there is a more troubling problem with the economics of decentralized storage. It may well be the most rational— read: profitable— strategy to operate an unreliable service deliberately designed to lose customer data. Here are two hypothetical examples to demonstrate the notion that on a blockchain, there is no success like failure.

Consider a storage system designed to store data and publish regular proofs of storage as promised, but with one catch: it would never return that data if the customer actually requested it. (From the customer perspective: you have backups but unbeknownst to you, they are unrecoverable.) Why would this design be more profitable? Because streaming a terabyte back to the customer is dominated by an entirely different type of operational expense than storing that terabyte in the first place: network bandwidth. It may well be profitable to set up a data storage operation in the middle-of-nowhere with cheap real-estate, abundant power but expensive bandwidth. Keeping data in storage while publishing the occasional proof involves very little bandwidth, because proof-of-storage protocols are very efficient in space. The only problem comes up if the customer actually wants their entire data streamed back. At that point a different cost structure involving network bandwidth comes into play and it may well be more profitable to walk away.

To make this more concrete: at the time of writing AWS charges ~1¢ per gigabyte per month for “infrequently accessed” data but 9¢ per gigabyte of data outbound over the network. Conveniently inbound traffic is free; uploading data to AWS costs nothing. As long as prevailing Filecoin market price is higher than S3 prices, one can operate a Filecoin storage miner on AWS to arbitrage the difference— this is exactly what DropBox used to do before figuring out how to operate its own datacenter. The only problem with this model is if the customer comes calling for their data too early or too often. In that case the bandwidth costs may well disrupt the profitability equation. If streaming the data back would lose money overall on that contract, the rational choice is to walk away.

Walking away from the contract for profit

Recall that storage providers are paid in funny money, namely the utility token associated with the blockchain. That currency is unlikely to work for purchasing anything in the real world and must be converted into dollars, euros or some other unit of measure accepted by the utility company to keep the datacenter lights on. That conversion in turn hinges on a volatile exchange rate. While there are reasonably mature markets in major cryptocurrencies such as Bitcoin and Ethereum, the tail-end of the ICO landscape is characterized by thin order-books and highly speculative trading. Against the backdrop of what amounts to an extreme version of FX risk, the service provider enters into a contract to store data for an extended period of time, effectively betting that the economics will work out. It need not be profitable today but perhaps it is projected to become profitable in the near future based on rosy forecasts of prices going to the moon. What happens if that bet proves incorrect? Again the rational choice is to walk away from the contract and drop all customer data.

For that matter, what happens when a better opportunity comes along? Suppose the exchange rate is stable or those risks are managed using a stablecoin while the market value of storage increases. Buyers are willing to pay more of the native currency per byte of data stashed away. Or another blockchain comes along, promising more profitable utilization of spare disk capacity. That may seem like great news for the storage provider except for one problem: they are stuck with existing customers paying lower rates negotiated earlier. Optimal choice is to renege on those commitments: delete existing customer data and reallocate the scarce space to higher-paying customers.

It is not clear if blockchain incentives can be tweaked to discourage this without creating unfavorable dynamics for honest service providers. Suppose we impose penalties on providers for abandoning storage contracts midway. These penalties can not be clawback provisions for past payments. The provider may well have already spent that money to cover operational expenses. For the same reason, it is not feasible to withhold payment until the very end of the contract period, without creating the risk that the buyer may walk away. Another option is requiring service providers to put up a surety bond. Before they are allow to participate in the ecosystem, they must set aside a lump sum on the blockchain held in escrow. These funds would be used to compensate any customers harmed by failure to honor storage contracts. But this has the effect of creating additional barriers to entry and locking away capital in a very unproductive way. Similarly the idea of taking monetary damages out of  future earnings does not work. It seems plausible that if a service provider screws over Alice because Bob offered a better price, recurring fees paid by Bob should be diverted to compensate Alice. But the service provider can trivially circumvent that penalty while still doing business with Bob: just start over with a new identity completely unlinkable to that “other” provider who screwed over Alice. To paraphrase the New Yorker cartoon on identity: on the blockchain nobody knows you are a crook.

Reputation revisited

Readers may object: surely such an operation will go out of business once the market recognizes their modus operandi and no one is willing to entrust them with storing data? Aside from the fact that lack of identity on a blockchain renders it meaningless to go out-of-business, this posits there is such a thing as “reputation” buyers take into account when making decisions. The whole point of operating a storage market on chain is to allow customers to select the lowest bidder while relying on the smart-contract logic to guarantee that both sides hold up their side of the bargain. But if we are to invoke some fuzzy notion of reputation as selection criteria for service providers, why bother with a blockchain? Amazon, MSFT and Google have stellar reputations in delivering high-reliability, low-cost storage with  no horror stories of customers randomly getting ripped-off because Google decided one day it would be more profitable to drop all of their files. Not to mention, legions of plaintiffs’ attorneys would be having a field day with any US company that reneges on contracts in such cavalier fashion, assuming a newly awakened FTC does not get in on the action first. There is no reason to accept the inefficiencies of a blockchain or invoke elaborate cryptographic proofs if reputation is a good enough proxy.

CP

 

Filecoin, StorJ and the problem with decentralized storage (part I)

Blockchains for everything

Decentralized storage services such as Filecoin and StorJ seek to disrupt the data-storage industry, using blockchain tokens to create a competitive marketplace that can offer more space at lower cost. They also promise to bring a veneer of legitimacy to the Initial Coin Offering (ICO) space. At a time when ICOs were being mass-produced as thinly-veiled, speculative investment vehicles that are likely to run afoul of the Howey test as unregistered securities, file-storage looks like a shining example of an actual utility tokens, for having some utility. Instead of betting on the “greater fool” theory of offloading the token on the next person willing to pay a higher price, these tokens are good for a useful service: paying someone else to store your backups. This blog post looks at some caveats and overlooked problems in the design.

Red-herring: privacy

A good starting point is to dispel the alleged privacy advantage. Decentralized storage system often tout their privacy advantage: data is stored encrypted by its owner, such that the storage provider can not read it even if they wanted to. That may seem like an improvement over the current low bar which relies on service providers swearing on a stack of pre-IPO shares that, pinky-promise, that they never not dip into customer data for business advantage, a promise more often honored in the breach as the examples of Facebook and Google repeatedly demonstrate. But there is no reason to fundamentally alter the data-storage model to achieve E2E security  against rogue providers. While far from being the path of least resistance, there is a long history of alternative remote backup services such as tarsnap for privacy-conscious users. (All 17 of them.) Previous blog posts here have demonstrated that it is possible to implement bring-your-own-encryption with vanilla cloud storage services such as AWS such that the cloud service is a glorified remote drive storing random noise it can not make sense of. These models are far more flexible than arbitrary, one-size-fits-all encryption model hard-coded into protocols such as StorJ. Users are free to adopt their preferred scheme, compatible with their existing key management model. For example with AWS Storage Gateway, Linux users can treat cloud storage as an iSCSI volume with LUKS encryption while those on Windows can apply Bitlocker-To-Go to protect that volume exactly as they would encrypt a USB thumb-drive. Backing up rarely accessed data in an enterprise is even easier: nothing more fancy than scripts to GPG-sign and encrypt backups before uploading them to AWS/Azure/GCP is necessary.

Facing the competition

Once we accept the premise that privacy alone can not be a differentiator for backup services—users can already solve that problem without depending on the service provider—the competitive landscape reverts to that of a commodity service. Roughly speaking, providers compete on three dimensions: reliability, cost and speed.

  • Cost is the price paid for storing each gigabyte of data for a given period of time.
  • Speed refers to how quickly that data can be downloaded when necessary and to a lesser extent, how quickly it can be uploaded during the backup process.
  • Reliability is the probability of being able to get all of your data back whenever you need it. A company that retains 99.999% of customer data while irreversibly losing the remaining 0.001% will not stay in business long. Even 100% retention rate is not great if the service only operates from 9AM-4PM.

The economic argument against decentralized storage can be stated this way: it is very unlikely that a decentralized storage market can offer an alternative that can compete against centralized providers— AWS, Google, Azure— when measured on any of these dimensions. (Of course nothing prevents Amazon or MSFT from participating in the decentralized marketplace to sell storage, but this would be another example of doing with increased friction something on a blockchain that can be done much more efficiently via existing channels.)

Among the three criteria, cost is easiest one to forecast. Here is the pitch from StorJ website:

“Have unused hard drive capacity and bandwidth?
Storj pays you for your unused hard drive capacity and bandwidth in STORJ tokens!”

Cloud services are ruled by a ruthless economy of scales. This is where Amazon, Google, MSFT and a host of other cloud providers shine, reaping the benefits of investment in data-centers and petabytes of storage capacity. Even if we ignore the question of reliability, it is very unlikely that the hobbyist with a few spare drives sitting in their basement can have a lower, per gigabyte cost.

The standard response to this criticism is pointing out that decentralized storage can unlock spare, unused capacity at zero marginal cost. Returning to our hypothetical hobbyist, he need not add new capacity to compete with AWS. Let us assume he already owns excess storage already paid for that sits underutilized; there is only so much space you can take up with vacation pictures. Disks consume about the same energy whether they are 99% of 1% full. Since the user is currently getting paid exactly $0 for that spare capacity, any value above zero is a good deal, according to this logic. In that case, any non-zero price point is achievable, including one that undercuts even the most cost-effective cloud provider. Our hobbyist can temporarily boot up those ancient PCs, stash away data someone on the other side of the world is willing to pay to safeguard and shutdown the computer once the backups are written. The equipment remains unplugged from the wall until such time as the buyer comes calling for their data.

Proof-of-storage and cost of storage

The problem with this model is that decentralized storage demands much more than mere inert storage of bits. They must achieve reliability in the absence of the usual contractual relationship, namely, someone you can sue for damages if the data disappears. Instead the blockchain itself must enforce fairness in the transaction: the service provider gets paid only if they are actually storing the data entrusted for safeguarding. Otherwise the provider could pocket the payment, discard uploaded data and put that precious disk space to some other use. Solving this problem requires a cryptographic technique called proofs-of-data-possession (PDP) or alternatively proofs-of-storage. Providers periodically run a specific computation over the data they promised to store— a computation that is only possible if they still have 100% of that data— and publish the results on the blockchain, which in turn facilitates payment conditional on periodic proofs. Because the data-owner can observe these proofs, they are assured their their precious data is still around. The key property is that the owner does not need access to the original file to check correctness: only a small “fingerprint” about the uploaded data is retained. That in a nutshell is the point of proof-of-storage; if the owner needed access to the entire dataset to verify the calculation, it would defeat the point of outsourcing storage.

While proofs of storage may keep service providers honest, it breaks one of the assumptions underlying the claimed economic advantage: leveraging idle capacity. Once we demand periodically going over the bits and running cryptographic calculations, the storage architecture can not be an ancient PC unplugged from the wall. There is a non-zero marginal cost to implementing proof-of-storage. In fact there is an inverse relationship between latency and price. Tape archives sitting on a shelf a much lower cost per gigabyte than spinning disks attached to a server. These tradeoffs are even reflected in the pricing model charged by Amazon: AWS offers a storage tier called Glacier which is considerably cheaper than S3 but comes with significant latency— on the order of hours— for accessing data. Requiring periodic proof-of-storage undermines  precisely the one model— offline media gathering dust in a vault— that has the best chance of undercutting large-scale centralized providers.

Beyond the economics, there is a more subtle problem with proof-of-storage: knowing your data is there does not mean that you can get it back when needed. This is the subject of the next blog post.

[continued]

CP

Off-by-one: the curious case of 2047-bit RSA keys

This is the story of an implementation bug discovered while operating an enterprise public-key infrastructure system. It is common in high-security scenarios for private keys to be stored on dedicated cryptographic hardware rather than managed as ordinary files on the commodity operating system. Smart-cards, hardware tokens and TPMs are examples of popular form factors. In this deployment, every employee was issued a USB token designed to connect to old-fashioned USB-A ports. USB tokens have a usability advantage over smart-cards in situations when most employees are using laptops: there is no separate card reader required, eliminating one piece carry lug around. The gadget presents itself to the operating system as a combined card reader with a card always present. Cards on the other hand have an edge for “converged access” scenarios involving both logical and physical access. Dual-interface cards with NFC can also be tapped against badge readers to open doors. (While it is possible to shoe-horn NFC into compact gadgets and this has been done, physical constraints on antenna size all but guarantee poor RF performance. Not to mention one decidedly low-tech but crucial aspect of an identity badge is having enough real-estate for the obligatory photograph and name of the employee spelled out in legible font.)

The first indication of something awry with the type of token used came from a simple utility rejecting the RSA public-key for a team member. That public-key had been part of a pair generated on the token, in keeping with the usual provisioning process that guarantees keys live on the token throughout their entire lifecycle. To recap that sequence:

  • Generate a key-pair of the desired characteristics, in this case 2048-bit RSA. This can be surprisingly slow with RSA on the order of 30 seconds, considering the hardware in question is powered by relatively modest SoCs under the hood.
  • Sign a certificate signing-request (CSR) containing the public-key. This is commonly done as a single operation at the time of key-generation, due to an implementation quirk: most card standards such as PIV require a certificate present before clients can use the card because they do not know have a way to identify private-keys in isolation.
  • Submit that CSR to the enterprise certificate authority to obtain a certificate. In principle certificates can be issued out of thin air. In reality most CA software can only accept a CSR containing the public-key of the subject, signed with the corresponding private key— and they will verify that criteria.
  • Load issued certificate on the token. At this point the token is ready for use in any scenario demanding PKI credentials, be it VPN, TLS client authentication in a web-browser, login to the operating system or disk-encryption.

On this particular token, that sequence resulted in a 2047-bit RSA key, one bit off the mark and falling short of the NIST recommendations to boot. A quick glance showed the provisioning process was not at fault. Key generation was executed on Windows using the tried-and-true certreq utility.  (Provisioning new credentials is commonly under-specified compared to steady-state usage of existing credentials, and vendors often deign to publish software for Windows only.) That utility takes an INF file as configuration specifying key type to generate. Quick glance at the INF file showed the number 2048 had not bit-rotted into 2047.

Something else lower in the stack was ignoring those instructions or failing to generate keys according to the specifications. Looking through other public-keys in the system showed that this was not in fact an isolated case. The culprit appeared to be the key-generation logic on the card itself.

Recall that when we speak of a “2048-bit RSA key” the counts are referring to the size of the modulus. An RSA modulus is the product of two large primes of comparable size. Generating a 2048-bit RSA key then is done by generating two random primes of half that size at 1024-bits and multiplying those two values together.

There is one catch with this logic: the product of two numbers N-bits in size each is not guaranteed to be 2·N. That intuitive-sounding 2·N result is an upper-bound: the actual product can be 2·N or 2·N-1 bits. Here is an example involving tractable numbers and the more familiar decimal notation. It takes two digits to express the prime numbers 19 and 29. But multiplying them we get 19 * 29 = 551, a number spanning three digits instead of four. By contrast the product of two-digit primes 37 and 59 is 2183, which is four digits as expected.

Informally, we can say not all N-bit numbers are alike. Some are “small,” meaning they are close to the lower bound of 2n-1, the smallest possible N-bit number. At the other end of the spectrum are “large” N-bit numbers, closer to the high end of the permissible range at 2ⁿ. Multiplying large N-bit numbers produces the expected 2N-bit product, while multiplying small ones can fall short of the goal.

RSA implementations commonly correct for this by setting some of the leading bits of the prime to 1, forcing each generated prime to be “large.” In other words, the random primes are not randomly selected from the full interval [2n-1, 2n – 1] but a tighter interval that excludes “small” primes. (Why not roll the dice on the full interval and check the product after the fact? See earlier point about the time-consuming nature of RSA key generation. Starting over from scratch is expensive.) This is effectively an extension of logic that is already present for prime generation,  namely setting the most significant bit to one. Otherwise the naive way to “choose random N-bit prime” by considering the entire interval [0, 2n – 1] can result in a much shorter prime, one that begins with an unfortunate run of leading zeroes. That guarantees failure: if one of the factors is strictly less than N bits, the final modulus can never hit the target of 2N bits.

So we know this token has a design flaw in prime generation that occasionally outputs 2047-bit modulus when asked for 2048. How occasional? If it were only setting the MSB to one and otherwise picking primes uniformly in the permissible interval, the error rate can be approximated by the probability that two random variables X and Y selected independently at random from the range [1, 2] have a product less than 2. Visualized geometrically, this is the area under the curve xy < 2 in a square region defined by sides in the same interval. That is a standard calculus problem that can be solved by integration. It predicts about 40% of RSA modulus falling short by one bit. That fraction is not consistent with the observed frequency which is closer to 1 in 10, an unlikely outcome from a Bayesian perspective if that was an accurate model for what the token is doing. (Note that if two leading bits were forced set on both primes, the error case is completely eliminated. From that perspective, the manufacturer was “off-by-one” according to more than one meaning of the phrase.)

So how damaging is this particular quirk to the security of RSA keys? It is certainly an embarrassing by virtue of how easy it should have been to catch this during testing. That does not reflect well on the quality assurance. Yet the difficulty of factoring 2047-bit keys is only marginally lower than that for full 2048-bits— which is to say far outside the range of currently known algorithms and computing power available to anyone outside the NSA. (Looked another way, forcing another bit to 1 when generating the primes also reduces the entropy.) Assuming this is the only bug in RSA key generation, there is no reason to throw away these tokens or lash out at the vendor. Also to be clear: this particular token was not susceptible to the ROCA vulnerability that affected all hardware using Infineon chips. In contrast to missing one from the generated key, Infineon RSA library produced full-size keys that were in fact much weaker than they appeared due to the special structure of the primes. That wreaked havoc on many large-scale systems, including latest generation Yubicrap (after a dubious switch from NXP to Infineon hardware) and the Estonian government electronic ID card system. In fact the irony of ROCA is proving that key length is far from being the only criteria for security. Due to the variable strategies used by Infineon to generate primes at different lengths, customer were better off using shorter RSA keys on the vulnerable hardware:

Screen Shot 2019-12-04 at 10.09.50 AM.png

(Figure 1, taken from the paper “The Return of Coppersmith’s Attack: Practical Factorization of Widely Used RSA Moduli”)

The counter-intuitive nature of ROCA is that the estimated worst-case factorization time (marked by the blue crosses above) does not increase in an orderly manner with key length. Instead there are sharp drops around 1000 and 2000 bits creating a sweet spot for the attack where the cost of recovering keys is drastically lower. Meanwhile the regions shaded yellow and orange correspond to key lengths where the attack is not feasible. To pick one example from the above graph, 1920-bit keys would not have been vulnerable to the factorization scheme described in the paper. Even 1800-bit keys would have been a better choice than the NIST-anointed choice of 2048. While 1800 keys were still susceptible to the attack, it would have required too much computing power—note the Y-axis has logarithmic scale— while 2048-bit keys were well within range of factoring with commodity hardware that can be leased from AWS.

It turns out that sometimes it is better missing the mark by one bit than hitting the target dead-on with the wrong algorithm.

CP