Second Breakfast

In my last post I discussed some of the key milestones that led to the development of the Internet & the modern Web as we know it today. One of the fundamental design choices that I covered previously was TimBL’s decision to make unidirectional links the default for HTTP, rather than bidirectional ones. This requirement has had severe (yet unintentional) consequences and is a primary reason the modern Web has trended towards centralization.

In this post I’ll further explain some of the problems associated with the modern Web, discuss the benefits of a decentralized Web, and explore the concept of [content-addressing](https://docs.ipfs.io/concepts/content-addressing/ "content-addressing"), which is an alternative way to identify and retrieve data developed by the [InterPlanetary File System (IPFS)](https://ipfs.io/ "InterPlanetary File System (IPFS)").

## **The Centralized Web**

In the early years of the Internet - from the 1980s through the early 2000s - internet services were built on open protocols that were designed and developed by communities of users that were (generally) ideologically aligned in their pursuit of providing reliable, censor-resistant, access to information for everyone. This era of the internet is commonly referred to as Web 1, and these open protocols were maintained by working groups or non-profit organizations that relied on the alignment of interests in the Internet community to gain adoption.

However, from the mid 2000s to the present, for-profit tech companies - most notably Google, Apple, Facebook, Microsoft, and Amazon - have built a plethora of web applications and services that have rapidly outpaced the capabilities of the open protocols. This trend was accelerated by the explosive growth of smartphones, as mobile apps became the majority of internet use. Eventually users migrated from web applications and services that utilized open protocols to these more sophisticated, centralized services. Even when users still accessed open protocols like the Web, they would typically do so mediated by applications and services built by the aforementioned for-profit tech companies. This era is commonly referred to as Web 2.

Before going further, it's worthwhile to note that I would be remiss to ignore the advancements and benefits provided to society by the web applications and services developed by big tech companies. The amazing technologies built by these companies have given billions of people access to information on a scale and scope previously incomprehensible, and many of their services are free to use. Big Tech has played a strong supporting role in the advancement of our collective knowledge in almost every domain, and it would be absurd to completely write-off the benefits of their applications and services.

The standard model for web applications and services on the modern Web is that of the [client/server architecture](https://en.wikipedia.org/wiki/Client%E2%80%93server_model "client/server architecture"), in which many clients (remote processors) request and receive service from a centralized server. Having a single server hosting the database of usernames, passwords and managing what levels of access individual users and computers have on a specific network is a major advantage of client server architecture. The benefits of this approach are further seen through the client/server architecture’s ability to scale in a cost effective and relatively secure manner.

However, under this model, the relationships Big Tech companies have with network participants are predicated on extractive business models that prey on user data, and gradually shift from positive-to-zero-sum as use of their applications and services grow. Furthermore, as expertly [explained by Chris Dixon](https://onezero.medium.com/why-decentralization-matters-5e3f79f7638e "explained by Chris Dixon"), these centralized services of the modern Web have “*also created broad societal tensions, which we see playing out in debates over subjects like fake news, state sponsored bots, EU privacy laws, and algorithmic biases.*” Unless something is done to reorient the way information and data is generated and consumed, it is no far stretch to posit that these debates will only intensify in the future.

One common response to this centralization is to impose government regulation on big tech companies. However, this response is wholly inadequate as it assumes that the Internet is similar to other communication networks like the phone, radio, and TV networks. But the hardware-based networks of the past are fundamentally different from the Internet, a software-based network. Once hardware-based networks are built, they are very difficult to rearchitect. On the contrary, software-based networks are much more dynamic/fluid, and can be rearchitected through entrepreneurial innovation and market forces.

## **The Decentralized Web**

An alternative approach to reorienting the generation and consumption of user data on the Web is to build applications and services that are politically and architecturally decentralized. To further explain what I mean by politically and architecturally decentralized, I refer to [Vitaliks definition](https://medium.com/VitalikButerin/the-meaning-of-decentralization-a0c92b76a274 "Vitaliks definition") of these networks:

- **Politically decentralized** — how many individuals or organizations ultimately control the computers that the system is made up of?

- **Architecturally decentralized** — how many physical computers is a system made up of? How many of those computers can it tolerate breaking down at any single time?

By pursuing this approach, users stand a much higher chance of retaining ownership of the data they create, and stand a much higher chance of receiving appropriate attribution for the use of said data. In doing so, web applications and services that are politically and architecturally decentralized provide the means for individual users to take control of how their data is utilized, and to monetize its use as they see fit. This era is commonly referred to as Web 3, and it is just beginning.
 
One of the most important differences between the architecture of the centralized Web (Web 2) and that of the decentralized Web (Web 3) is the way we identify and retrieve data. URLs (Uniform Resource Locators) are the primary addresses we give each other for data on the centralized Web. These are useful, as they make it possible for us to make links and connect data, however, URLs are based on the location where data is stored, not on the contents of the resources stored there. This is called location addressing, and it presents us with a plethora of problems.

Through the domain name, URLs indicate which authority we should go to for the data. Under the standard client/server architecture of Web 2, the links referencing the data are location-based, and the data itself is centralized on a server owned by an authority. This centralization makes the task of finding and retrieving the data we are seeking rather trivial, however, ultimately the contents of a file hosted on the centralized server have no direct relationship with their location-based addresses. This location-based addressing creates a confusing mess of data that is saved multiple times at different URL's, making it difficult to attribute ownership, or tell which items are duplicates or originals.

Conversely, on the decentralized web, we can all host each other's data, with a different kind of linking that’s more secure, making it easier to trust our neighbors. This new form of linking, called [Content Addressing](https://docs.ipfs.io/concepts/content-addressing/ "Content Addressing"), is facilitated by [cryptographic hashing](https://en.wikipedia.org/wiki/Cryptographic_hash_function "cryptographic hashing"), and it liberates us from reliance on central authorities.

Cryptographic hashing takes data of any size and type and returns to you a single, fixed-size "hash" that represents it. A hash is a string of characters that you can think of as a unique name for the data. These hashes, and the notion of storing data via content addressing is much more secure for a number of reasons. For example:

- Cryptographic hashes can be derived from the content of the data itself, meaning that anyone using the same algorithm on the same data will arrive at the same hash.

- Cryptographic hashes are completely unique.

These core attributes enable us to reframe our relationship with data and the Web. On the centralized Web, we've learned to trust certain authorities and not others. We do our best with the clues we have from URL's, but there are some malicious actors who use the shortcomings of location addressing to trick us.

Conversely, on the decentralized Web, we all pitch in and host each other's data, and content addressing enables us to trust the information that's shared. We may not know much about the peers who are hosting data, but hashes can prevent malicious actors from deceiving us about the content of files. This ability to share and receive information and data in a trustless manner is what makes cryptographic hashing so important to the decentralized Web. Furthermore, since we use hashes to request data on the decentralized Web, we can think of a hash as a link, not just a name.

At this juncture, it’s worthwhile to note that content addressing can be used on all different types of files and data, from JSON objects to research papers to video files. It’s also noteworthy to add that content addressing (via cryptographic hashing) is not new. Tools like Git and protocols like Ethereum and Bitcoin are among those that utilize content addressing, but each of those protocols differ in meaningful ways; namely in how they interpret the data, and in what cryptographic function they use for hashing. From that we can discern that for cryptographic hashing to work effectively, we need to know two things: (1) what data format we're working with and (2) what hashing algorithm we intend to use.

## **Content Identifiers (CID’s) & IPFS**

One particular form of content addressing used on Web 3 is a [content identifier, or CID](https://github.com/multiformats/cid "content identifier, or CID").  A CID is a format for referencing content in distributed information systems. It leverages content addressing, cryptographic hashing, and [self-describing formats](https://github.com/multiformats/multiformats "self-describing formats"). It uses a [multicodec](https://github.com/multiformats/multicodec "multicodec") to indicate its version, making it fully self describing. The CID specification is the core identifier used by IPFS & [IPLD](https://docs.ipld.io/ "IPLD"), and it supports a broad range of projects including [Libp2p](https://libp2p.io/ "Libp2p"), [OpenBazaar](https://openbazaar.org/ "OpenBazaar"), [Parity](https://www.parity.io/ "Parity"), and [FileCoin](https://filecoin.io/ "FileCoin").

The number of characters in a CID depends on the cryptographic hash of the underlying content, rather than the size of the content itself. Most content in IPFS is hashed using the [SHA2-256 hashing algorithm](https://en.wikipedia.org/wiki/SHA-2 "SHA2-256 hashing algorithm"), so most CID's you encounter there will be the same size (256 bits, which equates to 32 bytes).

To support multiple hashing algo's, IPFS uses [Multihash](https://github.com/multiformats/multihash#multihash "Multihash"). Multihash is a protocol for differentiating outputs from various well-established cryptographic hash functions, addressing size + encoding considerations. In short, a multihash is a self-describing hash which itself contains metadata that describes both its length and what cryptographic algo generated it. Furthermore, CIDs that utilize [Multiformats](https://github.com/multiformats/multiformats#background "Multiformats") are future-proof because they use multihash to support multiple hashing algorithms rather than relying on a specific one.

As an example, if we stored an image of Biden signing some awesome Executive Order on the IPFS network, its CID would look like this: `Qme2rBAgyKanLBiPuHgNLPJkwY4eW8VzrUscxhConVJ1uu`

CID’s don't indicate where the content is stored, but it forms an address based on the content itself. In this way, CIDs function like fingerprints for data, and consist primarily of a cryptographic hash of the data itself. Therefore, we can use this "fingerprint" as a unique and succinct name to point to the data. Because the name is unique, we can use it as a link, replacing location-based identifiers, like URL's, with ones based on the content of the data itself.

In sum, *CID's let us reference data securely, verifiably, and without coordination in a distributed network*.

## **Content Addressing + Data Structures**

A CID functions like a fingerprint for a blob of data, and consists primarily of a cryptographic hash of the data itself. This allows us to create a universal identifier for any file system, and because we can use CID’s to replace location-based identifiers with ones based on the content of the data itself, we can utilize them as fundamental tools for representing various formats of information.

When interacting with these various formats of information, at a certain level of detail it becomes necessary to formally describe the properties of that information (or data). On the decentralized web we access data directly from our peers, rather than from a central authority. Within an isolated environment, such as your own laptop, you can have a great degree of trust in the data structures you work with in memory or on disk. However, in a decentralized system you have less, or possibly zero, trust among peers.

To fit this environment, we need an efficient way to link data structures together, while still preserving our ability to verify their integrity (a crucial property of CID's).

This is where alternative data structures like blockchains and Merkel DAGs come into play. These structures provide the foundation for a trustworthy, distributed web of interlinked data.

While there are key advantages to properly structuring data, it's worthwhile to note that there is no single best way to structure one's data; every choice comes with significant tradeoffs. Structure gives our data meaning and organization. Structure can serve as an index for data, affecting the speed with which we can locate and retrieve specific information. Structure can add semantics to data as well, by enabling us to group related objects.

## **TL;DR**

The standard model for web applications and services in the era of Web 2 is that of the client/server architecture, which provides a host of benefits, such as one’s ability to scale web applications and services in a cost effective & relatively secure manner. However, this model is wholly predicated on extractive business practices that prey on user data, and gradually shift from positive-to-zero-sum as use of applications and services grow.

This centralized model is a primary contributor to several broad societal tensions we see around the globe today, and unless the way we generate and consume information and data is reoriented, these tensions will continue to grow.

An alternative approach to developing web applications and services that facilitate the generation and consumption of user data is to build them in a politically and architecturally decentralized manner.

A promising way to facilitate this re-architecturing is by utilizing content-based addressing in place of location-based addressing. A CID is a format for referencing content in distributed information systems. It leverages content addressing, cryptographic hashing, and self-describing formats.

CID’s don't indicate where the content is stored, but it forms an address based on the content itself. In this way, CIDs function like “fingerprints” for data, which can be used as a unique and succinct name to point to the data. Because the name is unique, we can use it as a link, and we can utilize them to represent various formats and properties of information (or data).

In the era of Web 3, we will see more and more applications and services built in a manner that enables the access data directly from peers, rather than from a central authority. To fit this environment, we need an efficient way to link data structures together, while still preserving our ability to verify their integrity.

This is where alternative data structures like blockchains and Merkel DAGs come into play. These structures provide the foundation for a trustworthy, distributed web of interlinked data, and that is exactly where I’ll pick up in my next post.

Web 2 --> Web3