Internet Archive – Why preserving the internet matters & what can we learn from it? P1
How much internet is there? And what happens to it when it’s gone? Today we’re exploring the fascinating history of the internet archive and doing our best to learn from it, from the software side and infrastructural one. How can one organization hold the collective data of the entire internet (as much as it can), and what can it teach us?
Study the past if you would define the future! – Confucius
If you want to understand today, you must search yesterday. – Pearl Buck
These two wonderful quotes drive us to go deeper into the reeds of history, to learn and improve, as people, and as an organization! Especially so if you consider another famous quote:
And though some might find this a gloomy quote, others will see the potential!
Remember: History repeats itself!
So, exploring the past, we can get a clearer picture of what the future holds, and what new services or products await us ahead! We’ve made a bigger argument towards this when we explored the past of social medias. Later we even made our argument on why it’s probably the best time to get yourself a social media of your own!
What is the Internet Archive?
What is the Internet Archive exactly? A “small” non-profit organization made up of 168 full time employees and a countless number of volunteers. Founded in 1996 by an American computer engineer and activist – Brewster Kahle. He is also credited with founding one of the leading SEO analysis software companies – Alexa.com.
Initially Mr. Kahle’s goal was simple – archive the world wide web, BBS, and any other publicly available software and webpage. Well not actually simple indeed, especially when its scope rose in 1999 with inclusion of any and all noteworthy archive collections like the NASA’s Image archive and the Perlinger’s archive among others.
However, at this time it was not yet publicly available, that is not until 2001 when the “Wayback Machine” launched.
The Wayback Machine is a publicly available non-profit database holding millions of books, videos, music, software, websites, and other digital media. Since its inception in 2001 and to this day, the archive.org library remains the last place of existence for a lot of information. There you can see what websites looked like before or ones that no longer exist or even grab the source code that is no longer available anywhere else.
And its hard to go bigger than 45 Petabytes of webpage Data as of drafting of this article! Add almost the same amount of other digital media (like documentaries, films, music etc) present in their servers! That makes up a total of 90 petabytes and more of raw data. Data that can be used if you know how…
For reference: A 1 petabyte = 1 million gigabytes.
Why does it matter?
Well, culture of course! Historic preservation is a collective task of all humanity; to protect and a preserve: buildings, structures, objects, art and so much more, so why not websites as well?
The internet is a massive milestone in humanities progress. A landmark that will be studied in schools and universities much like the history of ancient Rome or the Dinosaurs! With the help of the world wide web humanity has skyrocketed its development, and it is worthy to celebrate it!
If that is not enough, then consider this – according to the aforementioned Alexa, archive.org receives an approximative 3 million monthly unique visits.
The value of which is immeasurable when it comes to human behavioural analysis by social scientists! Which in turn help us in marketing and entrepreneurship to develop better and more effective campaigns based on actionable data!
Matter of fact, the initial set of Apache BigData projects (Hadoop and Spark) started at Internet Archive. Whilst Alexa Internet, a start-up sold to Amazon which formed the basis of Alexa Web Information Service (AWIS, one of the first AWS services).
However, the main goal of the Internet Archive and its core mission is: “Universal Access to All Knowledge!”
How do they do it? Server Infrastructure
But how do they manage store all that data? Sure they must be using some sort of advanced infrastructural models, some cross cloud framework or even forms of Edge computing? The answer is… difficult.
A typical monolithic server structure is relatively simple. Internet archive servers are both simple, and incredibly complex. But at the base level It’s made up of server racks. These servers are mostly compromised out of dozens of hard drives some as old as 2012.
The full scope of the architecture is massive – at present the rough numbers include:
– 750 servers
– 1,300 VMs running
– 30,000 storage devices
– More than 20,000 spinning disks (in paired storage)
And the bandwidth capacity as of January 2021 is 64 Gb per second.
The current raw capacity storage is almost 200 petabytes. And by the estimates provided by the IA core infrastructure team, their quarterly growth is roughly 25% per year (or 10-12 Petabytes raw capacity)
At 16 terabyte disks it would require 15 racks to store the current capacity of the archive. There are currently 75 racks in total. Some of which are running 4 terabyte disks.
Why so many? Paired Storage Paradigm or “Drive Mirroring”.
Paired Storage and Principles of Data storage
Due to storing massive quantities of data for some many years, the team at Internet Archive understand one thing. Disk drives all eventually fail, so what happens to the data in jeopardy?
When a disk does eventually fail, all the data on the drive is made “read only” and the operations team is alerted. All the data is stored in secondary, tertiary drives for safe keeping. There are multiple copies of the same information in the DataNodes. Once the disk drive is replaced, the data is instantly copied into it from the mirror drives and is reset into “read/write” status again.
And though there are alternatives to drive mirroring for large storage systems (Raid Arrays, CEPH, Hadoop etc.) Internet Archive chooses simplicity over everything else, to primarily ensure the transparency of data on a per-drive basis.
Another problem to consider is the potential catastrophic failure of ECC (Elliptic Curve Cryptography) approaches and falling below the minimal thresholds of disk population means the total loss of all data in the array. Simply inexcusable when it comes to rare digital information stored at IA.
Core principles of Drive Mirroring:
Transparency: All items* in the archive are directories on the disks.
Simplicity: The basic fundamental unit of storage is the disk. If there’s a problem, there’s not much in levels of abstraction. This makes supporting and upgrading failing disk drives far easier.
Durability: Disks are replicated across all datacentres, providing failsafe’s upon failsafe’s.
Performance: Content is served from all copies simultaneously, lessening the overall load on any one datacentre.
Longevity: Formats evolve as needed and replaced. As disks age out, as they fail, its simply a matter of replacing the old with the new slowly and steadily.
*Items are Internet Archive form for data storage, which they explain in this quick article.
Disaster looming – The Splinternet
And now comes the grim reality, and the part-time reason why we’ve decided to dedicate our take on the Internet Archive.
The core mission of IA is: Universal Knowledge for Everyone! But as Mr. Kahle can attest with his growing number of appearances and general grim view for the future of the archive – the future is uncertain.
Reason for that is something called the “Balkanization of the Internet”. The world has changed in the past decade and the internet has grown (and keeps growing) several closed gardens, otherwise known as firewalled regions, such as Russia, China, India, Turkey etc.
If before Internet Archive crawlers had relatively free access across the web, recent changes all point towards the possibility it being the closing of the golden age of the archive. Nations are paying increasingly more attention to internet regulations, and some might say propaganda, but that’s just the tip of the iceberg.
According to Kahle and Bailey (Co-Creator), corporations are just as capable of fracturing the wide web and creating these informational bubbles with strict access to information on the need-to-know basis. The dread “Splinternet” is sadly becoming a reality.
This all leads to an uncomfortable situation where companies could potentially create walls or restrictions to information simply based on what products people pay for.
Mr. Kahle’s outlook is becoming increasingly pessimistic:
But not all is lost! Organizations such as IFEX and ACLU are attempting to push back, and heavily advocate for freedom of information on the internet. And while that is going on the Internet Archive are preparing for their 25th year anniversary planned for October 29th, 2021.
If we’ve managed to pique your interest make sure to check them out, or better yet, consider donating to the organization.
The past 25 years have been wild, and the next 25 are promising to be wilder still, and as a non-profit organization, much like Wikipedia, they must rely on donations to survive.
A little goes a long way, for a mere 20$, the team at IA, can acquire, digitize, and preserve a book forever. The core message of the company stays however – free knowledge for everyone, and if that is not a noble intention, we don’t know what is!
Next time we’ll continue exploring the infrastructure of the Internet Archive, we’ll try to peek into their website, potential future of Machine Learning/AI applications, and of course possible new business models associated with IA.
And don’t forget our previous forays into the past: one time we’ve explored the first computer and proto-programming languages! This other time we’ve examined the history and the bright future of SEO!
Before you go however, tell us: What was the most fun thing you’ve found on the Wayback machine?
Stay classy business and tech nerds!