Parting Thoughts

PartingThoughts.net

Creating a Lasting Website as a Personal Archive

Posted 20 April 2016

I created this website as a way to publish my blog posts, essays, photographs, and a few other projects.

My focus is on creating something that will live longer than I will — hopefully a lot longer. It turns out that this is not so easy to do!

I have picked 50 years as a goal. Longer would be better; I’d like 250 years! But it is hard to think clearly about even 50 years in the future.

There’s a big stack of issues to consider, from the domain name and hosting to the content management system (CMS) and software maintenance. I’m going to survey most of them in this article, and then dive into a couple of big issues in future articles.

Third-Party Dependencies

With longevity of the site as a key goal, it is important that the website avoid use of third-party services. For example, in current websites JavaScript libraries and font files are often loaded from Google or other content delivery services. Using a third-party server to supply the file has some performance and simplicity advantages, but it makes the site vulnerable to any changes to the third-party service. My top goal is maximum long-term stability, so third-party services are out.

The current site does use some third-party services, such as AddThis for sharing, for ease of development. As long as these are optional services that don’t otherwise affect the site, they are an acceptable risk. When it comes to dealing with social networks, stability and avoidance of third party-services just aren’t possible.

Third-party software is fine, though, as long as it is considered part of the website’s software. For the CMS, I chose to use WordPress, knowing that it is more widely used than any other system on the web and therefore will be around for a long time.

Normally, an unattended WordPress installation is a dangerous thing, considering the number of hack attempts aimed at WordPress sites and the frequent rate of software upgrades.

There’s a whole host of issues here that I’ll address in a future article. For now, let’s put this big messy issue aside, and consider the WordPress software and the PHP/MySQL stack that underlies it as a given.

Offline Files, Online Website

A website does not need to be connected to the Internet; it can exist, in a static form, as a set of bits on any storage device or service.

An offline website is only good as a “deep freeze” backup, however. My guess is that most sites that make it to this state are never seen again.

I am putting together both an offline set of files and a website. The offline files will be my complete data set in raw form. This makes it easy to assemble the content, since no user interface is needed, and it can be as inclusive as I want without a lot of effort.

While I like the idea of having this big pile of bits that will capture a big piece of my life, it is far more important to me to have the most interesting of this content easily available to anyone. And for that, I need a website.

Long-Term Hosting?

To be on the web, the site needs to be hosted by someone. That is simple enough to do, of course, for immediate use; there are hundreds of hosting companies to choose from.

But how do I set up hosting that will run indefinitely? Servers cost money to run and need maintenance and support. Any company that is offering “lifetime” hosting can only be referring to that company’s lifetime. The few I found offering this kind of service did not appear to be big enough that I would bet on their lifetime.

The best approach, putting aside cost for the moment, seems to be to host the site on two different services, with automatic synchronization between the two. This adds cost and complexity that I don’t want to deal with right now, but I’d like to add it in the future.

So it comes down to betting on one company to be the initial host. My guess is that Amazon’s AWS services are likely to be among the last hosting services standing. Their scale, in terms of the number of companies that depend on their services and the ranges of services they provide, as well as their own businesses that use the same infrastructure, is a huge strength.

Amazon’s billing however, is usage-based, so you get a bill at the end of every month and the amount varies depending on traffic. This is not a good match to our “set it and forget it” goal.

Amazon will sell you a “reserved instance” paid up to 3 years in advance, but they don’t offer longer periods than that. The service is about $200/year, when paid 3 years in advance, for what Amazon calls a t2.medium instance. So even if you could pay for 50 years, you’d be looking at a $10,000 bill. (I imagine Amazon might offer a discount for 50 years — or they may not want to commit in any way to that long a period.)

Currently, I am hosting this site at WP Engine, because of their strong support for WordPress security. They might be a great long-term host, but they don’t want prepayment for any longer than one year, and they are too small for me to depend on as a 50-year solution.

One backstop is the Internet Archive’s wayback machine, which caches much of the web periodically. This is a wonderful service, and I am delighted that it is there — but it doesn’t do a comprehensive enough job to be the primary site, or even the primary backup. For example, it doesn’t include all the images and other ancillary files.

So let’s assume we have hosting set up somewhere prepaid for a few years, and that the best we can do for the long term is to have a couple of people lined up to pay the bills, maintain the software, and find the next-generation team when they are ready to retire. Perhaps some sort of escrow account can be set up that will pay the bills in future years.

Finding the Website

If the website is going to have its own domain name, the name needs to be registered at one of the domain registrars. GoDaddy will sell you 10 years prepaid, and Network Solutions offers 25 years. So someone is going to have to pay to renew this, but not for a while.

In addition to domain name registration, someone needs to provide domain name services (DNS). Typically, this is done by the domain registrar, and they do it for free. So if the registration problem is solved, the DNS problem should be solved too.

Another option is to use a subdomain with an existing service — for example, this site could be partingthoughts.wordpress.com, instead of partingthoughts.net.

If you take this approach, however, the site is now tied to a particular hosting provider, which is an unnecessary risk. It might be useful for a backup version of the site, however.

Storing Lots of Content

I have about 100,000 photos that I have taken during my lifetime. Only a tiny fraction of them will make it up onto the web, but they will still add up to a substantial amount of disk space. When I start adding in videos, the amount of space needed will skyrocket.

Fortunately, disk space, and the services that deliver it, have continued to fall in price and increase in capacity. Cost and availability of storage is no longer a limiting factor for this sort of website unless you have vast amounts of video or other big data sets.

In the AWS world, storage is the S3 service. It gives us arbitrary amounts of space for a low price. Some redundancy is built in, and you can pay extra for more redundancy if you want even higher reliability.

Of particular interest for archival storage is Amazon’s Glacier service, which stores data offline. It is significantly less expensive than online storage, making it ideal for backups.

Character, Page, and Image Encoding

It is easy to lose sight of the fact that everything that we store, process, and deliver from our web servers is just ones and zeros — there are no alphabetic characters, no images, no audio.

All of these media types are possible because of some defined encoding of the real-world entity — such as a photo — into bits. For these files to be readable, future software needs to be able to decode them. This is typically done by the browser, so it is dependent on what future browsers support.

Choosing encodings is something we need to do once, and do it right, so that our content will have as long a life as possible.

For characters, ASCII remains the heart of standard encodings. A specific enhancement, called UTF-8, predominates on the web today and provides for encoding of a much larger character set without going to a full 16-bit character. I’m betting that browsers will be able to display UTF-8 text for a very long time to come.

Images are more complex, because it is not just a matter of encoding but also of resolution and compression. I am storing at least two versions of my photos, if they are on the web: a screen resolution version for the web, and the original file for the offline archive.

The file size difference is huge. Images on the website are typically 100 Kbytes to 1 Mbyte, while raw files from my camera are around 25 Mbytes.

For offline storage, where I want the full photo quality to be preserved, I’m saving the Canon raw files. It would be more robust, perhaps, to convert the files to Adobe’s vendor-independent Camera Raw format, but this is a project I suspect I won’t get to.

For the web, I’m betting that JPEG is going to be around for a very, very long time. The number of JPEG images in the universe today is already massive, and it is increasing at an ever-faster rate. Cameras and other capture devices may move on to new formats, but display software — such as browsers — aren’t likely to drop legacy JPEG support for as far in the future as I can glimpse.

One way to reduce the dependency on JPEG would be to use an uncompressed format, but this would dramatically increase the amount of storage required and decrease the performance of the website. I’m going to roll the dice with JPEG.

Initial Conclusions

There does not seem to be any realistic way to create a website that has a high likelihood of persisting unaided for more than a few years. But with a little help from friends, or perhaps from an escrow account, a 50-year goal is achievable with diligent effort.

You can prepay everything for 3 years and it should run unaided, as long as there are no security breaches or truly essential software upgrades. After a few years, however, someone is going to have to start paying the bills for hosting, and once every 10–25 years someone will need to renew the domain name.

If you are intrigued by the idea of creating a personal archive, I encourage you to begin by ignoring most of this article so you can focus on your content. What do you want to publish, and how do you want to present it? Then you can move on to all the issues discussed here.