AdulauWikiDiary: Archiving Paper

Describe the new page here. Paper proposition for the 4th International Web Archiving Workshop (IWAW04) (http://www.iwaw.net/04/callforpapers.html)

Topic: Case studies → Web Archiving Project
Title: The Free Archive Project: Protecting the Free Digital Heritage
Authors: Dulaunoy, A., Poirrier, J.E.

Abstract
Introduction
1. The challenges of long term archiving
2. Where the Free (Software) Community can help the Archiving Community
Material and methods
Current status of the project
Bibliography
References

Abstract

This paper introduces the Free Archive project (http://www.freearchive.org). The project aims are to create a long-term archive system for the free digital heritage, to store all the works that can be stored digitally and are free. The project activities are the collection and preservation but also the creation of a free software infrastructure to handle the complete archiving system. This paper outlines the challenge of archiving free works, the legal position and collection policy, describes the technical approach that has been evaluated, as well as the current status of this long-term project.

Introduction

Since the advent of computers and the internet, our cultures and societies are becoming more and more digital (from Internet to Art, via Open Access scientific publications). We should then also care about the preservation of our digital culture, like we preserve our physical culture.

Besides simple loss (by destruction or unability to open proprietary formats), the risks of not protecting the digital heritage are possible rewrite of the history, loss of anteriority research in computer science (think about software patents) and loss of free software and free digital works.

But, unfortunately, many technical problems (e.g. lifetime of an URL on the internet, complexity of the digital world) as well as political ones (e.g. legal issues, digital rights management systems) prevent us to keep freedom in our digital society.

The challenges of long term archiving

Building libraries or archives is not easy: it's time consuming, legal issues can arise and technology is a serious challenge on the long-term. The area of digital libraries is new but some research exists and presents the various issues of the process (ref).

Before designing a long term archiving system, one must address many challenges in:

Acquisition: what will be archived, who will submit the works and how (manual?, automatic?), legal issue (for copyright and/or author rights), insitation on submission;
Archiving and long term storage/preservation: physical lifetime of storage;
Migration and data conversion: periodic transfert to correct storage, data conversion (the format and data format issues), emulation of software (but is it viable?);
Accessiblity/visibility and distribution: interfaces (from Web to other archives, problem of speed) and (re)distribution of the archived data.

Where the Free (Software) Community can help the Archiving Community

The Free Software Community already produces digital works (from free software to free documentation), in particular from the GNU Project. But free software is only a small part of the free digital culture: just consider all free works (from art to books), all public domain works (from museum, official archives, all the numerical data, …). Who is protecting this heritage? The GNU project has initiated a new wave, it's time to protect its heritage.

With this idea in mind, we gave birth to the idea of an internet repository for free digital works : the Andria project. This project, a.k.a. the Free Archive project (http://www.freearchive.org) aims is to create a long-term secured archive system for the free digital heritage. The purpose of the Free Archive project is to store all the work that can be stored digitally and are free (free includes public domain and free works). The Free Archive project also creates a free software to handle the archiving system. A software/hardware architecture in charge of protecting the digital heritage must be independant or secured of any external issue (like software vendor issue, legal issue…). So a free software is a logical view and we couldn't protect the free digital heritage with proprietary software .

The Andria Project is community project now handled by the ASBL/NGO Association Electronique Libre (http://www.ael.be). An international legal structure will be created to host the project and secure (on the technical side but also on the legal side) the archive itself. The current legal structure to handle the project is under definition and is a matter of being compatible with all the aspect (technical, economical, social and legal) of a digital libraries/archiving.

Material and methods

Purpose

The goal of the Free Archive project is to create a digital library for the free digital works. For that purpose, the project had to find:

a legal solution : we have to support a legal infrastructure to host securely the archived free works;
a technical solution: we have to create multiple free software to handle this task;
a functional solution: we have to build an infrastructure to host the digital library.

Legal position and collection policy

The project accepts any works that can digitalized and free (see the submission guidelines on the project website). We consider free works using the four freedoms described by the FSF with variation following the type of works (freedom to run the program, for any purpose; freedom to study how the program works, and adapt it to your needs; freedom to redistribute copies; freedom to improve the program, and release your improvements to the public, so that the whole community benefits).

Functional works (like Free Software, Free Documentation) must follow the 4 freedoms in order to be free. There are some variation if the work is an essay or a philosophical publication. A list of accepted licenses is available on the project website. Among them, one can find the GNU GPL, the GNU FDL, the Open Publication Licence and some Creative Common licences.

Submission can be done via a web interface, regular e-mail or by regular mail. There is no size limit for the documents submitted but they must be in a free and non-proprietary formats (such formats include, e.g., plain text, HTML, XML (including OpenOffice?.org format), TeX?, etc.). Let us insist on this aspect since it is really important for the durability of the archive on the long-term. For example, the archive system will regenerate new format from the old format if required at a specific period. Submission of any work (even work of which one are not the author) is possible, as long as the license is free.

Short description of available services

The Andria project is more than "simple" archiving system: it has unique identification (ID), time stamping services, an interface to existing archiving services and a mirroring interface.

An unique identification (ID) is given to any work submitted. This ID allows the system to find the work in a specific tree on the archive server. The time stamping service allows the precise location of the document in time. Interface to existing archiving services and the mirroring interface serve the purpose of collaboration with other systems and security.

Note that all these services are not impleted yet.

Technical approach

The technical approach used has two goals : follow the KISS method ("Keep It Simple, Stupid", since it tends to shorten time and reduce cost) and be as transparent as possible. All the operation inside the free archive must be asynchronous in order to avoid locking or bottleneck. The asynchronous permits also to have a building approach where each component can be easily developed without being linked to the rest of the other system components. With those aspects in mind, one can summarize the process as follow (this is a general process overview that could evolved in the future) :

[Image from http://www.foo.be/andria/images/andria-basic-process.dia ? - Maybe I'll try to make a new simple diagram with Inkscape. This one seems too complex and contains early design error.]

Image:Andria-overview.png

Submissions can be done via a web interface (HTTP), regular e-mail (SMTP), XML-RPC, by regular mail or some custom interface. For the moment, only the web interface and the regular mail systems are working in the evaluation phase (footnote:In the evaluation phase, we are now evaluating the various solution with a small amount of data). At the submission time, the submitter is asked a number a question, mainly for identification and licence verification purposes. An automatic submission, without passing the validation system, is possible when the license verification is not required (like gna! or Savannah).

The Andria submission interface is a server acting as a "waiting" (via regular polling) gateway for the submission. The submission format is a simple session directory containing all the sub file and the work(s) submitted. The submission system then proposes the work at a validation system.

A (basic) timestamping service is ready for the initial submission, getting the information from the "waiting" gateway and pushing the timestamped in the "waiting " gateway validation server. The unique identification (ID) has the following format : [ARCHIVENAME]-[TYPE]-[YEARMMDDHHMMSS]-[RANDOM]. IDs are stored in a specific tree on the archive server. The tree is based on the date of submission of the works. An extended timestamping service exists for the validated submission. After validation, the work is submitted to the archiving and mirror system(s).

The archiving system is composed of a small layer in order to push or retrieve data from the system. This layer hides the real storage strategy behind the system. We want to be sure that the storage strategy can be moved easily without changing the simple front side API. In the evaluation process, we have tested and integrated various kind strategies in the storage of the data. Two methods seem quite appropiate : a storage using a tree based structure relying on the underlying filesystem with some indexes (footnote : QDBM files for part of the tree) and a distributed storage solution a la Google File System(ref). The two methods were early prototypes that couldn't be used in production due the fragility of the software implementation. For the distributed storage solution, DRBS (Distributed Replicated Blob System) seems a promising Google File System approach.

Current status of the project

The project started at the end of 2003. End of 2004 will see the official announce & manifesto for protecting your free digital heritage. A website (collaborative wiki) is running at http://www.freearchive.org, as well as a mailing-list. The web interface for submission is ready (at http://www.freearchive.org/andria/submit.pl). We are only on the software design evaluation phase. For example, work is in progress on the replacement of the existing storage solution. The alpha version of the free archive system used will soon be released under the GNU General Public License.

Since we are still in an alpha process, any help is welcome:

by giving times to help for the building of the technical solution,
by giving times to create documentation and information about the project,
by giving ressources to handle manual submission,
by giving a financial help to the Andria project,
and, the most important, by submitting free works in the archive system !

Bibliography

Digital Libraries, Arms W., MIT press 2000.
The Google File System, Ghemawat S., Gobioff G., Leung ST., Google. (SOSP'03) [To ref: evaluation of DRBS]
Long-Term Preservation of Digital Material - Building an Archive to Preserve Digital Cultural Heritage from the Internet, Aschenbrenner A., Diplomarbeit 2001 Wien.
Modern Information Retrieval, Baeza-Yates R., Ribeiro-Neto B., ed. ACM - Addison Wesley. [To ref: indexing and access + digital libraries]
Part of Our Culture is Born Digital - On Efforts to Preserve it for future generations, Rauber A., Aschenbrenner A., Vienna University of Technology. [To ref: introduction]
Practical Digital Libraries - Books Bytes and Bucks, Lesk M., ed. Morgan Kaufmann 1997
Guidelines for the Preservation of Digital Heritage http://portal.unesco.org/ci/ev.php?URL_ID=8967&URL_DO=DO_TOPIC

References

Free Archive project [1]
QDBM (http://qdbm.sf.net)

ArchivingPaper

Contents