logo background

Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.

In our modern age new text documents are born in a blink of an eye then (often just as quickly) disappear from the internet. We find it a noble task to save these documents for future generations.

This project aims to support this noble goal in a scalable way. We want to make the archival activity streamlined and easy to do even in a huge (Terabyte / Petabyte) scale. This way we hope that more and more people can start their own collection helping the archiving effort.

Prerequisites

Special emphasis was placed on to keep the prerequisites of the project to the minimum. This is because one of the main goals of the project is to make it easy to start archiving even with limited technical knowledge and resources.

List of the required software to start archiving:

List of the required software to be able to search the archived documents:

System Requirements

It’s important to mention the hardware requirements of collecting and handling documents. If you only want to gather a few million documents then that could be done on a home PC. However, if you want to collect & store a massive amount, then purpose-built hardware is necessary. Also, if you want to collect 24/7 then having a machine that doesn’t consume a lot of power is recommended where electricity prices are high.

Collecting and indexing have quite different requirements as well, so we split the hardware requirements section into two separate chapters.

The requirements mentioned here could be built from used/commodity hardware if necessary to keep the prices low.

Collecting

The resources used by collecting will be limited most likely by the bandwidth of your connection. The CPU requirements can be quite high because the documents are verified and validated to avoid corrupt documents littering the archive.

Collecting small amounts (1 - 25 million)
  • CPU: AMD Ryzen 5 3600 or similar

  • Memory: 32 GB DDR4

  • Disc space: ~1 - 25 TB

  • Network: 100 Mbps

Collecting medium amounts (25 - 100 million)
  • CPU: AMD Ryzen 7 3800X or similar

  • Memory: 48 GB DDR4

  • Disc space: ~25 - 100 TB

  • Network: 500 Mbps

Collecting massive amounts (100 - 500+ million)
  • CPU: AMD Ryzen 9 3950X or similar

  • Memory: 64 GB DDR4

  • Disc space: ~100 - 500 TB

  • Network: 500 - 1000 Mbps

Indexing

Indexing should either be run on a different machine than collecting or the memory and disc requirements should be added together.

You can do the indexing on any machine that has sufficient memory requirements, but without a proper CPU, your indexing speed could be slow.

Indexing small amounts (1 - 25 million)
Indexing medium amounts (25 - 100 million)
Indexing massive amounts (100 - 500+ million)

Applications

The Library of Alexandria project consists of more than one (usually) scalable applications. Not all of them are mandatory for the archiving effort. Some of them are created for administrating or maintenance purposes.

Architectural overview

The collection of documents starts with the Generator Application. The Generator Application is responsible to create document locations (known URLs where documents could be available). These locations are passed to the Queue Application where they are stored until a Downloader Application picks them up and visit the location to get new documents. Then the downloaded document is sent to a new queue in the Queue Application. After a while a Vault Application will pick up a document and archive it. If indexing is being done then every archived document will be indexed by the Indexer Application and made searchable by the Web Application.

architecture

Administrator Application

This application’s goal is to provide basic database administrator tasks (query statistics, initiate the re-crawling of failed tasks etc).

Table 1. Parameters
Parameter Description

loa.conductor.host

The location (ip address or domain) of the Conductor Application. (Default value: localhost)

loa.conductor.port

The port where the Conductor Application is listening for new connections. (Default value: 8092)

loa.conductor.application-type

Sets the type of the application. If this property is set to WEB_APPLICATION then this application will be registered as a Web Application. This property is pre-configured for each service. Don’t change it!

loa.conductor.application-port

Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it!

loa.indexer.database.host

The host of the Elasticsearch database.

loa.indexer.database.port

The port of the Elasticsearch database.

loa.stage.location

Required only for the recollect-corrupt-documents command. The location where the document files are first downloaded. After some validation and occasionally modification (compression etc) it is moved to the vault. (Default value: temp folder)

loa.queue.producer-pool-size

The number of consumer connections to the Queue Application.(Default value: 10)

loa.queue.consumer-pool-size

The port open for the MongoDB database server. (Default value: 10)

Table 2. Tasks
Name Description

reindex

This task will reset every document’s status to DOWNLOADED.

silent-compressor

This command will go through every document in the database and ask the Vault Application to recompress them when the compression is not the same as provided.

cleanup

Removes every document with status CORRUPT. The documents will be removed from the vault and the database too.

recollect-corrupt-documents

Tries to recollect every document with status CORRUPT. The document’s sourceLocations variable stores the ids of the document locations where the document was collected from. This command will try to re-download the document from these source locations. If any of the downloads is successful, then the document’s content will be replaced with non-corrupt data (effectively reconstructing the document), if all of the downloads are failed, then the CORRUPT status will be kept.

Conductor Application

Workflow

Every application that is the part of the Library of Alexandria suite uses this application for service discovery purposes.

Table 3. Parameters
Parameter Description

loa.database.host

The host location of the MongoDB database server. (Default value: localhost)

loa.database.port

The port open for the MongoDB database server. (Default value: 27017)

loa.database.uri

If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "")

loa.database.no-cursor-timeout

Whenever the cursor objects created by the application should be able to timeout. Ideally you would set up the timeout on your MongoDB server (see: cursorTimeoutMillis) but because not everybody is a MongoDB expert, we disable timeouts by default. This could cause a couple of open cursors (so extra resource usage) on the MongoDB server when the application crashes for some reason, and the cursors are not closed correctly. If you set the cursor timeout too low, then the application will crash if it is not able to process a batch of items under the provided timeout. (Default value: true)

loa.indexer.database.host

The host of the Elasticsearch database.

loa.indexer.database.port

The port of the Elasticsearch database.

Vault Application

Responsible for storing documents and making them available via web endpoints.

Workflow

The Vault Application connects to the Queue Application, asks for new documents that should be archived (these were inserted by the Downloader Application to the queue previously).

Once a new document is acquired then it will be moved into a staging area. A checksum will be generated from the staging copy and checked for duplicates using the data available in the MongoDB database.

If the application is not a duplicate then the staging file’s contents will be moved to the archive, and a new entry will be created for the document in the database.

Table 4. Parameters
Parameter Description

loa.conductor.host

The location (ip address or domain) of the Conductor Application. (Default value: localhost)

loa.conductor.port

The port where the Conductor Application is listening for new connections. (Default value: 8092)

loa.conductor.application-type

Sets the type of the application. If this property is set to WEB_APPLICATION then this application will be registered as a Web Application. This property is pre-configured for each service. Don’t change it!

loa.conductor.application-port

Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it!

loa.vault.location.type

Describes the type of the vault’s location. At the moment it could be file only but in the future, we plan to support more location types. When it’s set to file then the vault will be located on the local filesystem. (Default value: file)

loa.vault.location.file.path

Used only when loa.vault.location.type is set to file. Path to the location of the vault on the filesystem.

loa.stage.location

The location where the document files are first moved. After the checksum calculation and duplicate detection, they are moved to the vault.

server.port

The port where the vault should publish its web endpoints.

loa.vault.name

The unique name of the vault. This name will be saved when archiving new documents and will be used to identify the vault that holds a given document. Do not change this after creating the vault because the documents previously archived by this vault are not going to be accessible.

loa.vault.modification-enabled

If this vault should be able to modify documents (eg. remove) after they are archived. If your vault is available publicly on the internet then set this to false! (Default value: true)

loa.vault.archiving

If this vault should archive new documents. (Default value: true)

loa.vault.version-number

This version number will be saved to the database for every document that has been archived. It will be used later on if its necessary to run cleanup or fixing tasks that are specific to a given version. This way it will be easier to fix bugs or database inconsistencies introduced by a specific application version. Please do not change this, otherwise the official migration/fixer utilities are not going to be usable.

loa.compression.algorithm

This property describes what compression algorithm should be used while saving documents to the vault. The available values are lzma, gzip, none. LZMA has the best compression ratios while being quite CPU resource-intensive, GZIP is better than no compression, but a little worse than LZMA while having minimal CPU footprint while none is saving the documents without compression. (Default value: none)

loa.checksum.type

The type of the hashing algorithm used to create the document’s checksum. At the moment only SHA-256 is available for this purpose. (Default value: sha-256)

loa.queue.consumer-pool-size

The port open for the MongoDB database server. (Default value: 10)

Queue Application

This application’s goal is to provide an abstraction layer between the applications.

Workflow

The Queue Application is a simple glue between the Vault Application, Downloader Application, and Generator Application. It exists to make these applications tightly coupled and scalable.

Table 5. Parameters
Parameter Description

loa.conductor.host

The location (ip address or domain) of the Conductor Application. (Default value: localhost)

loa.conductor.port

The port where the Conductor Application is listening for new connections. (Default value: 8092)

loa.conductor.application-type

Sets the type of the application. If this property is set to WEB_APPLICATION then this application will be registered as a Web Application. This property is pre-configured for each service. Don’t change it!

loa.conductor.application-port

Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it!

loa.queue.port

The port where the application should listen. (Default value: 61616)

loa.queue.data-directory

The location where the queue should save its contents. It should be a folder on the filesystem.

Generator Application

This application’s goal is to fill the Queue Application with downloadable document locations (links).

Workflow

The Generator Application connects to the Queue Application and send URLs to be checked and downloaded by the Downloader Application.

These URLs could come from various sources, the most common is a local file.

If you do not have a lot of URLs ready at hand, take a look into the url-collector project or into the document-location-database. These projects are also curated by the Bottomless Archive Project and are created for the specific purpose to give easy access for LoA users to a lot of possible document locations.
Table 6. Parameters
Parameter Description

loa.conductor.host

The location (ip address or domain) of the Conductor Application. (Default value: localhost)

loa.conductor.port

The port where the Conductor Application is listening for new connections. (Default value: 8092)

loa.conductor.application-type

Sets the type of the application. If this property is set to WEB_APPLICATION then this application will be registered as a Web Application. This property is pre-configured for each service. Don’t change it!

loa.conductor.application-port

Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it!

loa.source.name

The name of the source location. This name will be saved to the database for every crawled document. It could be helpful for statistics collection or in identifying bugs and errors with specific sources. (Default value: unknown)

loa.source.type

Describes the type of the source. The only one supported at the moment is file. If the value is file then the location data will be loaded from a local file. (Default value: file)

loa.source.file.location

Used only when loa.source.type is set to file. The location of the source file on the disk. It’s not a problem if it contains non-pdf files.

loa.source.file.encoding

Used only when loa.source.type is set to file. It can be set to none or gzip. If it’s set to none then the file will be read as a non-compressed file. If it’s set to gzip then it will be red as a gzipped file, being unzipped on the fly. (Default value: none)

loa.source.file.skip-lines

Skip the number of lines before starting to process the document locations in the file. Can be used to quickly get back to the last processed line if the application is restarted for any reason. (Default value: 0)

loa.queue.producer-pool-size

The number of consumer connections to the Queue Application.(Default value: 10)

Downloader Application

Workflow

This application is responsible for reading out the document locations from the Queue Application and downloading its contents.

Table 7. Parameters
Parameter Description

loa.conductor.host

The location (ip address or domain) of the Conductor Application. (Default value: localhost)

loa.conductor.port

The port where the Conductor Application is listening for new connections. (Default value: 8092)

loa.conductor.application-type

Sets the type of the application. If this property is set to WEB_APPLICATION then this application will be registered as a Web Application. This property is pre-configured for each service. Don’t change it!

loa.conductor.application-port

Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it!

loa.stage.location

The location where the document files are first downloaded. After some validation and occasionally modification (compression etc) it is moved to the vault. (Default value: temp folder)

loa.downloader.version-number

The version number that should be saved as the downloader version to the database when a new document is inserted. Can be used for debugging, cleanups and so on. (Default value: 2)

loa.downloader.maximum-archive-size

If a document’s size in bytes is bigger than the provided value, then the archiving step will be skipped. Too big documents can use up way too much space, compared to how useful they are. This parameter could be used to defend against this problem. (Default value: 8589934592 bytes aka 8 GB)

loa.downloader.source

Two types of document sources are supported by the downloader application. One of this is the QUEUE source, where the application get the possible locations of documents (URLs), and download them from there, then sends them for archiving. The other one is FOLDER, where the application load document files from a provided folder on the filesystem, and sends them for archiving. (Default value: QUEUE)

loa.downloader.source.folder.location

The location on the filesystem where the downloader should load the files from in case of the source set to FOLDER.

loa.downloader.source.folder.should-remove

When the source is set to FOLDER, should the application remove the files after processing from the folder. Useful when there are a lot of entries in the directory and can’t be processed with one go. (Default value: false)

loa.source.name

The name of the source location. This name will be saved to the database for every crawled document. It could be helpful for statistics collection or in identifying bugs and errors with specific sources. Only used when the loa.downloader.source is set to FOLDER. When QUEUE is specified, then the value that is saved to the database will be sent by the Generator Application. (Default value: unknown)

loa.queue.producer-pool-size

The number of consumer connections to the Queue Application.(Default value: 10)

loa.queue.consumer-pool-size

The number of producer connections to the Queue Application.(Default value: 10)

Indexer Application

This application makes documents searchable by inserting them into an Elasticsearch cluster.

Workflow

The application looks for documents that has the DOWNLOADED status and indexes them by sending them to the Elasticsearch cluster.

Table 8. Parameters
Parameter Description

loa.conductor.host

The location (ip address or domain) of the Conductor Application. (Default value: localhost)

loa.conductor.port

The port where the Conductor Application is listening for new connections. (Default value: 8092)

loa.conductor.application-type

Sets the type of the application. If this property is set to WEB_APPLICATION then this application will be registered as a Web Application. This property is pre-configured for each service. Don’t change it!

loa.conductor.application-port

Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it!

Web Application

The Web Application provides access to the users to the indexed documents.

Workflow

The application run queries on the Elasticsearch cluster and display the results. If the user wants to request a document, it will reach out to the Vault Application to download the requested document. The dashboard screen also uses the Queue Application to show how many messages are waiting for processing by the various applications.

Table 9. Parameters
Parameter Description

loa.conductor.host

The location (ip address or domain) of the Conductor Application. (Default value: localhost)

loa.conductor.port

The port where the Conductor Application is listening for new connections. (Default value: 8092)

loa.conductor.application-type

Sets the type of the application. If this property is set to WEB_APPLICATION then this application will be registered as a Web Application. This property is pre-configured for each service. Don’t change it!

loa.conductor.application-port

Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it!

loa.indexer.database.enabled

If the Elasticsearch connection should be enabled. Sometimes you want to run the application without connecting to Elasticsearch (just to view the dashboard etc). (Default value: true)

API endpoints

The Web Application is a bit special compared to the others, because it exposes web endpoints as well. These endpoints are mostly used by the UI, but can be used as an API as well if necessary.

General endpoints

…​

Debug endpoints
Table 10. APIs
API Description

/document/{documentId}/debug

Returns all the data that is available about the document in the database. This endpoint is created mainly because it is hard to search documents by id in MongoDB when the id is represented in binary.

Installation

Installing the applications and the prerequisite software is quite straightforward. At this time we provide a guide to the Windows-based systems. Installing LoA on Linux systems are supported as well but requires more technical knowledge. An ideal deployment is running the apps in separate VMs or Docker containers but for the sake of simplicity, we are not doing that in this guide. In the future, we will create a more advanced guide.

Installing Java

First, you need to download the Java 17 Runtime Environment. It’s available here. After the download is complete you should run the installer and follow the directions it provides until the installation is complete.

Once it’s done, if you open a command line (write cmd to the Start menu’s search bar) you will be able to use the java command. Try to write java -version. You should get something similar:

java version "15" 2020-09-15
Java(TM) SE Runtime Environment (build 15+36-1562)
Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)

Installing MongoDB

Download MongoDB 5.0 from here. After the download is complete run the installer and follow the directions it provides. If it’s possible, install the MongoDB Compass tool as well because you will need it later for administrative tasks.

Running the Queue Application

You can download the Queue Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-queue-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters. For the available parameters check the parameter list under the Queue Application.

Running the Vault Application

You can download the Vault Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-vault-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters. For the available parameters check the parameter list under the Vault Application.

Running the Generator Application

You can download the Generator Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-generator-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters. For the available parameters check the parameter list under the Generator Application.

Running the Downloader Application

You can download the Downloader Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-downloader-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters. For the available parameters check the parameter list under the Downloader Application.

Installing Elasticsearch

Elasticsearch is only necessary for indexing! If you only want to collect the PDF documents then it’s unnecessary for you!

You can download Elasticsearch 7.13.0 here.

After the download complete unzip it. You also need to do some slight adjustments to the default configuration.

Go to the elasticsearch/config folder. Open the jvm.options file in the text editor and edit the following parameters to adjust the memory. We suggest setting at least 4 GB of memory for Elasticsearch. Even if you don’t have a lot of files.

-Xms1g
-Xmx1g

For example:

-Xms4g
-Xmx4g

If you want to change the data directory then open the config/elasticsearch.yml file, uncomment the path.data and write the expected data path to it.

For example:

path.data: C:\loa\indexer\data

After this is done you are ready to run Elasticsearch by going into the elasticsearch folder and writing .\bin\elasticsearch to the console.

Running the Indexer Application

You can download the Indexer Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-indexer-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters. For the available parameters check the parameter list under the Indexer Application.

Running the Web Application

You can download the Web Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-web-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters. For the available parameters check the parameter list under the Web Application.

Domain Language

Table 11. Language elements
Name Description

Vault

The location where the collected documents are archived.

Document

A document collected from the internet.

Staging area

A temporary location where the collected documents placed for post processing before going to the archive.

Source

A source of document locations. At the moment, it can only be a file.

Failure rate

How many documents fail to download compared to the successfully downloaded ones.

Archiving

Moving a document from the staging area to the vault.