Library of Alexandria

Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.

In our modern age new text documents are born in a blink of an eye then (often just as quickly) disappear from the internet. We find it a noble task to save these documents for future generations.

This project aims to support this noble goal in a scalable way. We want to make the archival activity streamlined and easy to do even in a huge (Terabyte / Petabyte) scale. This way we hope that more and more people can start their own collection helping the archiving effort.

Prerequisites

Special emphasis was placed on to keep the prerequisites of the project to the minimum. This is because one of the main goals of the project is to make it easy to start archiving even with limited technical knowledge and resources.

List of the required software to start archiving:

List of the required software to be able to search the archived documents:

Elasticsearch 7.5.0

System Requirements

It’s important to mention the hardware requirements of collecting and handling documents. If you only want to gather a few million documents then that could be done on a home PC. However, if you want to collect & store a massive amount, then purpose-built hardware is necessary. This hardware is not necessarily expensive. Also if you want to collect 24/7 then having a machine that doesn’t consume a lot of wattages is recommended where electricity prices are high.

Collecting and indexing have quite different requirements as well so we split the hardware requirements section into two separate chapters.

The requirements mentioned here could be built from used/commodity hardware if necessary.

Collecting

Collecting is less memory-intensive than indexing the documents but doesn’t provide any search capability. The CPU requirements are quite high however because the documents are verified and validated to avoid corrupt documents littering the archive.

Recommended hardware

Collecting small amounts (1 - 25 million)

CPU: AMD Ryzen 5 3600 or similar
Memory: 16 GB DDR4
Disc space: ~1 - 25 TB
Network: 100 Mbps

Collecting medium amounts (25 - 100 million)

CPU: AMD Ryzen 7 3800X or similar
Memory: 32 GB DDR4
Disc space: ~25 - 100 TB
Network: 500 Mbps

Collecting massive amounts (100 - 500+ million)

CPU: AMD Ryzen 9 3950X or similar
Memory: 32 GB DDR4
Disc space: ~100 - 500 TB
Network: 500 - 1000 Mbps

Indexing

Recommended hardware

Indexing is much more memory intensive than collecting but requires a lot less disk space. Indexing should either be run on a different machine than collecting or the memory and disc requirements should be added together.

You can do the indexing on any machine that has sufficient memory requirements, but without a proper CPU, your indexing speed could be slow.

Indexing small amounts (1 - 25 million)

CPU: AMD Ryzen 5 3600 or similar
Memory: 64 GB
Disc space: 256-512 GB (SSD preferred)

Indexing medium amounts (25 - 100 million)

CPU: 2 x Intel Xeon E5-2680 v2 or similar
Memory: 256 GB
Disc space: 1 TB (SSD preferred)

Indexing massive amounts (100 - 500+ million)

CPU: 2 x Intel Xeon E5-2680 v2 or similar
Memory: 512 GB
Disc space: 4 TB (SSD preferred)

Applications

The Library of Alexandria project consists of more than one (usually) scalable applications. Not all of them are mandatory for the archiving effort. Some of them are created for administrating or maintenance purposes.

Administrator Application

This application’s goal is to provide basic database administrator tasks (initialize the databases, query statistics, initiate the re-crawling of failed tasks etc).

Table 1. Parameters
Parameter	Description
loa.database.host	The host location of the MongoDB database server. (Default value: localhost)
loa.database.port	The port open for the MongoDB database server. (Default value: 27017)
loa.indexer.database.host	The host of the Elasticsearch database.
loa.indexer.database.port	The port of the Elasticsearch database.

Table 2. Tasks
Name	Description
initialize-indexer	This task will initialize the indexes in the Elasticsearch database. Elasticsearch should run while this command is running.

Vault Application

Responsible for storing documents and making them available via web endpoints.

Workflow

The Vault Application connects to the Queue Application, asks for new documents that should be archived (these were inserted by the Downloader Application to the queue previously).

Once a new document is acquired then it will be moved into a staging area. A checksum will be generated from the staging copy and checked for duplicates using the data available in the MongoDB database.

If the application is not a duplicate then the staging file’s contents will be moved to the archive and a new entry will be created for the document in the database.

Table 3. Parameters
Parameter	Description
loa.vault.location.type	Describes the type of the vault’s location. At the moment it could be `file` only but in the future, we plan to support more location types. When it’s set to `file` then the vault will be located on the local filesystem. (Default value: file)
loa.vault.location.file.path	Used only when `loa.vault.location.type` is set to `file`. Path to the location of the vault on the filesystem.
loa.database.host	The host location of the MongoDB database server. (Default value: localhost)
loa.database.port	The port open for the MongoDB database server. (Default value: 27017)
loa.queue.host	The ip address of the Queue Application.
loa.queue.port	The port where the Queue Application is listening for new connections. (Default value: 61616)
loa.stage.location	The location where the document files are first moved. After the checksum calculation and duplicate detection, they are moved to the vault.
server.port	The port where the vault should publish it’s web endpoints.
loa.vault.version-number	This version number will be saved to the database for every document that has been archived. Later on, if it will be necessary to run cleanup or fixing database tasks that are specific to a given version will be checked by the value of this version number. This way it will be easier to fix bugs or database inconsistencies introduced by a specific application version. Please do not change this otherwise the official migration/fixer utilities are not going to be usable.
loa.compression.algorithm	This property describes what compression algorithm should be used while saving documents to the vault. The available values are `lzma`, `gzip`, `none`. [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) have the best compression ratios while being quite CPU resource-intensive, [GZIP](https://en.wikipedia.org/wiki/Gzip) is better than no compression but a little bit worse than LZMA while having minimal CPU footprint while `none` is saving the documents without compression. (Default value: none)
loa.checksum.type	The type of the hashing algorithm used to create the document’s checksum. At the moment only `SHA-256` is available for this purpose. (Default value: sha-256)

Queue Application

This application’s goal is to provide an abstraction layer between the applications.

Workflow

The Queue Application is a simple glue between the Vault Application, Downloader Application, and Generator Application. It exists to make these applications tightly coupled and scalable.

Table 4. Parameters
Parameter	Description
loa.queue.port	The port where the application should listen. (Default value: 61616)
loa.queue.data-directory	The location where the queue should save its contents. It should be a folder on the filesystem.

Generator Application

This application’s goal is to fill the Queue Application with downloadable document locations (links).

Workflow

The Vault Application connects to the Queue Application application and send URLs to be checked and downloaded there.

These URLs could come from various sources like a local file or from Common Crawl corpus.

Table 5. Parameters
Parameter	Description
loa.queue.host	The host of the queue where the application should connect and the document locations should be sent. (Default value: localhost)
loa.queue.port	The port of the queue where the application should connect and the document locations should be sent. (Default value: 61616)
loa.source.name	The name of the source location. This name will be saved to the database for every crawled document. It could be helpful for statistics collection or in identifying bugs and errors with specific sources. (Default value: unknown)
loa.source.type	Describes the type of the source. Could be either `file` or `commoncrawl`. If the value is file then the location data will be loaded from a local file. If it is commoncrawl then the parser will start loading and parsing urls from the Common Crawl corpus. (Default value: file)
loa.source.commoncrawl.crawl-id	Used only when `loa.source.type` is set to `commoncrawl`. The id of the Common Crawl crawling sentence. For example the 2019 January's id is CC-MAIN-2019-04. You can acquire the crawl id for each month’s corpus here.
loa.source.commoncrawl.warc-id	Used only when `loa.source.type` is set to `commoncrawl`. Every month’s Common Crawl crawling sentence is built from multiple WARC files (mostly around 64000). This is the id of the WARC file that the parsing should start from. When you first start crawling a month’s crawl sequence this should be 1. When the downloader opens a new WARC file it will print it’s id to the console. If you need to stop the crawler and want to re-start it from where it stopped then write down the last crawled WARC id and set this parameter to it.
loa.source.file.location	Used only when `loa.source.type` is set to `file`. The location of the source file on the disk. It’s not a problem if it contains non-pdf files.
loa.source.file.encoding	Used only when `loa.source.type` is set to `file`. It can be set to `none` or gzip. If it’s set to `none` then the file will be read as a non-compressed file. If it’s set to `gzip` then it will be red as a gzipped file, being unzipped on the fly. (Default value: none)

Downloader Application

Workflow

This application is responsible for reading out the document locations from the Queue Application and downloading its contents.

Table 6. Parameters
Parameter	Description
loa.queue.port	The port of the queue where the application should connect and the document locations should be sent. (Default value: 61616)
loa.queue.host	The host of the queue where the application should connect and the document locations should be sent. (Default value: localhost)
loa.downloader.executor.thread-count	How many downloads should run simultaneously. Usually, this number should be set according to your network or storage device (HDD/SSD) speed. If you want to tune the collecting speed of documents then increase this as long as one of them is fully saturated. However Be careful, if you have a subpar router or networking infrastructure then many simultaneous requests could cause timeouts, overheating on routers and thus making the tuning of this parameter counter-intuitive. (Default value: 250)
loa.downloader.executor.queue-length	How many locations do we want to pre-calculate. This queue are fed by the source subsystem with URL locations to crawl and is being consumed by the downloader subsystem. If you set this parameter too high it could cause out of memory errors while if it is set to too low then most of the download threads could idle. The suggested value is between 1000 - 50000 depending on the available memory. (Default value: 1000)
loa.stage.location	The location where the document files are first downloaded. After some validation and occasionally modification (compression etc) it is moved to the vault.
loa.database.host	The host location of the MongoDB database server. (Default value: localhost)
loa.database.port	The port open for the MongoDB database server. (Default value: 27017)

Indexer Application

This application makes documents searchable by inserting them into an Elasticsearch cluster.

Workflow

The application looks for documents that has the DOWNLOADED status and indexes them by sending them to the Elasticsearch cluster.

Table 7. Parameters
Parameter	Description
spring.data.mongodb.host	The host location of the MongoDB database server. (Default value: localhost)
spring.data.mongodb.port	The port open for the MongoDB database server. (Default value: 27017)
spring.data.mongodb.username	The username to access the database named `loa` on the MongoDB database server.
spring.data.mongodb.password	The password to access the database named `loa` on the MongoDB database server.
loa.vault.client.host	The IP address of the Vault Application to grab the documents from.
loa.vault.client.port	The port of the Vault Application to grab the documents from.
loa.indexer.database.host	The host of the Elasticsearch database.
loa.indexer.database.port	The port of the Elasticsearch database.

Web Application

The Web Application provides access to the users to the indexed documents.

Workflow

The application run queries on the Elasticsearch cluster and display the results. If the user wants to download a document it will reach out to the Vault Application to download the requested document.

Table 8. Parameters
Parameter	Description
spring.data.mongodb.host	The host location of the MongoDB database server. (Default value: localhost)
spring.data.mongodb.port	The port open for the MongoDB database server. (Default value: 27017)
spring.data.mongodb.username	The username to access the database named `loa` on the MongoDB database server.
spring.data.mongodb.password	The password to access the database named `loa` on the MongoDB database server.
loa.vault.client.host	The IP address of the Vault Application to grab the documents for displaying.
loa.vault.client.port	The port of the Vault Application to grab the documents documents for displaying.
loa.indexer.database.host	The host of the Elasticsearch database.
loa.indexer.database.port	The port of the Elasticsearch database.

Installation

Installing the applications and the prerequisite software is quite straightforward. At this time we provide a guide to the Windows-based systems. Installing LoA on Linux systems are supported as well but requires more technical knowledge. An ideal deployment is running the apps in separate VMs or Docker containers but for the sake of simplicity, we are not doing that in this guide. In the future, we will create a more advanced guide.

Installing Java

First, you need to download the Java 13 Runtime Environment. It’s available here. After the download is complete you should run the installer and follow the directions it provides until the installation is complete.

Once it’s done, if you open a command line (write cmd to the Start menu’s search bar) you will be able to use the java command. Try to write java -version. You should get something similar:

java version "13" 2019-09-17
Java(TM) SE Runtime Environment (build 13+33)
Java HotSpot(TM) 64-Bit Server VM (build 13+33, mixed mode, sharing)

Installing MongoDB

Download MongoDB 4.2 from here. After the download is complete run the installer and follow the directions it provides. If it’s possible, install the MongoDB Compass tool as well because you will need it later for administrative tasks.

Running the Queue Application

You can download the Queue Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-queue-application-{release-number}.jar ...

In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Queue Application.

Running the Vault Application

You can download the Vault Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-vault-application-{release-number}.jar ...

In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Vault Application.

Running the Generator Application

You can download the Generator Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-generator-application-{release-number}.jar ...

In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Generator Application.

Running the Downloader Application

You can download the Downloader Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-downloader-application-{release-number}.jar ...

In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Downloader Application.

Installing Elasticsearch

Elasticsearch is only necessary for indexing! If you only want to collect the PDF documents then it’s unnecessary for you!

You can download Elasticsearch 7.5.0 here.

After the download complete unzip it. Open a command-line client in the folder where you unzipped it and run the following command:

.\bin\elasticsearch-plugin install ingest-attachment

This will install the Ingest attachment plugin.

You also need to do some slight adjustments to the default configuration.

Go to the elasticsearch/config folder. Open the jvm.options file in the text editor and edit the following parameters to adjust the memory. We suggest setting at least 12 GB of memory for Elasticsearch. Even if you don’t have a lot of files.

-Xms1g
-Xmx1g

For example:

-Xms12g
-Xmx12g

If you want to change the data directory then open the config/elasticsearch.yml file, uncomment the path.data and write the expected data path to it.

For example:

path.data: C:\loa\indexer\data

After this is done you are ready to run Elasticsearch by going into the elasticsearch folder and writing .\bin\elasticsearch to the console.

Initializing the Elasticsearch index

Without proper initialization, Elasticsearch is not able to index documents.

To initialize the document mapping you need to run an initialization command available in the Administrator Application.

java -jar loa-database-application-{release-number}.jar --initialize-indexer ...

You need to provide the Elasticsearch connection parameters as well.

For example:

java -jar loa-database-application-{release-number}.jar --initialize-indexer --loa.indexer.database.host=127.0.0.1 --loa.indexer.database.port=9200

After the script finished running, Elasticsearch is ready for indexing.

Running the Indexer Application

You can download the Indexer Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-indexer-application-{release-number}.jar ...

In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Indexer Application.

Running the Web Application

You can download the Web Application files at our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar loa-web-application-{release-number}.jar ...

In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Web Application.

Domain Language

Table 9. Language elements
Name	Description
Vault	The location where the collected documents are archived.
Document	A document collected from the internet.
Staging area	A temporary location where the collected documents placed for post processing before going to the archive.
Source	A source of document locations. Could be a file or the Common Crawl corpus.
Failure rate	How many documents fail to download compared to the successfully downloaded ones.
Archiving	Moving a document from the staging area to the vault.