
Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
In our modern age new text documents are born in a blink of an eye then (often just as quickly) disappear from the internet. We find it a noble task to save these documents for future generations.
This project aims to support this noble goal in a scalable way. We want to make the archival activity streamlined and easy to do even in a huge (Terabyte / Petabyte) scale. This way we hope that more and more people can start their own collection helping the archiving effort.
Prerequisites
Special emphasis was placed on to keep the prerequisites of the project to the minimum. This is because one of the main goals of the project is to make it easy to start archiving even with limited technical knowledge and resources.
List of the required software to start archiving:
List of the required software to be able to search the archived documents:
System Requirements
It’s important to mention the hardware requirements of collecting and handling documents. If you only want to gather a few million documents then that could be done on a home PC. However, if you want to collect & store a massive amount, then purpose-built hardware is necessary. Also, if you want to collect 24/7 then having a machine that doesn’t consume a lot of power is recommended where electricity prices are high.
Collecting and indexing have quite different requirements as well, so we split the hardware requirements section into two separate chapters.
The requirements mentioned here could be built from used/commodity hardware if necessary to keep the prices low.
Collecting
The resources used by collecting will be limited most likely by the bandwidth of your connection. The CPU requirements can be quite high because the documents are verified and validated to avoid corrupt documents littering the archive.
Recommended hardware
Collecting small amounts (1 - 25 million)
-
CPU: AMD Ryzen 5 3600 or similar
-
Memory: 32 GB DDR4
-
Disc space: ~1 - 25 TB
-
Network: 100 Mbps
Collecting medium amounts (25 - 100 million)
-
CPU: AMD Ryzen 7 3800X or similar
-
Memory: 48 GB DDR4
-
Disc space: ~25 - 100 TB
-
Network: 500 Mbps
Collecting massive amounts (100 - 500+ million)
-
CPU: AMD Ryzen 9 3950X or similar
-
Memory: 64 GB DDR4
-
Disc space: ~100 - 500 TB
-
Network: 500 - 1000 Mbps
Indexing
Recommended hardware
Indexing should either be run on a different machine than collecting or the memory and disc requirements should be added together.
You can do the indexing on any machine that has sufficient memory requirements, but without a proper CPU, your indexing speed could be slow.
Indexing small amounts (1 - 25 million)
-
CPU: AMD Ryzen 5 3600 or similar
-
Memory: 64 GB
-
Disc space: 256-1024 GB SSD
Indexing medium amounts (25 - 100 million)
-
CPU: AMD Ryzen 7 3800X or similar
-
Memory: 64 GB
-
Disc space: 1-4 TB SSD
Indexing massive amounts (100 - 500+ million)
-
CPU: AMD Ryzen 9 3950X or similar
-
Memory: 16 GB memory for each 100 million documents and 32GB+ for the Indexer Application, the exact value depends on what parallelism level is being used
-
Disc space: 4 TB SSD for each 100 million documents
Applications
The Library of Alexandria project consists of more than one (usually) scalable applications. Not all of them are mandatory for the archiving effort. Some of them are created for administrating or maintenance purposes.
Architectural overview
The collection of documents starts with the Generator Application. The Generator Application is responsible to create document locations (known URLs where documents could be available). These locations are passed to the Queue Application where they are stored until a Downloader Application picks them up and visit the location to get new documents. Then the downloaded document is sent to a new queue in the Queue Application. After a while a Vault Application will pick up a document and archive it. If indexing is being done then every archived document will be indexed by the Indexer Application and made searchable by the Web Application. The communication of the connection information between these applications are made available by the Conductor Application using service discovery.
Conductor Application
Workflow
Every application that is the part of the Library of Alexandria suite uses this application for service discovery purposes.
Parameter | Description |
---|---|
loa.database.host |
The host location of the MongoDB database server. (Default value: localhost) |
loa.database.port |
The port open for the MongoDB database server. (Default value: 27017) |
loa.database.uri |
If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "") |
loa.database.no-cursor-timeout |
Whenever the cursor objects created by the application should be able to timeout. Ideally you would set up the timeout on your MongoDB server (see: cursorTimeoutMillis) but because not everybody is a MongoDB expert, we disable timeouts by default. This could cause a couple of open cursors (so extra resource usage) on the MongoDB server when the application crashes for some reason, and the cursors are not closed correctly. If you set the cursor timeout too low, then the application will crash if it is not able to process a batch of items under the provided timeout. (Default value: true) |
loa.indexer.database.host |
The host of the Elasticsearch database. |
loa.indexer.database.port |
The port of the Elasticsearch database. |
Queue Application
This application’s goal is to provide an abstraction layer between the applications.
Workflow
The Queue Application is a simple glue between the Vault Application, Downloader Application, and Generator Application. It exists to make these applications tightly coupled and scalable. It works as a message queue provider between these processes. It has two queues.
The loa-document-location contains the URLs that should be visited for document downloading. It is connecting the Generator Application and the Downloader Application.
The loa-document-archiving contains the metadata for the downloaded documents thats are in the staging area. It connects the Downloader Application and the Vault Application.
Both of these queues are persistent, so if the Queue Application is stopped, the messages are not going to be lost.
The queues contain only basic text (like urls) and metadata (like content length, file type, etc). All of this data is sent in a binary format, so it is fairly compact. Because of this, the whole application doesn’t require a lot of storage space. A big deployment might need around 100 GB, while a small-is deployment can fit on around 10 GB. The data access pattern consist of a lot of small IO operations. Because of this, an SSD is recommended for storing the application’s data.
Parameter | Description |
---|---|
loa.conductor.host |
The location (ip address or domain) of the Conductor Application. (Default value: localhost) |
loa.conductor.port |
The port where the Conductor Application is listening for new connections. (Default value: 8092) |
loa.conductor.application-type |
Sets the type of the application. If this property is set to |
loa.conductor.application-port |
Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it! |
loa.queue.port |
The port where the application should listen. (Default value: 61616) |
loa.queue.data-directory |
The location where the queue should save its contents. It should be a folder on the filesystem. |
Vault Application
Responsible for storing documents and making them available via web endpoints.
Workflow
The Vault Application connects to the Queue Application, asks for the metadata of new documents that should be archived (these were inserted by the Downloader Application to the queue previously).
When a metadata entry is acquired, the application will look up if the document is a duplicate. If it is, then it updates the document’s entry in the database to contain the new source for the document, then asks the Staging Application to remove the document’s content from the staging area. If it is not a duplicate, then it saves the new document entity to the database, then downloads the document’s content from the Staging Application into a vault location that is available either on local disk or in an AWS S3 compatible storage.
The Vault Application is scalable so more than one instance can run at the same time. This is necessary to support more than one storage machines. Each instance of the application has an unique name. When a document is stored, then the vault’s name that is holding the document’s content is also saved to the document’s metadata in the database. This way it is easy to track which vault instance has which document.
Parameter | Description |
---|---|
loa.conductor.host |
The location (ip address or domain) of the Conductor Application. (Default value: localhost) |
loa.conductor.port |
The port where the Conductor Application is listening for new connections. (Default value: 8092) |
loa.conductor.application-type |
Sets the type of the application. If this property is set to |
loa.conductor.application-port |
Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it! |
loa.vault.location.type |
Describes the type of the vault’s location. At the moment it could be |
loa.vault.location.file.path |
Used only when |
loa.stage.location |
The location where the document files are first moved. After the checksum calculation and duplicate detection, they are moved to the vault. |
server.port |
The port where the vault should publish its web endpoints. |
loa.vault.name |
The unique name of the vault. This name will be saved when archiving new documents and will be used to identify the vault that holds a given document. Do not change this after creating the vault because the documents previously archived by this vault are not going to be accessible. |
loa.vault.parallelism |
The number of documents that should be saved in parallel at the same time. If the value is set to 4, then the application tries to archive 4 documents at the same time. (Default value: 20) |
loa.vault.modification-enabled |
If this vault should be able to modify documents (eg. remove) after they are archived. If your vault is available publicly on the internet then set this to false! (Default value: true) |
loa.vault.archiving |
If this vault should archive new documents. (Default value: true) |
loa.vault.version-number |
This version number will be saved to the database for every document that has been archived. It will be used later on if it is necessary to run cleanup or fixing tasks that are specific to a given version. This way it will be easier to fix bugs or database inconsistencies introduced by a specific application version. Please do not change this, otherwise the official migration/fixer utilities are not going to be usable. (Default value: 6) |
loa.queue.consumer-pool-size |
The port open for the MongoDB database server. (Default value: 10) |
Generator Application
This application’s goal is to fill the Queue Application with downloadable document locations (links/urls).
Workflow
The Generator Application connects to the Queue Application and send URLs to be checked and downloaded by the Downloader Application.
These URLs can come from a file that is either encrypted (with GZIP) or plain text. The file is read line by line. Each line should represent a document location.
The URLs are validated by the Generator Application. So if an url doesn’t end with one of the supported file types (ie.: .pdf, .doc, .docx, etc) then it is not going to be added to the document location queue. Also, other illegal or invalid urls will be filtered out too.
If you do not have a lot of URLs ready at hand, take a look into the url-collector project or into the document-location-database. These projects are also curated by the Bottomless Archive Project and are created for the specific purpose to give easy access for LoA users to a lot of possible document locations. |
Parameter | Description |
---|---|
loa.conductor.host |
The location (ip address or domain) of the Conductor Application. (Default value: localhost) |
loa.conductor.port |
The port where the Conductor Application is listening for new connections. (Default value: 8092) |
loa.conductor.application-type |
Sets the type of the application. If this property is set to |
loa.conductor.application-port |
Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it! |
loa.source.name |
The name of the source location. This name will be saved to the database for every crawled document. It could be helpful for statistics collection or in identifying bugs and errors with specific sources. (Default value: unknown) |
loa.source.type |
Describes the type of the source. The only one supported at the moment is |
loa.source.file.location |
Used only when |
loa.source.file.encoding |
Used only when |
loa.source.file.skip-lines |
Skip the number of lines before starting to process the document locations in the file. Can be used to quickly get back to the last processed line if the application is restarted for any reason. (Default value: 0) |
loa.queue.producer-pool-size |
The number of consumer connections to the Queue Application.(Default value: 10) |
Staging Application
Workflow
This application is responsible for holding documents before they are being moved to their final place in one of the Vault Application instances.
Parameter | Description |
---|---|
loa.staging.path |
Where should the Staging Application put the files it acquires for staging. |
Downloader Application
Workflow
This application is responsible for reading out the document locations from the Queue Application and downloading its contents.
Parameter | Description |
---|---|
loa.conductor.host |
The location (ip address or domain) of the Conductor Application. (Default value: localhost) |
loa.conductor.port |
The port where the Conductor Application is listening for new connections. (Default value: 8092) |
loa.conductor.application-type |
Sets the type of the application. If this property is set to |
loa.conductor.application-port |
Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it! |
loa.stage.location |
The location where the document files are first downloaded. After some validation and occasionally modification (compression etc) it is moved to the vault. (Default value: temp folder) |
loa.downloader.version-number |
The version number that should be saved as the downloader version to the database when a new document is inserted. Can be used for debugging, cleanups and so on. (Default value: 6) |
loa.downloader.maximum-archive-size |
If a document’s size in bytes is bigger than the provided value, then the archiving step will be skipped. Too big documents can use up way too much space, compared to how useful they are. This parameter could be used to defend against this problem. (Default value: 8589934592 bytes aka 8 GB) |
loa.downloader.source |
Two types of document sources are supported by the downloader application. One of this is the |
loa.downloader.parallelism |
How many connections should be open at any given time to download documents in parallel. If the value is set to 4, then the application tries to download 4 documents at the same time. Your internet speed might allow more parallel downloads, in that case, increase this value. (Default value: 3) |
loa.downloader.source.folder.location |
The location on the filesystem where the downloader should load the files from in case of the source set to |
loa.downloader.source.folder.should-remove |
When the source is set to |
loa.source.name |
The name of the source location. This name will be saved to the database for every crawled document. It could be helpful for statistics collection or in identifying bugs and errors with specific sources. Only used when the |
loa.queue.producer-pool-size |
The number of consumer connections to the Queue Application.(Default value: 10) |
loa.queue.consumer-pool-size |
The number of producer connections to the Queue Application.(Default value: 10) |
loa.compression.algorithm |
This property describes what compression algorithm should be used while saving documents to the vault. The available values are |
loa.checksum.type |
The type of the hashing algorithm used to create the document’s checksum. At the moment only |
Indexer Application
This application makes documents searchable by inserting them into an Elasticsearch cluster.
Workflow
The application looks for documents that has the DOWNLOADED
status and indexes them by sending them to the Elasticsearch cluster.
Parameter | Description |
---|---|
loa.downloader.parallelism |
How many documents should be indexed in parallel at any given time. If the value is set to 4, then the application tries to index 4 documents at the same time. (Default value: 3) |
loa.downloader.batch-size |
The amount of documents to index that the application queries with each round-trip to the database. If the cursor timeout error happens frequently, this parameter should be set to a lower value. This would change the application to do more frequent calls to the database, lowering the chance of a timeout. (Default value: 10) |
loa.conductor.host |
The location (ip address or domain) of the Conductor Application. (Default value: localhost) |
loa.conductor.port |
The port where the Conductor Application is listening for new connections. (Default value: 8092) |
loa.conductor.application-type |
Sets the type of the application. If this property is set to |
loa.conductor.application-port |
Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it! |
Web Application
The Web Application provides access to the users to the indexed documents.
Workflow
The application run queries on the Elasticsearch cluster and display the results. If the user wants to request a document, it will reach out to the Vault Application to download the requested document. The dashboard screen also uses the Queue Application to show how many messages are waiting for processing by the various applications.
Parameter | Description |
---|---|
loa.conductor.host |
The location (ip address or domain) of the Conductor Application. (Default value: localhost) |
loa.conductor.port |
The port where the Conductor Application is listening for new connections. (Default value: 8092) |
loa.conductor.application-type |
Sets the type of the application. If this property is set to |
loa.conductor.application-port |
Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it! |
loa.indexer.database.enabled |
If the Elasticsearch connection should be enabled. Sometimes you want to run the application without connecting to Elasticsearch (just to view the dashboard etc). (Default value: true) |
API endpoints
The Web Application is a bit special compared to the others, because it exposes web endpoints as well. These endpoints are mostly used by the UI, but can be used as an API as well if necessary.
General endpoints
…
Debug endpoints
API | Description |
---|---|
/document/{documentId}/debug |
Returns all the data that is available about the document in the database. This endpoint is created mainly because it is hard to search documents by id in MongoDB when the id is represented in binary. |
Administrator Application
This application’s goal is to provide basic database administrator tasks (query statistics, initiate the re-crawling of failed tasks etc).
Parameter | Description |
---|---|
loa.conductor.host |
The location (ip address or domain) of the Conductor Application. (Default value: localhost) |
loa.conductor.port |
The port where the Conductor Application is listening for new connections. (Default value: 8092) |
loa.conductor.application-type |
Sets the type of the application. If this property is set to |
loa.conductor.application-port |
Sets the port that this application reports as the listening port. This property is pre-configured for each service. Don’t change it! |
loa.indexer.database.host |
The host of the Elasticsearch database. |
loa.indexer.database.port |
The port of the Elasticsearch database. |
loa.stage.location |
Required only for the |
loa.queue.producer-pool-size |
The number of consumer connections to the Queue Application.(Default value: 10) |
loa.queue.consumer-pool-size |
The port open for the MongoDB database server. (Default value: 10) |
Name | Description |
---|---|
reindex |
This task will reset every document’s status to |
silent-compressor |
This command will go through every document in the database and ask the Vault Application to recompress them when the compression is not the same as provided. |
cleanup |
Removes every document with status |
recollect-corrupt-documents |
Tries to recollect every document with status |
Installation
Installing the applications and the prerequisite software is quite straightforward. At this time we provide a guide to the Windows-based systems. Installing the project suite on Linux systems are supported as well but requires more technical knowledge. An ideal deployment is running the apps in separate VMs or Docker containers but for the sake of simplicity, we are not doing that in this guide. In the future, we will create a more advanced guide.
Installing Java
First, you need to download the Java 17 Runtime Environment. It’s available here. After the download is complete you should run the installer and follow the directions it provides until the installation is complete.
Once it’s done, if you open a command line (write cmd to the Start menu’s search bar) you will be able to use the java command. Try to write java -version
. You should get something similar:
java version "15" 2020-09-15 Java(TM) SE Runtime Environment (build 15+36-1562) Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)
Installing MongoDB
Download MongoDB 5.0 from here. After the download is complete run the installer and follow the directions it provides. If it’s possible, install the MongoDB Compass tool as well because you will need it later for administrative tasks.
Running the Conductor Application
You can download the Conductor Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-conductor-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Conductor Application.
Running the Staging Application
You can download the Staging Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-staging-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Staging Application.
Running the Queue Application
You can download the Queue Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-queue-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Queue Application.
Running the Vault Application
You can download the Vault Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-vault-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Vault Application.
Running the Generator Application
You can download the Generator Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-generator-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Generator Application.
Running the Downloader Application
You can download the Downloader Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-downloader-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Downloader Application.
Installing Elasticsearch
Elasticsearch is only necessary for indexing! If you only want to collect the PDF documents then it’s unnecessary for you!
You can download Elasticsearch 8.3.1 here.
After the download complete unzip it. You also need to do some slight adjustments to the default configuration.
Go to the elasticsearch/config folder. Open the jvm.options file in the text editor and edit the following parameters to adjust the memory. We suggest setting at least 4 GB of memory for Elasticsearch. Even if you don’t have a lot of files.
-Xms1g -Xmx1g
For example:
-Xms4g -Xmx4g
If you want to change the data directory then open the config/elasticsearch.yml file, uncomment the path.data and write the expected data path to it.
For example:
path.data: C:\loa\indexer\data
After this is done you are ready to run Elasticsearch by going into the elasticsearch folder and writing .\bin\elasticsearch
to the console.
Running the Indexer Application
You can download the Indexer Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-indexer-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Indexer Application.
Running the Web Application
You can download the Web Application files at our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar loa-web-application-{release-number}.jar ...
In the place of the … you should write the various parameters. For the available parameters check the parameter list under the Web Application.
Domain Language
Name | Description |
---|---|
Vault |
The location where the collected documents are archived. |
Document |
A document collected from the internet. |
Staging area |
A temporary location where the collected documents placed for post processing before going to the archive. |
Source |
A source of document locations. At the moment, it can only be a file. |
Failure rate |
How many documents fail to download compared to the successfully downloaded ones. |
Archiving |
Saving a document into the vault. |