Full text search in Nextcloud

Nextcloud supports a search via the web interface and in the clients. However, only file names are compared and not searched for in file contents. However, there is the option of setting up a full-text search.

Installing Elasticsearch in Ubuntu/Debian

Java runtime environment

Nextcloud’s full-text search is bassed on Elasticsearch, which needs to be installed independently. Elasticsearch is a Java-based search engine, so the first step is to ensure that Java is available. You can check if Java is already installed with the following command:

java -version

As result you should get an output similar to this:

openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)

If you get a message like Command 'java' not found instead, you have to install a Java runtime environment, i.e. in Ubuntu Linux as following:

apt install openjdk-11-jre

Install Elasticsearch

After that, add the repository for Elasticsearch and install Elasticsearch as following:

apt install apt-transport-https ca-certificates
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -
echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elasticsearch7.list
apt update
apt install elasticsearch -y

Before you start Elasticsearch, you should definitely adjust the configuration to limit the size of the heap – otherwise Elasticsearch can use all the free memory and other services can be affected.

For this I added the following entry in the file /etc/default/elasticsearch to limit the heap size to 4 GB:

ES_JAVA_OPTS="-Xms4g -Xmx4g"

Note: the server used has about 32 GB RAM. Maybe you need to adjust that to a lower value if you do not have that much memory available. The more memory that can be used, the more effectively Elasticsearch can work, since less data then has to be reloaded during operation.

Some instructions also point out that the IP address for incoming connections should be set to 127.0.0.1 in the /etc/elasticsearch/elasticsearch.yml file. However, this is not necessary for Elasticsearch 7, since Elasticsearch can only be addressed locally without specifying an IP address. To be on the safe side, you should at least check the setting and, if necessary, comment out network.host or set it to 127.0.0.1:

# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
#
#network.host: 192.168.0.1

To be able to search the content of PDF documents as well you need to install an additional plugin:

/usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-attachment

Tesseract for OCR

In addition to text documents, images can also be searched for readable text. The “Tesseract” tool is required for this, which can be installed as follows with support for German and English:

apt install tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng

Important: OCR is a time-consuming process. If you have a lot of pictures in your Nextcloud, the first build of the search index will take a long time!

Setting up Elasticsearch as service

After all preparations are completed, you can activate Elasticsearch as a service as following:

systemctl daemon-reload
systemctl enable elasticsearch
systemctl start elasticsearch

Starting the service may take a while, in my case it was about 20 seconds.

Now Elasticsearch is available for full-text search in text documents including PDF and Office documents like Word and LibreOffice/OpenOffice.

Setting up full-text search in Nextcloud

After Elasticsearch and, if applicable, Tesseract are installed, install the following apps in Nextcloud:

And the following app if you want to use Tesseract to search for text in images:

Configuring the full text search

The configuration can be reached via the administration. The following information must be added to it.

In the section “Elasticsearch”

  • Address of the servlet: http://localhost:9200
  • Index: Name of the index, for example the domain name of your Nextcloud
  • Analyzer tokenizer: standard

Changing the tokenizer is usually not needed. To see what tokenizers are available and how they work see the documentation at Elasticsearch.

In the section “Files”

Here you can activate the inclusion of PDF and Office documents in the index and, if necessary, adjust the maximum file size up to which documents are included in the index.

In the section “Files – Tesseract OCR”

If you also instaled Tesseract, you can basically activate OCR here. For the languages, enter all the languages that you have installed for Tesseract, separated by a comma – e.g. eng,deu.

Exclude folders from the search

To exclude folders from the search, just add a file named .noindex to these folders. Also see https://help.nextcloud.com/t/how-to-exclude-a-folder-from-indexing/35318/2.

Generating the search index

The first structure of the search index is done in a console with the following command in the main directory of Nextcloud:

php occ fulltextsearch:index

Depending on the amount of data available, this process can take several hours. Therefore, when accessing the server via SSH, it makes sense to use tools like screen or tmux so that you can run the command in the background without having to keep the connection to the server open all the time.

Activating Cron in Nextcloud

When the process is complete, future new files or file changes will be automatically added to the index as part of Nextcloud’s cron job. To do this, however, it must also run via cron – AJAX or Webcron is not sufficient for this!

See the documentation of Nextcloud how to do set up cron.

Testing the search

After setting up and building the search index, you can test the search by clicking the search icon in the web interface and entering a term that you know appears in your documents.

One or more entries under “Full-text search” should then appear in the list of results:

Nextcloud, full text search

Im Android-Client wird die Volltextsuche dann ebenfalls unterstützt:

Nextcloud, full text search in Android

Update the PDF plugin when updating Elasticsearch

Elasticsearch is also updated as part of regular updates with apt update and apt upgrade. It can happen that the service can no longer be used after the update because the plugin for the PDF import no longer matches the server version.

In this case you need to remove and install the plugin again:

/usr/share/elasticsearch/bin/elasticsearch-plugin remove ingest-attachment
/usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-attachment

After that you can restart Elasticsearch:

systemctl restart elasticsearch

Leave a public comment

Your email address will not be published. This is not a contact form! If you want to send me a personal message, use my e-mail address in the imprint.

You can use the following HTML tags in the comment:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>