How To Index PDF Content Within Your Umbraco 8 Search

Published On Sat 29 August, 2020

In this video, you will learn how to index PDF files within your Umbraco V8 search index. If you want to index any PDFs that are stored within the Media library you can use a free plug-in called UmbracoExamine.PDF. UmbracoExamine.PDF can be installed using NuGet:

After you have installed the plug-in, restart the site within an IISRESET. When the website next loads a new index called 'PDFIndex' will be automatically created for you. To query the index can be done using this code snippet:

Without creating a custom searcher the index will not automatically visible, or, searchable within the Umbraco backend. Depending on your requirement you will likely bump into another challenge...

This solution is great but it has two limitations:

PDFs will not be automatically visible, or, searchable within the Umbraco backend search. This can be fixed by building a custom search indexer.
You will not be able to search content and PDFs at the same time. All the indexed Umbracao data will be in one index and all the indexed PDF content will be in another index.

To solve both of these issues something called a 'multi-index searcher' can be created. The multi-index searcher will combine both the external index (the index that contains all the Umbraco data) and the pdf index. The code to create this searcher is shown below:

This code is taken directly from the UmbracoExamine.PDF Github page, I make no claim to it. I can however confirm that it works. One important thing to note is the ComposeAfter attribute. This sets the order that the composer will be triggered. In this situation, it will trigger after the ExaminePdfComposer composer. To use the newly create searcher in code, you can use this snippet:

Note, that the PDF indexer uses the field called fileTextContent to store the indexed PDF content. Also do not be surprised if you see the same PDF appear twice within your search. If the filename of a PDF is referenced on a page (within the external index) and some content within the same PDF has the same phrase then two results can be returned. To prevent this from happening you can try filtering out those records whose __IndexType contains the term media.