If you want to index any PDFs stored within the Media library so they can be indexed you can use a free plug-in called UmbracoExamine.PDF. UmbracoExamine.PDF can be installed using Nuget like so:
After you have installed the plug-in and restarted the site a new index called 'PDFIndex' will be automatically created for you. While it is possible to query this index directly, like this:
Without creating a custom searcher the index will not automatically visible, or, searchable within the Umbraco backend. Depending on your requirement you will likely bump into another challenge...
Most people will likely want a search that will query both content and PDF content. Having all the content within one index and all the PDF indexed content within another index causes will not allow for this type of functionality by default.
To solve both of these issues something called a 'multi-index searcher' can be created. The multi-index searcher will combine both the external index and the pdf index. The code to create this searcher can be seen below:
This code is taken directly from the UmbracoExamine.PDF Github page so I take no claim to it. I can however confirm that it works. One important thing to note is the
ComposeAfter attribute that tells the composer after the ExaminePdfComposer composer. To use the searcher in code, you can use this snippet:
Note, that the PDF indexer uses the field called
fileTextContent to store the indexed PDF content. Also do not be surprised if you see the same PDF appear twice within your search. If the filename of a PDF within the external index and some content within the same PDF found to have the same phrase then two results can be returned. This can be done by filtering out those records whos
__IndexType contains the term