- Current status
- Source code
When looking for a way to add search functionality to my blog, I found a few hosted search providers and some existing software but none that matched my taste. I had used regain before, but found too many problems.
So I had to do it all myself, again.
My head already contained a list of must-have features:
- Crawl + index URLs like a "real" search engine (as opposed to knowing the Wordpress database structure and searching in that)
- Support for multiple domains
- Indexing of all URLs that are linked from my own pages
- Author extraction (meta tags, microformats)
- Image EXIF + PDF text parsing
- Must be fast
- Rank matches in title higher than in headline, headline higher than text
- Excerpt display with search term highlighting
- Sort by relevance and date
- Facets to drill down: File type, language, domain, tag/keyword
- Time filter: Modification date before/after/at a given date
As of version 0.2.1, phinde consists of 1800 lines of PHP code and 400 lines of HTML/Twig.
Because I wanted to rank headlines higher than normal text, MySQL full text search could not be used. From phorkie's development I knew that Elasticsearch supports field boosting and settled with that.
I made a schema that contained individual fields for title, each of the headline types (h1-h6), the text and tags/keywords. The fields each got a different boost that determines their priority in search result ranking.
My blog+website index contains 3.600 documents and takes 34 MiB (mostly "normal" HTML pages). The indieweb chat search instance indexes 900.000 documents, with a size of 550 MiB (tiny documents, each a single chat log line).
Elasticsearch works well except when there are schema changes, which often happens during development. I found it easier to throw away all data after making changes to the schema, because migrating a schema is too much work. This might be different when you have a couple of million documents in ES - but for me it's easier to let half a dozen worker processes re-crawl everything, than to implement schema migration scripts.
Crawling the web is a prime example for parallelizing.
When a URL is fetched, the script extracts all linked URLs and determines if they should be followed. Each URL is then put in the job queue, together with information if it shall be crawled and/or indexed.
The phinde-worker script is tiny; it only listens for incoming jobs and then starts a process script that does the actual work. This frees me from complicated error and exception handling, allows updating the processor without restarting the worker and makes development easy because I can run the processor from command line, just as the worker does.
At first I had two different job queues: One for crawling and one for indexing. Bugs in the crawler script would not influence the indexer and vice versa.
This allowed me to crawl many URLs quickly without the indexing overhead, but also meant I had to fetch each URL twice.
Splitting crawling and indexing also means that the code needs to handle crawled-but-not-indexed and indexed-but-not-crawled cases. I originally did not handle this, which broke data integrity a couple of times.
Now the process script handles both crawling and indexing. This means only one HTTP request, and less code because I don't have to handle different processing states when updating the Elasticsearch documents.
Then headings, page title, text, keywords/tags and author information are extracted and stored.
The home page only has a search slot and not much more, the search result page shows result document title, excerpt of the content that contains the search terms as well as the author.
Despite the size of the chat log corpus with 900k documents, querying Elasticsearch only takes milliseconds.
I took special care of the pager and will publish a blog post with the full details of the design considerations.
Whenever a blog post is published, the search engine needs to index it. At first I triggered it manually, then I had a cronjob that checked the my blog's atom feed every hour. None of them is ideal.
Luckily we have WebSub (formerly known as PubSubHubbub), which defines a protocol for notifications on the web. My blog already sends out notifications to interested parties via my hub at phubb.cweiske.de, which means that feed readers with WebSub support already get instantly notified about new posts.
I decided to build WebSub subscriptions into phinde, and today blog posts get indexed immediately when my blog sends out update notifications.
Since 2016-02, every page on my blog has a small search box. It takes you to search.cweiske.de with the site-specific filter set to cweiske.de/tagebuch/. It provides a button to remove the site-specific filter, which then queries all indexed pages.
I'm very pleased with the results, given that the effort it took to implement it was small - thanks to the awesome libraries and Elasticsearch, which does all the hard storage + search work.