Updating LexBlog.com’s aggregation engine was no small feat. Scott Fennell and I spent months testing all of the various components of our new aggregation engine that powers the vast majority of the site, but something that was hard to prepare for was the shear scale of the site. Now that it’s up an running, we’re learning a lot about how to manage a site like this, and what sorts of features are necessary for it to be a successful publication from the perspective of an editor or reader.

One thing that I’ve recently keyed in on is search. Normally, I would tell a client that on-site search is not important. Most visitors are coming to a site from a much better search engine (Google), and are more apt to click around the site once there. LexBlog has layered in some nice features to the standard WordPress search, but most of those are around making sure that readers can search by an author’s name when they’re on a blog or website. This seems like a thing WordPress should do by default, but the generic WordPress search is “dumb” in the sense that it only looks to the post content and post title when running a search. Authors are not in either, so some work had to be done to support searching an author’s name and getting their posts.

In any case, the on-site search is “good enough” for most readers, and most sites aren’t the size of LexBlog.com. However, LexBlog.com is big. Very big. There are nearly 400,000 posts and 20,000 users on the site. The results that are returned by WordPress with what is essentially a LIKE %query% SQL statement does a disservice to anyone that waits around for the page to load (a search on LexBlog.com right now can take anywhere from 10-15 seconds to return a page).

While a very small percentage of visitors to LexBlog.com use the on-site search feature (only about 2% of all page views are to a search results page), the relationships that we’ve layered in to each post and user make search a potentially very attractive feature on the site. We could support advanced searches by organization, site, author, and date as opposed to a generic text search against all of the content. Moreover, the speed issues alone make me long for a better solution on the site.

Enter Elasticsearch. Elasticsearch is a product from Elastic.co; if you haven’t heard their story before, it’s worth the read.

I have looked at a variety of alternative search technologies for WordPress before this, including:

But I continue to come back to Elasticsearch for a number of reasons:

  • It’s cost effective
  • Easy to scale and manage
  • Has a wonderful, developer-friendly WordPress plugin – ElasticPress –┬á built by a team of people – 10up – that I trust to continue to maintain and iterate said plugin

This weekend, I took Elasticsearch/ElasticPress for a spin on LexBlog.com‘s staging environment, and the results were surprising. Most keyword searches were 3-5 times faster (again, that speed difference alone is enough for me to favor Elasticsearch over WordPress’s search). Here’s a view of how long it takes for someone to search for “Kavanaugh” on LexBlog.com’s production environment:

Over 13 seconds!

and here’s what it looks like in staging:

Just over 4 seconds – much better ­čÖé

More than speed, though, Elasticsearch’s queries are optimized for searches in a way that WordPress is not. As I mentioned, WordPress searches post content and titles, but Elasticsearch/ElasticPress expands that to include taxonomies (tags, categories, and custom taxonomies) and bylines.

WordPress also has a very weak algorithm for the keyword search itself. Without going into too much detail, it performs a relatively exact search of the query so that misspellings or typos that may occur when you’re on a mobile device (or if you’re like me, whenever you’ve been staring at a screen for more than 10 hours) are treated like you meant to search for that exact phrase. Elasticsearch performs “fuzzy matching”, which looks for variations on the keyword that you’ve searched. For example, if you’re interested in the┬áStop Online Piracy Act – SOPA – you might search “SOPA’s enforcement.” However, maybe you’re feeling lazy that day and don’t want to type in the apostrophe, so you search “SOPAs enforcement.” Elasticsearch is smart enough to return results for the┬áStop Online Piracy Act/SOPA whereas WordPress returns only results where the text was literally “SOPAs enforcement”; so only instances where the author made the same “typo” that you did!

While it seems like Elasticsearch may be winning the day at LexBlog, it’s still something for us to explore in more depth. As with all updates to a site, many people have a voice (including the readers) and we’re still waiting to see how they (and we) value search.