Another week of reading, lectures, and confused Googling and skimming through videos is in the bag. This week was all about NP-complete. It has been fun to experience an “aha!” moment in each course I’ve taken so far, and this was an especially fun one. In large part, because seeking answers to these questions is seemingly beyond the reach of computers (in a reasonable time).
We’re accustomed to computers being incredibly fast. So accustomed that we forget just how fast they are. They’re really fast. I’m writing this on a 4-year old computer. It has a 2.5 GHz Intel Core i7 processor. That number equates to how many cycles the system clock of this computer runs in a second. So 2,500,000,000 cycles in one second. The version of the CPU running on this machine is quite powerful. It should execute around 9 instructions per cycle for a single core in the processor and there are 4 total cores running.
This all adds up to a lot of numbers and those numbers represent commands that we expect the computer to execute for us so that we can….. I dunno….. watch videos of cute cats.
This is to say, you should think, when you hear a computer can’t find an answer to a problem, that it’s a big deal. These are the sorts of problems that a reasonably programmed (or so we think) computer with enough processors to require fans that sound like this:
just can’t solve if the domain is large enough.
The full list of these problems is here and I suppose there are two ways to think of this. One is very abstract. The other is with a concrete example. I think it’s important to understand the core essence of these problems, so I’ll begin with the abstraction.
Imagine that you have a problem whose possible solutions represent all of the possible groupings of all the variables in the problem. Let’s say you have 2 possible solutions and n possible groupings. That’s 2^n possible solutions. Generally, to find each solutions, you have to do a bit of work to verify it as well. Let’s call that m. Now the total time to find the solution to the problem is proportional to 2^n * m. If n is 5 and m is 10 then 2^(5) * 10 = 320. Let’s say that represents seconds, so we’re at 5 minutes, 20 seconds. If n is 10 and m is 100, then we’re at 102,400, or 1,706 minutes (or 71ish days).
So numbers cause big shifts in our ability to get a problem done in a time that makes sense. All essentially because the problem’s answer could be this or this or this or this. If we had a computer that could simply calculate all of the this’s in parallel with how many or‘s there were (i.e, for each possible solution, have a single processor dedicated to exploring a single solution set) then we could do this in polynomial time. But you can see just how quickly that becomes a problem. Too many processors equals fires (or really loud fans).
The other way to think of this is through a concrete problem. One of the easier ones (in my opinion) to understand is the traveling salesperson problem:
The problem is essentially a question of permutations. Given a list of cities and the distances between those cities, what is the shortest possible route that visits each city and returns to the origin city. Again, to get to the answer we have to look at one path or another path or another path or another path. There is no way to know which path is the shortest without first computing the lengths of all the paths. You can imagine that with more cities, there are more paths and with a large enough set, the problem is so complex that we cannot reasonably answer it.
In short, these are the problems that baffle us with or without computers. That they all share some key characteristics says something about the nature of these problems and about problems in general. In fact, one of the key notions of NP-complete problems is that if you can solve one in polynomial time (i.e., efficiently, or non-exponentially) then you can solve all such problems in polynomial time. But perhaps some problems are just too complex for a person or machine to do.
I should note that I continue to remain positive about the direction of the Gutenberg project (the new WordPress editor, coming to you as of WordPress 5.0). My feelings on this are numerous and expansive, but the long and short of it is that I believe WordPress core needs a major shakeup to help the community (re)develop focus and draw in engaged and effective technical participants. Gutenberg represents a wonderful opportunity to do that as it brings a new paradigm to the core editor (and likely elsewhere as the foundational technology expands into other areas of site management) and has the potential to draw in a new wave of web developers.
That said, the introduction of Gutenberg into core has been an interesting thing to watch. From afar, the concerns of the Accessibility Team seem to clearly show the divisions between WordPress as an open source project (WordPress.org) and as a commercial one (WordPress.com):
Having just finished my midterm in Analysis of Algorithms (yes, the class is as dry as it sounds), my brain is still sharp on a few topics; one of them being dynamic programming, which I mentioned in my last post. In that post, wherein I tried to find motivation for forcing myself to relearn calculus, I used the classic example of trying to calculate the nth term of the Fibonacci sequence.
This is a question that is plaguing me at the moment as I force myself to relearn calculus for Analysis of Algorithms at Oregon State University. In moments like this, where the concepts are abstract and I need to learn even more abstract concepts so I can understand the first class of abstract ideas, motivation is key.
Why does this matter? I’ve been working in the industry for over five years and never needed this knowledge before, why now?
Well, the short answer is that you don’t need this body of knowledge to develop a wide range of applications and features to applications. In my world, many of the concerns that common sorting, searching, and general optimization algorithms address are not real concerns because they’ve been abstracted to parts of the language or framework. I’m able to do my job because someone else has figured out how to do other parts of my job that normally would need to be created from scratch. So while learning merge sort and analyzing its complexity is a fun exercise, I’ll not be writing it from scratch anytime soon.
Updating LexBlog.com’s aggregation engine was no small feat. Scott Fennell and I spent months testing all of the various components of our new aggregation engine that powers the vast majority of the site, but something that was hard to prepare for was the shear scale of the site. Now that it’s up an running, we’re learning a lot about how to manage a site like this, and what sorts of features are necessary for it to be a successful publication from the perspective of an editor or reader.
One thing that I’ve recently keyed in on is search. Normally, I would tell a client that on-site search is not important. Most visitors are coming to a site from a much better search engine (Google), and are more apt to click around the site once there. LexBlog has layered in some nice features to the standard WordPress search, but most of those are around making sure that readers can search by an author’s name when they’re on a blog or website. This seems like a thing WordPress should do by default, but the generic WordPress search is “dumb” in the sense that it only looks to the post content and post title when running a search. Authors are not in either, so some work had to be done to support searching an author’s name and getting their posts.
In any case, the on-site search is “good enough” for most readers, and most sites aren’t the size of LexBlog.com. However, LexBlog.com is big. Very big. There are nearly 400,000 posts and 20,000 users on the site. The results that are returned by WordPress with what is essentially a
LIKE %query% SQL statement does a disservice to anyone that waits around for the page to load (a search on LexBlog.com right now can take anywhere from 10-15 seconds to return a page).
While a very small percentage of visitors to LexBlog.com use the on-site search feature (only about 2% of all page views are to a search results page), the relationships that we’ve layered in to each post and user make search a potentially very attractive feature on the site. We could support advanced searches by organization, site, author, and date as opposed to a generic text search against all of the content. Moreover, the speed issues alone make me long for a better solution on the site.
Enter Elasticsearch. Elasticsearch is a product from Elastic.co; if you haven’t heard their story before, it’s worth the read.
I have looked at a variety of alternative search technologies for WordPress before this, including:
But I continue to come back to Elasticsearch for a number of reasons:
- It’s cost effective
- Easy to scale and manage
- Has a wonderful, developer-friendly WordPress plugin – ElasticPress – built by a team of people – 10up – that I trust to continue to maintain and iterate said plugin
This weekend, I took Elasticsearch/ElasticPress for a spin on LexBlog.com‘s staging environment, and the results were surprising. Most keyword searches were 3-5 times faster (again, that speed difference alone is enough for me to favor Elasticsearch over WordPress’s search). Here’s a view of how long it takes for someone to search for “Kavanaugh” on LexBlog.com’s production environment:
and here’s what it looks like in staging:
More than speed, though, Elasticsearch’s queries are optimized for searches in a way that WordPress is not. As I mentioned, WordPress searches post content and titles, but Elasticsearch/ElasticPress expands that to include taxonomies (tags, categories, and custom taxonomies) and bylines.
WordPress also has a very weak algorithm for the keyword search itself. Without going into too much detail, it performs a relatively exact search of the query so that misspellings or typos that may occur when you’re on a mobile device (or if you’re like me, whenever you’ve been staring at a screen for more than 10 hours) are treated like you meant to search for that exact phrase. Elasticsearch performs “fuzzy matching”, which looks for variations on the keyword that you’ve searched. For example, if you’re interested in the Stop Online Piracy Act – SOPA – you might search “SOPA’s enforcement.” However, maybe you’re feeling lazy that day and don’t want to type in the apostrophe, so you search “SOPAs enforcement.” Elasticsearch is smart enough to return results for the Stop Online Piracy Act/SOPA whereas WordPress returns only results where the text was literally “SOPAs enforcement”; so only instances where the author made the same “typo” that you did!
While it seems like Elasticsearch may be winning the day at LexBlog, it’s still something for us to explore in more depth. As with all updates to a site, many people have a voice (including the readers) and we’re still waiting to see how they (and we) value search.
This was one of the most eventful summers in my life both personally and professionally. In July, Garry (LexBlog’s COO) and I had a chance to go to Chicago and spend some time talking about LexBlog’s future product line and general opportunities for integrating with our platform. It’s not often that I get an opportunity to do face-to-face meetings of these sort, and it was nice to get back in the saddle. It was also my first time visiting Chicago, and Garry seemed more than happy to drag me around.