Playing Wordle the Hard Way

The hard way to cheat in Wordle is to write code! (The easy way are anonymous browser windows). After many hours, and many nights of computing, it turns out that Wolver, the Wordle Solver,  is rarely better than teenage girls. Still, it was a lot of fun. Part of the fun was learning a lot of new words.

Let’s look at the word POINT.

ARTEL, TONIC, POINT

Solve POINT in hard mode

Wolver does two things: First, it creates a list of guesses rated by how many words they rule out. The top guess on that will be used as a first guess. Second, Wolver looks at the Wordle results for that guess, drops all solutions that are no longer possible, and again creates a list of rated guesses. The output then looks like

ARTEL -> TONIC

Processing ARTEL and Wordle results (Word of the day was POINT)

Wolver looks at ARTEL. The black square stands for yellow. Wordle told us there is a a letter T in the solution. Wolver knows about 107 words that have a T, starting with STINK. 44 seconds later, Wolver suggests to use TONIC to narrow the group down to 2 possible solutions. Let’s do that, and we get

ARTEL, TONIC -> POINT

Processing ARTEL TONIC and Wordle results (Word of the day was POINT)

After the Wordle hints for ARTEL and TONIC Wolver only knows about 2 words that fit, POINT and JOINT. At this point Wolver just picks one, and was lucky. For comparison, the two test teenagers both solved today’s problem in 3.

For POINT, anything below 3 is pure luck of the guess. There are many ways get to only POINT/JOINT left after 2 guesses, and then it’s a 50% chance. 3 and 4 are both equally good. 5 tries would be wasteful.

Wolver can go deeper if needed, e.g. if after two words there are still 12 or more solutions left. ALOES will get it down to 12 or less for sure, but it’s the only word that can do that. If you happen to start with JEEZE, you may end up with 94 solutions even if you make the best possible choice in the second guess.

First Let’s Look at Words

There are at many thousand 5-letter words in the English language, including SOARE, ORIEL, EPHAS and SAIRS. But most people would never think of them as solutions. I am not even sure what SAIRS means, and Google couldn’t help. Any of those four are valid guesses in Wordle, but luckily, none of them will be chosen as answers – Wordle would just be too hard if the solution could be TYNED. The words of the day are limited to more common words like PROXY, DILLY, and POINT.

The only way to be sure about what is a valid guess and what is a valid answer is to look at the Wordle code. As of the start of this coding project, there were 12,972 words allowed as guess, and 2315 possible answers.

In comparison, the game Mastermind has only six colors, but any combination is allowed. There are 15,625 combinations of six colors. Mastermind allows all combinations as guesses and as solutions. Wordle allows 26 letters, but limits the guesses to ~13,000, and solutions to ~2300. So in some ways, it’s easier.

So we have two lists, the allowed guesses and the potential solutions.

Ranking Guesses 1 – Quick and Dirty

First we rank the guesses. Each guess rules out a certain number of words. How many depends on the word of the day.

So for each combination of guess and solution, we compute the result (green, yellow, gray) that Wordle would give us. Then we check each solution against that result, and count how many are left. The fewer the better. The combinations with a word against itself only have one left. It’s the lucky case. What we care about is how many do we get in the worst case, or on average.

For example, for ARISE and the Wordle result of all grays, we get 168 solutions left. If we average all the possible answers from all gray to all green, we get an average of 67 words left (we use a weighted average here).  That is pretty good. ROATE’s worst case is 195 solutions left, but the average is 63. So ROATE is a good first guess.

The top 10 guesses with this metric are

ROATE, RAISE, RAILE, SOARE, ARISE,
IRATE, ORATE, ARIEL, AROSE, RAINE

Really bad guesses are GYPPY, JUGUM, JUJUS, QAJAQ, IMMIX, with IMMIX’s average a massive 970. The worse case for IMMIX is 1429. Still, if the solution today is EMAIL, the result will be [gray, green, gray, green, gray] and EMAIL will be the only solution left. There are 9 such cases, including BUXOM, PIXIE and AXIOM. But that’s 9 out of 2315, about 0.4%. Don’t start with IMMIX!

Ranking Guesses 2 – Going Deeper

Of course we can do better than this if we throw a little bit more compute power at it. After we compute every Wordle result for every combination of guess and solution, we basically have a simplified Wordle problem, one with fewer answers. (If we play in hard mode, we can also throw out a large number of guesses at this time.) We can now just use the same code to again look at the remaining answer for each guess and then average the averages. In computer science, we call this recursion.

For example, TOILE, was ranked #38 previously. However, after two guesses, the average number of words is down to 6.09 and becoming the new #1, while ROATE is at 6.09. The difference is mostly immaterial, however, TRONE comes down from #983 to #5. It has an average of 81 after one level, but 6.10 after two.

The top 10 guesses with this metric (in easy mode) are:

RAILE, IRATE, TOILE, ROATE, TRONE,
ALER, ARTEL, SALET, ORATE, SLATE

If you play hard mode, they are:

ARTEL, TRONE, IRATE, ROATE, REAST,
TOILE, TALER, CRATE, TENOR, RAILE

Going all in

You see where it’s going. You can keep going deeper. If you go three levels down, it may look like (in hard mode)

RATEL, ARTEL, ROATE, COATE, TALER,
STOAE, ORATE, LATER, IRATE, REALS

My code has been running for a few days (on one core) and only got about 90 words in (out of 13,000), so we won’t get a final level 3 list anytime soon. REALS was #59, and only made 10th place, so it’s unlikely that I’ll find a better word later.
Still, those run times don’t bode well for a full solve. Short of spending lots of money on a Google cloud job I will never know the perfect strategy. To do full analysis at deeper than two levels I will have to start implementing more pruning strategies, to do less work overall. As an example, if I want to compute the worst case, I can process large results first, and prune small results that are already smaller than the worst case of the large one.

Conclusions

Foremost, teenagers are as good as Wolver most days. Sometimes they even win with luck. This elderly immigrant with English as a third language, however, is outperformed by the algorithm.

The improvements by going a level deeper are minimal, but require significant CPU. Wolver doesn’t solve, maybe I should call it Wuesser?

Hard Mode makes it easier for the algorithm. It will take more guesses on average because the best guesses may be ruled out. But on the positive side, hard rules vastly reduces the number of guesses to test in higher levels, allowing the algorithm to go deeper.

I spent enough time on this, will spend more every day as I improve here and there, and I will spend some more to clean up and update github, but this obsession has to stop now!

Please Share