Update: Much of this post is irrelevant now that the new version of Wykki is operational. See this post for more information.
Besides my internship and an awesome family vacation to Maine and Massachusetts, I also spent some time on a new project over the summer: Wykki.
In most general terms, Wykki is a lot like a search engine, which, instead of showing a list of web results, directly answers the question you’re asking.
Wykki answering the question “Who played severus snape?”
Existing search engines (such as Google and Bing) have also recently started to develop the ability to answer similar questions–so how is Wykki different?
- Wykki is slowly absorbing all of Wikipedia
- Wykki learns how to answer new question types as more people use it
Wikipedia is a very useful source of empirical information. Although the body paragraphs of its articles are sometimes changed in controversial ways (but not as much as you’d think), there is a place where its most reliable information is available in an easily digestible format: the Infotables.
Infotables are where Wikipedia stores, well, info, and you can find them at the top right of most articles. They’re all categorized and nicely formatted, which makes it easy to extract data from them. The one on the right, about Severus Snape, is an example of of an Infotable.
For my AP Computer Science final project at the end of Junior year, I made a Java program that takes the (~40 GB) monthly backup of Wikipedia, which is available for free online, and scrapes it for Infotables. It then organizes this information into one master Vocabulary file and thousands of sub-files for every article.
Wykki takes this data and imports it into a Google App Engine database, which is a lot more efficient than constantly reading and writing text files on a PC as the Java program did.
Using Python, Wykki is then able to match information to questions–if you were to type in “Severus Snape portrayer” for example, it’d match that to the “Severus Snape” Entity and its “portrayer” Property, yielding the proper result as seen above.
But how do you get from ambiguous commands like “Severus Snape portrayer” to answering natural language questions like “Who played Severus Snape?” That’s where the learning comes in.
Learning New Question Types
Every Property–be it age, nationality, height, or anything else in the left column of an Infotable–is assigned a unique 9-character identifier. “Portrayer,” for example, is assigned to P37616415.
That way, if Wykki is looking at the “Severus Snape” Entity in its database, it can find the Property marked P37616415, and know what value to return to the user.
But how does this enable learning?
Sometimes, Wykki encounters a question that it can match to an Entity, but not to a Property. An example of this would be asking “Who created Severus Snape?”–Wykki would be able to find the “Severus Snape” Entity but not be able to match “Who created [Entity]?” to a Property.
To solve this, Wykki asks for extra input from the user:
Wykki asking for more information to answer “Who created Severus Snape?”
Once the user clicks the intended Property–“creator” in the example, which links to the Property P65196977–Wykki sends through the right answer:
Wykki answering “Who created Severus Snape?”
This is where the magic happens–now that a single user has told Wykki that “Who created [Entity]?” is related to the Property P65196977, Wykki can use this in the future.
So if a different user then asks “Who created Albus Dumbledore?” Wykki knows that they’re referring to P65196977, and is able to answer right away:
Wykki answering “Who created Albus Dumbledore?”
This is a very powerful approach, because it means that Wykki does not have to learn what questions like “Who created Severus Snape?” and “Who created Albus Dumbledore?” mean individually, but can learn a more general version of the question once and apply it to many current and future Entities.
Here’s a couple of questions Wykki can already answer (they’ll open in a new window):
The site itself consists a quite simple white box on a gray background, with a header and footer wrapped around it.
The box consists of four parts:
- The top message (small, 1em size)
- The middle message (large, 1.7em size)
- The bottom message (small, 1em size)
- The input box
Wykki can populate all four of those from the server, or choose not to populate one or two of the messages, which makes them disappear without breaking the layout.
The site is also adaptive, changing easily from a layout for a 27″ desktop monitor to one for a 5″ smartphone screen. (Try going to http://wykki.com/ and resizing your window!)
Under the Hood
Some of this has come up before, but here’s a quick rundown on the different things that power Wykki:
The main code is about 600 lines of Python (excluding comments and spacing), the HTML is about 35, and the CSS is about 100.
As you can see from the site and screenshots, Wykki is in Alpha right now. Why is that? Mostly because I’ve only imported about 10,000 Entities from Wikipedia so far, which means Wykki can answer “How many episodes of Friends are there?” but not “How many episodes of Modern Family are there?”
The problem here is that importing data from Wikipedia is fairly slow, and I’m limited by how much I can upload to Google App Engine every day. Plus Wikipedia data is constantly being updated, so I’d go out of sync.
What I’ll probably end up doing is rewriting most of the code to be a layer on top of Freebase (a free database of information that knows hundreds of millions of facts), where I won’t have to import the actual information, but just link the different Properties to my dictionary of questions. That way my database would only have to contain natural language data (as in, what questions relate to what Properties), while Freebase can be changed, updated, and grown without me having to account for the changes.
Another system, for which the ground layers are already there, but which just needs to get an interface, is the scoring system. The scoring system evaluates the “relatedness” between learned questions and their Properties over time, which is needed to prevent (accidental or intentional) linkage between unrelated Properties and questions.
So yeah, that’s Wykki. It’s been a really fun project so far, and I’ve learned a ton (shout out to Codecademy for teaching me Python and to all the Stack Overflow users who have answered my many questions).
I’m hoping to keep working on Wykki and implementing the stuff above ASAP (but I just started Senior year, so it might take a while).
So, what are you waiting for? Head over to http://wykki.com/ and give it a try!
For questions regarding Wykky, email firstname.lastname@example.org, or tweet @askwykki.