Mastery of Human Languages and the Next Big Thing in AI:
Interview with NLP Expert Dima Korolev
MD: Can you provide some background about your upbringing?
I got my first computer at 6 or 7, and enjoyed writing programs more than playing games during my entire life. Algorithms always left me fascinated. My real career started at Google in Zurich, and then Microsoft in Bellevue. My dream has been to build startups, and that is what I am doing right now.What advice would you give to yourself after you graduated from the Moscow Engineering Physics Institute?
MD: What advice would you give to yourself after you graduated from the Moscow Engineering Physics Institute?
As long as you are not pushing yourself to some predetermined career goals, and instead trying to explore while you are still young, you will be fine. Thankfully, if you find yourself privileged enough, you do not have to make the choices that will make you more money in the immediate future, and instead you can choose something that is more fruitful and exciting. Betting on compounding returns with respect to investing in oneself generally pays off, and it is doubly so these days when it comes to building software and data products.
MD: Why did you choose to pursue research in data engineering and machine learning?
I like research and academia, but it was quickly apparent that I like building things more. While doing research you look at the numbers, and, if you get lucky, your paper gets published. As an engineer what you are working on goes out into the world and gets results. Data is a good field where you have to understand math and it is not just data engineering, it is a combination of engineering and research.
MD: Which paper or published work that you have written so far have you been most proud of, or believe is most significant? Why?
I have not published works. The big companies I was employed by like filing patents, and I have a few with my name to them.
Some of the patents we filed with my teammates are kind of interesting, but not really groundbreaking if you ask me. I knew it before starting in the industry that when you teach people how to write software and it clicks with them, it is rewarding and enjoyable. It always gives me more gratification when the colleagues of mine, or the people I am coaching get this excitement of broadening their knowledge, and I was lucky enough to be surrounded by grateful people often enough to not have the urge to do research and publish it to be peer-reviewed. In addition to this, contributing to open source projects, at least in my experience, has had higher returns, both in terms of keeping me busy with something useful, and when it comes to connecting with interesting people to later work for or with.
MD: Can you talk more about your May 2013 patent about Construction of Text Classifiers?
Basically, the idea for that patent is you do not want to show adult pages for non-adult viewers; accidental exposure is wrong. To make it happen, there is an underlying machine learning model which can tell adult content from non-adult content. So you have to prove that it is high-precision before releasing it. Before I came to Google, my job had to do with text, and semantic analysis was just emerging. During my first year at Google we literally used to build seed models semi-manually, and it usually takes a bunch of refinement steps and spans quarters. As our team grew and we got more experience, it became apparent that several manual steps are largely redundant, as relatively straightforward techniques can automate them away to a high degree. Specifically, this particular patent is about mining text-level features out of thin air, with no prior language knowledge whatsoever, by using the combination of several disparate, but otherwise not too individually strong, non-textual signals.
MD: What has been your favorite project at FriendlyData?
FriendlyData is an interesting beast. A lot of people think I am a Natural Language Processing (NLP) expert, but it is not true. When it comes to computers understanding the human language, you need to analyze sentiment, you have to use some fancy techniques like recurrent neural nets. At the same time, if you want to translate English queries into the programming language which a database can understand, with every single stage being as good as 90% accurate, the probabilities have to be multiplied, and, with complex inputs, it quickly goes way below 50%, rendering the project useless.
The error rate of the approach similar to the one taken by Google translate would be too high, while FriendlyData had to guarantee that the database requests we built are precise. And the approach we took was the opposite of modern trends. Together with another NLP expert whom I admire greatly, we came up with the means of defining the grammar and applying it quickly at the query time.
The most interesting project that emerged from this, which I view as one of the secret sauces behind our exit, was how can we adjust the grammar definition to allow for a suggestive engine. I would love to talk about this piece in more detail, but am under a non-competition agreement until the Fall of 2020.
MD: What is the next big thing for data science or machine learning? What risks lie ahead with machine learning?
My view of what is big may not be what the world thinks is big. We will make tremendous progress on small tasks. For example, you can argue that AI will have a profound impact on, or even take away, certain jobs. But I do not consider this groundbreaking, because they just seem like the natural progression for AI.
As I view the world, the next big thing might be cyber-counseling, or an AI that can connect with and help people feel better. This AI would be a system that works with one’s self-perception, as one closed loop. One of the biggest social changes is that we are more concerned about human identity these days, and your or my well-being have a lot to do with our subjective self-perception.
In today’s world, we are putting increasing emphasis on helping people feel better compared to being fed and healthy, and this largely is uncharted territory with respect to technology. I sense several new major markets emerging from this line of thought in the next decade or so.
MD: You mention in your blog post “On Ethics of Applying Machine Learning” that a good practice is to have sufficient human interaction for any problem you want to use machine learning to solve. When will ML be efficient enough such that human interaction is unnecessary?
That blog post was about how we have to be smart with technology. Systems, institutions, and groups of people can often be remarkably unwise. These groups might not realize that a lot of simple technological ideas can help them build better programs. However, you cannot use tech blindly; we need stronger human brains.
If you want to optimize something, you cannot just use technology to find a better strategy, and then throw this strategy away because it violates some seemingly-intuitive constraints introduced by us, humans, in the first place. Instead, we have to slowly factor in the human strengths until we end up with results that we are confident are better than what humans alone could come up with. The post was inspired by an idea that generally uses AI technology to come up with the best strategy for a real-life problem and then bluntly throws away half of this strategy so that what’s left is culturally acceptable. I am arguing, in that blog post, that this is a textbook-perfect counter-example, and it is exactly the way to not apply the AI to our real-life problems.
MD: Do you have any ideas for projects you can develop for coronavirus relief?
The more I read about COVID, the more I am convinced it mostly is about informational warfare these days. The virus does highlight the deficiencies of our healthcare system, but it does so even more when it comes to our informational hygiene. Dealing with information in the 21st century was supposed to be about finding the best strategies to get the right information out to people, whereas, sadly, what we are seeing is largely the opposite.
A project that could help with coronavirus relief would, I think, begin from exposing the data, diligently and openly, out there in the air, so that not only would we have the dashboards showing big red circles with the numbers of deaths next to them, but it be more of a Jupyter notebook style shared workplace.
I know I am daydreaming here, of course, but we do have all the tech available to make something like this happen. It’s only about the critical mass of people who would want to look this way. And it’s not something Bill Gates or Mark Zuckerberg could help put together, the times have changed. I hope that Peter Thiel or Elon Musk are looking in this exact direction as we speak.
Written by Michael Ding & Edited by Alexander Fleiss