What is the Best Programming Language for Machine Learning Tasks?
This is an all too common a question. Given the popularity and importance of Artificial Intelligence (AI) today, machine learning – the means to achieve an intelligent agent, has become the subject of technological discussions. Since there exists such a diverse variety of programming languages nowadays, it is naturally necessary to understand which one fits the task best.
Or is there really just one?
When we look at the most asked-for languages by machine learning/data science-related positions posted on indeed.com, clear trends are shown in the demand for a set of programming languages.
Even without reading this article, many might already have the rough idea that Python is indubitably the most popular language for machine learning tasks. The comparatively much simpler grammar/syntax, vast community support due to the language’s open-source nature and its capabilities as a general-purpose programming language are the main contributors to its preeminence.
On average, a program takes 3 to 5 times more lines code to construct in Java compared to Python. When compared to C++, Python is 5 to 10 times shorter. (Source: https://www.python.org/doc/essays/comparisons/) This succinctness has made Python an extremely beginner-friendly language, which expands its user base and braces newcomers for performing tasks in machine learning.
Some of the most commonly used packages for machine learning and deep learning are created in Python: Scikit-Learn, Keras, Tensorflow, to name a few. Aside from these models-wrapped-up packages that make machine learning production more convenient, data analysis packages such as Numpy, Pandas and Scipy ensure that Python is just as capable as R in tasks such as data wrangling and cleansing. These C-based packages enable users to enjoy the speed and low-latency of C while coding in the straightforward syntax of Python, eliminating (to a certain extent) Python’s disadvantage in speed as a high-level language.
It is important to realize that there are multiple production stages in any machine learning project, since the machine learning models themselves often must be integrated with a larger application that utilizes the results/predictions of the models. Python holds a significant edge over more task-specific languages like R or Octave in this aspect, as it allows users to build a wide variety of applications of different sizes, online or offline.
In all, Python has become the clear leader of the pack and does not show any signs of slowing down.
While Python has so many statistical/data science packages under its belt, it is a full-fledged programming language while R is specifically a statistical programming language. For this reason, R seems much more favored in academia and classroom settings as it allows students and researchers to focus on the statistical concepts at hand.
Similar to Python, R has an impressive suite of packages dedicated to tasks up and down the machine learning pipeline: data acquisition, wrangling/cleansing, model building, predicting, etc. The only disadvantage it is at compared to Python is perhaps, as mentioned above, its inability to handle tasks beyond this pipeline due to the nature of the language.
The niche to R is in its roots. Born as a statistical language primarily used in academia and research, R is the most mature tool in managing data and is often the most intuitive for people with a background in statistics. When it comes to exploration or one-time-deep-dives of data, R is often more convenient to use while one must write a fair amount of generic code in Python to solve a rather specific problem. Essentially it comes down to the trade-off between having a very domain-specific language in R vs the general-purpose Python.
Despite its high hurdle for new programmers, C++ is still prevalently used by people who work in machine learning simply due to its outstanding performance. C++, as opposed to languages like Python and Java (which we will subsequently talk about) is considered a lower-level language, which means it is easier to read for the computer (correspondingly, harder to read for humans). Thus, it takes less time for the computer to understand and execute the program. In fact, popular machine learning packages for Python such as Torch and Tensorflow are implemented in C++ under the hood for optimized performance.
Given the complexity of machine learning and deep learning algorithms, there are two reasons to which C++ is advantageous. First, for those who are learning about the algorithms, the most effective way is always to implement one yourself using a language that you are familiar with. If C++ could be that language, the fact that C++ requires a more thorough inside-out understanding of the program at hand would help enhance your comprehension of the algorithm. Second, the training and predicting of a sophisticated, recurrent model such as Random Forest or Gradient Boosting Machine would be done in much shorter time if executed in C++ than the other popular languages on this list.
However, C++ has a lot less prevalently used machine learning production packages than Python or R. OpenNN and CNTK are a couple worthy of mention. While the language remains a popular choice amongst more experienced engineers striving for enhanced performance, its popularity is unlikely to increase for precisely the same reason. Also, the necessity of having to manually manage memory and pointers in C++ certainly does not help at all.
The uniqueness of Java in comparison with other programming languages is well-elaborated here by Mohan Pawar here: https://www.quora.com/What-feature-made-Java-different-from-all-other-programming-languages.
As for Java in machine learning, its advantage is similar to that of Python’s: being capable of scaling to larger systems or applications due to being a general-purpose language built for cross-platform development. However, Java’s shortcomings are two-fold: it is not the best in any single category. It is much more verbose than Python, but significantly less so than C++. In terms of speed, it yields better performance on average compared to Python, yet seems sluggish in contrast with C++. Java is not a statistics-specific language like R, despite it having its own set of machine learning libraries such as OpenNLP, Java-ML, etc.
It is worth noting, however, that Java is still the most popular programming language in the world, which of course spills over to the narrower domain of machine learning. In particular, most of the big data technology stacks are written for Java (think Apache Hadoop, etc.). Given the importance of storing and managing immense amounts of data in machine learning and artificial intelligence projects, Java naturally has an edge in being able to integrate the data stack with statistical models and other applications.
After analyzing the four popular languages for machine learning above, you might already be able to tell that each serves a slightly different purpose with their own advantages. Python is simple in syntax, has easy-to-use libraries for production, and could do a bit of everything; R is for the statistician and is convenient to use in data exploration; C++ provides unmatched speed performance, and is perhaps also a good language to learn machine learning in; Java could do many things just like Python, but it integrates big data stacks particularly well and is the most commonly used language in the developer community, enjoying much support.
The debate of the best programming language for machine learning may not be settled, but with everything said, there is perhaps a best language for every developer and specifically for their tasks at hand.
Read more from RebellionResearch.com:
Written by Yicheng Shen & Edited by Alexander Fleiss