In the last few years, R has steadily been increasing in popularity as the main language used for data mining and analytics. There are multiple reasons for this popularity. Below we will first show just how popular R has become, looking at some recent surveys. After that we will look at some pros and cons of using R for big data analysis. Finally, we will speculate about what to expect in the future, especially given the inclusion of R directly into the new 2016 SQL Server release.
Rexer Analytics Survey
Rexer analytics conducts a large survey on the views, practices and preferences of data mining and analytics professionals every two years. It used to be every year until 2011. It is the largest poll of this kind. The last survey was from 2013, and the next one will be published in late September. The survey is freely available upon request and a summary of the results are online. In the 2013 survey 1,259 analytic professionals from 75 countries participated. One of the main highlights of the 2013 survey was the increasing prevalence of R. It seems like it is moving from a language mainly used in research and academia to one that is also widely used in commercial enterprises.
The 2013 survey shows that the usage of R has been steadily increasing. With more than two thirds of data miners using R and almost a fourth of all data miners using R as their main software tool for analysis. This is quite significant, given that the average data miner reports uses five different tools. Moreover, users are quite satisfied with their R experience. More than 85% of R users are “satisfied” or “extremely satisfied”.
The following chart (from the Revolution Analytics blog) shows the popularity of the top data mining tools based on this survey.
The KDnuggets poll is an online poll about the software tools most used by the data mining and analytic community (as opposed to the Rexer Analytics survey that covers a broader range of topics, not just software). About 2,800 voters participated in the 2015 survey and chose from a record number of 93 different tools.
This survey shows that R is the most popular language used by data miners. However Python is growing at a faster rate. Given that R is a language specifically used for statistics, and that Python is general purpose language, Python is becoming a serious contender to R. But Python will probably not replace it altogether. Many are already using a combination of the two languages.
Another interesting result of the survey is that many are using various combinations of tools. Particularly, the diagram below categorizes these results based on whether the tools are commercial or open source. While 91% of respondents use at least some sort of commercial software and 73% use some open source software, almost two thirds of all respondents (64%) use a combination of both kinds of tools. What is interesting about this is that this trend among professionals is also being picked up by major vendors. Again the incorporation of R into SQL Server 2016 being a good example.
Pros and Cons of Using R for Big Data
- Most commonly used language for statistical computation and data analytics; “lingua franca of statistics“
- Large comprehensive library of cutting edge algorithms and great visualization tools, can do complicated tasks with a few lines of code,
- All computations are done in memory, so it is fast on small data sets
- Open source with an active user community, so dynamically updated, community support, and possibility to modify existing algorithms
- Will be accessible directly from within SQL Server 2016
- Designed for use as a standalone program on a desktop, so not easily adaptable to parallel and network computing
- Everything loaded into memory, so it could be quite slow when scaling for big data applications
- Open source, so it can lead to problems in a production environment: questionable reliability, no backward compatibility, security issues, ..
- Steep learning curve, it can take a while to learn R and become familiar with the existing packages
In January it was announced that Microsoft has acquired Revolution Analytics. Revolution Analytics is a vendor that is mainly focused on building open source and open-core versions of R for enterprise, as well as academic, applications. Following this, it became public a couple of months ago that SQL Server 2016 will allow R commands to be executed from within its environment. This is particularly exciting for us at Decisive Facts because SQL Server and R are the two main tools that we make use of. And the more streamlined combination for the two would be very useful for us. This should also have an effect on the growth of R usage. It would be expected that many users of SQL Server who were previously not using R might take an interest to it.