40 Years Of Analytics : Journey Of R

  • by Yatin Jog
  • 11 Months ago

The very first question is, what is R?

Basically, R is a dialect of S which was a language or is a language that was developed by John Chambers and at the Bell Labs. It was initiated in 1976 as an internal statistical analysis environment that people at Bell Labs used to analyze data. Initially, it was implemented as a series of FORTRAN libraries to implement routines that were tedious to have to do over and over again, so there were FORTRAN libraries to repeat these statistical routines.

 

Early versions of the language did not contain functions for statistical modeling. In 1988, the system was rewritten in the C language. This was version three. Version four of the S language was released in 1998. And it’s the version we use today. So, R is an implementation of the S language that was originally developed in Bell Labs. In 1993 Bell Labs started a corporation called StatSci which became Insightful Corporation, an exclusive license to develop and sell the S language. In 2004, Insightful purchased the S language from Lucent. So Bell Labs became Lucent Technology for $2 million and became the owner. In 2006, Alcatel purchased Lucent Technologies and it’s now called Alcatel-Lucent. So Insightful developed a product which was an implementation of the S language under the product name S-PLUS. They built a number of fancy features into it, for example, graphical user interfaces and all kinds of nice tools. In 2008 the Insightful Corporation has acquired a company called TIBCO for $25 million dollars. TIBCO still develops PLUS, although in a variety of different types of business analytic type products. The basic fundamentals of the S language have not really changed since 1998 and the language that existed in 1998.

 

R is a relatively recent development. In 1991, it was created in New Zealand by two gentlemen named Ross Ihaka and Robert Gentleman. In 1993 the first announcement of R was made to the public. 1995, Martin Michler convinced Ross and Robert to use, to license R under the GNU General Public License. And that made R what we call free software.

In 1997, the R core group was formed. The core group, basically controls the source code for R. The primary source code for R can only be modified by members of the R core group. However, a number of people who are not in the core group have suggested changes to R, and they have been accepted by the core group. One of the main benefits of R is that it runs on any standard computing platform or operating system. The core software of R is actually quite lean. Its functionality is divided into modular packages, so you don’t have to download and install a massive piece of software. Its graphics capabilities are very sophisticated and give the user a lot of control over how graphics are created.

There a couple drawbacks of R.

  • It’s essentially based on 40 year old technology. So the original S language developed in the 70s was based on a couple of principles, and the basic ideas have not changed too much. So there is little built in support for dynamic or 3D graphics.
  • Another drawback of R that the functionality is based on consumer demand and basically user contributions. There is no corporation, there’s no company that you can complain to. There’s no helpline that you can call to say that, to demand a specific implementation or a specific feature. If the feature’s not there, then you have to build it.
  • Another drawback which is a little bit more technical is that the objects that you manipulate in R have to be stored in the physical memory of the computer. And so if the object is bigger than the physical memory of the computer, then you can’t load it into memory. And then, therefore, you can’t do something in R with that object. So there have been a lot of advancements to deal with this too.

 

In the R language and in the hardware side, there are computers now that you can buy with tremendous amounts of memory. And so some of those problems had been resolved just by, kind of, improvements in technology. As we enter the, kind of, big data era where you have larger and larger data sets, the model of loading objects into physical memory can be a limitation. The basic R system is divided into two conceptual parts. There is the base R system that you download from a CRAN which is the comprehensive R archive network. The base system contains what’s called the base package which has all the kind of low level fundamental functions that you need to run the R system. And then there are other packages contained in the base system which includes, for example, util stats, data sets, graphics and a bunch of other packages that are kind of fundamental packages that more or less everyone might use. And then there are a series of recommended packages, so, boot for bootstrap, the class for classification, cluster, code tools, and a variety of other packages. These are the commonly used packages, they may not be critical packages, but they’re commonly used by many people. Right now there are about 4,000 packages that have been developed by users and programmers all around the world. These packages are user contributed. They’re not controlled by the R core. And they are uploaded to CRAN on an everyday and periodic basis.

There are a couple of documents that you can find on the R website.

  • Introduction to R, which is a relatively long PDF document about basics of how to use R, how to use the language.
  • Writing R Extensions manual which is really only useful to read if you’re thinking of developing R packages.
  • R data import and export manual, which is useful for getting R’s data into R and the various different ways.
  • R installation administration manual is most useful if you want to build R from the source code.
  • R internals manual is a really technical document for how R is designed. How R is implemented at a very low level.

 

So, that was a brief overview of R, and the history

  • facebook
  • googleplus
  • twitter
  • linkedin
  • linkedin

Leave a Reply

Your email address will not be published. Required fields are marked *