21 April 2015

The Unscientific Analysis of Languages popular with Indian Startups

Well all this started with this one tweet

And finally ended with this one

And I had the entire dump of Hasjobs postings. It was pretty cool of Kiran to send them across to me and saving me the time and effort to scrape that data. At that time I had very little idea what I would do with it. I am aware of R and this was the moment when I thought I could make use of the little knowledge of it I had.

So I got on with it.

Step 1: Step R on my system.
Step 2: Write some code to extract the data and cleanse it
Step 3: Generate the counts for words
Step 4: Manually pick up the technology words with counts
Step 5: Generate the image with language popularity

So as it stands the top 5 of required technologies for Indian Startups are

1. PHP
2. Android
3. Ruby
4. IOS
5. Javascript


Surprised ? No ?  At least I am because the one technology no one talks about but seems is highly used by Indian startups is PHP. Rest sound very reasonable to me. What do you guys think ?

Following was the code I wrote to extract the results. Let me know if I am missing something.

install.packages ("tm")
install.packages ("RColorBrewer")
library(NLP)
library(tm)
library(RColorBrewer)

corpus <- Corpus(VectorSource(hasjob.content$headline))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))

td.mat <- as.matrix(TermDocumentMatrix(corpus))
write.matrix(format(td.mat, scientific=FALSE),
               file = paste(targetPath, "data.csv", sep="/"), sep=",")