21 April 2015

The Unscientific Analysis of Languages popular with Indian Startups

Well all this started with this one tweet

And finally ended with this one

And I had the entire dump of Hasjobs postings. It was pretty cool of Kiran to send them across to me and saving me the time and effort to scrape that data. At that time I had very little idea what I would do with it. I am aware of R and this was the moment when I thought I could make use of the little knowledge of it I had.

So I got on with it.

Step 1: Step R on my system.
Step 2: Write some code to extract the data and cleanse it
Step 3: Generate the counts for words
Step 4: Manually pick up the technology words with counts
Step 5: Generate the image with language popularity

So as it stands the top 5 of required technologies for Indian Startups are

1. PHP
2. Android
3. Ruby
4. IOS
5. Javascript


Surprised ? No ?  At least I am because the one technology no one talks about but seems is highly used by Indian startups is PHP. Rest sound very reasonable to me. What do you guys think ?

Following was the code I wrote to extract the results. Let me know if I am missing something.

install.packages ("tm")
install.packages ("RColorBrewer")
library(NLP)
library(tm)
library(RColorBrewer)

corpus <- Corpus(VectorSource(hasjob.content$headline))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))

td.mat <- as.matrix(TermDocumentMatrix(corpus))
write.matrix(format(td.mat, scientific=FALSE),
               file = paste(targetPath, "data.csv", sep="/"), sep=",")

03 March 2015

Fabric - Just the tool for continuous deployment

Originally published in Near Engineering blog

One of the big problems with scaling complex and tightly integrated system clusters is deploying changes. As number of tools grow, changes and deploying for scale get all the more complex. Logging into each of the different system is rendered impossible as you grow from a small unknown startup to one building large scale systems.
A few months back, we decided that we had hit the limits of spending time just to deploy our RTB and DMP offerings to scale. We needed a radical solution to get it right.
Let me get to our problem statement. We need to deploy a setup which involves a concoction of various technologies like HAProxy, Redis, Nginx, PHP-FPM, ZMQ, Kafka, ElasticSeach, Logstash etc. This is just tools which must be configured on deployment of a new machine in our cluster. Besides these you have all the code in PHP which needs to configure location of the services which are different based on regions. I believe this is a fairly typical system which most companies which pan geo setups will be working on.
We looked at various options.
The first two work on the pull model and the Fabric is the one which uses the push model.
Pull models are good for certain use cases, but becomes a problem because you have to login to each machine and set things up there. With push model, once the machine is ready for ssh, its pretty straight forward to setup a cluster.
Fabric for its part is highly configurable. Code for Fabric is written in Python. Or you can also say that Fabric just another Python library. This really give you great control over the config files that need to be written in each node of your cluster.
Looking at all this, we decided that the base system setup, that involves setting up users and various system level config would be done via puppet as its very effective at that. For setting up various machines in the cluster, we decided to go with fabric. Its customisation and extensibility is what made us use this. With Cuisine, fabric was a breeze. Right from setting up nginx, php-fpm to redis to code, we made sure fabric did it just right for us.
The only issue we faced was with deploying Kafka. Fabric cannot run programs with ‘&’ as mentioned in their FAQ. The work around was simple with dtach.
I love the way Fabric works and it has good future in deployments.
1. Some important fabric script snippets that we use are
def setup_packages():
puts(green(‘Setup EPEL & Remi’))
sudo(“rpm -Uvh –replacepkgs http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm”)
sudo(“rpm -Uvh –replacepkgs http://rpms.famillecollet.com/enterprise/remi-release-6.rpm”)
2. To Setup Nginx
def install_nginx():
sudo(“rpm -Uvh http://nginx.org/packages/rhel/6/x86_64/RPMS/nginx-1.6.2-1.el6.ngx.x86_64.rpm”)
if cuisine.file_exists(“/etc/nginx/conf.d/default.conf”):
sudo(“mv /etc/nginx/conf.d/default.conf /etc/nginx/conf.d/default.conf.default”)
nginx_conf = “’PUT in your Nginx Conf here”’
php_params = “’Put in you PHP Param settings here”’
with cuisine.mode_sudo():  
cuisine.file_write(“/etc/nginx/php_params”,php_params)
cuisine.file_write(“/etc/nginx/conf.d/dmp.conf”,nginx_conf)
3. To install PHP FPM with PECL and XHProf
def install_php():
sudo(“yum -y –enablerepo=remi install php php-devel php-fpm php-pear php-pecl-xhprof”)