Hive Plays Well with JSON
Hive Plays Well with JSON Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity. In Hive, table structured data [...]
Hive Plays Well with JSON Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity. In Hive, table structured data [...]
Removing Duplicates from Order Data Using Spark If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is [...]
Storing Nested Objects in Cassandra with Composite Columns One of the popular features of MongoDB is the ability to store arbitrarily nested objects and be able to index on any nested field. In this post I will [...]
Data Normalization with Spark Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the [...]
Anomaly Detection with Robust Zscore Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between [...]
Bulk Insert, Update and Delete in Hadoop Data Lake Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If [...]
Handling Categorical Feature Variables in Machine Learning using Spark Categorical features variables i.e. features variables with fixed set of unique values appear in the training data set for many real world problems. However, categorical variables [...]
Combating High Cardinality Features in Supervised Machine Learning Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not [...]
In a project several years ago I built a rule engine from scratch. In a recent project, which needed a rule engine, I decided to take different route. I decided to give Drools rule engine [...]
Auto Training and Parameter Tuning for a ScikitLearn based Model for Leads Conversion Prediction This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of [...]