Blogs or Expert Columns
ThirdEye Data launches 3 new Open Source solutions for Anomaly Detection and Predictive Analytics
ThirdEye Data launches 3 new Open Source solutions for Anomaly Detection and Predictive Analytics Over the past 20 years, the Open Source Software (OSS) movement has given developers and programmers the freedom to experiment, innovate, and become more efficient. Part of the digital transformation that’s been facilitated by OSS has also allowed programmers to leverage Machine Learning to develop vital solutions for Anomaly Detection and Predictive Analytics. And so, ThirdEye Data has decided it’s time to make its own contribution to the OSS community by giving away 3 Artificial Intelligence-powered Open Source Software solutions, that help businesses gain Anomaly [...]
What’s Driving the Cloud Data Warehouse Explosion?
What’s Driving the Cloud Data Warehouse Explosion? (RoboLab/Shutterstock) The advent of powerful data warehouses in the cloud is changing the face of big data analytics, as companies move their workloads into the cloud. According to analysts and cloud executives, the phenomenon is accelerating, thanks largely to the potential to save large sums of money, analyze even bigger data sets, and eliminate the hassle of managing on-premise clusters. Amazon Web Services is largely credited with kicking off the cloud data warehousing (CDW) wave with Redshift. Since launching it in 2012, AWS has attracted 6,500 customers to Redshift and remains the company [...]
Embrace The Noise: A Case Study Of Text Annotation For Medical Imaging
Embrace The Noise: A Case Study Of Text Annotation For Medical Imaging In this post we'll discuss the recent paper TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-rays focusing on the best practices the paper exemplifies with regards to labeling text data for NLP. What Is TextRay ? TextRay was written by a team from Zebra Medical Vision, a for-profit company that applies deep learning to medical imaging. One of the core challenges of the medical imaging space is acquiring the labeled images (such as X-rays) to train their models with. The TextRay paper expands a core insight from a [...]
Deep Learning: Which Loss and Activation Functions should I use?
Deep Learning: Which Loss and Activation Functions should I use? The purpose of this post is to provide guidance on which combination of final-layer activation function and loss function should be used in a neural network depending on the business goal. This post assumes that the reader has knowledge of activation functions. An overview on these can be seen in the prior post: Deep Learning: Overview of Neurons and Activation Functions What are you trying to solve? Like all machine learning problems, the business goal determines how you should evaluate it’s success. Are you trying to predict a numerical value? Examples: [...]
Bulk Mutation in an Integration Data Lake with Spark
Bulk Mutation in an Integration Data Lake with Spark Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File (HDFS), which is append only. HDFS is essentially WORM file system i.e. Write Once and Read Many Times. In an integration scenario, however your source data streams may have updates and deletes. This post is about performing updates and deletes in an HDFS backed data lake.The Spark based solution is available in my open source project chombo. Virtual [...]
Learning Alarm Threshold from User Feedback using Decision Tree on Spark
Learning Alarm Threshold from User Feedback using Decision Tree on Spark Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. This kind of sensory overload can happen with monitoring systems in various domains, e.g computer systems and network, industrial monitoring systems and medical patient monitoring systems. Typically alarm flooding happens when alarm threshold levels are not set properly. How do we know what the proper alarm threshold level should be. That is the problem we will [...]
Pluggable Rule Driven Data Validation with Spark
Pluggable Rule Driven Data Validation with Spark Data validation is an essential component in any ETL data pipeline. As we all know most Data Engineers and Scientist spend most of their time cleaning and preparing their databefore they can even get to the core processing of the data. In this post we will go over a pluggable rule driven data validation solution implemented on Spark. Earlier I had posted about the same solution implemented on Hadoop. This post can be considered as a sequel to the earlier post. The solution is available in my open source project chombo on github. Anatomy of a Validator A validator [...]
Improving Elastic Search Query Result with Query Expansion using Topic Modeling
Improving Elastic Search Query Result with Query Expansion using Topic Modeling Query expansion is a process of reformulating a query to improve query results and to be more specific to improve the recall for a query. Topic modeling is an Natural Language Processing (NLP) technique to discover hidden topics or concepts in documents. We will be going through a Query Expansion technique based on Topic Modeling. The solution is based on Latent Dirichlet Allocation (LDA) algorithm as implemented python gensim library. LDA is a popular Topic Modeling algorithm. The implementation is available from my open source project avenir in github. It provides an user friendly wrapper class around gensim LDA implementation. Query Expansion Query expansion is the technique of expanding the [...]
Cassandra Range Query Made Simple
In Cassandra, rows are hash partitioned by default. If you want to data sorted by some attribute, column name sorting feature of Cassandra is usually exploited. If you look at the Cassandra slice range API, you will find that you can specify only the range start, range end and an upper limit on the number of columns fetched. However in many applications the need is to paginate through the data i.e each call should fetch a predetermined number of items. There is no easy way to map the desired number of items to be returned to the column name [...]
Hive Plays Well with JSON
Hive Plays Well with JSON Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity. In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS. The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS. In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical data. Hive works well with flat record structured data. I wanted to find out how Hive handles hierarchically structured data. I found out that [...]
Removing Duplicates from Order Data Using Spark
Removing Duplicates from Order Data Using Spark If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes a necessity. In this post we will focus on de duplication based on exact match, whether for the whole record or set of specified key fields. De duplication can also be performed based on fuzzy matching. We will address de duplication for flat record oriented data only. The Spark based implementation is available in my open source project chombo. There is a corresponding Hadoop based [...]
Storing Nested Objects in Cassandra with Composite Columns
Storing Nested Objects in Cassandra with Composite Columns One of the popular features of MongoDB is the ability to store arbitrarily nested objects and be able to index on any nested field. In this post I will show how to store nested objects in Cassandra using composite columns. I have recently added this feature to my open source Cassandra project agiato. Since in Cassandra, like many other NOSQL databases, stored data is highly denormalized. The denormalized data often manifests itself in the form of a nested object e.g., denormalizing one to many relations. In the solution presented here, the object data is stored in [...]
Data Normalization with Spark
Data Normalization with Spark Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly. In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization The Spark based implementation is available in my open source project chombo. There is also a Hadoop based implementation in the same project. [...]
Anomaly Detection with Robust Zscore
Anomaly Detection with Robust Zscore Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers. In this post, I will go over a robust Zscore based implementation on Hadoop [...]
Bulk Insert, Update and Delete in Hadoop Data Lake
Bulk Insert, Update and Delete in Hadoop Data Lake Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only fashion and are easy to handle. Such data tend to be fact data e.g., user behavior tracking data or sensor data. However, dimension data or master data e.g., customer data, product data will typically be mutable. Generally they arrive in batches from some external source system reflecting incremental inserts, updates and deletes in the [...]
Handling Categorical Feature Variables in Machine Learning using Spark
Handling Categorical Feature Variables in Machine Learning using Spark Categorical features variables i.e. features variables with fixed set of unique values appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms are Logistic Regression, Support Vector Machine (SVM) and any Regression algorithm. In this post we will go over a Spark based solution to alleviate the problem. The solution implementation can be found in my open source projects chombo and avenir. We will be using CRM data as the use case Categorical Feature Variable Problem The underlying data type of a categorical [...]
Combating High Cardinality Features in Supervised Machine Learning
Combating High Cardinality Features in Supervised Machine Learning Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not handle categorical variables. Those that can, categorical data can pose a serious problem if they have high cardinality i.e too many unique values. In this post we will go though a technique to convert high cardinality categorical attributes to numerical values, based on how the categorical variable correlates with the class or target variable. The Map Reduce implementations are available in my open source projects avenir and chombo Categorical Variables Some Machine [...]
Ruling with Drools Rule Engine
In a project several years ago I built a rule engine from scratch. In a recent project, which needed a rule engine, I decided to take different route. I decided to give Drools rule engine from JBOSS a try. It has worked out well so far. In this post, I will share my experience with it. I will use insurance as an example to demonstrate how to use it. Why Rule Engine You should seriously consider rule engine when the following conditions apply. You have complex business logic Business logic changes often You want to provide visibility of the business [...]
How to build your own AlphaZero AI using Python and Keras
How to build your own AlphaZero AI using Python and Keras Teach a machine to learn Connect4 strategy through self-play and deep learning In this article I’ll attempt to cover three things: Two reasons why AlphaZero is a massive step forward for Artificial Intelligence How you can build a replica of the AlphaZero methodology to play the game Connect4 How you can adapt the code to plug in other games AlphaGo → AlphaGo Zero → AlphaZero In March 2016, Deepmind’s AlphaGo beat 18 times world champion Go player Lee Sedol 4–1 in a series watched by over 200 million people. A machine had learnt [...]
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency By Reza Shiftehfar Uber is committed to delivering safer and more reliable transportation across our global markets. To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks in our driver-partner sign-up process. Over time, the need for more insights has resulted in over 100 petabytes of analytical data that needs to be cleaned, stored, and served with minimum latency through our Hadoop-based Big Data platform. Since 2014, we have worked to develop a Big Data solution that ensures [...]
Andrew Ng launches ‘AI for Everyone’, a new Coursera program aimed at business professionals
Andrew Ng launches 'AI for Everyone', a new Coursera program aimed at Business Professionals Above: Coursera cofounder Andrew Ng poses in an undated photo. Image Credit: Andrew Ng Andrew Ng, a computer scientist who led Google’s AI division, Google Brain, and formerly served as vice president and chief scientist at Baidu, is a veritable celebrity in the artificial intelligence (AI) industry. After leaving Baidu, he debuted an online curriculum of classes centered around machine learning — Deeplearning.ai — and soon after launched Landing.ai, a startup with the goal of “revitalizing manufacturing through AI.” (One of its first partners was Taiwanese company Foxconn, which produces the [...]
Third Eye Data Unveils Safera Crime Analysis System
Innovation continues to flow from Third Eye Consulting Services & Solutions, which just announced the release of its powerful Safera crime analysis and prediction system. Billed as a solution for “ushering in a Safe Era for Businesses and Communities”, the system harnesses powerful predictive analytical tools to help combat crime. Safera has quickly become a vital resource for public safety officials, law enforcement agencies, and corporate users to detect and predict suspicious activities, crime, or fraud before it happens. “Our secure system architecture gives Safera tremendous power and utility,” says Dj Das, CEO and founder of the consulting firm [...]
The Hybrid Cloud Market Just Got A Heck Of A Lot More Compelling
The Hybrid Cloud Market Just Got A Heck Of A Lot More Compelling Let’s start with a basic premise that the vast majority of the world’s workloads remain in private data centers. Cloud infrastructure vendors are working hard to shift those workloads, but technology always moves a lot slower than we think. That is the lens through which many cloud companies operate. The idea that you operate both on prem and in the cloud with multiple vendors is the whole idea behind the notion of the hybrid cloud. It’s where companies like Microsoft, IBM, Dell and Oracle are placing their bets. These [...]
