Blogs or Expert Columns
Hive Plays Well with JSON
Hive Plays Well with JSON Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity. In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS. The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS. In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical data. Hive works well with flat record structured data. I wanted to find out how Hive handles hierarchically structured data. I found out that [...]
Removing Duplicates from Order Data Using Spark
Removing Duplicates from Order Data Using Spark If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes a necessity. In this post we will focus on de duplication based on exact match, whether for the whole record or set of specified key fields. De duplication can also be performed based on fuzzy matching. We will address de duplication for flat record oriented data only. The Spark based implementation is available in my open source project chombo. There is a corresponding Hadoop based [...]
Storing Nested Objects in Cassandra with Composite Columns
Storing Nested Objects in Cassandra with Composite Columns One of the popular features of MongoDB is the ability to store arbitrarily nested objects and be able to index on any nested field. In this post I will show how to store nested objects in Cassandra using composite columns. I have recently added this feature to my open source Cassandra project agiato. Since in Cassandra, like many other NOSQL databases, stored data is highly denormalized. The denormalized data often manifests itself in the form of a nested object e.g., denormalizing one to many relations. In the solution presented here, the object data is stored in [...]
Data Normalization with Spark
Data Normalization with Spark Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly. In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization The Spark based implementation is available in my open source project chombo. There is also a Hadoop based implementation in the same project. [...]
Anomaly Detection with Robust Zscore
Anomaly Detection with Robust Zscore Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers. In this post, I will go over a robust Zscore based implementation on Hadoop [...]
Bulk Insert, Update and Delete in Hadoop Data Lake
Bulk Insert, Update and Delete in Hadoop Data Lake Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only fashion and are easy to handle. Such data tend to be fact data e.g., user behavior tracking data or sensor data. However, dimension data or master data e.g., customer data, product data will typically be mutable. Generally they arrive in batches from some external source system reflecting incremental inserts, updates and deletes in the [...]
Handling Categorical Feature Variables in Machine Learning using Spark
Handling Categorical Feature Variables in Machine Learning using Spark Categorical features variables i.e. features variables with fixed set of unique values appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms are Logistic Regression, Support Vector Machine (SVM) and any Regression algorithm. In this post we will go over a Spark based solution to alleviate the problem. The solution implementation can be found in my open source projects chombo and avenir. We will be using CRM data as the use case Categorical Feature Variable Problem The underlying data type of a categorical [...]
Combating High Cardinality Features in Supervised Machine Learning
Combating High Cardinality Features in Supervised Machine Learning Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not handle categorical variables. Those that can, categorical data can pose a serious problem if they have high cardinality i.e too many unique values. In this post we will go though a technique to convert high cardinality categorical attributes to numerical values, based on how the categorical variable correlates with the class or target variable. The Map Reduce implementations are available in my open source projects avenir and chombo Categorical Variables Some Machine [...]
Ruling with Drools Rule Engine
In a project several years ago I built a rule engine from scratch. In a recent project, which needed a rule engine, I decided to take different route. I decided to give Drools rule engine from JBOSS a try. It has worked out well so far. In this post, I will share my experience with it. I will use insurance as an example to demonstrate how to use it. Why Rule Engine You should seriously consider rule engine when the following conditions apply. You have complex business logic Business logic changes often You want to provide visibility of the business [...]
How to build your own AlphaZero AI using Python and Keras
How to build your own AlphaZero AI using Python and Keras Teach a machine to learn Connect4 strategy through self-play and deep learning In this article I’ll attempt to cover three things: Two reasons why AlphaZero is a massive step forward for Artificial Intelligence How you can build a replica of the AlphaZero methodology to play the game Connect4 How you can adapt the code to plug in other games AlphaGo → AlphaGo Zero → AlphaZero In March 2016, Deepmind’s AlphaGo beat 18 times world champion Go player Lee Sedol 4–1 in a series watched by over 200 million people. A machine had learnt [...]
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency By Reza Shiftehfar Uber is committed to delivering safer and more reliable transportation across our global markets. To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks in our driver-partner sign-up process. Over time, the need for more insights has resulted in over 100 petabytes of analytical data that needs to be cleaned, stored, and served with minimum latency through our Hadoop-based Big Data platform. Since 2014, we have worked to develop a Big Data solution that ensures [...]
Andrew Ng launches ‘AI for Everyone’, a new Coursera program aimed at business professionals
Andrew Ng launches 'AI for Everyone', a new Coursera program aimed at Business Professionals Above: Coursera cofounder Andrew Ng poses in an undated photo. Image Credit: Andrew Ng Andrew Ng, a computer scientist who led Google’s AI division, Google Brain, and formerly served as vice president and chief scientist at Baidu, is a veritable celebrity in the artificial intelligence (AI) industry. After leaving Baidu, he debuted an online curriculum of classes centered around machine learning — Deeplearning.ai — and soon after launched Landing.ai, a startup with the goal of “revitalizing manufacturing through AI.” (One of its first partners was Taiwanese company Foxconn, which produces the [...]
Third Eye Data Unveils Safera Crime Analysis System
Innovation continues to flow from Third Eye Consulting Services & Solutions, which just announced the release of its powerful Safera crime analysis and prediction system. Billed as a solution for “ushering in a Safe Era for Businesses and Communities”, the system harnesses powerful predictive analytical tools to help combat crime. Safera has quickly become a vital resource for public safety officials, law enforcement agencies, and corporate users to detect and predict suspicious activities, crime, or fraud before it happens. “Our secure system architecture gives Safera tremendous power and utility,” says Dj Das, CEO and founder of the consulting firm [...]
The Hybrid Cloud Market Just Got A Heck Of A Lot More Compelling
The Hybrid Cloud Market Just Got A Heck Of A Lot More Compelling Let’s start with a basic premise that the vast majority of the world’s workloads remain in private data centers. Cloud infrastructure vendors are working hard to shift those workloads, but technology always moves a lot slower than we think. That is the lens through which many cloud companies operate. The idea that you operate both on prem and in the cloud with multiple vendors is the whole idea behind the notion of the hybrid cloud. It’s where companies like Microsoft, IBM, Dell and Oracle are placing their bets. These [...]
Using Presto in our Big Data Platform on AWS
At Netflix, the Big Data Platform team is responsible for building a reliable data analytics platform shared across the whole company. In general, Netflix product decisions are very data driven. So we play a big role in helping different teams to gain product and consumer insights from a multi-petabyte scale data warehouse (DW). Their use cases range from analyzing A/B tests results to analyzing user streaming experience to training data models for our recommendation algorithms. We shared our overall architecture in a previous blog post. The underpinning of our big data platform is that we leverage AWS S3 for our [...]
Outlier Detection using Apache Spark Solution
Outlier Detection using Apache Spark Solution Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, may be considered anomalous between 10 PM and 6 AM. In this case, the context is the hour of the day.</span In this post, we will go through some contextual outlier detection techniques based on statistical modeling of the data. The Spark based implementation is available in the open source projects in github [...]
Microsoft’s AI Roadmap
Digital transformation is in full effect, and giants of the tech industry are investing heavily in new technologies Due to its nearly limitless potential, artificial intelligence is at the forefront of much of this research, and Microsoft has been making headlines with new technologies, major acquisitions, and innovative ideas. The tech giant has long been moving toward a cloud-based future, and investment in AI is helping solidify its path toward becoming the AI leader in a number of fields. Here are a few technologies Microsoft has invested in recently and the potential impact they’ll have on the company’s future [...]
Harnessing Machine Learning for Anomaly Detections in Web Server Logs
Detecting Anomaly in Web Server Logs with Microsoft Azure Cloud – For FREE and at one-tenth the effort! Every website has web server logs which record the intricate details of site visitors – their browsing behaviors, clicks, actions etc. Web server logs soon become very large and bloated as they log all these information, one record per line. Within this maze of data lies hidden deep secrets about the website. Secrets like what are site visitors actually doing on their website, how best is the server responding to the requests from site visitors, what are the actions that site [...]
Web Server Logs
Web Server logs are server log files of a web server. A server log is a log file (or several files) automatically created and maintained by a server consisting of a list of activities it performed. A typical example is a web server log which maintains a history of page requests. The W3C maintains a standard format (the Common Log Format) for web server log files, but other proprietary formats exist. More recent entries are typically appended to the end of the file. Information about the request, including client IP address, request date/time, page requested, HTTP code, bytes served, [...]
Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop
Connecting users worldwide on our platform all day, every day requires an enormous amount of data management. When you consider the hundreds of operations and data science teams analyzing large sets of anonymous, aggregated data, using a variety of different tools to better understand and maintain the health of our dynamic marketplace, this challenge is even more daunting. Three years ago, Uber adopted the open source Apache Hadoop framework as its data platform, making it possible to manage petabytes of data across computer clusters. However, given our many teams, tools, and data sources, we needed a way to reliably [...]
How Search Engines Use Machine Learning: 9 Things We Know for Sure
When we first started hearing about machine learning in the early 2010s, it seemed scary at first. But once it was explained to us (and we realized how technology is already being used to provide us with solutions), we started to get down to the practical questions: How are search engines using machine learning? How will it affect SEO? Machine learning is essentially using algorithms to calculate trends, value, or other characteristics of specific things based on historical data. Google has even declared itself a machine learning-first company. If you want to learn more about the tactical side of this technology, Eric Enge has a great [...]
Step into A New Era Of Public Service With Smarter Infrastructure
This year at Smart City Expo World Congress, the industry-leading event for urbanization, Microsoft will bring city leaders and solution experts from around the world to demonstrate innovative technologies currently empowering the digital transformation of smart cities. This blog post is the third in a series about how a connected city—powered by our intelligent cloud and intelligent edge platform—will help you step into a new era of public service. Join Microsoft and its partners at SCEWC 2018. By 2025, the UN projects that 68% of the world population will live in cities or urban areas. To keep pace with urbanization, innovative cities use technology like the [...]
Facial Recognition Tech Is Ready for Its Post-Phone Future
MARCIO JOSE SANCHEZ/AP ONE YEAR AGO, Craig Federighi opened his eyes, stared into the brand-new iPhone X, and showed the world how he could unlock it with his face. Or, at least, he tried. It took the Apple executive a few attempts and one back-up phone to get the screen to unlock without a fingerprint or a passcode. But then, like magic, he was in.This was Apple’s annual fall hardware show, where the company dangles its newest iPhones before the world and sets the tone for consumer products to come. Executives danced around the stage to show off the iPhone X's seemingly endless [...]
Anheuser-Busch InBev brews up game-changing business solutions with Microsoft Azure
Anheuser-Busch InBev, headquartered in Leuven, Belgium, isn’t just a beverage company, it’s a technology company. From its Beer Garage in Silicon Valley to its Global Analytics Center in Bengaluru, India, the company known as AB InBev is pushing the innovation envelope. The company is using technology to drive commercial and operational growth and increase sustainability by moving its IT operations to the cloud, and it is gaining more significant insights into business operations by breaking down data silos and building a global analytics platform. AB InBev chose Microsoft Azure as the best platform to support these game-changing advances. With [...]