read our blogs

Blogs or Expert Columns

Language Understanding (LUIS) in Azure

Azure LUIS: Language Understanding (LUIS) allows your application to understand what a person wants in their own words. Azure LUIS uses machine learning to allow developers to build applications that can receive user input in natural language and extract meaning from it. A client application that converses with the user can pass user input to a LUIS app and receive relevant, detailed information back. Several Microsoft technologies work with LUIS: Bot Framework allows a chatbot to talk with a user via text input. Bing Speech API converts spoken language requests into text. Once converted to text, LUIS processes the requests. What is [...]

Amazon Machine Learning

Amazon Machine Learning (Amazon ML) is a robust, cloud-based service that makes it easy for developers of all skill levels to use machine learning technology. Amazon ML provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. Once your models are ready, Amazon ML makes it easy to obtain predictions for your application using simple APIs, without having to implement custom prediction generation code or manage any infrastructure.This section introduces the key concepts and terms that will help you understand what you need [...]

Amazon Lex

Amazon Lex is an AWS service for building conversational interfaces for any applications using voice and text. With Amazon Lex, the same conversational engine that powers Amazon Alexa is now available to any developer, enabling you to build sophisticated, natural language chatbots into your new and existing applications. Amazon Lex provides the deep functionality and flexibility of natural language understanding (NLU) and automatic speech recognition (ASR) so you can build highly engaging user experiences with lifelike, conversational interactions, and create new categories of products. Amazon Lex enables any developer to build conversational chatbots quickly. With Amazon Lex, no deep [...]

Amazon ElasticSearch Service

Amazon ElasticSearch Service Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to create a domain and deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics. With Amazon ES, you get direct access to the Elasticsearch APIs so that existing code and applications work seamlessly with the service. Additionally, Amazon ES offers the following benefits of a managed service: Cluster scaling options Self-healing clusters Replication for high availability Data durability Enhanced [...]

Amazon RDS

What Is Amazon Relational Database Service (Amazon RDS)? Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the cloud. It provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks. Overview of Amazon RDS Why do you want a managed relational database service? Because Amazon RDS takes over many of the difficult or tedious management tasks of a relational database: When you buy a server, you get CPU, memory, storage, and IOPS, all bundled together. With Amazon RDS, these [...]

Amazon Aurora

Amazon Aurora Amazon Aurora (Aurora) is a fully managed, MySQL- and PostgreSQL-compatible, relational database engine. It combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases.Aurora makes it simple and cost-effective to set up, operate, and scale your MySQL and PostgreSQL deployments, freeing you to focus on your business and applications. Amazon RDS provides administration for Aurora by handling routine database tasks such as provisioning, patching, backup, recovery, failure detection, and repair. Amazon RDS also provides push-button migration tools to convert your existing Amazon RDS for MySQL and Amazon RDS for [...]

Amazon Redshift

Amazon Redshift System Overview Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service. Topics Data Warehouse System Architecture Performance Columnar Storage Internal Architecture and System Operation Workload Management Using Amazon Redshift with Other Services An Amazon Redshift data warehouse is an enterprise-class relational database query and management system. Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools. When you execute analytic queries, you are retrieving, comparing, and evaluating large amounts of data in multiple-stage operations to produce a final result. Amazon Redshift achieves efficient storage and [...]

AWS Data Pipeline

What is AWS Data Pipeline? AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up. The following components of AWS Data Pipeline work together to manage your data: A pipeline definition specifies the business logic of your data management. For more information, see Pipeline Definition File Syntax. A pipeline schedules and runs tasks. You [...]

Amazon Athena

What is Amazon Athena? Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and [...]

Couchbase 

Couchbase is the merge of two popular NoSQL technologies: Membase, which provides persistence, replication, sharding to the high-performance Memcached technology CouchDB, which pioneers the document-oriented model based on JSON Like other NoSQL technologies, both Membase and CouchDB are built from the ground up on a highly distributed architecture, with data shared across machines in a cluster.  Built around the Memcached protocol, Membase provides an easy migration to existing Memcached users who want to add persistence, sharding and fault resilience on their familiar Memcached model.  On the other hand, CouchDB provides first class support for storing JSON [...]

Apache Hbase

What is Apache HBase? Apache Hbase is a popular and highly efficient Column-oriented NoSQL database built on top of Hadoop Distributed File System that allows performing read/write operations on large datasets in real time using Key/Value data. Introduction to Apache Hbase Apache HBase is an Apache Hadoop project and Open Source, non-relational distributed Hadoop database that had its genesis in the Google’s Bigtable. The programming language of HBase is Java. Today it is an integral part of the Apache Software Foundation and the Hadoop ecosystem. It is a high availability database that exclusively runs on top of the HDFS and provides the Capabilities [...]

Apache Ignite

What is Apache Ignite? Apache Ignite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scaleDurable Memory Ignite's durable memory component treats RAM not just as a caching layer but as a complete fully functional storage layer. This means that users can turn the persistence on and off as needed. If persistence is turned off, then Ignite can act as a distributed in-memory database or in-memory data grid, depending on whether you prefer to use SQL or key-value APIs. If persistence is turned on, then Ignite becomes a distributed, horizontally scalable [...]

Hadoop MapReduce

MapReduce Tutorial: Introduction In this MapReduce Tutorial blog, I am going to introduce you to MapReduce, which is one of the core building blocks of processing in Hadoop framework. Before moving ahead, I would suggest you to get familiar with HDFS concepts which I have covered in my previous HDFS tutorial blog. This will help you to understand the MapReduce concepts quickly and easily. Google released a paper on MapReduce technology in December, 2004. This became the genesis of the Hadoop Processing Model. So, MapReduce is a programming model that allows us to perform parallel and distributed processing on huge data sets. The [...]

Apache Oozie 

OVERVIEW The blueprint for Enterprise Hadoop includes Apache™ Hadoop’s original data storage and data processing layers and also adds components for services that enterprises must have in a modern data architecture: data integration and governance, security and operations. Apache Oozie provides some of the operational services for a Hadoop cluster, specifically around job scheduling within the cluster. WHAT OOZIE DOES Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN as its architectural center, and supports [...]

Mongo DB 

MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. Document Database A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents. The advantages of using documents are: Documents (i.e. objects) correspond to native data types in many programming languages. Embedded documents and arrays reduce need for expensive joins. Dynamic schema supports fluent polymorphism. Key Features High Performance MongoDB provides high performance data persistence. In particular, [...]

Apache Kudu

Introducing Apache Kudu Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation. Kudu’s design sets it apart. Some of Kudu’s benefits include: Fast processing of OLAP workloads. Integration with MapReduce, Spark and other Hadoop ecosystem components. Tight integration with Apache Impala, making it a good, mutable alternative to using HDFS with Apache Parquet. Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict-serializable [...]

Redis

Overview Of Redis Architecture Redis is a in-memory, key-value data store. Redis is the most popular key-value data store. Redis is used by all big IT brands in this world. Amazon Elastic Cache supports Redis which makes redis a very powerful and must know key-value data store. In this post I will provide you a brief introduction to redis architecture. What Is In-Memory, Key-Value Store Key-Value store is a storage system where data is stored in form of key and value pairs. When we say in-memory key-value store, by that we mean that the key-value pairs are stored in [...]

Apache Tez

Apache Tez Introduction The Apache TEZ® project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN. The 2 main design themes for Tez are: Empowering end users by: Expressive dataflow definition APIs Flexible Input-Processor-Output runtime model Data type agnostic Simplifying deployment Execution Performance Performance gains over Map Reduce Optimal resource management Plan reconfiguration at runtime Dynamic physical data flow decisions By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, [...]

Apache Drill

Apache Drill: Drill is an Apache open-source SQL query engine for Big Data exploration. Apache Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language. Drill provides plug-and-play integration with existing Apache Hive and Apache HBase deployments. What's New in Apache Drill 1.12 Drill 1.12 provides the following new features and improvements: Kafka and OpenTSDB storage plugins (DRILL-4779, DRILL-5337) SSL/TLS support (DRILL-5431) Network encryption support (DRILL-5682) Queue-based memory assignment for [...]

Presto

WHAT IS PRESTO? Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. WHO USES IT? Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day. [...]

Apache Sqoop

Before starting with this Apache Sqoop tutorial, let us take a step back. Can you recall the importance of data ingestion, as we discussed it in our earlier blog on Apache Flume. Now, as we know that Apache Flume is a data ingestion tool for unstructured sources, but organisations store their operational data in relational databases. So, there was a need of tool which can import and export data from relational databases. This is why Apache Sqoop was born. Sqoop can easily integrate with Hadoop and dump structured data from relational databases on HDFS, complimenting the power of Hadoop. This [...]

Apache Storm 

Apache Storm OVERVIEW A system for processing streaming data in real time Apache™ Storm adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations. Storm integrates with YARN via Apache Slider, YARN manages Storm while also considering cluster resources for data governance, security and operations components of a modern data architecture. WHAT STORM DOES Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm is extremely fast, with the ability to process over a million records per second [...]

Spark Streaming 

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming can be used to stream live data and processing can happen in real time. Spark Streaming’s ever-growing user base consists of household names like Uber, Netflix and Pinterest.When it comes to Real Time Data Analytics, Spark Streaming provides a single platform to ingest data for fast and live processing in Apache Spark. Through this blog, I will introduce you to this new exciting domain of Spark Streaming and we will go through a complete use case, Twitter Sentiment [...]

Apache Cassandra

What is Apache Cassandra™? Apache Cassandra™, a top level Apache project born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure. Apache Cassandra™ offers capabilities that relational databases and other NoSQL databases simply cannot match such as: continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones.Apache Cassandra™’s architecture is responsible for its ability to scale, perform, and offer continuous uptime. Rather than using a legacy master-slave or a manual and [...]